1 Introduction
Recently topic models, such as latent Dirichlet allocation (LDA) [BNJ03] and its variants [TJBB06], have been proven extremely successful in modeling large, complex text corpus. These models assume that the words in a document are generated from a mixture of latent topics represented by multinomial distributions over a given dictionary. Therefore, the major inference problem becomes recovering latent topics from text corpus. Popular inference algorithms for LDA include variational inference [BNJ03, TKW07, WPB11, HBWP13], sampling methods [GS04, PNI08], and tensor decomposition [AGM12, AFH12, AGH12] recently. However, all of them require that the number of topics is given as input.
It is known that model selection, i.e., choosing the appropriate number of topics plays a vital role in successfully applying LDA models [TMN14, KRS14]. For example, [TMN14] has shown that a large value of leads to severe deterioration in the learning rate; [KRS14]
points out that incorrect number of mixture components can result in an unpredictable error when estimating parameters of mixture model via spectral methods. Moreover, as
increases, the computational cost of inference for the LDA model grows significantly.^{†}^{†}^{*}Dehua Cheng and Xinran He contributed equally to this article.Unfortunately, it is extremely challenging to choose the number of topics for the LDA model. In practice, [Tad12] approximates the marginal likelihood via Laplace’s method, while [AEF10, GS04] computes the likelihood via MCMC. Moreover, [Tad12] proposes another model selection method by analysis of residuals. However, it only provides rough measures for evidence in favor of a larger . Other model selection criteria, such as AIC [Aka74], BIC [S78] and cross validation can be applied. Though achieving practical success [AEF10], they only have asymptotic model selection consistency. Moreover, they require multiple runs of the learning algorithm with a wide range of , which limits its practicality on largescale datasets. Bayesian nonparametrics, such as Hierarchical Dirichlet Processes (HDP)[TJBB06], provide alternatives to select in a principled way. However, it has been shown in a recent paper [MH13] that HDP is inconsistent for estimating the number of topics for LDA even with infinite amount of data.
In this paper, we provide theoretical analysis on the number of topics for latent topic models using spectral decomposition methods. By the results from Anandkumar et al. [AGH12]
, for the LDA model the secondorder moment follows a special structure as the summation over the outer product of topic vectors. We show that a spectral decomposition on the secondorder empirical moment with proper thresholding on the singular values can lead to the correct number of topics. Under mild assumptions, we show that our analysis provides both a lower bound and an upper bound on number of topics
in the LDA model. To the best of our knowledge, this is the first work of analyzing the number of topics with provable guarantee by utilizing the result of tensor decomposition approach.Our main contributions are:

For LDA, we analyze the empirical secondorder moment and derive an upper bound on its variance in terms of the corpus statistics, i.e., the number of documents, the length of each document and the number of unique words. Essentially, our results provide a computable guideline to the convergence of secondorder moment. This contribution itself is valuable, e.g., for determining the correct downsampling rate on a largescale dataset.

We analyze the spectral structure of the true secondorder moment for LDA. That is, we provide the spectral information on the covariance of Dirichlet design matrix.

Based on the results on empirical and true secondorder moment for LDA, we derived three inequalities regarding the number of topics , which in turn provide both upper and lower bounds on with known parameters or constants. We also present the simulation study for our theoretical results.

We show that our results and techniques can be generalized to other mixture models. The results on Gaussian mixture models is presented as an example.
The rest of the paper is organized as follows: In section 2, we present our main result on how to analyze the number of topics in the LDA model. We carry out experiments on the synthetic datasets to demonstrate the validity and tightness of our bounds in section 3. We conclude the paper and show how our methodology generalizes to other mixture models in section 4.
2 Analyze the Number of Topics in LDA
Latent Dirichlet Allocation [BNJ03]
(LDA) is a powerful generative model for topic modeling. It has been applied to a variety of applications and also serves as building blocks in other powerful models. Most existing methods follow the empirical Bayes method for parameter estimation
[BNJ03, TKW07, GS04, PNI08]. Recently, method of moments has been explored, leading to a series of interesting work and new insight into the LDA model. It has been shown in [AFH12, AGH12] that the latent topics can be directly derived from the properly constructed thirdorder moment (which can be directly estimated from the data) by orthogonal tensor decomposition. Following this line of work, we observe that the loworder moments are also useful for discovering the number of topics in the LDA model. In this section, we will investigate the structure of both empirical and true secondorder moment, and show that they lead to effective bounds on the number of topics.Notation  Definition 

()  Number(index) of documents 
()  Number(index) of words in a document 
()  Number(index) of unique words 
()  Number(index) of latent topics 
Multinomial parameters for the th topic  
Collection of all topics  
Collection of all words in th document  
th word in th document  
Topic mixing for th document  
Topic assignment for word  
Hyperparameter for document topic distribution  
Hyperparameter for generating topics 
2.1 Notation and Problem Formulation
As introduced in [BNJ03], the full generative process for the th document in the LDA model is described as follows:

Generate the topic mixing .

For each word in document :

Generate a topic , where denotes the multinomial distribution.

Generate a word , where is the multinomial parameter associated with topic .

The notation is summarized in Table 1. is represented by natural basis , meaning that the th word in th document is the th word in the dictionary.
In [AGH12], the authors proposed the method of moment for learning the LDA model, where the empirical firstorder moment is defined as
and the empirical secondorder moment as
where and the outer product is defined as for any column vector . Then we define the firstorder and secondorder moments as the expectation of the empirical moments, i.e., and respectively. Furthermore, it has been shown that equals the weighted sum of the outer products of the topic parameter [AGH12], i.e.,
This implies that the rank of is exactly the number of topics . Another interesting observation from this derivation is that since is the summation of rank1 matrices and all the topics are linearly independent almost surely under our full generative model, we have the Kth largest singular value and K+1th largest singular value . Therefore, the number of nonzero singular values of is exactly the number of topics, which provides a direct way to estimate under the noiseless scenario. However, in practice, we only have access to the estimated as an approximation to the true secondorder moment . As a result, the rank of may not be and may be larger than zero. To overcome this obstacle, we need to study (1) the spectral structure of , and (2) the relationship between and its estimator .
2.2 Solution Outline
The secondorder moment can be estimated directly from the observations, without inferring the topic mixing and estimating parameters. Our idea follows that when the sample size becomes large enough, can approximate well enough. That is, is very close to zero while is bounded away from zero. Then, by picking a proper threshold satisfying , we can obtain the value of by simply counting the number of singular values of greater than . We will work along two directions to achieve the goal: (1) examine the convergence rate of the singular values of ; (2) investigate the relationship between the spectral structure of and the model parameters. Next we will provide the analysis results from both directions.
2.3 Convergence of
Without loss of generality, we assume that both and are generated from symmetrical Dirichlet distribution, namely for and for . We also assume that all documents have the same length for simplicity. Since
is an unbiased estimator of
by definition, we can bound the difference between the singular value of and that of by bounding their variance as follows:Theorem 2.1.
Especially, when , we have
(1) 
Proof.
Let and be the spectral and Frobenius norm of , respectively. We denote as the
th largest eigenvalue of matrix
. We establish the result through the following chain of inequalities:Step (i) follows directly based on the fact that is semidefiniteness and is symmetric. The detailed proof is deferred to Lemma A.1 in Appendix. Step (ii) and (iii) are wellknown results on matrix norm and matrix perturbation theory [HJ]. And in Lemma 2.2, we provide upper bound on the Frobenius norm of matrix . Because , i.e., for , therefore, . ∎
Lemma 2.2.
For the LDA model, with probability at least , we have .
Proof.
We first compute the expectation and then use Markov inequality to complete the proof. The square of Frobenius norm is . Since we have , . The expectation of can be calculated as
The remaining task is to calculate the conditional variance of and , which is discussed in Lemma 2.3.
Then by Markov inequality, for any , we have
By setting , with probability at least , we have
∎
Lemma 2.3.
For the LDA model, the following holds
and
for and represents higherorder terms.
We make a few relaxations and introduce notation (keeping the dominant terms and absorb the rest into to achieve an upperbound on the variance). To be rigorous, we have the following assumptions on the scale of each statistics or parameters: , , , , , , and . The calculation of the variance is provided in Appendix D.
It is interesting to examine the role of , and in . decreases to as . Even if there are only two words in each document, would still converge to . Similar observation is made in [AGH12]. and have similar influence on .
To apply the results above, we simply ignore the higherorder terms. However, because will increase as , , or decreases, one should pay extra attention when are far from the asymptotic region. As shown in our simulated studies, our bound yields convincing results when are on the scale of hundreds or above, which is more than common in realworld applications.
2.4 Spectral Structure of
The spectral structure of depends on and . We use the following theorem to characterize the spectral structure of .
Theorem 2.4.
Assume that , , and and

With probability at least we have
(2) 
With probability at least , we have
(3)
Proof.
We have , where is a matrix and is a diagonal matrix. The first singular values of are also the first singular values of . And we have
and
To estimate the singular value of , we need to utilize the fact that
. The random variables in the same column of
are dependent with each other. Thus, powerful results from random matrix theory can not be applied. To decouple the dependency, we design a diagonal matrix
, whose diagonal elements are drawn from independently. In this way, is a matrix with independent elements, i.e., each element is an i.i.d. random variable following .We denote each row of as , then . In order to apply matrix Chernoff bound [Tro12], we need to bound the spectral norm of , i.e., . Because is a rank matrix, we have . By Lemma C.3 (see Appendix) and the union bound, with probability greater than , we have
We also have and . Applying the matrix Chernoff bound to , with probability greater than
we have
And with probability greater than
we have
With certain assumptions on and , we can fully utilize the bounds above. If we assume that , , then and . Therefore, decreases rapidly as increases, where approximately. This fact leads to increasing difficulty in distinguishing the topics with small singular values from noise. Note that also decreases with a slower rate as increases.
2.5 Analysis of the Number of Topics
(a)  (b)  (c) 
(d)  (e)  (f) 
The convergence of and the spectral structure of provide us the upper bounds and the lower bounds on the singular values of the empirical secondorder moments . We can infer the number of topics by the following steps:
First, by setting , thresholding provides a lower bound on , since with high probability, every spurious topic has singular value smaller than .^{1}^{1}1Strictly speaking, there is no onetoone correspondence between topics and the singular values of the secondorder moments. Here we refer to the correspondence in terms of the total number of topics.
Secondly, if we set , thresholding provides a upper bound on , since with high probability, every true topic has singular value greater than the threshold. However, the above threshold is not computable, since depends on the true number of topics .
Instead, we can directly utilize the upper bound on to provide an upper bound for . We have as shown in Theorem 2.4. The left hand side, , is determined by the observed corpus, and the right hand side is a function of . When decreases as increases (see discussion in Section 2.4), solving the inequality leads to an upper bound on .
3 Experimental Results
We validate our theoretical results by conducting experiments on the synthetic datasets generated according to the LDA model. For each experiment setting, we report the results by averaging over five random runs.
In the first set of experiments, we test the convergence of the secondorder moment as a function of . The parameter setting is as follows: , and . We vary the dictionary size , document length , or document number while keeping the other two fixed. The detailed settings are summarized as belows:

[(a)]

Fix and , vary the length of document from to .

Fix and , vary the number of documents from to .

Fix and , vary the size of dictionary from to .
Figure 1 (ac) shows the matrix norms on and the th and th largest singular values of . The results match nicely with our theoretical analysis in that serves as an accurate upper bound on the Frobenius norm of . When the amount of data is large enough, the red line goes below the purple line, which indicates that with enough data, thresholding with provides a tight lower bound on the number of topics.
In the second experiment, we evaluate our bounds on the spectral structure of in Theorem 2.4. Similarly, we vary , or while keeping the other two parameters fixed. The detailed settings are as follows:

[(a)]

Fix , , and , vary from to .

Fix , , and , vary number of topics from to .

Fix , , and , vary the size of dictionary from to .
The results in Figure 1 (df) match well with our theoretical analysis.
In the last experiment, we calculate the upper bound and the lower bound of when varying the number of documents or the length of documents. The results are presented in Figure 2. As we can see, the lower bound indeed converges to the true number of topics. However, the upper bound converges to a value other than the ground truth, partly because the upper bound involves both and , whereas does not change as the size of dataset increases. The experiment results demonstrate that our upper and lower bounds on can effectively narrow down the range of possible .
(a), varying  (b), varying  (c), varying 
4 Discussion and Conclusions
So far we have shown that for the LDA model, by investigating the convergence of the empirical moments and the spectral structure of the expected moment , the singular value of the empirical moment provides useful information on the number of topics. This line of research provides an interesting direction for analyzing mixture models in general [HK13]. Next we show how to generalize our methodology with an example of Gaussian Mixture Models (GMM).
4.1 Generalization
Our analysis can be easily generalized to other mixture models whose empirical loworder moments have the same structures as the weighted sum of the outer products of mixture components. Convergence analysis of leads to the lower bound on the number of mixture components, while solving inequality on the first singular value provides an upper bound. In order to derive the convergence bound , the variance of need to be computed. Moreover, we need to explore the spectral structure of the true moment to provide upper and lower bound on the first and the th singular values respectively.
As an example, we next show how to conduct the analysis on the Gaussian Mixture Model [Bis06] with spherical mixture components.
GMM assumes that the data points are generated from a mixture of multivariate Gaussian components. That is, for a dataset generated from spherical Gaussian mixtures with components, we assume that
where is the mixture probability, is the component assignment for the th data point, and is a
dimensional spherical Gaussian distribution with
. We further assume and for a Bayesian version of GMM. Note that we assume that the following parameters are known: .The problem on how to correctly choosing the number of mixture components has been extensively studied. Such as traditional methods (cross validation, AIC and BIC [LV10]), penalized likelihood methods [THK13] and variational approaches [CB01]. Similar to the LDA model, we show that analyzing the empirical moments provides an alternative approach to bound the number of mixture components.
We define the empirical secondorder moment as and the secondorder moment as the expectation of the empirical moment, namely . Then by similar analysis, we have the following theorem for GMM:
Theorem 4.1.
Let , then

[(1)]

Let be the number of singular values of such that , where
then with probability at least , we have

Let be the maximal integer such that
Then with probability at least , we have
4.2 Conclusion
In this paper, we provide theoretical analysis for model selection in LDA. Specifically, we present both an upper bound and a lower bound on the number of topics based on the connection between secondorder moments and latent topics. The upper bound is obtained by bounding the difference between the estimated secondorder moment and the true moment . The lower bound is obtained via analyzing the largest singular value of . Furthermore, our analysis can be easily generalized to other latent models, such as Gaussian mixture models.
One major limitation of our approach is that all our analysis assumes that the data are generated exactly according to LDA. As a result, the analysis result may not hold when being applied to real world dataset.
For future work, we will examine effective ways to improve the theoretical results. For example, by bounding higherorder moments of or replacing Markov inequality with tighter inequalities. Moreover, we could bound the spectral norm of directly instead of its Frobenius norm, which potentially yields tighter bounds.
5 Acknowledgment
We thank Fei Sha and David Kale for helpful discussion and suggestion. The research was sponsored by the NSF research grants IIS1254206, and U.S. Defense Advanced Research Projects Agency (DARPA) under Social Media in Strategic Communication (SMISC) program, Agreement Number W911NF1210034. The views and conclusions are those of the authors and should not be interpreted as representing the official policies of the funding agency, or the U.S. Government.
References
 [AEF10] Edoardo M. Airoldi, Elena A. Erosheva, Stephen E. Fienberg, Cyrille Joutard, Tanzy Love, and Suyash Shringarpure. Reconceptualizing the classification of pnas articles. Proceedings of the National Academy of Sciences, 107(49):20899–20904, 2010.
 [AFH12] Anima Anandkumar, Dean P Foster, Daniel Hsu, Sham Kakade, and YiKai Liu. A spectral algorithm for latent dirichlet allocation. In NIPS, pages 926–934, 2012.
 [AGH12] Anima Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. arXiv preprint arXiv:1210.7559, 2012.
 [AGM12] Sanjeev Arora, Rong Ge, and Ankur Moitra. Learning topic models–going beyond svd. In FOCS, pages 1–10, 2012.
 [Aka74] Hirotugu Akaike. A new look at the statistical model identification. Automatic Control, 19(6):716–723, 1974.
 [Bis06] Christopher M. Bishop. Pattern Recognition and Machine Learning. SpringerVerlag New York, Inc., 2006.
 [BNJ03] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. JMLR, 3:993–1022, 2003.
 [CB01] Adrian Corduneanu and Christopher M Bishop. Variational Bayesian model selection for mixture distributions. In AISTATS, pages 27–34, 2001.
 [GS04] Thomas L Griffiths and Mark Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101:5228–5235, April 2004.
 [HBWP13] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. JMLR, 14(1):1303–1347, 2013.
 [HJ] Roger A. Horn and Charles R. Johnson. Matrix analysis, 1985.
 [HK13] Daniel Hsu and Sham M. Kakade. Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In ITCS, 2013.
 [KRS14] Alex Kulesza, N Raj Rao, and Satinder Singh. Lowrank spectral learning. In ICML, 2014.
 [LM00] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. The annals of statistics, 28(5):1302–1338, 2000.
 [LV10] Olga Lukociene and Jeroen K. Vermunt. Determining the number of components in mixture models for hierarchical data. In Advances in Data Analysis, Data Handling and Business Intelligence, pages 241–249. 2010.
 [MH13] Jeffrey W Miller and Matthew T Harrison. A simple example of dirichlet process mixture inconsistency for the number of components. In NIPS, pages 199–206. 2013.
 [PNI08] Ian Porteous, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max Welling. Fast collapsed gibbs sampling for latent dirichlet allocation. In KDD, pages 569–577, 2008.
 [S78] Gideon Schwarz et al. Estimating the dimension of a model. The annals of statistics, 6(2):461–464, 1978.
 [SR09] Russell J Steele and Adrian E Raftery. Performance of bayesian model selection criteria for gaussian mixture models. Dept. Stat., Univ. Washington, Washington, DC, Tech. Rep, 559, 2009.

[Tad12]
Matt Taddy.
On estimation and selection for topic models.
In
Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS12)
, pages 1184–1193, 2012.  [THK13] Huang Tao, Peng Heng, and Zhang Kun. Model selection for gaussian mixture models. arXiv preprint arXiv:1301.3558, 2013.
 [TJBB06] Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei. Hierarchical dirichlet processes. Journal of the American Statistical Association, 101(476), 2006.
 [TKW07] Yee Whye Teh, Kenichi Kurihara, and Max Welling. Collapsed variational inference for hdp. In NIPS, 2007.
 [TMN14] Jian Tang, Zhaoshi Meng, Xuanlong Nguyen, Qiaozhu Mei, and Ming Zhang. Understanding the limiting factors of topic modeling via posterior contraction analysis. In ICML, 2014.
 [Tro12] Joel A. Tropp. Userfriendly tail bounds for sums of random matrices. Found. Comput. Math., 12(4):389–434, 2012.
 [Ver10] Roman Vershynin. Introduction to the nonasymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
 [WPB11] Chong Wang, John W Paisley, and David M Blei. Online variational inference for the hierarchical dirichlet process. In AISTATS, pages 752–760, 2011.
Appendix A Theoretical results for LDA
a.1 Coefficient Setting for Theorem 2.4
Bound of
We have that with probability greater than
we have
We can choose and as follows to simplify the formula of the bound

Choose , first probability term is less than .

Choose , third probability term is less than .

Choose as
second probability term is less than .
As a result, with probability greater than , we have
As an alternative, we can choose and as follows to simplify the formula of the bound

Choose , first probability term is less than .

Choose , third probability term is less than .

Choose , second probability term is less than .
As a result, with probability greater than
we have
Bound of
We have that with probability greater than
we have
We can choose and as follows to simplify the formula of the bound

Choose , first probability term is less than .

Choose , third probability term is less than .

Choose as
second probability term is less than .
As a result, with probability greater than , we have
As an alternative, we can choose and as follows to simplify the formula of the bound

Choose , third probability term is less than .

Choose , first probability term is less than .

Choose , second probability term is less than .
As a result, with probability greater than
we have
a.2 Lemma for Theorem 2.1
Lemma A.1.
With and previously defined, we have that
Proof.
Because is a symmetric semidefinite matrix, so we have
And because is a symmetric matrix, we have
for some permutation .
Because we have , so we have .
Let be the smallest index that , for , we have
By the fact that
Comments
There are no comments yet.