1 Introduction
Topic modeling is a popular method that learns thematic structure from large document collections without human supervision. The model is simple: documents are mixtures of topics, which are modeled as distributions over a vocabulary [Blei, 2012]. Each word token is generated by selecting a topic from a documentspecific distribution, and then selecting a specific word from that topicspecific distribution. Posterior inference over documenttopic and topicword distributions is intractable — in the worst case it is NPhard even for just two topics [Arora et al., 2012b]
. As a result, researchers have used approximate inference techniques such as singular value decomposition
[Deerwester et al., 1990], variational inference [Blei et al., 2003], and MCMC [Griffiths & Steyvers, 2004].Recent work in theoretical computer science focuses on designing provably efficient algorithms for topic modeling. These treat the topic modeling problem as one of statistical recovery: assuming the data was generated perfectly from the hypothesized model using an unknown set of parameter values, the goal is to recover the model parameters in polynomial time given a reasonable number of samples.
Arora et al. [2012b] present an algorithm that provably recovers the parameters of topic models provided that the topics meet a certain separability assumption [Donoho & Stodden, 2003]. Separability requires that every topic contains at least one anchor word
that has nonzero probability only in that topic. If a document contains this anchor word, then it is guaranteed that the corresponding topic is among the set of topics used to generate the document. The algorithm proceeds in two steps: first it selects anchor words for each topic; and second, in the recovery step, it reconstructs topic distributions given those anchor words. The input for the algorithm is the secondorder moment matrix of wordword cooccurrences.
Anandkumar et al. [2012] present a provable algorithm based on thirdorder moments that does not require separability, but, unlike the algorithm of Arora et al., assumes that topics are not correlated. Although standard topic models like LDA [Blei et al., 2003] assume that the choice of topics used to generate the document are uncorrelated, there is strong evidence that topics are dependent [Blei & Lafferty, 2007, Li & McCallum, 2007]: economics and politics are more likely to cooccur than economics and cooking.
Both algorithms run in polynomial time, but the bounds that have been proven on their sample complexity are weak and their empirical runtime performance is slow. The algorithm presented by Arora et al. [2012b]
solves numerous linear programs to find anchor words.
Bittorf et al. [2012] and Gillis [2012] reduce the number of linear programs needed. All of these algorithms infer topics given anchor words using matrix inversion, which is notoriously unstable and noisy: matrix inversion frequently generates negative values for topicword probabilities.In this paper we present three contributions. First, we replace linear programming with a combinatorial anchor selection algorithm. So long as the separability assumption holds, we prove that this algorithm is stable in the presence of noise and thus has polynomial sample complexity for learning topic models. Second, we present a simple probabilistic interpretation of topic recovery given anchor words that replaces matrix inversion with a new gradientbased inference method. Third, we present an empirical comparison between recoverybased algorithms and existing likelihoodbased topic inference. We study both the empirical sample complexity of the algorithms on synthetic distributions and the performance of the algorithms on realworld document corpora. We find that our algorithm performs as well as collapsed Gibbs sampling on a variety of metrics, and runs at least an order of magnitude faster.
Our algorithm both inherits the provable guarantees from Arora et al. [2012a, b]
and also results in simple, practical implementations. We view our work as a step toward bridging the gap between statistical recovery approaches to machine learning and maximum likelihood estimation, allowing us to circumvent the computational intractability of maximum likelihood estimation yet still be robust to model error.
2 Background
We consider the learning problem for a class of admixture distributions that are frequently used for probabilistic topic models. Examples of such distributions include latent Dirichlet allocation [Blei et al., 2003], correlated topic models [Blei & Lafferty, 2007], and Pachinko allocation [Li & McCallum, 2007]. We denote the number of words in the vocabulary by and the number of topics by . Associated with each topic
is a multinomial distribution over the words in the vocabulary, which we will denote as the column vector
of length . Each of these topic models postulates a particular prior distribution over the topic distribution of a document. For example, in latent Dirichlet allocation (LDA) is a Dirichlet distribution, and for the correlated topic modelis a logistic Normal distribution. The generative process for a document
begins by drawing the document’s topic distribution . Then, for each position we sample a topic assignment , and finally a word .We can combine the column vectors for each of the topics to obtain the wordtopic matrix of dimension . We can similarly combine the column vectors for documents to obtain the topicdocument matrix of dimension . We emphasize that is unknown and stochastically generated: we can never expect to be able to recover it. The learning task that we consider is to find the wordtopic matrix . For the case when
is Dirichlet (LDA), we also show how to learn hyperparameters of
.Maximum likelihood estimation of the wordtopic distributions is NPhard even for two topics [Arora et al., 2012b], and as a result researchers typically use approximate inference. The most popular approaches are variational inference [Blei et al., 2003]
, which optimizes an approximate objective, and Markov chain Monte Carlo
[McCallum, 2002], which asymptotically samples from the posterior distribution but has no guarantees of convergence.Arora et al. [2012b] present an algorithm that provably learns the parameters of a topic model given samples from the model, provided that the wordtopic distributions satisfy a condition called separability:
Definition 2.1.
The wordtopic matrix is separable for if for each topic , there is some word such that and for .
Such a word is called an anchor word because when it occurs in a document, it is a perfect indicator that the document is at least partially about the corresponding topic, since there is no other topic that could have generated the word. Suppose that each document is of length , and let be the topictopic covariance matrix. Let be the expected proportion of topic in a document generated according to . The main result of Arora et al. [2012b] is:
Theorem 2.2.
There is a polynomial time algorithm that learns the parameters of a topic model if the number of documents is at least
where is defined above, is the condition number of , and . The algorithm learns the wordtopic matrix and the topictopic covariance matrix up to additive error .
Unfortunately, this algorithm is not practical. Its running time is prohibitively large because it solves linear programs, and its use of matrix inversion makes it unstable and sensitive to noise. In this paper, we will give various reformulations and modifications of this algorithm that alleviate these problems altogether.
3 A Probabilistic Approach to Exploiting Separability
The Arora et al. [2012b] algorithm has two steps: anchor selection, which identifies anchor words, and recovery, which recovers the parameters of and of . Both anchor selection and recovery take as input the matrix (of size ) of wordword cooccurrence counts, whose construction is described in the supplementary material. is normalized so that the sum of all entries is . The highlevel flow of our complete learning algorithm is described in Algorithm 1, and follows the same two steps. In this section we will introduce a new recovery method based on a probabilistic framework. We defer the discussion of anchor selection to the next section, where we provide a purely combinatorial algorithm for finding the anchor words.
The original recovery procedure (which we call “Recover”) from Arora et al. [2012b] is as follows. First, it permutes the matrix so that the first rows and columns correspond to the anchor words. We will use the notation to refer to the first rows, and for the first rows and just the first columns. If constructed from infinitely many documents, would be the secondorder moment matrix , with the following block structure:
where is a diagonal matrix of size . Next, it solves for and using the algebraic manipulations outlined in Algorithm 2.
The use of matrix inversion in Algorithm 2 results in substantial imprecision in the estimates when we have small sample sizes. The returned and matrices can even contain small negative values, requiring a subsequent projection onto the simplex. As we will show in Section 5, the original recovery method performs poorly relative to a likelihoodbased algorithm. Part of the problem is that the original recover algorithm uses only rows of the matrix (the rows for the anchor words), whereas is of dimension . Besides ignoring most of the data, this has the additional complication that it relies completely on cooccurrences between a word and the anchors, and this estimate may be inaccurate if both words occur infrequently.
Here we adopt a new probabilistic approach, which we describe below after introducing some notation. Consider any two words in a document and call them and , and let and refer to their topic assignments. We will use to index the matrix of wordtopic distributions, i.e. . Given infinite data, the elements of the matrix can be interpreted as . The rownormalized matrix, denoted , which plays a role in both finding the anchor words and the recovery step, can be interpreted as a conditional probability .
Denoting the indices of the anchor words as , the rows indexed by elements of are special in that every other row of lies in the convex hull of the rows indexed by the anchor words. To see this, first note that for an anchor word ,
(1)  
(2) 
where (1) uses the fact that in an admixture model , and (2) is because . For any other word , we have
Denoting the probability as , we have . Since is nonnegative and , we have that any row of lies in the convex hull of the rows corresponding to the anchor words. The mixing weights give us ! Using this together with , we can recover the matrix simply by using Bayes’ rule:
Finally, we observe that is easy to solve for since .
Our new algorithm finds, for each row of the empirical row normalized cooccurrence matrix, , the coefficients that best reconstruct it as a convex combination of the rows that correspond to anchor words. This step can be solved quickly and in parallel (independently) for each word using the exponentiated gradient algorithm. Once we have , we recover the matrix using Bayes’ rule. The full algorithm using KL divergence as an objective is found in Algorithm 3. Further details of the exponentiated gradient algorithm are given in the supplementary material.
One reason to use KL divergence as the measure of reconstruction error is that the recovery procedure can then be understood as maximum likelihood estimation. In particular, we seek the parameters , , that maximize the likelihood of observing the word cooccurence counts, . However, the optimization problem does not explicitly constrain the parameters to correspond an admixture model.
We can also define a similar algorithm using quadratic loss, which we call RecoverL2. This formulation has the extremely useful property that both the objective and gradient can be kernelized so that the optimization problem is independent of the vocabulary size. To see this, notice that the objective can be rewritten as
where is and can be computed once and used for all words, and is and can be computed once prior to running the exponentiated gradient algorithm for word .
To recover the matrix for an admixture model, recall that . This may be an overconstrained system of equations with no solution for , but we can find a leastsquares approximation to by pre and postmultiplying by the pseudoinverse . For the special case of LDA we can learn the Dirichlet hyperparameters. Recall that in applying Bayes’ rule we calculated . These values for specify the Dirichlet hyperparameters up to a constant scaling. This constant could be recovered from the matrix [Arora et al., 2012b], but in practice we find it is better to choose it using a grid search to maximize the likelihood of the training data.
We will see in Section 5 that our nonnegative recovery algorithm performs much better on a wide range of performance metrics than the recovery algorithm in Arora et al. [2012b]. In the supplementary material we show that it also inherits the theoretical guarantees of Arora et al. [2012b]: given polynomially many documents, our algorithm returns an estimate at most from the true wordtopic matrix .
4 A Combinatorial Algorithm for Finding Anchor Words
Here we consider the anchor selection step of the algorithm where our goal is to find the anchor words. In the infinite data case where we have infinitely many documents, the convex hull of the rows in will be a simplex where the vertices of this simplex correspond to the anchor words. Since we only have a finite number of documents, the rows of are only an approximation to their expectation. We are therefore given a set of points that are each a perturbation of whose convex hull defines a simplex. We would like to find an approximation to the vertices of . See Arora et al. [2012a] and Arora et al. [2012b] for more details about this problem.
Arora et al. [2012a] give a polynomial time algorithm that finds the anchor words. However, their algorithm is based on solving linear programs, one for each word, to test whether or not a point is a vertex of the convex hull. In this section we describe a purely combinatorial algorithm for this task that avoids linear programming altogether. The new “FastAnchorWords” algorithm is given in Algorithm 4. To find all of the anchor words, our algorithm iteratively finds the furthest point from the subspace spanned by the anchor words found so far.
Since the points we are given are perturbations of the true points, we cannot hope to find the anchor words exactly. Nevertheless, the intuition is that even if one has only found points that are close to (distinct) anchor words, the point that is furthest from will itself be close to a (new) anchor word. The additional advantage of this procedure is that when faced with many choices for a next anchor word to find, our algorithm tends to find the one that is most different than the ones we have found so far.
The main contribution of this section is a proof that the FastAnchorWords algorithm succeeds in finding points that are close to anchor words. To precisely state the guarantees, we recall the following definition from [Arora et al., 2012a]:
Definition 4.1.
A simplex is robust if for every vertex of , the distance between and the convex hull of the rest of the vertices is at least .
In most reasonable settings the parameters of the topic model define lower bounds on the robustness of the polytope . For example, in LDA, this lower bound is based on the largest ratio of any pair of hyperparameters in the model [Arora et al., 2012b]. Our goal is to find a set of points that are close to the vertices of the simplex, and to make this precise we introduce the following definition:
Definition 4.2.
Let be a set of points whose convex hull is a simplex with vertices . Then we say covers if when is written as a convex combination of the vertices as , then . Furthermore we will say that a set of points covers the vertices if each vertex is covered by some point in the set.
We will prove the following theorem: suppose there is a set of points whose convex hull is robust and has vertices (which appear in ) and that we are given a perturbation of the points so that for each , , then:
Theorem 4.3.
There is a combinatorial algorithm that runs in time ^{1}^{1}1In practice we find setting dimension to 1000 works well. The running time is then . and outputs a subset of of size that covers the vertices provided that .
This new algorithm not only helps us avoid linear programming altogether in inferring the parameters of a topic model, but also can be used to solve the nonnegative matrix factorization problem under the separability assumption, again without resorting to linear programming. Our analysis rests on the following lemmas, whose proof we defer to the supplementary material. Suppose the algorithm has found a set of points that are each close to distinct vertices in and that .
Lemma 4.4.
There is a vertex whose distance from is at least .
The proof of this lemma is based on a volume argument, and the connection between the volume of a simplex and the determinant of the matrix of distances between its vertices.
Lemma 4.5.
The point found by the algorithm must be close to some vertex .
This lemma is used to show that the error does not accumulate too badly in our algorithm, since only depends on , (not on the used in the previous step of the algorithm). This prevents the error from accumulating exponentially in the dimension of the problem, which would be catastrophic for our proof.
After running the first phase of our algorithm, we run a cleanup phase (the second loop in Alg. 4) that can reduce the error in our algorithm. When we have points close to vertices, only one of the vertices can be far from their span. The farthest point must be close to this missing vertex. The following lemma shows that this cleanup phase can improve the guarantees of Lemma A.2:
Lemma 4.6.
Suppose and each point in is close to some vertex , then the farthest point found by the algorithm is close to the remaining vertex.
This algorithm is a greedy approach to maximizing the volume of the simplex. The larger the volume is, the more words per document the resulting model can explain. Better anchor word selection is an open question for future work. We have experimented with a variety of other heuristics for maximizing simplex volume, with varying degrees of success.
Related work. The separability assumption has also been studied under the name “pure pixel assumption” in the context of hyperspectral unmixing. A number of algorithms have been proposed that overlap with ours – such as the VCA [Nascimento & Dias, 2004] algorithm (which differs in that there is no cleanup phase) and the NFINDR [Gomez et al., 2007] algorithm which attempts to greedily maximize the volume of a simplex whose vertices are data points. However these algorithms have only been proven to work in the infinite data case, and for our algorithm we are able to give provable guarantees even when the data points are perturbed (e.g., as the result of sampling noise). Recent work of Thurau et al. [2010] and Kumar et al. [2012] follow the same pattern as our paper, but use nonnegative matrix factorization under the separability assumption. While both give applications to topic modeling, in realistic applications the termbydocument matrix is too sparse to be considered a good approximation to its expectation (because documents are short). In contrast, our algorithm works with the Gram matrix so that we can give provable guarantees even when each document is short.
5 Experimental Results
We compare three parameter recovery methods, Recover [Arora et al., 2012b], RecoverKL and RecoverL2 to a fast implementation of Gibbs sampling [McCallum, 2002].^{2}^{2}2We were not able to obtain Anandkumar et al. [2012]’s implementation of their algorithm, and our own implementation is too slow to be practical. Linear programmingbased anchor word finding is too slow to be comparable, so we use FastAnchorWords for all three recovery algorithms. Using Gibbs sampling we obtain the wordtopic distributions by averaging over 10 saved states, each separated by 100 iterations, after 1000 burnin iterations.
5.1 Methodology
We train models on two synthetic data sets to evaluate performance when model assumptions are correct, and real documents to evaluate realworld performance. To ensure that synthetic documents resemble the dimensionality and sparsity characteristics of real data, we generate semisynthetic corpora. For each real corpus, we train a model using MCMC and then generate new documents using the parameters of that model (these parameters are not guaranteed to be separable).
We use two realworld data sets, a large corpus of New York Times articles (295k documents, vocabulary size 15k, mean document length 298) and a small corpus of NIPS abstracts (1100 documents, vocabulary size 2500, mean length 68). Vocabularies were pruned with document frequency cutoffs. We generate semisynthetic corpora of various sizes from models trained with from NY Times and NIPS, with document lengths set to 300 and 70, respectively, and with documenttopic distributions drawn from a Dirichlet with symmetric hyperparameters .
We use a variety of metrics to evaluate models: For the semisynthetic corpora, we can compute reconstruction error between the true wordtopic matrix and learned topic distributions. Given a learned matrix and the true matrix , we use an LP to find the best matching between topics. Once topics are aligned, we evaluate distance between each pair of topics. When true parameters are not available, a standard evaluation for topic models is to compute heldout probability, the probability of previously unseen documents under the learned model. This computation is intractable but there are reliable approximation methods [Wallach et al., 2009, Buntine, 2009]. Topic models are useful because they provide interpretable latent dimensions. We can evaluate the semantic quality of individual topics using a metric called Coherence. Coherence is based on two functions, and , which are number of documents with at least one instance of , and of and , respectively [Mimno et al., 2011]. Given a set of words , coherence is
(3) 
The parameter is used to avoid taking the of zero for words that never cooccur [Stevens et al., 2012]. This metric has been shown to correlate well with human judgments of topic quality. If we perfectly reconstruct topics, all the highprobability words in a topic should cooccur frequently, otherwise, the model may be mixing unrelated concepts. Coherence measures the quality of individual topics, but does not measure redundancy, so we measure intertopic similarity. For each topic, we gather the set of the most probable words. We then count how many of those words do not appear in any other topic’s set of most probable words. Some overlap is expected due to semantic ambiguity, but lower numbers of unique words indicate less useful models.
5.2 Efficiency
The Recover algorithms, in Python, are faster than a heavily optimized Java Gibbs sampling implementation [Yao et al., 2009].
Fig. 1 shows the time to train models on synthetic corpora on a single machine. Gibbs sampling is linear in the corpus size. RecoverL2 is also linear (), but only varies from 33 to 50 seconds. Estimating is linear, but takes only 7 seconds for the largest corpus. FastAnchorWords takes less than 6 seconds for all corpora.
5.3 Semisynthetic documents
The new algorithms have good reconstruction error on semisynthetic documents, especially for larger corpora. Results for semisynthetic corpora drawn from topics trained on NY Times articles are shown in Fig. 2 for corpus sizes ranging from 50k to 2M synthetic documents. In addition, we report results for the three Recover algorithms on “infinite data,” that is, the true matrix from the model used to generate the documents. Error bars show variation between topics. Recover performs poorly in all but the noiseless, infinite data setting. Gibbs sampling has lower
with smaller corpora, while the new algorithms get better recovery and lower variance with more data (although more sampling might reduce MCMC error further).
Results for semisynthetic corpora drawn from NIPS topics are shown in Fig. 3. Recover does poorly for the smallest corpora (topic matching fails for , so is not meaningful), but achieves moderate error for comparable to the NY Times corpus. RecoverKL and RecoverL2 also do poorly for the smallest corpora, but are comparable to or better than Gibbs sampling, with much lower variance, after 40,000 documents.
5.4 Effect of separability
The nonnegative algorithms are more robust to violations of the separability assumption than the original Recover algorithm. In Fig. 3, Recover does not achieve zero error even with noiseless “infinite” data. Here we show that this is due to lack of separability. In our semisynthetic corpora, documents are generated from the LDA model, but the topicword distributions are learned from data and may not satisfy the anchor words assumption. We test the sensitivity of algorithms to violations of the separability condition by adding a synthetic anchor word to each topic that is by construction unique to the topic. We assign the synthetic anchor word a probability equal to the most probable word in the original topic. This causes the distribution to sum to greater than 1.0, so we renormalize. Results are shown in Fig. 4. The error goes to zero for Recover, and close to zero for RecoverKL and RecoverL2. The reason RecoverKL and RecoverL2 do not reach exactly zero is because we do not solve the optimization problems to perfect optimality.
5.5 Effect of correlation
The theoretical guarantees of the new algorithms apply even if topics are correlated. To test how algorithms respond to correlation, we generated new synthetic corpora from the same model trained on NY Times articles. Instead of a symmetric Dirichlet distribution, we use a logistic normal distribution with a blockstructured covariance matrix. We partition topics into 10 groups. For each pair of topics in a group, we add a nonzero offdiagonal element to the covariance matrix. This block structure is not necessarily realistic, but shows the effect of correlation. Results for two levels of covariance () are shown in Fig. 5.
Results for Recover are much worse in both cases than the Dirichletgenerated corpora in Fig. 2. The other three algorithms, especially Gibbs sampling, are more robust to correlation, but performance consistently degrades as correlation increases, and improves with larger corpora. With infinite data error is equal to error in the uncorrelated synthetic corpus (nonzero because of violations of the separability assumption).
5.6 Real documents
The new algorithms produce comparable quantitative and qualitative results on real data. Fig. 6 shows three metrics for both corpora. Error bars show the distribution of log probabilities across heldout documents (top panel) and coherence and unique words across topics (center and bottom panels). Heldout sets are 230 documents for NIPS and 59k for NY Times. For the small NIPS corpus we average over 5 nonoverlapping train/test splits. The matrixinversion in Recover failed for the smaller corpus (NIPS). In the larger corpus (NY Times), Recover produces noticeably worse heldout log probability per token than the other algorithms. Gibbs sampling produces the best average heldout probability ( under a paired test), but the difference is within the range of variability between documents. We tried several methods for estimating hyperparameters, but the observed differences did not change the relative performance of algorithms.
Gibbs sampling has worse coherence than the Recover algorithms, but produces more unique words per topic. These patterns are consistent with semisynthetic results for similarly sized corpora (details are in supplementary material).
For each NY Times topic learned by RecoverL2 we find the closest Gibbs topic by distance. The closest, median, and farthest topic pairs are shown in Table 1.^{3}^{3}3The UCI NY Times corpus includes namedentity annotations, indicated by the zzz prefix. We observe that when there is a difference, recoverbased topics tend to have more specific words (Anaheim Angels vs. pitch).
RecoverL2  run inning game hit season zzz_anaheim_angel 

Gibbs  run inning hit game ball pitch 
RecoverL2  father family zzz_elian boy court zzz_miami 
Gibbs  zzz_cuba zzz_miami cuban zzz_elian boy protest 
RecoverL2  file sport read internet email zzz_los_angeles 
Gibbs  web site com www mail zzz_internet 
6 Conclusions
We present new algorithms for topic modeling, inspired by Arora et al. [2012b], which are efficient and simple to implement yet maintain provable guarantees. The running time of these algorithms is effectively independent of the size of the corpus. Empirical results suggest that the sample complexity of these algorithms is somewhat greater than MCMC, but, particularly for the variant, they provide comparable results in a fraction of the time. We have tried to use the output of our algorithms as initialization for further optimization (e.g. using MCMC) but have not yet found a hybrid that outperforms either method by itself. Finally, although we defer parallel implementations to future work, these algorithms are parallelizable, potentially supporting webscale topic inference.
References
 Anandkumar et al. [2012] Anandkumar, A., Foster, D., Hsu, D., Kakade, S., and Liu, Y. Two svds suffice: Spectral decompositions for probabilistic topic modeling and latent dirichlet allocation. In NIPS, 2012.
 Arora et al. [2012a] Arora, S., Ge, R., Kannan, R., and Moitra, A. Computing a nonnegative matrix factorization – provably. In STOC, pp. 145–162, 2012a.
 Arora et al. [2012b] Arora, S., Ge, R., and Moitra, A. Learning topic models – going beyond svd. In FOCS, 2012b.
 Bittorf et al. [2012] Bittorf, V., Recht, B., Re, C., and Tropp, J. Factoring nonnegative matrices with linear programs. In NIPS, 2012.
 Blei [2012] Blei, D. Introduction to probabilistic topic models. Communications of the ACM, pp. 77–84, 2012.
 Blei & Lafferty [2007] Blei, D. and Lafferty, J. A correlated topic model of science. Annals of Applied Statistics, pp. 17–35, 2007.
 Blei et al. [2003] Blei, D., Ng, A., and Jordan, M. Latent dirichlet allocation. Journal of Machine Learning Research, pp. 993–1022, 2003. Preliminary version in NIPS 2001.
 Buntine [2009] Buntine, Wray L. Estimating likelihoods for topic models. In Asian Conference on Machine Learning, 2009.
 Deerwester et al. [1990] Deerwester, S., Dumais, S., Landauer, T., Furnas, G., and Harshman, R. Indexing by latent semantic analysis. JASIS, pp. 391–407, 1990.
 Donoho & Stodden [2003] Donoho, D. and Stodden, V. When does nonnegative matrix factorization give the correct decomposition into parts? In NIPS, 2003.
 Gillis [2012] Gillis, N. Robustness analysis of hotttopixx, a linear programming model for factoring nonnegative matrices, 2012. http://arxiv.org/abs/1211.6687.

Gomez et al. [2007]
Gomez, C., Borgne, H. Le, Allemand, P., Delacourt, C., and Ledru, P.
Nfindr method versus independent component analysis for lithological identification in hyperspectral imagery.
Int. J. Remote Sens., 28(23), January 2007.  Griffiths & Steyvers [2004] Griffiths, T. L. and Steyvers, M. Finding scientific topics. Proceedings of the National Academy of Sciences, 101:5228–5235, 2004.
 Kivinen & Warmuth [1995] Kivinen, Jyrki and Warmuth, Manfred K. Exponentiated gradient versus gradient descent for linear predictors. Inform. and Comput., 132, 1995.
 Kumar et al. [2012] Kumar, A., Sindhwani, V., and Kambadur, P. Fast conical hull algorithms for nearseparable nonnegative matrix factorization. 2012. http://arxiv.org/abs/1210.1190v1.
 Li & McCallum [2007] Li, W. and McCallum, A. Pachinko allocation: Dagstructured mixture models of topic correlations. In ICML, pp. 633–640, 2007.
 McCallum [2002] McCallum, A.K. Mallet: A machine learning for language toolkit, 2002. http://mallet.cs.umass.edu.
 Mimno et al. [2011] Mimno, David, Wallach, Hanna, Talley, Edmund, Leenders, Miriam, and McCallum, Andrew. Optimizing semantic coherence in topic models. In EMNLP, 2011.
 Nascimento & Dias [2004] Nascimento, J.M. P. and Dias, J. M. B. Vertex component analysis: A fast algorithm to unmix hyperspectral data. IEEE TRANS. GEOSCI. REM. SENS, 43:898–910, 2004.
 Nocedal & Wright [2006] Nocedal, J. and Wright, S. J. Numerical Optimization. Springer, New York, 2nd edition, 2006.
 Stevens et al. [2012] Stevens, Keith, Kegelmeyer, Philip, Andrzejewski, David, and Buttler, David. Exploring topic coherence over many models and many topics. In EMNLP, 2012.
 Thurau et al. [2010] Thurau, C., Kersting, K., and Bauckhage, C. Yes we can – simplex volume maximization for descriptive web–scale matrix factorization. In CIKM–10, 2010.
 Wallach et al. [2009] Wallach, Hanna, Murray, Iain, Salakhutdinov, Ruslan, and Mimno, David. Evaluation methods for topic models. In ICML, 2009.
 Wedin [1972] Wedin, P. Perturbation bounds in connection with singular value decomposition. BIT Numerical Mathematics, 12(1):99–111, 1972.
 Yao et al. [2009] Yao, Limin, Mimno, David, and McCallum, Andrew. Efficient methods for topic model inference on streaming document collections. In KDD, 2009.
Appendix A Proof for AnchorWords Finding Algorithm
Recall that the correctness of the algorithm depends on the following Lemmas:
Lemma A.1.
There is a vertex whose distance from is at least .
Lemma A.2.
The point found by the algorithm must be close to some vertex .
In order to prove Lemma A.1, we use a volume argument. First we show that the volume of a robust simplex cannot change by too much when the vertices are perturbed.
Lemma A.3.
Suppose are the vertices of a robust simplex . Let be a simplex with vertices , each of the vertices is a perturbation of and . When the volume of the two simplices satisfy
Proof: As the volume of a simplex is proportional to the determinant of a matrix whose columns are the edges of the simplex, we first show the following perturbation bound for determinant.
Claim A.4.
Proof: Since , we can multiply both and by . Hence .
The Frobenius norm of is bounded by
Let the eigenvalues of be , then by definition of Frobenius Norm . The eigenvalues of are just , and the determinant . Hence it suffices to show
To do this we apply Lagrangian method and show the minimum is only obtained when all ’s are equal. The optimal value must be obtained at a local optimum of
Taking partial derivatives with respect to ’s, we get the equations (here using is small so ). The right hand side is a constant, so each must be one of the two solutions of this equation. However, only one of the solution is larger than , therefore all the ’s are equal.
For the lower bound, we can project the perturbed subspace to the dimensional space. Such a projection cannot increase the volume and the perturbation distances only get smaller. Therefore we can apply the claim directly, the columns of are just for ; columns of are just . The smallest eigenvalue of is at least because the polytope is robust, which is equivalent to saying after orthogonalization each column still has length at least . The Frobenius norm of is at most . We get the lower bound directly by applying the claim.
For the upper bound, swap the two sets and and use the argument for the lower bound. The only thing we need to show is that the smallest eigenvalue of the matrix generated by points in is still at least . This follows from Wedin’s Theorem [Wedin, 1972] and the fact that .
Now we are ready to prove Lemma A.1.
Proof: The first case is for the first step of the algorithm, when we try to find the farthest point to the origin. Here essentially . For any two vertices , since the simplex is robust, the distance between and is at least . Which means , one of them must be at least .
For the later steps, recall that contains vertices of a perturbed simplex. Let be the set of original vertices corresponding to the perturbed vertices in . Let be any vertex in which is not in . Now we know the distance between and is equal to . On the other hand, we know . Using Lemma A.3 to bound the ratio between the two pairs and , we get:
when .
Lemma A.2 is based on the following observation: in a simplex the point with largest is always a vertex. Even if two vertices have the same norm if they are not close to each other the vertices on the edge connecting them will have significantly lower norm.
Proof: (Lemma A.2)
Since is the point found by the algorithm, let us consider the point before perturbation. The point is inside the simplex, therefore we can write as a convex combination of the vertices:
Let be the vertex with largest coefficient . Let be the largest distance from some vertex to the space spanned by points in (. By Lemma A.1 we know . Also notice that we are not assuming .
Now we rewrite as , where is a vector in the convex hull of vertices other than . Observe that must be far from , because is the farthest point found by the algorithm. Indeed:
The second inequality is because there must be some point that correspond to the farthest vertex and have . Thus as is the farthest point .
The point is on the segment connecting and , the distance between and is not much smaller than that of and . Following the intuition in norm when and are far we would expect to be very close to either or . Since it cannot be really close to , so it must be really close to . We formalize this intuition by the following calculation (see Figure 8):
Project everything to the orthogonal subspace of (points in are now at the origin). After projection distance to is just the norm of a vector. Without loss of generality we assume because these two have length at most , and extending these two vectors to have length can only increase the length of .
The point must be far from by applying Lemma A.1: consider the set of vertices . The set satisfy the assumptions in Lemma A.1 so there must be one vertex that is far from , and it can only be . Therefore even after projecting to orthogonal subspace of , is still far from any convex combination of . The vertices that are not in all have very small norm after projecting to orthogonal subspace (at most ) so we know the distance of and is at least .
Now the problem becomes a two dimensional calculation. When is fixed the length of is strictly increasing when the distance of and decrease, so we assume the distance is . Simple calculation (using essentially just pythagorean theorem) shows
The right hand side is largest when (since the vectors are in unit ball) and the maximum value is . When this value is smaller than , we must have . Thus and .
The cleanup phase tries to find the farthest point to a subset of vertices, and use that point as the th vertex. This will improve the result because when we have points close to vertices, only one of the vertices can be far from their span. Therefore the farthest point must be close to the only remaining vertex. Another way of viewing this is that the algorithm is trying to greedily maximize the volume of the simplex, which makes sense because the larger the volume is, the more words/documents the final LDA model can explain.
The following lemma makes the intuitions rigorous and shows how cleanup improves the guarantee of Lemma A.2.
Lemma A.5.
Suppose and each point in is close to some vertex , then the farthest point found by the algorithm is close to the remaining vertex.
Proof: We still look at the original point and express it as . Without loss of generality let be the vertex that does not correspond to anything in . By Lemma A.1 is far from . On the other hand all other vertices are at least close to . We know the distance , this cannot be true unless .
These lemmas directly lead to the following theorem:
Theorem A.6.
FastAnchorWords algorithm runs in time and outputs a subset of of size that covers the vertices provided that .
Appendix B Proof for Nonnegative Recover Procedure
In order to show RecoverL2 learns the parameters even when the rows of are perturbed, we need the following lemma that shows when columns of are close to the expectation, the posteriors computed by the algorithm is also close to the true value.
Lemma B.1.
For a robust simplex with vertices , let be a point in the simplex that can be represented as a convex combination . If the vertices of are perturbed to where and is perturbed to where . Let be the point in that is closest to , and , when for all .
Proof: Consider the point , by triangle inequality: . Hence , and is in . The point is the point in that is closest to , so and .
Then we need to show when a point () moves a small distance, its representation also changes by a small amount. Intuitively this is true because is robust. By Lemma A.1 when , the simplex is also robust. For any , let and be the projections of and in the orthogonal subspace of , then
and this completes the proof.
With this lemma it is not hard to show that RecoverL2 has polynomial sample complexity.
Theorem B.2.
When the number of documents is at least
our algorithm using the conjunction of FastAnchorWords and RecoverL2 learns the matrix with entrywise error at most .
Proof: (sketch) We can assume without loss of generality that each word occurs with probability at least and furthermore that if is at least then the empirical matrix is entrywise within an additive to the true see [Arora et al., 2012b] for the details. Also, the anchor rows of form a simplex that is robust.
The error in each column of can be at most . By Theorem A.6 when (which is satisfied when ) , the anchor words found are close to the true anchor words. Hence by Lemma B.1 every entry of has error at most .
With such number of documents, all the word probabilities are estimated more accurately than the entries of , so we omit their perturbations here for simplicity. When we apply the Bayes rule, we know , where is which is lower bounded by . The numerator and denominator are all related to entries of with positive coefficients sum up to at most 1. Therefore the errors and are at most the error of a single entry of , which is bounded by . Applying Taylor’s Expansion to , the error on entries of is at most . When , we have , and get the desired accuracy of . The number of document required is .
The sample complexity for can then be bounded using matrix perturbation theory.
Appendix C Empirical Results
This section contains plots for , heldout probability, coherence, and uniqueness for all semisynthetic data sets. Up is better for all metrics except error.
c.1 Sample Topics
Tables 2, 3, and 4 show 100 topics trained on real NY Times articles using the RecoverL2 algorithm. Each topic is followed by the most similar topic (measured by distance) from a model trained on the same documents with Gibbs sampling. When the anchor word is among the top six words by probability it is highlighted in bold. Note that the anchor word is frequently not the most prominent word.
RecoverL2  run inning game hit season zzz_anaheim_angel 

Gibbs  run inning hit game ball pitch 
RecoverL2  king goal game team games season 
Gibbs  point game team play season games 
RecoverL2  yard game play season team touchdown 
Gibbs  yard game season team play quarterback 
RecoverL2  point game team season games play 
Gibbs  point game team play season games 
RecoverL2  zzz_laker point zzz_kobe_bryant zzz_o_neal game team 
Gibbs  point game team play season games 
RecoverL2  point game team season player zzz_clipper 
Gibbs  point game team season play zzz_usc 
RecoverL2  ballot election court votes vote zzz_al_gore 
Gibbs  election ballot zzz_florida zzz_al_gore votes vote 
RecoverL2  game zzz_usc team play point season 
Gibbs  point game team season play zzz_usc 
RecoverL2  company billion companies percent million stock 
Gibbs  company million percent billion analyst deal 
RecoverL2  car race team season driver point 
Gibbs  race car driver racing zzz_nascar team 
RecoverL2  zzz_dodger season run inning right game 
Gibbs  season team baseball game player yankees 
RecoverL2  palestinian zzz_israeli zzz_israel official attack zzz_palestinian 
Gibbs  palestinian zzz_israeli zzz_israel attack zzz_palestinian zzz_yasser_arafat 
RecoverL2  zzz_tiger_wood shot round player par play 
Gibbs  zzz_tiger_wood shot golf tour round player 
RecoverL2  percent stock market companies fund quarter 
Gibbs  percent economy market stock economic growth 
RecoverL2  zzz_al_gore zzz_bill_bradley campaign president zzz_george_bush vice 
Gibbs  zzz_al_gore zzz_george_bush campaign presidential republican zzz_john_mccain 
RecoverL2  zzz_george_bush zzz_john_mccain campaign republican zzz_republican voter 
Gibbs  zzz_al_gore zzz_george_bush campaign presidential republican zzz_john_mccain 
RecoverL2  net team season point player zzz_jason_kidd 
Gibbs  point game team play season games 
RecoverL2  yankees run team season inning hit 
Gibbs  season team baseball game player yankees 
RecoverL2  zzz_al_gore zzz_george_bush percent president campaign zzz_bush 
Gibbs  zzz_al_gore zzz_george_bush campaign presidential republican zzz_john_mccain 
RecoverL2  zzz_enron company firm zzz_arthur_andersen companies lawyer 
Gibbs  zzz_enron company firm accounting zzz_arthur_andersen financial 
RecoverL2  team play game yard season player 
Gibbs  yard game season team play quarterback 
RecoverL2  film movie show director play character 
Gibbs  film movie character play minutes hour 
RecoverL2  zzz_taliban zzz_afghanistan official zzz_u_s government military 
Gibbs  zzz_taliban zzz_afghanistan zzz_pakistan afghan zzz_india government 
RecoverL2  palestinian zzz_israel israeli peace zzz_yasser_arafat leader 
Gibbs  palestinian zzz_israel peace israeli zzz_yasser_arafat leader 
RecoverL2  point team game shot play zzz_celtic 
Gibbs  point game team play season games 
RecoverL2  zzz_bush zzz_mccain campaign republican tax zzz_republican 
Gibbs  zzz_al_gore zzz_george_bush campaign presidential republican zzz_john_mccain 
RecoverL2  zzz_met run team game hit season 
Gibbs  season team baseball game player yankees 
RecoverL2  team game season play games win 
Gibbs  team coach game player season football 
RecoverL2  government war zzz_slobodan_milosevic official court president 
Gibbs  government war country rebel leader military 
RecoverL2  game set player zzz_pete_sampras play won 
Gibbs  player game match team soccer play 
RecoverL2  zzz_al_gore campaign zzz_bradley president democratic zzz_clinton 
Gibbs  zzz_al_gore zzz_george_bush campaign presidential republican zzz_john_mccain 
RecoverL2  team zzz_knick player season point play 
Gibbs  point game team play season games 
RecoverL2  com web www information sport question 
Gibbs  palm beach com statesman daily american 
RecoverL2  season team game coach play school 

Gibbs  team coach game player season football 
RecoverL2  air shower rain wind storm front 
Gibbs  water fish weather storm wind air 
RecoverL2  book film beginitalic enditalic look movie 
Gibbs  film movie character play minutes hour 
RecoverL2  zzz_al_gore campaign election zzz_george_bush zzz_florida president 
Gibbs  zzz_al_gore zzz_george_bush campaign presidential republican zzz_john_mccain 
RecoverL2  race won horse zzz_kentucky_derby win winner 
Gibbs  horse race horses winner won zzz_kentucky_derby 
RecoverL2  company companies zzz_at percent business stock 
Gibbs  company companies business industry firm market 
RecoverL2  company million companies percent business customer 
Gibbs  company companies business industry firm market 
RecoverL2  team coach season player jet job 
Gibbs  team player million season contract agent 
RecoverL2  season team game play player zzz_cowboy 
Gibbs  yard game season team play quarterback 
RecoverL2  zzz_pakistan zzz_india official group attack zzz_united_states 
Gibbs  zzz_taliban zzz_afghanistan zzz_pakistan afghan zzz_india government 
RecoverL2  show network night television zzz_nbc program 
Gibbs  film movie character play minutes hour 
RecoverL2  com information question zzz_eastern commentary daily 
Gibbs  com question information zzz_eastern daily commentary 
RecoverL2  power plant company percent million energy 
Gibbs  oil power energy gas prices plant 
RecoverL2  cell stem research zzz_bush human patient 
Gibbs  cell research human scientist stem genes 
RecoverL2  zzz_governor_bush zzz_al_gore campaign tax president plan 
Gibbs  zzz_al_gore zzz_george_bush campaign presidential republican zzz_john_mccain 
RecoverL2  cup minutes add tablespoon water oil 
Gibbs  cup minutes add tablespoon teaspoon oil 
RecoverL2  family home book right com children 
Gibbs  film movie character play minutes hour 
RecoverL2  zzz_china chin 