1 Introduction
Topic modeling is a statistical tool for the automatic discovery and comprehension of latent thematic structure or topics, assumed to pervade a corpus of documents.
Suppose that we have a corpus of documents composed of words from a vocabulary of distinct words indexed by . In the classic “bags of words” modeling paradigm widelyused in Probabilistic Latent Semantic Analysis [1] and Latent Dirichlet Allocation (LDA) [2, 3], each document is modeled as being generated by independent and identically distributed (iid) drawings of words from an unknown
document worddistribution vector. Each document worddistribution vector is itself modeled as an unknown
probabilistic mixture of unknown latent topic worddistribution vectors that are shared among thedocuments in the corpus. The goal of topic modeling then is to estimate the latent topic worddistribution vectors and possibly the topic mixing weights for each document from the empirical wordfrequency vectors of all documents. Topic modeling has also been applied to various types of data other than text, e.g., images, videos (with photometric and spatiotemporal featurevectors interpreted as the words), genetic sequences, hyperspectral images, voice, and music, for signal separation and blind deconvolution.
If denotes the unknown topicmatrix whose columns are the latent topic worddistribution vectors and denotes the weightmatrix whose columns are the mixing weights over topics for the documents, then each column of the matrix corresponds to a document worddistribution vector. Let denote the observed wordsbydocuments matrix whose columns are the empirical wordfrequency vectors of the documents when each document is generated by iid drawings of words from the corresponding column of the matrix. Then given only and , the goal is to estimate the topic matrix and possibly the weightmatrix . This can be formulated as a nonnegative matrix factorization (NMF) problem [4, 5, 6, 7] where the typical solution strategy is to minimize a cost function of the form
(1) 
where the regularization term is introduced to enforce desirable properties in the solution such as uniqueness of the factorization, sparsity, etc. The joint optimization of (1) with respect to is, however, nonconvex and necessitates the use of suboptimal strategies such as alternating minimization, greedy gradient descent, local search, approximations, and heuristics. These are also typically sensitive to small sample sizes (words per document) especially when because many words may not be sampled and may be far from in Euclidean distance. In LDA, the columns of and are modeled as iid random drawings from Dirichlet prior
distributions. The resulting maximum aposteriori probability estimation of
, however, turns out to be a fairly complex nonconvex problem. One then takes recourse to suboptimal solutions based on variational Bayes approximations of the posterior distribution and other methods based on Gibbs sampling and expectation propagation.In contrast to these approaches we adopt the nonnegative matrix factorization framework and propose a new geometrically motivated algorithm that has competitive performance compared to the current stateofthe art and is free of heuristics and approximations.
2 A new geometric approach
A key ingredient of the new approach is the socalled “separability” assumption introduced in [5] to ensure the uniqueness of nonnegative matrix factorization. Applied to this means that each topic contains “novel” words which appear only in that topic – a property that has been found to hold in the estimates of topic matrices produced by several algorithms [8]. More precisely, A topic matrix is separable if for each , there exists a row of that has a single nonzero entry which is in the th column. Figure 1 shows an example of a separable topic matrix with three topics. Words 1 and 2 are unique (novel) to topic 1, words 3, 4 to topic 2, and word 5 to topic 3.
Let be the set of novel words of topic for and let be the remaining words in the vocabulary. Let and denote the th and th rowvectors of and respectively. Observe that all the rowvectors of that correspond to the novel words of the same topic are just different scaled versions of the same rowvector: for each , . Thus if , , and denote the rownormalized versions (i.e., unit row sums) of , , and respectively then and for all (e.g., in Fig. 1, and ), and for all , lives in the convex hull of ’s (in Fig. 1, is in the convex hull of ).
This geometric viewpoint reveals how to extract the topic matrix from : (1) Rownormalize to . (2) Find extreme points of ’s rowvectors. (3) Cluster the rowvectors of that correspond to the same extreme point into the same group. There will be disjoint groups and each group will correspond to the novel words of the same topic. (4) Express the remaining rowvectors of as convex combinations of the extreme points. This gives us (5) Finally, renormalize to obtain .
The reality, however, is that we only have access to , not . The above algorithm when applied to would work well if is close to which would happen if is large. When is small, two problems arise: (i) Points corresponding to novel words of the same topic may become multiple extreme points and may be far from each other (e.g., and in Fig. 1
). (ii) Points in the convex hull may also become “outlier” extreme points (e.g.,
in Fig. 1).As a step towards overcoming these difficulties we observe that in practice, the unique words of any topic only occur in a few documents. This implies that the rows of are sparse and that the rowvectors of corresponding to the novel words of the same topic are likely to form a lowdimensional subspace (e.g., in Fig. 1) since their supports are subsets of the supports of the same rowvector of . If we make the further assumption that for any pair of distinct topics there are several documents in which their novel words do not cooccur then the row subspaces of corresponding to the novel words any two distinct topics are likely to be significantly disjoint (although they might share a common lowdimensional subspace). Finally, the rowvectors of corresponding to nonnovel words are unlikely to be close to the row subspaces of corresponding to the novel words any one topic (e.g., in Fig. 1). These observations and assumptions motivate the revised step Algorithm 1 for extracting from .
Step (2) of Algorithm 1 finds rows of many of which are likely to correspond to the novel words of topics and some to outliers (nonnovel words). This step uses Algorithm 2 which is a linearcomplexity procedure for finding, with high probability, extreme points and points close to them (the candidate novel words of topics) using a small number of random projections. Step (3) uses the stateoftheart sparse subspace clustering algorithm from [9, 10] to identify clusters of novel words, one for each topic, and an additional cluster containing the outliers (nonnovel words). Step (4) expresses rows of corresponding to nonnovel words as convex combinations of these groups of rows and step (5) estimates the entries in the topic matrix and normalizes it to make it columnstochastic. In many applications, nonnovel words occur in only a few topics. The groupsparsity penalty proposed in [11] is used in step (4) of Algorithm 1 to favor solutions where the row vectors of nonnovel words are convex combinations of as few groups of novel words as possible. Our proposed algorithm runs in polynomialtime in , , and and all the optimization problems involved are convex.
3 Experimental results
3.1 Synthetic Dataset
In this section, we validate our algorithm on some synthetic examples. We generate a separable topic matrix with novel words per topic as follows: first, iid rowsvectors corresponding to nonnovel words are generated uniformly on the probability simplex. Then, iid values are generated for the nonzero entries in the rows of novel words. The resulting matrix is then columnnormalized to get one realization of . Let . Next, iid columnvectors are generated for the matrix according to a Dirichlet prior . Following [12], we set for all . Finally, we obtain by generating iid words for each document.
For different settings of , , , and , we calculate the error of the estimated topic matrix as . For each setting we average the error over random samples. In sparse subspace clustering the value of is set as in [10] (it depends on the size of the candidate set) and the value of as in [9] (it depends on the values of ). In Step 4 of Algorithm 1, we set for all settings.
We compare our algorithm against the LDA algorithm [2] and a stateofart NMFbased algorithm [13]. This NMF algorithm is chosen because it compensates for the type of noise we use in our topic model. Our LDA algorithm uses Gibbs sampling for inferencing. Figure 2 depicts the estimation error as a function of the number of documents (top) and the number of words/document (bottom). Evidently, our algorithm is uniformly better than comparable techniques. Specifically, while NMF has similar error as our algorithm for large it performs relatively poorly as a function of . On the other hand LDA has similar error performance as ours for large but performs poorly as a function of . Note that both of these algorithms have comparably high error rates for small and .
3.2 Swimmer Image Dataset
Pos.  LA 1  LA 2  LA 3  LA 4  RA 1  RA 2  RA 3  RA 4  LL 1  LL 2  LL 3  LL 4  RL 1  RL 2  RL 3  RL 4 

a) 
















b) 












c) 










a) 






b) 






In this section we apply our algorithm to the synthetic swimmer image dataset introduced in [5]. There are binary images each of pixels. Each image represents a swimmer composed of four limbs, each of which can be in one of distinct positions, and a torso.
We interpret pixel positions as words in a dictionary. Documents are images, where an image is interpreted as a collection of pixel positions with nonzero values. Since each of the four limbs can independently take one of four positions, it turns out that the topic matrix satisfies the separability assumption with “ground truth” topics that correspond to single limb positions. Following the setting of [13], we set body pixel values to 10 and background pixel values to 1. We then take each “clean” image, suitably normalized, as an underlying distribution across pixels and generate a “noisy” document of iid “words” according to the topic model. Examples are shown in Fig. 3. We then apply our algorithm to the “noisy” dataset. We again compare our algorithm against LDA and the NMF algorithm from [13]. Results are shown in Figures 4 and 5. Values of tuning parameters , and are set as in Sec. 3.1. Specifically, for the results in Figs. 4 and 5.
This dataset is a good validation test for different algorithms since the ground truth topics are known and are unique. As we see in Fig. 5, both LDA and NMF produce topics that do not correspond to any pure left/right arm/leg positions. Indeed, many estimated topics are composed of multiple limbs. Nevertheless, no such errors are realized in our algorithm and our topicestimates are closer to the ground truth images.
3.3 Text Corpora
“chips”  “vision”  “networks”  “learning” 

chip  visual  network  learning 
circuit  cells  routing  training 
analog  ocular  system  error 
current  cortical  delay  SVM 
gate  activity  load  model 
“election”  “law”  “market”  “game” 

state  case  market  game 
politics  law  executive  play 
election  lawyer  industry  team 
campaign  charge  sell  run 
vote  court  business  season 
In this section, we apply our algorithm on two different text corpora, namely, the NIPS dataset [14] and the New York (NY) Times dataset [15]. In the NIPS dataset, there are documents with words in the vocabulary. There are, on average, words in each document. In the NY Times dataset, , , and . The vocabulary is obtained by deleting a standard “stop” word list used in computational linguistics, including numbers, individual characters, and some common English words such as “the”. Words that occur less than times in the dataset and the words that occur in less than documents are removed from the vocabulary as well. The tuning parameters and are set in the same way as in Sec. 3.1 (specifically, and ).
Table 1 depicts typical topics extracted by our algorithm. For each topic we show its most frequent words, listed in descending order of estimated probability. Although there is no “ground truth” to compare with, the most frequent words in the estimated topics do form recognizable themes. For example, in the NIPS dataset, the set of (most frequent) words “chip”, “circuit”, etc., can be annotated as “IC Design”; The words “visual”, “cells”, etc., can be labeled as “human visual system”. As a point of comparison, we also experimented with related convex programming algorithms [8, 7] that have recently appeared in the literature. We found that they fail to produce meaningful results for these datasets.
References

[1]
T. Hofmann,
“Probabilistic latent semantic analysis,”
in
Uncertainty in Artificial Intelligence
, San Francisco, CA, 1999, pp. 289–296, Morgan Kaufmann Publishers.  [2] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, Mar. 2003.
 [3] D. M. Blei, “Probabilistic topic models,” Commun. ACM, vol. 55, no. 4, pp. 77–84, Apr. 2012.
 [4] D. D. Lee ans H. S. Seung, “Learning the parts of objects by nonnegative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, Oct. 1999.
 [5] D. Donoho and V. Stodden, “When does nonnegative matrix factorization give a correct decomposition into parts?,” in Advances in Neural Information Processing Systems 16, Cambridge, MA, 2004, MIT Press.

[6]
A. Cichocki, R. Zdunek, A. H. Phan, and S. Amari,
Nonnegative matrix and tensor factorizations: applications to exploratory multiway data analysis and blind source separation
, Wiley, 2009. 
[7]
B. Recht, C. Re, J. Tropp, and V. Bittorf,
“Factoring nonnegative matrices with linear programs,”
in Advances in Neural Information Processing Systems 25, 2012, pp. 1223–1231.  [8] S. Arora, R. Ge, and A. Moitra, “Learning topic models – going beyond SVD,” arXiv:1204.1956v2 [cs.LG], 2012.
 [9] M. Soltanolkotabi, and E. J. Candes, “A geometric analysis of subspace clustering with outliers,” ArXiv eprints, Dec. 2011.
 [10] E. Elhamifar and R. Vidal, “Sparse subspace clustering: algorithm, theory, and applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012.
 [11] E. Esser, M. Moller, S. Osher, G. Sapiro, and J. Xin, “A convex model for nonnegative matrix factorization and dimensionality reduction on physical space,” IEEE Trans. Image Processing, vol. 21, pp. 3239–3252, Jul. 2012.
 [12] T. Griffiths and M. Steyvers, “Finding scientific topics,” in Proceedings of the National Academy of Sciences, 2004, vol. 101, pp. 5228–5235.
 [13] V. Y. F. Tan and C. Févotte, “Automatic relevance determination in nonnegative matrix factorization with the betadivergence,” IEEE Transactions on Pattern Analysis and Machine Intelligence, in press.

[14]
A. Globerson, G. Chechik, F. Pereira, and N. Tishby,
“Euclidean embedding of cooccurrence data,”
The Journal of Machine Learning Research
, vol. 8, pp. 2265–2295, 2007.  [15] A. Chaney and D. M. Blei, “Visualizing topic models,” in International AAAI Conference on Weblogs and Social Media, 2012.
Comments
There are no comments yet.