A New Geometric Approach to Latent Topic Modeling and Discovery

01/05/2013 ∙ by Weicong Ding, et al. ∙ 0

A new geometrically-motivated algorithm for nonnegative matrix factorization is developed and applied to the discovery of latent "topics" for text and image "document" corpora. The algorithm is based on robustly finding and clustering extreme points of empirical cross-document word-frequencies that correspond to novel "words" unique to each topic. In contrast to related approaches that are based on solving non-convex optimization problems using suboptimal approximations, locally-optimal methods, or heuristics, the new algorithm is convex, has polynomial complexity, and has competitive qualitative and quantitative performance compared to the current state-of-the-art approaches on synthetic and real-world datasets.



There are no comments yet.


page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Topic modeling is a statistical tool for the automatic discovery and comprehension of latent thematic structure or topics, assumed to pervade a corpus of documents.

Suppose that we have a corpus of documents composed of words from a vocabulary of distinct words indexed by . In the classic “bags of words” modeling paradigm widely-used in Probabilistic Latent Semantic Analysis [1] and Latent Dirichlet Allocation (LDA) [2, 3], each document is modeled as being generated by independent and identically distributed (iid) drawings of words from an unknown

document word-distribution vector. Each document word-distribution vector is itself modeled as an unknown

probabilistic mixture of unknown latent topic word-distribution vectors that are shared among the

documents in the corpus. The goal of topic modeling then is to estimate the latent topic word-distribution vectors and possibly the topic mixing weights for each document from the empirical word-frequency vectors of all documents. Topic modeling has also been applied to various types of data other than text, e.g., images, videos (with photometric and spatio-temporal feature-vectors interpreted as the words), genetic sequences, hyper-spectral images, voice, and music, for signal separation and blind deconvolution.

If denotes the unknown topic-matrix whose columns are the latent topic word-distribution vectors and denotes the weight-matrix whose columns are the mixing weights over topics for the documents, then each column of the matrix corresponds to a document word-distribution vector. Let denote the observed words-by-documents matrix whose columns are the empirical word-frequency vectors of the documents when each document is generated by iid drawings of words from the corresponding column of the matrix. Then given only and , the goal is to estimate the topic matrix and possibly the weight-matrix . This can be formulated as a nonnegative matrix factorization (NMF) problem [4, 5, 6, 7] where the typical solution strategy is to minimize a cost function of the form


where the regularization term is introduced to enforce desirable properties in the solution such as uniqueness of the factorization, sparsity, etc. The joint optimization of (1) with respect to is, however, non-convex and necessitates the use of suboptimal strategies such as alternating minimization, greedy gradient descent, local search, approximations, and heuristics. These are also typically sensitive to small sample sizes (words per document) especially when because many words may not be sampled and may be far from in Euclidean distance. In LDA, the columns of and are modeled as iid random drawings from Dirichlet prior

distributions. The resulting maximum aposteriori probability estimation of

, however, turns out to be a fairly complex non-convex problem. One then takes recourse to sub-optimal solutions based on variational Bayes approximations of the posterior distribution and other methods based on Gibbs sampling and expectation propagation.

In contrast to these approaches we adopt the non-negative matrix factorization framework and propose a new geometrically motivated algorithm that has competitive performance compared to the current state-of-the art and is free of heuristics and approximations.

2 A new geometric approach

A key ingredient of the new approach is the so-called “separability” assumption introduced in [5] to ensure the uniqueness of nonnegative matrix factorization. Applied to this means that each topic contains “novel” words which appear only in that topic – a property that has been found to hold in the estimates of topic matrices produced by several algorithms [8]. More precisely, A topic matrix is separable if for each , there exists a row of that has a single non-zero entry which is in the -th column. Figure 1 shows an example of a separable topic matrix with three topics. Words 1 and 2 are unique (novel) to topic 1, words 3, 4 to topic 2, and word 5 to topic 3.

Let be the set of novel words of topic for and let be the remaining words in the vocabulary. Let and denote the -th and -th row-vectors of and respectively. Observe that all the row-vectors of that correspond to the novel words of the same topic are just different scaled versions of the same row-vector: for each , . Thus if , , and denote the row-normalized versions (i.e., unit row sums) of , , and respectively then and for all (e.g., in Fig. 1, and ), and for all , lives in the convex hull of ’s (in Fig. 1, is in the convex hull of ).

Figure 1: A separable topic matrix and the underlying geometric structure. Solid circles represent rows of , empty circles represent rows of .

This geometric viewpoint reveals how to extract the topic matrix from : (1) Row-normalize to . (2) Find extreme points of ’s row-vectors. (3) Cluster the row-vectors of that correspond to the same extreme point into the same group. There will be disjoint groups and each group will correspond to the novel words of the same topic. (4) Express the remaining row-vectors of as convex combinations of the extreme points. This gives us (5) Finally, renormalize to obtain .

The reality, however, is that we only have access to , not . The above algorithm when applied to would work well if is close to which would happen if is large. When is small, two problems arise: (i) Points corresponding to novel words of the same topic may become multiple extreme points and may be far from each other (e.g., and in Fig. 1

). (ii) Points in the convex hull may also become “outlier” extreme points (e.g.,

in Fig. 1).

As a step towards overcoming these difficulties we observe that in practice, the unique words of any topic only occur in a few documents. This implies that the rows of are sparse and that the row-vectors of corresponding to the novel words of the same topic are likely to form a low-dimensional subspace (e.g., in Fig. 1) since their supports are subsets of the supports of the same row-vector of . If we make the further assumption that for any pair of distinct topics there are several documents in which their novel words do not co-occur then the row subspaces of corresponding to the novel words any two distinct topics are likely to be significantly disjoint (although they might share a common low-dimensional subspace). Finally, the row-vectors of corresponding to non-novel words are unlikely to be close to the row subspaces of corresponding to the novel words any one topic (e.g., in Fig. 1). These observations and assumptions motivate the revised -step Algorithm 1 for extracting from .

1: word-document matrix ; # topics .
2:Estimate of topic matrix .
3:Row-normalize to get . Let .
4:Apply Algorithm 2 to rows of to obtain a subset of rows that correspond to candidate novel words. Let be the remaining row indices.
5:Apply the sparse subspace clustering algorithm of [9, 10] to with parameters to obtain clusters of novel words and cluster of outlier words. Rearrange the rows of indexed by into a matrix .
6:For each , solve
for some . Let be the optimal solution.
7:For , , set
and normalize each column of to be column stochastic.
Algorithm 1 Topic Discovery
1:Set of probability row-vectors ; Number of projections ; Tolerance .
2:Set of candidate novel row-vectors.
3:Set .
4:Generate row-vector Uniform(unit-sphere in .
5:, .
7:Repeat steps through , times.
Algorithm 2 Find candidate novel words

Step (2) of Algorithm 1 finds rows of many of which are likely to correspond to the novel words of topics and some to outliers (non-novel words). This step uses Algorithm 2 which is a linear-complexity procedure for finding, with high probability, extreme points and points close to them (the candidate novel words of topics) using a small number of random projections. Step (3) uses the state-of-the-art sparse subspace clustering algorithm from [9, 10] to identify clusters of novel words, one for each topic, and an additional cluster containing the outliers (non-novel words). Step (4) expresses rows of corresponding to non-novel words as convex combinations of these groups of rows and step (5) estimates the entries in the topic matrix and normalizes it to make it column-stochastic. In many applications, non-novel words occur in only a few topics. The group-sparsity penalty proposed in [11] is used in step (4) of Algorithm 1 to favor solutions where the row vectors of non-novel words are convex combinations of as few groups of novel words as possible. Our proposed algorithm runs in polynomial-time in , , and and all the optimization problems involved are convex.

3 Experimental results

3.1 Synthetic Dataset

Figure 2: Error of estimated topic matrix in Frobenius norm. Upper: ; Lower: .

In this section, we validate our algorithm on some synthetic examples. We generate a separable topic matrix with novel words per topic as follows: first, iid rows-vectors corresponding to non-novel words are generated uniformly on the probability simplex. Then, iid values are generated for the nonzero entries in the rows of novel words. The resulting matrix is then column-normalized to get one realization of . Let . Next, iid column-vectors are generated for the matrix according to a Dirichlet prior . Following [12], we set for all . Finally, we obtain by generating iid words for each document.

For different settings of , , , and , we calculate the error of the estimated topic matrix as . For each setting we average the error over random samples. In sparse subspace clustering the value of is set as in [10] (it depends on the size of the candidate set) and the value of as in [9] (it depends on the values of ). In Step 4 of Algorithm 1, we set for all settings.

We compare our algorithm against the LDA algorithm [2] and a state-of-art NMF-based algorithm [13]. This NMF algorithm is chosen because it compensates for the type of noise we use in our topic model. Our LDA algorithm uses Gibbs sampling for inferencing. Figure 2 depicts the estimation error as a function of the number of documents (top) and the number of words/document (bottom). Evidently, our algorithm is uniformly better than comparable techniques. Specifically, while NMF has similar error as our algorithm for large it performs relatively poorly as a function of . On the other hand LDA has similar error performance as ours for large but performs poorly as a function of . Note that both of these algorithms have comparably high error rates for small and .

3.2 Swimmer Image Dataset

(a) (b) (c)
Figure 3: (a) Example “clean” images (cols. of ) in Swimmer dataset; (b) Corresponding images with sampling “noise” (cols. of ); (c) Examples of ideal topics (cols. of ).
Pos. LA 1 LA 2 LA 3 LA 4 RA 1 RA 2 RA 3 RA 4 LL 1 LL 2 LL 3 LL 4 RL 1 RL 2 RL 3 RL 4
Figure 4: Topics estimated for noisy swimmer dataset by a) proposed algorithm, b) LDA inference using code in [12], c) NMF algorithm using code in [13]. Topics closest to the 16 ideal (ground truth) topics LA1, LA2, etc., are shown. LDA misses and NMF misses of the ground truth topics while our algorithm recovers all and our topic estimates look less noisy.
Figure 5: Topic errors for (a) LDA algorithm [12] and (b) NMF algorithm [13] on the Swimmer dataset. Figure depicts topics that are extracted by LDA and NMF but are not close to any “ground truth” topic. The ground truth topics correspond to 16 different positions of left/right arms and legs.

In this section we apply our algorithm to the synthetic swimmer image dataset introduced in [5]. There are binary images each of pixels. Each image represents a swimmer composed of four limbs, each of which can be in one of distinct positions, and a torso.

We interpret pixel positions as words in a dictionary. Documents are images, where an image is interpreted as a collection of pixel positions with non-zero values. Since each of the four limbs can independently take one of four positions, it turns out that the topic matrix satisfies the separability assumption with “ground truth” topics that correspond to single limb positions. Following the setting of [13], we set body pixel values to 10 and background pixel values to 1. We then take each “clean” image, suitably normalized, as an underlying distribution across pixels and generate a “noisy” document of iid “words” according to the topic model. Examples are shown in Fig. 3. We then apply our algorithm to the “noisy” dataset. We again compare our algorithm against LDA and the NMF algorithm from [13]. Results are shown in Figures 4 and 5. Values of tuning parameters , and are set as in Sec. 3.1. Specifically, for the results in Figs. 4 and 5.

This dataset is a good validation test for different algorithms since the ground truth topics are known and are unique. As we see in Fig. 5, both LDA and NMF produce topics that do not correspond to any pure left/right arm/leg positions. Indeed, many estimated topics are composed of multiple limbs. Nevertheless, no such errors are realized in our algorithm and our topic-estimates are closer to the ground truth images.

3.3 Text Corpora

chips” vision” networks” learning”
chip visual network learning
circuit cells routing training
analog ocular system error
current cortical delay SVM
gate activity load model

election” law” market” game”
state case market game
politics law executive play
election lawyer industry team
campaign charge sell run
vote court business season
Table 1: Most frequent words in examples of estimated topics. Upper: NIPS, with topics; Lower: NY Times, with topics

In this section, we apply our algorithm on two different text corpora, namely, the NIPS dataset [14] and the New York (NY) Times dataset [15]. In the NIPS dataset, there are documents with words in the vocabulary. There are, on average, words in each document. In the NY Times dataset, , , and . The vocabulary is obtained by deleting a standard “stop” word list used in computational linguistics, including numbers, individual characters, and some common English words such as “the”. Words that occur less than times in the dataset and the words that occur in less than documents are removed from the vocabulary as well. The tuning parameters and are set in the same way as in Sec. 3.1 (specifically, and ).

Table 1 depicts typical topics extracted by our algorithm. For each topic we show its most frequent words, listed in descending order of estimated probability. Although there is no “ground truth” to compare with, the most frequent words in the estimated topics do form recognizable themes. For example, in the NIPS dataset, the set of (most frequent) words “chip”, “circuit”, etc., can be annotated as “IC Design”; The words “visual”, “cells”, etc., can be labeled as “human visual system”. As a point of comparison, we also experimented with related convex programming algorithms [8, 7] that have recently appeared in the literature. We found that they fail to produce meaningful results for these datasets.


  • [1] T. Hofmann, “Probabilistic latent semantic analysis,” in

    Uncertainty in Artificial Intelligence

    , San Francisco, CA, 1999, pp. 289–296, Morgan Kaufmann Publishers.
  • [2] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, Mar. 2003.
  • [3] D. M. Blei, “Probabilistic topic models,” Commun. ACM, vol. 55, no. 4, pp. 77–84, Apr. 2012.
  • [4] D. D. Lee ans H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, Oct. 1999.
  • [5] D. Donoho and V. Stodden, “When does non-negative matrix factorization give a correct decomposition into parts?,” in Advances in Neural Information Processing Systems 16, Cambridge, MA, 2004, MIT Press.
  • [6] A. Cichocki, R. Zdunek, A. H. Phan, and S. Amari,

    Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation

    Wiley, 2009.
  • [7] B. Recht, C. Re, J. Tropp, and V. Bittorf,

    “Factoring nonnegative matrices with linear programs,”

    in Advances in Neural Information Processing Systems 25, 2012, pp. 1223–1231.
  • [8] S. Arora, R. Ge, and A. Moitra, “Learning topic models – going beyond SVD,” arXiv:1204.1956v2 [cs.LG], 2012.
  • [9] M. Soltanolkotabi, and E. J. Candes, “A geometric analysis of subspace clustering with outliers,” ArXiv e-prints, Dec. 2011.
  • [10] E. Elhamifar and R. Vidal, “Sparse subspace clustering: algorithm, theory, and applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012.
  • [11] E. Esser, M. Moller, S. Osher, G. Sapiro, and J. Xin, “A convex model for nonnegative matrix factorization and dimensionality reduction on physical space,” IEEE Trans. Image Processing, vol. 21, pp. 3239–3252, Jul. 2012.
  • [12] T. Griffiths and M. Steyvers, “Finding scientific topics,” in Proceedings of the National Academy of Sciences, 2004, vol. 101, pp. 5228–5235.
  • [13] V. Y. F. Tan and C. Févotte, “Automatic relevance determination in nonnegative matrix factorization with the beta-divergence,” IEEE Transactions on Pattern Analysis and Machine Intelligence, in press.
  • [14] A. Globerson, G. Chechik, F. Pereira, and N. Tishby, “Euclidean embedding of co-occurrence data,”

    The Journal of Machine Learning Research

    , vol. 8, pp. 2265–2295, 2007.
  • [15] A. Chaney and D. M. Blei, “Visualizing topic models,” in International AAAI Conference on Weblogs and Social Media, 2012.