1 Introduction
Statistical topic modeling is useful in exploratory data analysis [Blei et al.2003], but model inference is known to be NPhard even for the simplest models with only two topics [Sontag and Roy2011], and training often remains a black box to users. Likelihoodbased training requires expensive approximate inference such as variational methods [Blei et al.2003]
, which are deterministic but sensitive to initialization, or Markov chain Monte Carlo (MCMC) methods
[Griffiths and Steyvers2004], which have no finite convergence guarantees. Recently Arora et al. proposed the Anchor Words algorithm [Arora et al.2013], which casts topic inference as statistical recovery using a separability assumption: each topic has a specific anchor word that appears only in the context of that single topic. Each anchor word can be used as a unique pivot to disambiguate the corresponding topic distribution. We then reconstruct the word cooccurrence pattern of each nonanchor words as a convex combination of the cooccurrence patterns of the anchor words.This algorithm is fast, requiring only one pass through the training documents, and provides provable guarantees, but results depend entirely on selecting good anchor words. [Arora et al.2013]
propose a greedy method that finds an approximate convex hull around a set of vectors corresponding to the word cooccurrence patterns for each vocabulary word. Although this method is an improvement over previous work that used impractical linear programming methods
[Arora et al.2012], serious problems remain. The method greedily chooses the farthest point from the current subspace until the given number of anchors have been found. Particularly at the early stages of the algorithm, the words associated with the farthest points are likely to be infrequent and idiosyncratic, and thus form poor bases for human interpretation and topic recovery. This poor choice of anchors noticeably affects topic quality: the anchor words algorithm tends to produce large numbers of nearly identical topics.Besides providing a separability criterion, anchor words also have the potential to improve topic interpretability. After learning topics for given text collections, users often request a label that summarizes each topic. Manually labeling topics is arduous, and labels often do not carry over between random initializations and models with differing numbers of topics. Moreover, it is hard to control the subjectivity in labelings between annotators, which is open to interpretive errors. There has been considerable interest in automating the labeling process [Mei et al.2007, Lau et al.2011, Chuang et al.2012]. [Chuang et al.2012] propose a measure of saliency
: a good summary term should be both distinctive specifically to one topic and probable in that topic. Anchor words are by definition optimally distinct, and therefore may seem to be good candidates for topic labels, but greedily selecting extreme words often results in anchor words that have low probability.
In this work we explore the opposite of Arora et al.’s method: rather than finding an approximate convex hull for an exact set of vectors, we find an exact convex hull for an approximate set of vectors. We project the word cooccurrence matrix to visualizable 2 and 3dimensional spaces using methods such as SNE [van der Maaten and Hinton2008], resulting in an input matrix up to 3600 times narrower than the original input for our training corpora. Despite this radically lowdimensional projection, the method not only finds topics that are as good or better than the greedy anchor method, it also finds highly salient, interpretable anchor words and provides users with a clear visual explanation for why the algorithm chooses particular words, all while maintaining the original algorithm’s computational benefits.
2 Related Work
Latent Dirichlet allocation (LDA) [Blei et al.2003] models documents with a vocabulary using a predefined number of topics by . LDA views both , a set of topicword distributions for each topic , and , a set of documenttopic distributions for each document , and , a set of topicassignment vectors for word tokens in the document , as randomly generated from known stochastic processes. Merging as th column vector of matrix , as th column vector of matrix
, the learning task is to estimate the posterior distribution of latent variables
, , and given worddocument matrix , which is the only observed variable where th column corresponds to the empirical word frequencies in the training documents .[Arora et al.2013] recover wordtopic matrix and topictopic matrix instead of in the spirit of nonnegative matrix factorization. Though the true underlying word distribution for each document is unknown and could be far from the sample observation , the empirical wordword matrix converges to its expectation as the number of documents increases. Thus the learning task is to approximately recover and pretending that the empirical
is close to the true secondorder moment matrix
.The critical assumption for this method is to suppose that every topic has a specific anchor word that occurs with nonnegligible probability () only in that topic. The anchor word need not always appear in every document about the topic , but we can be confident that the document is at least to some degree about the topic if it contains . This assumption drastically improves inference by guaranteeing the presence of a diagonal submatrix inside the wordtopic matrix . After constructing an estimate , the algorithm in [Arora et al.2013] first finds a set of anchor words ( is userspecified), and recovers and subsequently based on . Due to this structure, overall performance depends heavily on the quality of anchor words.
In the matrix algebra literature this greedy anchor finding method is called QR with rowpivoting
. Previous work classifies a matrix into two sets of row (or column) vectors where the vectors in one set can effectively reconstruct the vectors in another set, called
subsetselection algorithms. [Gu and Eisenstat1996] suggest one important deterministic algorithm. A randomized algorithm provided by [Boutsidis et al.2009] is the stateofthe art using a prestage that selects the candidates in addition to [Gu and Eisenstat1996]. We found no change in anchor selection using these algorithms, verifying the difficulty of the anchor finding process. This difficulty is mostly because anchors must be nonnegative convex bases, whereas the classified vectors from the subset selection algorithms yield unconstrained bases.The SNE model has previously been used to display highdimensional embeddings of words in 2D space by Turian.^{1}^{1}1http://metaoptimize.com/projects/wordreprs/ Lowdimensional embeddings of topic spaces have also been used to support user interaction with models: [Eisenstein et al.2011] use a visual display of a topic embedding to create a navigator interface. Although SNE has been used to visualize the results of topic models, for example by [LacosteJulien et al.2008] and [Zhu et al.2009], we are not aware of any use of the method as a fundamental component of topic inference.
3 Lowdimensional Embeddings
Real text corpora typically involve vocabularies in the tens of thousands of distinct words. As the input matrix scales quadratically with , the Anchor Words algorithm must depend on a lowdimensional projection of in order to be practical. Previous work [Arora et al.2013] uses random projections via either Gaussian random matrices [Johnson and Lindenstrauss1984] or sparse random matrices [Achlioptas2001], reducing the representation of each word to around 1,000 dimensions. Since the dimensionality of the compressed word cooccurrence space is an order of magnitude larger than , we must still approximate the convex hull by choosing extreme points as before.
In this work we explore two projection methods: PCA and SNE [van der Maaten and Hinton2008]
. Principle Component Analysis (PCA) is a commonlyused dimensionality reduction scheme that linearly transforms the data to new coordinates where the largest variances are orthogonally captured for each dimension. By choosing only a few such principle axes, we can represent the data in a lower dimensional space. In contrast,
SNE embedding performs a nonlinear dimensionality reduction preserving the local structures. Given a set of points in a highdimensional space , SNE allocates probability mass for each pair of points so that pairs of similar (closer) points become more likely to cooccur than dissimilar (distant) points.(1)  
(2) 
Then it generates a set of new points in lowdimensional space
so that probability distribution over points in
behaves similarly to the distribution over points in by minimizing KLdivergence between two distributions:(3)  
(4) 
Instead of approximating a convex hull in such a highdimensional space, we select the exact vertices of the convex hull formed in a lowdimensional projected space, which can be calculated efficiently. Figures 1 and 2 show the convex hulls for 2D projections of using SNE and PCA for a corpus of Yelp reviews. Figure 3 illustrates the convex hulls for 3D SNE projection for the same corpus. Anchor words correspond to the vertices of these convex hulls. Note that we present the 2D projections as illustrative examples only; we find that three dimensional projections perform better in practice.
In addition to the computational advantages, this approach benefits anchorbased topic modeling in two aspects. First, as we now compute the exact convex hull, the number of topics depends on the dimensionality of the embedding, . For example in the figures, 2D projection has 21 vertices, whereas 3D projection supports 69 vertices. This implies users can easily tune the granularity of topic clusters by varying without increasing the number of topics by one each time. Second, we can effectively visualize the thematic relationships between topic anchors and the rest of words in the vocabulary, enhancing both interpretability and options for further vocabulary curation.
4 Experimental Results
We find that radically lowdimensional SNE projections are effective at finding anchor words that are much more salient than the greedy method, and topics that are more distinctive, while maintaining comparable heldout likelihood and semantic coherence. As noted in Section 1, the previous greedy anchor words algorithm tends to produce many nearly identical topics. For example, 37 out of 100 topics trained on a 2008 political blog corpus have obama, mccain, bush, iraq or palin as their most probable word, including 17 just for obama. Only 66% of topics have a unique top word. In contrast, the SNE model on the same dataset has only one topic whose most probable word is obama, and 86% of topics have a unique top word (mccain is the most frequent top word, with five topics).
We use three real datasets: business reviews from the Yelp Academic Dataset,^{2}^{2}2https://www.yelp.com/academic_dataset political blogs from the 2008 US presidential election [Eisenstein and Xing2010], and New York Times articles from 2007.^{3}^{3}3http://catalog.ldc.upenn.edu/LDC2008T19 Details are shown in Table 1. Documents with fewer than 10 word tokens are discarded due to possible instability in constructing . We perform minimal vocabulary curation, eliminating a standard list of English stopwords^{4}^{4}4We used the list of 524 stop words included in the Mallet library. and terms that occur below frequency cutoffs: 100 times (Yelp, Blog) and 150 times (NYT). We further restrict possible anchor words to words that occur in more than three documents. As our datasets are not artificially synthesized, we reserve 5% from each set of documents for heldout likelihood computation.
Name  Documents  Vocab.  Avg. length 

Yelp  20,000  1,606  40.6 
Blog  13,000  4,447  161.3 
NYT  41,000  10,713  277.8 
Unlike [Arora et al.2013], which presents results on synthetic datasets to compare performance across different recovery methods given increasing numbers of documents, we are are interested in comparing anchor finding methods, and are mainly concerned with semantic quality. As a result, although we have conducted experiments on synthetic document collections,^{5}^{5}5None of the algorithms are particularly effective at finding synthetically introduced anchor words possibly because there are other candidates around anchor vertices that approximate the convex hull to almost the same degree. we focus on real datasets for this work. We also choose to compare only anchor finding algorithms, so we do not report comparisons to likelihoodbased methods, which can be found in [Arora et al.2013].
For both PCA and SNE, we use threedimensional embeddings across all experiments. This projection results in matrices that are 0.03% as wide as the original matrix for the NYT dataset. Without lowdimensional embedding, each word is represented by a Vdimensional vector where only several terms are nonzero due to the sparse cooccurrence patterns. Thus a vertex captured by the greedy anchorfinding method is likely to be one of many eccentric vertices in very highdimensional space. In contrast, SNE creates an effective dense representation where a small number of pivotal vertices are more clearly visible, improving both performance and interpretability.
Note that since we can find an exact convex hull in these spaces,^{6}^{6}6In order to efficiently find an exact convex hull, we use the Quickhull algorithm. there is an upper bound to the number of topics that can be found given a particular projection. If more topics are desired, one can simply increase the dimensionality of the projection. For the greedy algorithm we use sparse random projections with 1,000 dimensions with 5% negative entries and 5% positive entries. PCA and SNE choose (49, 32, 47) and (69, 77, 107) anchors, respectively for each of three Yelp, Blog, and NYTimes datasets.
4.1 Anchorword Selection
We begin by comparing the behavior of lowdimensional embeddings to the behavior of the standard greedy algorithm. Table 2 shows ordered lists of the first 12 anchor words selected by three algorithms: SNE embedding, PCA embedding, and the original greedy algorithm. Anchor words selected by SNE (police, business, court) are more general than anchor words selected by the greedy algorithm (cavalry, alsadr, yiddish). Additional examples of anchor words and their associated topics are shown in Table 3 and discussed in Section 4.2.
#  tSNE  PCA  Greedy 

1  police  beloved  cavalry 
2  bonds  york  biodiesel 
3  business  family  h/w 
4  day  loving  kingsley 
5  initial  late  mourners 
6  million  president  pcl 
7  article  people  carlisle 
8  wife  article  alsadr 
9  site  funeral  kaye 
10  mother  million  abc’s 
11  court  board  yiddish 
12  percent  percent  greatgrandmother 
Type  #  HR  Top Words (Yelp) 
tSNE  16  0  mexican good service great eat restaurant authentic delicious 
PCA  15  0  mexican authentic eat chinese don’t restaurant fast salsa 
Greedy  34  35  good great food place service restaurant it’s mexican 
tSNE  6  0  beer selection good pizza great wings tap nice 
PCA  39  6  wine beer selection nice list glass wines bar 
Greedy  99  11  beer selection great happy place wine good bar 
tSNE  3  0  prices great good service selection price nice quality 
PCA  12  0  atmosphere prices drinks friendly selection nice beer ambiance 
Greedy  34  35  good great food place service restaurant it’s mexican 
tSNE  10  0  chicken salad good lunch sauce ordered fried soup 
PCA  10  0  chicken salad lunch fried pita time back sauce 
Greedy  69  12  chicken rice sauce fried ordered i’m spicy soup 
Type  #  HR  Top Words (Blog) 
tSNE  10  0  hillary clinton campaign democratic bill party win race 
PCA  4  0  hillary clinton campaign democratic party bill democrats vote 
Greedy  45  19  obama hillary campaign clinton obama’s barack it’s democratic 
tSNE  3  0  iraq war troops iraqi mccain surge security american 
PCA  9  1  iraq iraqi war troops military forces security american 
Greedy  91  8  iraq mccain war bush troops withdrawal obama iraqi 
tSNE  9  0  allah muhammad qur verses unbelievers ibn muslims world 
PCA  18  14  allah muhammad qur verses unbelievers story time update 
Greedy  4  5  allah muhammad people qur verses unbelievers ibn sura 
tSNE  19  0  catholic abortion catholics life hagee time biden human 
PCA  2  0  people it’s time don’t good make years palin 
Greedy  40  1  abortion parenthood planned people time state life government 
Type  #  HR  Top Words (NYT) 
tSNE  0  0  police man yesterday officers shot officer yearold charged 
PCA  6  0  people it’s police way those three back don’t 
Greedy  50  198  police man yesterday officers officer people street city 
tSNE  19  0  senator republican senate democratic democrat state bill 
PCA  33  2  state republican republicans senate senator house bill party 
Greedy  85  33  senator republican president state campaign presidential people 
tSNE  2  0  business chief companies executive group yesterday billion 
PCA  21  0  billion companies business deal group chief states united 
Greedy  55  10  radio business companies percent day music article satellite 
tSNE  14  0  market sales stock companies prices billion investors price 
PCA  11  0  percent market rate week state those increase high 
Greedy  77  44  companies percent billion million group business chrysler people 
The GramSchimdt process used by Arora et al. greedily selects anchors in highdimensional space. As each word is represented within dimensions, finding the word that has the next most distinctive cooccurrence pattern tends to prefer overly eccentric words with only short, intense bursts of cooccurring words. While the bases corresponding to these anchor words could be theoretically relevant for the original space in highdimension, they are less likely to be equally important in lowdimensional space. Thus projecting down to lowdimensional space can rearrange the points emphasizing not only uniqueness, but also longevity, achieving the ability to form measurably more specific topics.
Concretely, neither cavalry, alsadr, yiddish nor police, business, court are full representations of New York Times articles, but the latter is a much better basis than the former due to its greater generality. We see the effect of this difference in the specificity of the resulting topics (for example in 17 obama topics). Most words in the vocabulary have little connection to the word cavalry, so the probability does not change much across different . When we convert these distributions into using the Bayes’ rule, the resulting topics are very close to the corpus distribution, a unigram distribution .
This lack of specificity results in the observed similarity of topics.
4.2 Quantitative Results
In this section we compare PCA and SNE projections to the greedy algorithm along several quantitative metrics. To show the effect of different values of , we report results for varying numbers of topics. As the anchor finding algorithms are deterministic, the anchor words in a dimensional model are identical to the first anchor words in a dimensional model. For the greedy algorithm we select anchor words in the order they are chosen. For the PCA and SNE methods, which find anchors jointly, we sort words in descending order by their distance from their centroid.
Recovery Error. Each nonanchor word is approximated by a convex combination of the anchor words. The projected gradient algorithm [Arora et al.2013] determines these convex coefficients so that the gap between the original word vector and the approximation becomes minimized. As choosing a good basis of anchor words decreases this gap, Recovery Error (RE) is defined by the average residuals across all words.
(5) 
Recovery error decreases with the number of topics, and improves substantially after the first 10–15 anchor words for all methods. The SNE method has slightly better performance than the greedy algorithm, but they are similar. Results for recovery with the original, unprojected matrix (not shown) are much worse than the other algorithms, suggesting that the initial anchor words chosen are especially likely to be uninformative.
Normalized Entropy. As shown previously, if the probability of topics given a word is close to uniform, the probability of that word in topics will be close to the corpus distribution. Normalized Entropy (NE) measures the entropy of this distribution relative to the entropy of a
dimensional uniform distribution:
(6) 
The normalized entropy of topics given word distributions usually decreases as we add more topics, although both SNE and PCA show a dip in entropy for low numbers of topics. This result indicates that words become more closely associated with particular topics as we increase the number of topics. The lowdimensional embedding methods (SNE and PCA) have consistently lower entropy.
Topic Specificity and Topic Dissimilarity. We want topics to be both specific (that is, not overly general) and different from each other. When there are insufficient number of topics, often resembles the corpus distribution , where high frequency terms become the top words contributing to most topics. Topic Specificity (TS) is defined by the average KL divergence from each topic’s conditional distribution to the corpus distribution.^{7}^{7}7We prefer specificity to [AlSumait et al.2009]’s term vacuousness because the metric increases as we move away from the corpus distribution.
(7) 
One way to define the distance between multiple points is the minimum radius of a ball that covers every point. Whereas this is simply the distance form the centroid to the farthest point in the Euclidean space, it is an itself difficult optimization problem to find such centroid of distributions under metrics such as KLdivergence and JensenShannon divergence. To avoid this problem, we measure Topic Dissimilarity (TD) viewing each conditional distribution as a simple dimensional vector in . Recall ,
(8) 
Specificity and dissimilarity increase with the number of topics, suggesting that with few anchor words, the topic distributions are close to the overall corpus distribution and very similar to one another. The SNE and PCA algorithms produce consistently better specificity and dissimilarity, indicating that they produce more useful topics early with small numbers of topics. The greedy algorithm produces topics that are closer to the corpus distribution and less distinct from each other (17 obama topics).
Topic Coherence is known to correlate with the semantic quality of topic judged by human annotators [Mimno et al.2011]. Let be most probable words (i.e., top words) for the topic .
(9) 
Here is the codocument frequency, which is the number of documents in consisting of two words and simultaneously. is the simple document frequency with the word . The numerator contains smoothing count in order to avoid taking the logarithm of zero.
Coherence scores for SNE and PCA are worse than those for the greedy method, but this result must be understood in combination with the Specificity and Dissimilarity metrics. The most frequent terms in the overall corpus distribution often appear together in documents. Thus a model creating many topics similar to the corpus distribution is likely to achieve high Coherence, but low Specificity by definition.
Saliency. [Chuang et al.2012] define saliency for topic words as a combination of distinctiveness and probability within a topic. Anchor words are distinctive by construction, so we can increase saliency by selecting more probable anchor words. We measure the probability of anchor words in two ways. First, we report the zerobased rank of anchor words within their topics. Examples of this metric, which we call “hard” rank are shown in Table 3. The hard rank of the anchors in the PCA and SNE models are close to zero, while the anchor words for the greedy algorithm are much lower ranked, well below the range usually displayed to users. Second, while hard rank measures the perceived difference in rank of contributing words, position may not fully capture the relative importance of the anchor word. “Soft” rank quantifies the average log ratio between probabilities of the prominent word and the anchor word .
(10) 
Lower values of soft rank (Fig. 8 indicate that the anchor word has greater relative probability to occur within a topic. As we increase the number of topics, anchor words become more prominent in topics learned by the greedy method, but SNE anchor words remain relatively more probable within their topics as measured by soft rank.
Heldout Probability. Given an estimate of the topicword matrix , we can compute the marginal probability of heldout documents under that model. We use the lefttoright estimator introduced by [Wallach et al.2009], which uses a sequential algorithm similar to a Gibbs sampler. This method requires a smoothing parameter for documenttopic Dirichlet distributions, which we set to .
We note that the greedy algorithm run on the original, unprojected matrix has better heldout probability values than SNE for the Yelp corpus, but as this method does not scale to realistic vocabularies we compare here to the sparse random projection method used in [Arora et al.2013]. The SNE method appears to do best, particularly in the NYT corpus, which has a larger vocabulary and longer training documents. The length of individual heldout documents has no correlation with heldout probability.
We emphasize that Heldout Probability is sensitive to smoothing parameters and should only be used in combination with a range of other topicquality metrics. In initial experiments, we observed significantly worse heldout performance for the SNE algorithm. This phenomenon was because setting the probability of anchor words to zero for all but their own topics led to large negative values in heldout log probability for those words. As SNE tends to choose more frequent terms as anchor words, these “spikes” significantly affected overall probability estimates. To make the calculation more fair, we added to any zero entries for anchor words in the topicword matrix across all models and renormalized.
Because SNE is a stochastic model, different initializations can result in different embeddings. To evaluate how steady anchor word selection is, we ran five random initializations for each dataset. For the Yelp dataset, the number of anchor words varies from 59 to 69, and 43 out of those are shared across at least four trials. For the Blog dataset, the number of anchor words varies from 80 to 95, with 56 shared across at least four trials. For the NYT dataset, this number varies between 83 and 107, with 51 shared across at least four models.
4.3 Qualitative Results
Table 3 shows topics trained by three methods (SNE, PCA, and greedy) for all three datasets. For each model, we select five topics at random from the SNE model, and then find the closest topic from each of the other models. If anchor words present in the top eight words, they are shown in boldface.
A fundamental difference between anchorbased inference and traditional likelihoodbased inference is that we can give an order
to topics according to their contribution to word cooccurrence convex hull. This order is intrinsic to the original algorithm, and we heuristically give orders to
SNE and PCA based on their contributions. This order is listed as # in the previous table. For all but one topic, the closest topic from the greedy model has a higher order number than the associated SNE topic. As shown above, the standard algorithm tends to pick less useful anchor words at the initial stage; only the later, higher ordered topics are specific.The most clear distinction between models is the rank of anchor words represented by Hard Rank for each topic. Only one topic corresponding to (initial) has the anchor word which does not coincide with the topranked word. For the greedy algorithm, anchor words are often tens of words down the list in rank, indicating that they are unlikely to find a connection to the topic’s semantic core. In cases where the anchor word is highly ranked (unbelievers, parenthood) the word is a good indicator of the topic, but still less decisive.
SNE and PCA are often consistent in their selection of anchor words, which provides useful validation that lowdimensional embeddings discern more relevant anchor words regardless of linear vs nonlinear projections. Note that we are only varying the anchor selection part of the Anchor Words algorithm in these experiments, recovering topicword distributions in the same manner given anchor words. As a result, any differences between topics with the same anchor word (for example chicken) are due to the difference in either the number of topics or the rest of anchor words. Since PCA suffers from a crowding problem in lowerdimensional projection (see Figure 2) and the problem could be severe in a dataset with a large vocabulary, SNE is more likely to find the proper number of anchors given a specified granularity.
5 Conclusion
One of the main advantages of the anchor words algorithm is that the running time is largely independent of corpus size. Adding more documents would not affect the size of the cooccurrence matrix, requiring more times to construct the cooccurrence matrix at the beginning. While the inference is scalable depending only on the size of the vocabulary, finding quality anchor words is crucial for the performance of the inference.
[Arora et al.2013] presents a greedy anchor finding algorithm that improves over previous linear programming methods, but finding quality anchor words remains an open problem in spectral topic inference. We have shown that previous approaches have several limitations. Exhaustively finding anchor words by eliminating words that are reproducible by other words [Arora et al.2012] is impractical. The anchor words selected by the greedy algorithm are overly eccentric, particularly at the early stages of the algorithm, causing topics to be poorly differentiated. We find that using lowdimensional embeddings of word cooccurrence statistics allows us to approximate a better convex hull. The resulting anchor words are highly salient, being both distinctive and probable. The models trained with these words have better quantitative and qualitative properties along various metrics. Most importantly, using radically lowdimensional projections allows us to provide users with clear visual explanations for the model’s anchor word selections.
An interesting property of using lowdimensional embeddings is that the number of topics depends only on the projecting dimension. Since we can efficiently find an exact convex hull in lowdimensional space, users can achieve topics with their preferred level of granularities by changing the projection dimension. We do not insist this is the “correct” number of topics for a corpus, but this method, along with the range of metrics described in this paper, provides users with additional perspective when choosing a dimensionality that is appropriate for their needs.
We find that the SNE method, besides its wellknown ability to produce high quality layouts, provides the best overall anchor selection performance. This method consistently selects higherfrequency terms as anchor words, resulting in greater clarity and interpretability. Embeddings with PCA are also effective, but they result in less wellformed spaces, being less effective in heldout probability for sufficiently large corpora.
Anchor word finding methods based on lowdimensional projections offer several important advantages for topic model users. In addition to producing more salient anchor words that can be used effectively as topic labels, the relationship of anchor words to a visualizable word cooccurrence space offers significant potential. Users who can see why the algorithm chose a particular model will have greater confidence in the model and in any findings that result from topicbased analysis. Finally, visualizable spaces offer the potential to produce interactive environments for semisupervised topic reconstruction.
Acknowledgments
We thank David Bindel and the anonymous reviewers for their valuable comments and suggestions, and Laurens van der Maaten for providing his SNE implementation.
References
 [Achlioptas2001] Dimitris Achlioptas. 2001. Databasefriendly random projections. In SIGMOD, pages 274–281.
 [AlSumait et al.2009] Loulwah AlSumait, Daniel Barbar , James Gentle, and Carlotta Domeniconi. 2009. Topic significance ranking of lda generative models. In ECML.
 [Arora et al.2012] S. Arora, R. Ge, and A. Moitra. 2012. Learning topic models – going beyond svd. In FOCS.
 [Arora et al.2013] Sanjeev Arora, Rong Ge, Yonatan Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, and Michael Zhu. 2013. A practical algorithm for topic modeling with provable guarantees. In ICML.

[Blei et al.2003]
D. Blei, A. Ng, and M. Jordan.
2003.
Latent dirichlet allocation.
Journal of Machine Learning Research
, pages 993–1022. Preliminary version in NIPS 2001.  [Boutsidis et al.2009] Christos Boutsidis, Michael W. Mahoney, and Petros Drineas. 2009. An improved approximation algorithm for the column subset selection problem. In SODA, pages 968–977.
 [Chuang et al.2012] Jason Chuang, Christopher D. Manning, and Jeffrey Heer. 2012. Termite: Visualization techniques for assessing textual topic models. In International Working Conference on Advanced Visual Interfaces (AVI), pages 74–77.
 [Eisenstein and Xing2010] Jacob Eisenstein and Eric Xing. 2010. The CMU 2008 political blog corpus. Technical report, CMU, March.
 [Eisenstein et al.2011] Jacob Eisenstein, Duen Horng Chau, Aniket Kittur, and Eric P. Xing. 2011. Topicviz: Semantic navigation of document collections. CoRR, abs/1110.6200.
 [Griffiths and Steyvers2004] T. L. Griffiths and M. Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences, 101:5228–5235.
 [Gu and Eisenstat1996] Ming Gu and Stanley C. Eisenstat. 1996. Efficient algorithms for computing a strong rankrevealing qr factorization. In SIAM J. Sci Comput, pages 848–869.
 [Johnson and Lindenstrauss1984] William B. Johnson and Joram Lindenstrauss. 1984. Extensions of lipschitz mappings into a hilbert space. Contemporary Mathematics, 26:189 –206.
 [LacosteJulien et al.2008] Simon LacosteJulien, Fei Sha, and Michael I. Jordan. 2008. DiscLDA: Discriminative learning for dimensionality reduction and classification. In NIPS.
 [Lau et al.2011] Jey Han Lau, Karl Grieser, David Newman, and Timothy Baldwin. 2011. Automatic labelling of topic models. In HLT, pages 1536–1545.
 [Mei et al.2007] Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai. 2007. Automatic labeling of multinomial topic models. In KDD, pages 490–499.
 [Mimno et al.2011] David Mimno, Hanna Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Optimizing semantic coherence in topic models. In EMNLP.
 [Sontag and Roy2011] D. Sontag and D. Roy. 2011. Complexity of inference in latent dirichlet allocation. In NIPS, pages 1008–1016.

[van der Maaten and
Hinton2008]
L.J.P. van der Maaten and G.E. Hinton.
2008.
Visualizing highdimensional data using tSNE.
JMLR, 9:2579–2605, Nov.  [Wallach et al.2009] Hanna Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. 2009. Evaluation methods for topic models. In ICML.
 [Zhu et al.2009] Jun Zhu, Amr Ahmed, and Eric P. Xing. 2009. MedLDA: Maximum margin supervised topic models for regression and classi cation. In ICML.
Comments
There are no comments yet.