Robust Spectral Inference for Joint Stochastic Matrix Factorization

11/01/2016 ∙ by Moontae Lee, et al. ∙ cornell university 0

Spectral inference provides fast algorithms and provable optimality for latent topic analysis. But for real data these algorithms require additional ad-hoc heuristics, and even then often produce unusable results. We explain this poor performance by casting the problem of topic inference in the framework of Joint Stochastic Matrix Factorization (JSMF) and showing that previous methods violate the theoretical conditions necessary for a good solution to exist. We then propose a novel rectification method that learns high quality topics and their interactions even on small, noisy data. This method achieves results comparable to probabilistic techniques in several domains while maintaining scalability and provable optimality.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Summarizing large data sets using pairwise co-occurrence frequencies is a powerful tool for data mining. Objects can often be better described by their relationships than their inherent characteristics. Communities can be discovered from friendships [1], song genres can be identified from co-occurrence in playlists [2], and neural word embeddings are factorizations of pairwise co-occurrence information [3, 4]. Recent Anchor Word algorithms [5, 6] perform spectral inference on co-occurrence statistics for inferring topic models [7, 8]. Co-occurrence statistics can be calculated using a single parallel pass through a training corpus. While these algorithms are fast, deterministic, and provably guaranteed, they are sensitive to observation noise and small samples, often producing effectively useless results on real documents that present no problems for probabilistic algorithms.

Figure 1: 2D visualizations show the low-quality convex hull found by Anchor Words [6] (left) and a better convex hull (middle) found by discovering anchor words on a rectified space (right).

We cast this general problem of learning overlapping latent clusters as Joint-Stochastic Matrix Factorization (JSMF), a subset of non-negative matrix factorization that contains topic modeling as a special case. We explore the conditions necessary for inference from co-occurrence statistics and show that the Anchor Words algorithms necessarily violate such conditions. Then we propose a rectified algorithm that matches the performance of probabilistic inference—even on small and noisy datasets—without losing efficiency and provable guarantees. Validating on both real and synthetic data, we demonstrate that our rectification not only produces better clusters, but also, unlike previous work, learns meaningful cluster interactions.

Let the matrix represent the co-occurrence of pairs drawn from objects:

is the joint probability

for a pair of objects and . Our goal is to discover latent clusters by approximately decomposing . is the object-cluster matrix, in which each column corresponds to a cluster and is the probability of drawing an object conditioned on the object belonging to the cluster ; and is the cluster-cluster matrix, in which represents the joint probability of pairs of clusters. We call the matrices and joint-stochastic (i.e.,

) due to their correspondence to joint distributions;

is column-stochastic. Example applications are shown in Table 1.

Domain Object Cluster Basis
Document Word Topic Anchor Word
Image Pixel Segment Pure Pixel
Network User Community Representative
Legislature Member Party/Group Partisan
Playlist Song Genre Signature Song
Table 1: JSMF applications, with anchor-word equivalents.

Anchor Word algorithms [5, 6] solve JSMF problems using a separability assumption: each topic contains at least one “anchor” word that has non-negligible probability exclusively in that topic. The algorithm uses the co-occurrence patterns of the anchor words as a summary basis for the co-occurrence patterns of all other words. The initial algorithm [5] is theoretically sound but unable to produce column-stochastic word-topic matrix due to unstable matrix inversions. A subsequent algorithm [6] fixes negative entries in

, but still produces large negative entries in the estimated topic-topic matrix

. As shown in Figure 3, the proposed algorithm infers valid topic-topic interactions.

2 Requirements for Factorization

In this section we review the probabilistic and statistical structures of JSMF and then define geometric structures of co-occurrence matrices required for successful factorization. is a joint-stochastic matrix constructed from training examples, each of which contain some subset of objects. We wish to find latent clusters by factorizing into a column-stochastic matrix and a joint-stochastic matrix , satisfying .

Figure 2: The JSMF event space differs from LDA’s. JSMF deals only with pairwise co-occurrence events and does not generate observations/documents.

Probabilistic structure.

Figure 2 shows the event space of our model. The distribution

over pairs of clusters is generated first from a stochastic process with a hyperparameter

. If the -th training example contains a total of objects, our model views the example as consisting of all possible pairs of objects.111Due to the bag-of-words assumption, every object can pair with any other object in that example, except itself. One implication of our work is better understanding the self-co-occurrences, the diagonal entries in the co-occurrence matrix. For each of these pairs, cluster assignments are sampled from the selected distribution (). Then an actual object pair is drawn with respect to the corresponding cluster assignments (, ). Note that this process does not explain how each training example is generated from a model, but shows how our model understands the objects in the training examples.

Following [5, 6], our model views

as a set of parameters rather than random variables.

222In LDA, each column of B is generated from a known distribution . The primary learning task is to estimate ; we then estimate to recover the hyperparameter . Due to the conditional independence , the factorization is equivalent to

Under the separability assumption, each cluster has a basis object such that and . In matrix terms, we assume the submatrix of comprised of the rows with indices is diagonal. As these rows form a non-negative basis for the row space of , the assumption implies .333 means the non-negative rank of the matrix B, whereas means the usual rank. Providing identifiability to the factorization, this assumption becomes crucial for inference of both and . Note that JSMF factorization is unique up to column permutation, meaning that no specific ordering exists among the discovered clusters, equivalent to probabilistic topic models (see the Appendix).

Statistical structure.

Let be a (known) distribution of distributions from which a cluster distribution is sampled for each training example. Saying , we have i.i.d samples which are not directly observable. Defining the posterior cluster-cluster matrix and the expectation , Lemma 2.2 in [5] showed that444This convergence is not trivial while as

by the Central Limit Theorem.

(1)

Denote the posterior co-occurrence for the -th training example by and all examples by . Then , and . Thus

(2)

Denote the noisy observation for the -th training example by , and all examples by . Let be a matrix of topics. We will construct so that is an unbiased estimator of . Thus as

(3)

Geometric structure.

Though the separability assumption allows us to identify even from the noisy observation , we need to throughly investigate the structure of cluster interactions. This is because it will eventually be related to how much useful information the co-occurrence between corresponding anchor bases contains, enabling us to best use our training data. Say is the set of doubly non-negative matrices: entrywise non-negative and positive semidefinite (PSD).

Claim   and
Proof

    Take any vector

. As is defined as a sum of outer-products,

(4)

Thus . In addition, for all . Proving is analogous by the linearity of expectation. Relying on double non-negativity of , Equation (3) implies not only the low-rank structure of , but also double non-negativity of by a similar proof (see the Appendix).

The Anchor Word algorithms in [5, 6] consider neither double non-negativity of cluster interactions nor its implication on co-occurrence statistics. Indeed, the empirical co-occurrence matrices collected from limited data are generally indefinite and full-rank, whereas the posterior co-occurrences must be positive semidefinite and low-rank. Our new approach will efficiently enforce double non-negativity and low-rankness of the co-occurrence matrix based on the geometric property of its posterior behavior. We will later clarify how this process substantially improves the quality of the clusters and their interactions by eliminating noises and restoring missing information.

3 Rectified Anchor Words Algorithm

In this section, we describe how to estimate the co-occurrence matrix from the training data, and how to rectify so that it is low-rank and doubly non-negative. We then decompose the rectified in a way that preserves the doubly non-negative structure in the cluster interaction matrix.

Generating co-occurrence .

Let be the vector of object counts for the -th training example, and let where is the document’s latent topic distribution. Then is assumed to be a sample from a multinomial distribution where , and recall and . As in [6], we generate the co-occurrence for the -th example by

(5)

The diagonal penalty in Eq. 5

cancels out the diagonal matrix term in the variance-covariance matrix, making the estimator unbiased. Putting

, that is Thus by the linearity of expectation.

Rectifying co-occurrence .

While is an unbiased estimator for in our model, in reality the two matrices often differ due to a mismatch between our model assumptions and the data555There is no reason to expect real data to be generated from topics, much less exactly latent topics. or due to error in estimation from limited data. The computed

is generally full-rank with many negative eigenvalues, causing a large approximation error. As the posterior co-occurrence

must be low-rank, doubly non-negative, and joint-stochastic, we propose two rectification methods: Diagonal Completion (DC) and Alternating Projection (AP). DC modifies only diagonal entries so that becomes low-rank, non-negative, and joint-stochastic; while AP enforces modifies every entry and enforces the same properties as well as positive semi-definiteness. As our empirical results strongly favor alternating projection, we defer the details of diagonal completion to the Appendix.

Based on the desired property of the posterior co-occurrence , we seek to project our estimator onto the set of joint-stochastic, doubly non-negative, low rank matrices. Alternating projection methods like Dykstra’s algorithm [9] allow us to project onto an intersection of finitely many convex sets using projections onto each individual set in turn. In our setting, we consider the intersection of three sets of symmetric matrices: the elementwise non-negative matrices , the normalized matrices whose entry sum is equal to 1, and the positive semi-definite matrices with rank , . We project onto these three sets as follows:

where is an eigendecomposition and is the matrix modified so that all negative eigenvalues and any but the largest positive eigenvalues are set to zero. Truncated eigendecompositions can be computed efficiently, and the other projections are likewise efficient. While and are convex, is not. However, [10] show that alternating projection with a non-convex set still works under certain conditions, guaranteeing a local convergence. Thus iterating three projections in turn until the convergence rectifies to be in the desired space. We will show how to satisfy such conditions and the convergence behavior in Section 5.

Selecting basis .

The first step of the factorization is to select the subset of objects that satisfy the separability assumption. We want the best rows of the row-normalized co-occurrence matrix so that all other rows lie nearly in the convex hull of the selected rows. [6]

use the Gram-Schmidt process to select anchors, which computes

pivoted QR decomposition

, but did not utilize the sparsity of . To scale beyond small vocabularies, they use random projections that approximately preserve distances between rows of . For all experiments we use a new pivoted QR algorithm (see the Appendix) that exploits sparsity instead of using random projections, and thus preserves deterministic inference.666To effectively use random projections, it is necessary to either find proper dimensions based on multiple trials or perform low-dimensional random projection multiple times [11] and merge the resulting anchors.

Recovering object-cluster .

After finding the set of basis objects , we can infer each entry of by Bayes’ rule as in [6]. Let be the coefficients that reconstruct the -th row of in terms of the basis rows corresponding to . Since , we can use the corpus frequencies to estimate . Thus the main task for this step is to solve simplex-constrained QPs to infer a set of such coefficients for each object. We use an exponentiated gradient algorithm to solve the problem similar to [6]. Note that this step can be efficiently done in parallel for each object.

Figure 3: The algorithm of [6] (first panel) produces negative cluster co-occurrence probabilities. A probabilistic reconstruction alone (this paper & [5], second panel) removes negative entries but has no off-diagonals and does not sum to one. Trying after rectification (this paper, third panel) produces a valid joint stochastic matrix.

Recovering cluster-cluster .

[6] recovered by minimizing ; but the inferred generally has many negative entries, failing to model the probabilistic interaction between topics. While we can further project onto the joint-stochastic matrices, this produces a large approximation error.

We consider an alternate recovery method that again leverages the separability assumption. Let be the submatrix whose rows and columns correspond to the selected objects , and let be the diagonal submatrix of rows of corresponding to . Then

(6)

This approach efficiently recovers a cluster-cluster matrix mostly based on the co-occrrurence information between corresponding anchor basis, and produces no negative entries due to the stability of diagonal matrix inversion. Note that the principle submatrices of a PSD matrix are also PSD; hence, if then . Thus, not only is the recovered an unbiased estimator for , but also it is now doubly non-negative as after the rectification.777We later realized that essentially same approach was previously tried in [5], but it was not able to generate a valid topic-topic matrix as shown in the middle panel of Figure 3.

4 Experimental Results

Our Rectified Anchor Words algorithm with alternating projection fixes many problems in the baseline Anchor Words algorithm [6] while matching the performance of Gibbs sampling [12] and maintaining spectral inference’s determinism and independence from corpus size. We evaluate direct measurement of matrix quality as well as indicators of topic utility. We use two text datasets: NIPS full papers and New York Times news articles.888https://archive.ics.uci.edu/ml/datasets/Bag+of+Words We eliminate a minimal list of 347 English stop words and prune rare words based on tf-idf scores and remove documents with fewer than five tokens after vocabulary curation. We also prepare two non-textual item-selection datasets: users’ movie reviews from the Movielens 10M Dataset,999http://grouplens.org/datasets/movielens and music playlists from the complete Yes.com dataset.101010http://www.cs.cornell.edu/~shuochen/lme We perform similar vocabulary curation and document tailoring, with the exception of frequent stop-object elimination. Playlists often contain the same songs multiple times, but users are unlikely to review the same movies more than once, so we augment the movie dataset so that each review contains number of movies based on the half-scaled rating information that varies from 0.5 stars to 5 stars. Statistics of our datasets are shown in Table 2.

Dataset Avg. Len
NIPS 1,348 5k 380.5
NYTimes 269,325 15k 204.9
Movies 63,041 10k 142.8
Songs 14,653 10k 119.2
Table 2: Statistics of four datasets.

We run DC 30 times for each experiment, randomly permuting the order of objects and using the median results to minimize the effect of different orderings. We also run 150 iterations of AP alternating , , and in turn. For probabilistic Gibbs sampling, we use the Mallet with the standard option doing 1,000 iterations. All metrics are evaluated against the original , not against the rectified , whereas we use and inferred from the rectified .

Qualitative results.

Although [6] report comparable results to probabilistic algorithms for LDA, the algorithm fails under many circumstances. The algorithm prefers rare and unusual anchor words that form a poor basis, so topic clusters consist of the same high-frequency terms repeatedly, as shown in the upper third of Table 3. In contrast, our algorithm with AP rectification successfully learns themes similar to the probabilistic algorithm. One can also verify that cluster interactions given in the third panel of Figure 3 explain how the five topics correlate with each other.

Arora et al. 2013 (Baseline)
neuron layer hidden recognition signal cell noise
neuron layer hidden cell signal representation noise
neuron layer cell hidden signal noise dynamic
neuron layer cell hidden control signal noise
neuron layer hidden cell signal recognition noise
This paper (AP)
neuron circuit cell synaptic signal layer activity
control action dynamic optimal policy controller reinforcement
recognition layer hidden word speech image net
cell field visual direction image motion object orientation
gaussian noise hidden approximation matrix bound examples
Probabilistic LDA (Gibbs)
neuron cell visual signal response field activity
control action policy optimal reinforcement dynamic robot
recognition image object feature word speech features
hidden net layer dynamic neuron recurrent noise
gaussian approximation matrix bound component variables
Table 3: Each line is a topic from NIPS (). Previous work simply repeats the most frequent words in the corpus five times.

Similar to [13], we visualize the five anchor words in the co-occurrence space after 2D PCA of . Each panel in Figure 1 shows a 2D embedding of the NIPS vocabulary as blue dots and five selected anchor words in red. The first plot shows standard anchor words and the original co-occurrence space. The second plot shows anchor words selected from the rectified space overlaid on the original co-occurrence space. The third plot shows the same anchor words as the second plot overlaid on the AP-rectified space. The rectified anchor words provide better coverage on both spaces, explaining why we are able to achieve reasonable topics even with .

Rectification also produces better clusters in the non-textual movie dataset. Each cluster is notably more genre-coherent and year-coherent than the clusters from the original algorithm. When , for example, we verify a cluster of Walt Disney 2D Animations mostly from the 1990s and a cluster of Fantasy movies represented by Lord of the Rings films, similar to clusters found by probabilistic Gibbs sampling. The Baseline algorithm [6] repeats Pulp Fiction and Silence of the Lambs 15 times.

Figure 4: Experimental results on real dataset. The x-axis indicates where varies by 5 up to 25 topics and by 25 up to 100 or 150 topics. Whereas the Baseline algorithm largely fails with small and does not infer quality and even with large , Alternating Projection (AP) not only finds better basis vectors (Recovery), but also shows stable and comparable behaviors to probabilistic inference (Gibbs) in every metric.

Quantitative results.

We measure the intrinsic quality of inference and summarization with respect to the JSMF objectives as well as the extrinsic quality of resulting topics. Lines correspond to four methods: Baseline for the algorithm in the previous work [6] without any rectification, DC for Diagonal Completion, AP for Alternating Projection, and Gibbs for Gibbs sampling.

Anchor objects should form a good basis for the remaining objects. We measure Recovery error with respect to the original matrix, not the rectified matrix. AP reduces error in almost all cases and is more effective than DC. Although we expect error to decrease as we increase the number of clusters , reducing recovery error for a fixed by choosing better anchors is extremely difficult: no other subset selection algorithm [14] decreased error by more than 0.001. A good matrix factorization should have small element-wise Approximation error . DC and AP preserve more of the information in the original matrix than the Baseline method, especially when is small.111111In the NYTimes corpus, is a large error: each element is around due to the number of normalized entries. We expect non-trivial interactions between clusters, even when we do not explicitly model them as in [15]. Greater diagonal Dominancy indicates lower correlation between clusters.121212Dominancy in Songs corpus lacks any Baseline results at because dominancy is undefined if an algorithm picks a song that occurs at most once in each playlist as a basis object. In this case, the original construction of , and hence of , has a zero diagonal element, making dominancy NaN. AP and Gibbs results are similar. We do not report held-out probability because we find that relative results are determined by user-defined smoothing parameters [13, 16].

Specificity measures how much each cluster is distinct from the corpus distribution. When anchors produce a poor basis, the conditional distribution of clusters given objects becomes uniform, making similar to . Inter-topic Dissimilarity counts the average number of objects in each cluster that do not occur in any other cluster’s top 20 objects. Our experiments validate that AP and Gibbs yield comparably specific and distinct topics, while Baseline and DC simply repeat the corpus distribution as in Table 3. Coherence penalizes topics that assign high probability (rank ) to words that do not occur together frequently. AP produces results close to Gibbs sampling, and far from the Baseline and DC. While this metric correlates with human evaluation of clusters [17] “worse” coherence can actually be better because the metric does not penalize repetition [13].

In semi-synthetic experiments [6] AP matches Gibbs sampling and outperforms the Baseline, but the discrepancies in topic quality metrics are smaller than in the real experiments (see Appendix). We speculate that semi-synthetic data is more “well-behaved” than real data, explaining why issues were not recognized previously.

5 Analysis of Algorithm

Why does AP work?

Before rectification, diagonals of the empirical matrix may be far from correct. Bursty objects yield diagonal entries that are too large; extremely rare objects that occur at most once per document yield zero diagonals. Rare objects are problematic in general: the corresponding rows in the matrix are sparse and noisy, and these rows are likely to be selected by the pivoted QR. Because rare objects are likely to be anchors, the matrix is likely to be highly diagonally dominant, and provides an uninformative picture of topic correlations. These problems are exacerbated when is small relative to the effective rank of , so that an early choice of a poor anchor precludes a better choice later on; and when the number of documents is small, in which case the empirical is relatively sparse and is strongly affected by noise. To mitigate this issue, [16] run exhaustive grid search to find document frequency cutoffs to get informative anchors. As model performance is inconsistent for different cutoffs and search requires cross-validation for each case, it is nearly impossible to find good heuristics for each dataset and number of topics.

Fortunately, a low-rank PSD matrix cannot have too many diagonally-dominant rows, since this violates the low rank property. Nor can it have diagonal entries that are small relative to off-diagonals, since this violates positive semi-definiteness. Because the anchor word assumption implies that non-negative rank and ordinary rank are the same, the AP algorithm ideally does not remove the information we wish to learn; rather, 1) the low-rank projection in AP suppresses the influence of small numbers of noisy rows associated with rare words which may not be well correlated with the others, and 2) the PSD projection in AP recovers missing information in diagonals. (As illustrated in the Dominancy panel of the Songs corpus in Figure 4, AP shows valid dominancies even after in contrast to the Baseline algorithm.)

Why does AP converge?

AP enjoys local linear convergence [10] if 1) the initial is near the convergence point , 2) is super-regular at , and 3) strong regularity holds at . For the first condition, recall that we rectified by pushing toward , which is the ideal convergence point inside the intersection. Since as shown in (5), is close to as desired.The prox-regular sets131313A set is prox-regular if is locally unique. are subsets of super-regular sets, so prox-regularity of at is sufficient for the second condition. For permutation invariant , the spectral set of symmetric matrices is defined as , and is prox-regular if and only if is prox-regular [18, Th. 2.4]. Let be . Since each element in has exactly positive components and all others are zero, . By the definition of and , is locally unique almost everywhere, satisfying the second condition almost surely. (As the intersection of the convex set and the smooth manifold of rank matrices, is a smooth manifold almost everywhere.)

Checking the third condition a priori is challenging, but we expect noise in the empirical to prevent an irregular solution, following the argument of Numerical Example 9 in [10]. We expect AP to converge locally linearly and we can verify local convergence of AP in practice. Empirically, the ratio of average distances between two iterations are always on the NYTimes dataset (see the Appendix), and other datasets were similar. Note again that our rectified is a result of pushing the empirical toward the ideal . Because approximation factors of [6] are all computed based on how far and its co-occurrence shape could be distant from ’s, all provable guarantees of [6] hold better with our rectified .

6 Related and Future Work

JSMF is a specific structure-preserving Non-negative Matrix Factorization (NMF) performing spectral inference. [19, 20] exploit a similar separable structure for NMF problmes. To tackle hyperspectral unmixing problems, [21, 22] assume pure pixels

, a separability-equivalent in computer vision. In more general NMF without such structures, RESCAL

[23] studies tensorial extension of similar factorization and SymNMF [24] infers rather than . For topic modeling, [25]

performs spectral inference on third moment tensor assuming topics are uncorrelated.

As the core of our algorithm is to rectify the input co-occurrence matrix, it can be combined with several recent developments. [16] proposes two regularization methods for recovering better . [13] nonlinearly projects co-occurrence to low-dimensional space via -SNE and achieves better anchors by finding the exact anchors in that space. [11] performs multiple random projections to low-dimensional spaces and recovers approximate anchors efficiently by divide-and-conquer strategy. In addition, our work also opens several promising research directions. How exactly do anchors found in the rectified form better bases than ones found in the original space ? Since now the topic-topic matrix is again doubly non-negative and joint-stochastic, can we learn super-topics in a multi-layered hierarchical model by recursively applying JSMF to topic-topic co-occurrence ?

Acknowledgments

This research is supported by NSF grant HCC:Large-0910664. We thank Adrian Lewis for valuable discussions on AP convergence.

References

  • [1] Alan Mislove, Bimal Viswanath, Krishna P. Gummadi, and Peter Druschel. You are who you know: Inferring user profiles in Online Social Networks. In Proceedings of the 3rd ACM International Conference of Web Search and Data Mining (WSDM’10), New York, NY, February 2010.
  • [2] Shuo Chen, J. Moore, D. Turnbull, and T. Joachims. Playlist prediction via metric embedding. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pages 714–722, 2012.
  • [3] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. GloVe: Global vectors for word representation. In EMNLP, 2014.
  • [4] Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In NIPS, 2014.
  • [5] S. Arora, R. Ge, and A. Moitra. Learning topic models – going beyond SVD. In FOCS, 2012.
  • [6] Sanjeev Arora, Rong Ge, Yonatan Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, and Michael Zhu. A practical algorithm for topic modeling with provable guarantees. In ICML, 2013.
  • [7] T. Hofmann. Probabilistic latent semantic analysis. In UAI, pages 289–296, 1999.
  • [8] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation.

    Journal of Machine Learning Research

    , pages 993–1022, 2003.
    Preliminary version in NIPS 2001.
  • [9] JamesP. Boyle and RichardL. Dykstra. A method for finding projections onto the intersection of convex sets in Hilbert spaces. In Advances in Order Restricted Statistical Inference, volume 37 of Lecture Notes in Statistics, pages 28–47. Springer New York, 1986.
  • [10] Adrian S. Lewis, D. R. Luke, and J r me Malick. Local linear convergence for alternating and averaged nonconvex projections. Foundations of Computational Mathematics, 9:485–513, 2009.
  • [11] Tianyi Zhou, Jeff A Bilmes, and Carlos Guestrin. Divide-and-conquer learning by anchoring a conical hull. In Advances in Neural Information Processing Systems 27, pages 1242–1250, 2014.
  • [12] T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101:5228–5235, 2004.
  • [13] Moontae Lee and David Mimno. Low-dimensional embeddings for interpretable anchor-based topic inference. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    , pages 1319–1328. Association for Computational Linguistics, 2014.
  • [14] Mary E Broadbent, Martin Brown, Kevin Penner, I Ipsen, and R Rehman. Subset selection algorithms: Randomized vs. deterministic. SIAM Undergraduate Research Online, 3:50–71, 2010.
  • [15] D. Blei and J. Lafferty. A correlated topic model of science. Annals of Applied Statistics, pages 17–35, 2007.
  • [16] Thang Nguyen, Yuening Hu, and Jordan Boyd-Graber. Anchors regularized: Adding robustness and extensibility to scalable topic-modeling algorithms. In Association for Computational Linguistics, 2014.
  • [17] David Mimno, Hanna Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. Optimizing semantic coherence in topic models. In EMNLP, 2011.
  • [18] A. Daniilidis, A. S. Lewis, J. Malick, and H. Sendov. Prox-regularity of spectral functions and spectral sets. Journal of Convex Analysis, 15(3):547–560, 2008.
  • [19] Christian Thurau, Kristian Kersting, and Christian Bauckhage. Yes we can: simplex volume maximization for descriptive web-scale matrix factorization. In CIKM’10, pages 1785–1788, 2010.
  • [20] Abhishek Kumar, Vikas Sindhwani, and Prabhanjan Kambadur. Fast conical hull algorithms for near-separable non-negative matrix factorization. CoRR, pages –1–1, 2012.
  • [21] Jos M. P. Nascimento, Student Member, and Jos M. Bioucas Dias. Vertex component analysis: A fast algorithm to unmix hyperspectral data. IEEE Transactions on Geoscience and Remote Sensing, pages 898–910, 2005.
  • [22] Cécile Gomez, H. Le Borgne, Pascal Allemand, Christophe Delacourt, and Patrick Ledru.

    N-FindR method versus independent component analysis for lithological identification in hyperspectral imagery.

    International Journal of Remote Sensing, 28(23):5315–5338, 2007.
  • [23] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective learning on multi-relational data. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML, pages 809–816. ACM, 2011.
  • [24] Da Kuang, Haesun Park, and Chris H. Q. Ding. Symmetric nonnegative matrix factorization for graph clustering. In SDM. SIAM / Omnipress, 2012.
  • [25] Anima Anandkumar, Dean P. Foster, Daniel Hsu, Sham Kakade, and Yi-Kai Liu. A spectral algorithm for latent Dirichlet allocation. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pages 926–934, 2012.