# Topic Discovery through Data Dependent and Random Projections

We present algorithms for topic modeling based on the geometry of cross-document word-frequency patterns. This perspective gains significance under the so called separability condition. This is a condition on existence of novel-words that are unique to each topic. We present a suite of highly efficient algorithms based on data-dependent and random projections of word-frequency patterns to identify novel words and associated topics. We will also discuss the statistical guarantees of the data-dependent projections method based on two mild assumptions on the prior density of topic document matrix. Our key insight here is that the maximum and minimum values of cross-document frequency patterns projected along any direction are associated with novel words. While our sample complexity bounds for topic recovery are similar to the state-of-art, the computational complexity of our random projection scheme scales linearly with the number of documents and the number of words per document. We present several experiments on synthetic and real-world datasets to demonstrate qualitative and quantitative merits of our scheme.

## Authors

• 9 publications
• 3 publications
• 18 publications
• 61 publications
• ### A New Geometric Approach to Latent Topic Modeling and Discovery

A new geometrically-motivated algorithm for nonnegative matrix factoriza...
01/05/2013 ∙ by Weicong Ding, et al. ∙ 0

• ### Necessary and Sufficient Conditions and a Provably Efficient Algorithm for Separable Topic Discovery

We develop necessary and sufficient conditions and a novel provably cons...
08/23/2015 ∙ by Weicong Ding, et al. ∙ 0

• ### BATS: A Spectral Biclustering Approach to Single Document Topic Modeling and Segmentation

Existing topic modeling and text segmentation methodologies generally re...
08/05/2020 ∙ by Sirui Wang, et al. ∙ 0

• ### Knowledge-based Word Sense Disambiguation using Topic Models

Word Sense Disambiguation is an open problem in Natural Language Process...
01/05/2018 ∙ by Devendra Singh Chaplot, et al. ∙ 0

• ### Generalized Topic Modeling

Recently there has been significant activity in developing algorithms wi...
11/04/2016 ∙ by Avrim Blum, et al. ∙ 0

• ### Prior-aware Dual Decomposition: Document-specific Topic Inference for Spectral Topic Models

Spectral topic modeling algorithms operate on matrices/tensors of word c...
11/19/2017 ∙ by Moontae Lee, et al. ∙ 0

• ### Improving Term Frequency Normalization for Multi-topical Documents, and Application to Language Modeling Approaches

Term frequency normalization is a serious issue since lengths of documen...
02/08/2015 ∙ by Seung-Hoon Na, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider a corpus of documents composed of words chosen from a vocabulary of distinct words indexed by . We adopt the classic “bags of words” modeling paradigm widely-used in probabilistic topic modeling (Blei, 2012). Each document is modeled as being generated by independent and identically distributed (iid) drawings of words from an unknown

document word-distribution vector. Each document word-distribution vector is itself modeled as an unknown

probabilistic mixture of unknown latent topic word-distribution vectors that are shared among the documents in the corpus. Documents are generated independently. For future reference, we adopt the following notation. We denote by the unknown topic-matrix whose columns are the latent topic word-distribution vectors. denotes the weight-matrix whose columns are the mixing weights over topics for the documents. These columns are assumed to be iid samples from a prior distribution. Each column of the matrix corresponds to a document word-distribution vector. denotes the observed word-by-document matrix realization. The columns of are the empirical word-frequency vectors of the

documents. Our goal is to estimate the latent topic word-distribution vectors (

) from the empirical word-frequency vectors of all documents ().

A fundamental challenge here is that word-by-document distributions () are unknown and only a realization is available through sampled word frequencies in each document. Another challenge is that even when these distributions are exactly known, the decomposition into the product of topic-matrix, , and topic-document distributions, , which is known as Nonnegative Matrix Factorization (NMF), has been shown to be an -hard problem in general. In this paper, we develop computationally efficient algorithms with provable guarantees for estimating for topic matrices satisfying the separability condition (Donoho & Stodden, 2004; Arora et al., 2012b).

###### Definition 1.

(Separability) A topic matrix is separable if for each topic , there is some word such that and , .

The condition suggests the existence of novel words that are unique to each topic. Our algorithm has three main steps. In the first step, we identify novel words by means of data dependent or random projections. A key insight here is that when each word is associated with a vector consisting of its occurrences across all documents, the novel words correspond to extreme points of the convex hull of these vectors. A highlight of our approach is the identification of novel words based on data-dependent and random projections. Our idea is that whenever a convex object is projected along a random direction, the maximum and minimum values in the projected direction correspond to extreme points of the convex object. While our method identifies novel words with negligible false and miss detections, evidently multiple novel words associated with the same topic can be an issue. To account for this issue, we apply a distance based clustering algorithm to cluster novel words belonging to the same topic. Our final step involves linear regression to estimate topic word frequencies using novel words.

We show that our scheme has a similar sample complexity to that of state-of-art such as (Arora et al., 2012a). On the other hand, the computational complexity of our scheme can scale as small as for a corpora containing documents, with an average of words per document from a vocabulary containing words. We then present a set of experiments on synthetic and real-world datasets. The results demonstrates qualitative and quantitative superiority of our scheme in comparison to other state-of-art schemes.

## 2 Related Work

The literature on topic modeling and discovery is extensive. One direction of work is based on solving a nonnegative matrix factorization (NMF) problem. To address the scenario where only the realization is known and not , several papers (Lee & Seung, 1999; Donoho & Stodden, 2004; Cichocki et al., 2009; Recht et al., 2012) attempt to minimize a regularized cost function. Nevertheless, this joint optimization is non-convex and suboptimal strategies have been used in this context. Unfortunately, when which is often the case, many words do not appear in and such methods often fail in these cases.

Latent Dirichlet Allocation (LDA)  (Blei et al., 2003; Blei, 2012) is a statistical approach to topic modeling. In this approach, the columns of

are modeled as iid random drawings from some prior distributions such as Dirichlet. The goal is to compute MAP (maximum aposteriori probability) estimates for the topic matrix. This setup is inherently non-convex and MAP estimates are computed using variational Bayes approximations of the posterior distribution, Gibbs sampling or expectation propagation.

A number of methods with provable guarantees have also been proposed. (Anandkumar et al., 2012)

describe a novel method of moments approach. While their algorithm does not impose structural assumption on topic matrix

, they require Dirichlet priors for matrix. One issue is that such priors do not permit certain classes of correlated topics (Blei & Lafferty, 2007; Li & McCallum, 2007). Also their algorithm is not agnostic since it uses parameters of the Dirichlet prior. Furthermore, the algorithm suggested involves finding empirical moments and singular decompositions which can be cumbersome for large matrices.

Our work is closely related to recent work of (Arora et al., 2012b) and (Arora et al., 2012a) with some important differences. In their work, they describe methods with provable guarantees when the topic matrix satisfies the separability condition. Their algorithm discovers novel words from empirical word co-occurrence patterns and then in the second step the topic matrix is estimated. Their key insight is that when each word, , is associated with a dimensional vector111th component is probability of occurrence of word and word in the same document in the entire corpus the novel words correspond to extreme points of the convex hull of these vectors. (Arora et al., 2012a) presents combinatorial algorithms to recover novel words with computational complexity scaling as , where is the element wise tolerable error of the topic matrix . An important computational remark is that often scales with , i.e. probability values in get small when is increased, hence one needs smaller to safely estimate when is too large. The other issue with their method is that empirical estimates of joint probabilities in the word-word co-occurrence matrix can be unreliable, especially when is not large enough. Finally, their novel word detection algorithm requires linear independence of the extreme points of the convex hull. This can be a serious problem in some datasets where word co-occurrences lie on a low dimensional manifold.

Major Differences: Our work also assumes separability and existence of novel words. We associate each word with a -dimensional vector consisting of the word’s frequency of occurrence in the -documents rather than word co-occurrences as in (Arora et al., 2012b, a). We also show that extreme points of the convex hull of these cross-document frequency patterns are associated with novel words. While these differences appear technical, it has important consequences. In several experiments our approach appears to significantly outperform (Arora et al., 2012a) and mirror performance of more conventional methods such as LDA (Griffiths & Steyvers, 2004). Furthermore, our approach can deal with degenerate cases found in some image datasets where the data vectors can lie on a lower dimensional manifold than the number of topics. At a conceptual level our approach appears to hinge on distinct cross-document support patterns of novel words belonging to different topics. This is typically robust to sampling fluctuations when support patterns are distinct in comparison to word co-occurrences statistics of the corpora. Our approach also differs algorithmically. We develop novel algorithms based on data-dependent and random projections to find extreme points efficiently with computational complexity scaling as for the random scheme.

Organization: We illustrate the motivating Topic Geometry in Section 3. We then present our three-step algorithm in Section 4 with intuitions and computational complexity. Statistical correctness of each step of proposed approach are summarized in Section 5. We address practical issues in Section 6.

## 3 Topic Geometry

Recall that and respectively denote the empirical and actual document word distribution matrices, and , where is the latent topic word distribution matrix and is the underlying weight matrix. Let , and denote the , and matrices after row normalization. We set , so that . Let and respectively denote the row of and representing the cross-document patterns of word . We assume that is separable (Def. 1). Let be the set of novel words of topic and let be the set of non-novel words.

The geometric intuition underlying our approach is formulated in the following proposition :

###### Proposition 1.

Let be separable. Then for all novel words , and for all non-novel words , is a convex combination of ’s, for .

Proof: Note that for all ,

 K∑k=1˜βik=1

and for all , . Moreover, we have

 ˜Ai=K∑k=1˜βik˜θk

Hence for . In addition, for .

Fig. 1 illustrates this geometry. Without loss of generality, we could assume that novel word vectors are not in the convex hull of the other rows of . Hence, The problem of identifying novel words reduces to finding extreme points of all ’s.

Furthermore, retrieving topic matrix is straightforward given all distinct novel words :

###### Proposition 2.

If the matrix and distinct novel words are given, then can be calculated using linear regressions.

Proof: By Proposition 1, we have . Next . So can be computed by solving a linear system of equations. Specifically, if we let , can be obtained by column normalizing .

Proposition 1 and 2 validate the approach to estimate via identifying novel words given access to . However, only , a realization of , is available in the real problem which is not close to in typical settings of interest (). However, even when the number of samples per document () is limited, if we collect enough documents (), the proposed algorithm could still asymptotically estimate with arbitrary precision, as we will discuss in the following sections.

## 4 Proposed Algorithm

The geometric intuition mentioned in Propositions 1 and 2 motivates the following three-step approach for topic discovery :

(1) Novel Word Detection: Given the empirical word-by-document matrix , extract the set of all novel words . We present variants of projection-based algorithms in Sec. 4.1.

(2) Novel Word Clustering: Given a set of novel words with , cluster them into groups corresponding to topics. Pick a representative for each group. We adopt a distance based clustering algorithm. (Sec. 4.2).

(3) Topic Estimation: Estimate topic matrix as suggested in Proposition 2 by constrained linear regression. (Section 4.3).

### 4.1 Novel Word Detection

Fig.  1 illustrates the key insight to identify novel words as extreme points of some convex body. When we project every point of a convex body onto some direction , the maximum and minimum correspond to extreme points of the convex object. Our proposed approaches, data dependent and random projection, both exploit this fact. They only differ in the choice of projected directions.

A. Data Dependent Projections (DDP)
To simplify our analysis, we randomly split each document into two subsets, and obtain two statistically independent document collections and , both distributed as , and then row normalize as and . For some threshold, , to be specified later, and for each word , we consider the set, , of all other words that are sufficiently different from word in the following sense:

 Ji={j∣M(˜Xi−˜Xj)(˜X′i−˜X′j)⊤≥d/2} (1)

We then declare word as a novel word if all words are uniformly uncorrelated to word with some margin, to be specified later.

 M⟨˜Xi,˜X′i⟩≥M⟨˜Xi,˜X′j⟩+γ/2,∀j∈Ji (2)

The correctness of DDP Algorithm is established by the following Proposition and will be further discussed in section 5. The proof is given in the Supplementary section.

###### Proposition 3.

Suppose conditions and (will be defined in section 5) on prior distribution of hold. Then, there exists two positive constants and such that if is a novel word, for all , with high probability (converging to one as ). In addition, if is a non-novel word, there exists some such that with high probability.

The algorithm is elaborated in Algorithm 1. The running time of the algorithm is summarized in the following proposition. Detailed justification is provided in the Supplementary section.

###### Proposition 4.

The running time of Algorithm 1 is .

Proof Sketch. Note that is sparse since . Hence by exploiting the sparsity can be computed in time. For each word , finding and calculating cost time in the worst case.

B. Random Projections (RP)
DDP uses different directions to find all the extreme points. Here we use random directions instead. This significantly reduces the time complexity by decreasing the number of required projections.

The Random Projection Algorithm (RP) uses roughly random directions drawn uniformly iid over the unit sphere. For each direction , we project all ’s onto it and choose the maximum and minimum.

Note that will converge to conditioned on and as increases. Moreover, only for the extreme points , can be the maximum or minimum projection value. This provides intuition of consistency for RP. Since the directions are independent, we expect to find all the novel words using number of random projections.

C. Random Projections with Binning

Another alternative to RP is a Binning algorithm which is computationally more efficient. Here the corpus is split into equal sized bins. For each bin a random direction is chosen and the word with the maximum projection along is chosen as a winner. Then, we find the number of wins for each word . We then divide these winning frequencies by as an estimate for . can be shown to be zero for all non-novel words. For non-degenerate prior over , these probabilities converge to strictly positive values for novel words. Hence, estimating ’s helps in identifying novel words. We then choose the indices of largest values as novel words. The Binning algorithm is outlined in Algorithm 3.

In contrast with DDP, the RP algorithm is completely agnostic and parameter-free. This means that it requires no parameters like and to find the novel words. Moreover, it significantly reduces the computational complexity :

###### Proposition 5.

The running times of the RP and Binning algorithms are and , respectively.

###### Proof.

We will sketch the proof and provide a more detailed justification in the Supplementary section. Note that the number of operations needed to find the projections is in Binning and in RP. In addition, finding the the maximum takes for RP and for Binning. In sum, it takes for RP and for Binning to find all the novel words. ∎

### 4.2 Novel Word Clustering

Since there may be multiple novel words for a single topic, our DDP or RP algorithm can extract multiple novel words for each topic. This necessitates clustering to group the copies. We can show that our clustering scheme is consistent if we assume that is positive definite:

###### Proposition 6.

Let , and . If is positive definite, then converges to zero in probability whenever and are novel words of the same topic as . Moreover, if and are novel words of different types, it converges in probability to some strictly positive value greater than some constant .

The proof is presented in the Supplementary section.

As the Proposition 6 suggests, we construct a binary graph with its vertices correspond to the novel words. An edge between word and is established if . Then, the clustering reduces to finding connected components. The procedure is described in Algorithm 4.

In Algorithm 4, we simply choose any word of a cluster as the representative for each topic. This is simply for theoretical analysis. However, we could set the representative to be the average of data points in each cluster, which is more noise resilient.

### 4.3 Topic Matrix Estimation

Given novel words of different topics (), we could directly estimate () as in Proposition 2. This is described in Algorithm 5. We note that this part of the algorithm is similar to some other topic modeling approaches, which exploit separability. Consistency of this step is also validated in (Arora et al., 2012b). In fact, one may use the convergence of extremum estimators (Amemiya, 1985) to show the consistency of this step.

## 5 Statistical Complexity Analysis

In this section, we describe the sample complexity bound for each step of our algorithm. Specifically, we provide guarantees for DDP algorithm under some mild assumptions on the distribution over . The analysis of the random projection algorithm is much more involved and requires elaborate arguments. We will omit it in this paper.

We require following technical assumptions on the correlation matrix and the mean vector of :

is positive definite with its minimum eigenvalue being lower bounded by

There exists a positive value such that for , .

The second condition captures the following intuition : if two novel words are from different topics, they must appear in a substantial number of distinct documents. Note that for two novel words and of different topics, . Hence, this requirement means that should be fairly distant from the origin, which implies that the number of documents these two words co-occur in, with similar probabilities, should be small. This is a reasonable assumption, since otherwise we would rather group two related topics into one. In fact, we show in the Supplementary section (Section A.5

) that both conditions hold for the Dirichlet distribution, which is a traditional choice for the prior distribution in topic modeling. Moreover, we have tested the validity of these assumptions numerically for the logistic normal distribution (with non-degenerate covariance matrices), which is used in Correlated Topic Modeling (CTM)

(Blei & Lafferty, 2007).

### 5.1 Novel Word Detection Consistency

In this section, we provide analysis only for the DDP Algorithm. The sample complexity analysis of the randomized projection algorithms is however more involved and is the subject of the ongoing research. Suppose and hold. Denote and to be positive lower bounds on non-zero elements of and minimum eigenvalue of , respectively. We have:

###### Theorem 1.

For parameter choices and the DDP algorithm is consistent as . Specifically, true novel and non-novel words are asymptotically declared as novel and non-novel, respectively. Furthermore, for

 M≥C1(logW+log(1δ1))β2∧η8min(λ2∧β2∧,ζ2a2∧)

where is a constant, Algorithm 1

finds all novel words without any outlier with probability at least

, where .

Proof Sketch. The detailed justification is provided in the Supplementary section. The main idea of the proof is a sequence of statements :

• Given , for a novel word , defined in the Algorithm 1 is a subset of asymptotically with high probability, where . Moreover is a superset of with high probability for a non-novel word with .

• Given , for a novel word , converges to a strictly positive value greater than for , and if is non-novel, such that converges to a non-positive value.

These statements imply Proposition 3, which proves the consistency of the DDP Algorithm.

The term seems to be the dominating factor in the sample complexity bound. Basically, represents the minimum proportion of documents that a word would appear in. This is not surprising as the rate of convergence of is dependent on the values of and . As these values are decreased, converges to a larger value and the convergence get slower. In another view, given that the number of words per document is bounded, in order to have converge, a large number of documents is needed to observe all the words sufficiently. It is remarkable that a similar term would also arise in the sample complexity bound of (Arora et al., 2012b), where is the minimum non-zero element of diagonal part of . It may be noted that although it seems that the sample complexity bound scales logarithmically with , and would be decreased typically as increases.

### 5.2 Novel Word Clustering Consistency

We similarly prove the consistency and sample complexity of the novel word clustering algorithm :

###### Theorem 2.

For , given all true novel words as the input, the clustering algorithm, Algorithm 4 (ClusterNovelWords) asymptotically (as recovers novel word indices of different types, namely, the support of the corresponding rows are different for any two retrieved indices. Furthermore, if

 M≥C2(logW+log(1δ2))η8λ2∧β4∧

then Algorithm 4 clusters all novel words correctly with probability at least .

Proof Sketch. More detailed analysis is provided in the Supplementary section. We can show that converges to a strictly positive value if and are novel words of different topics. Moreover, it converges to zero if they are novel words of the same topic. Hence all novel words of the same topic are connected in the graph with high probability asymptotically. Moreover, there would not be an edge between the novel words of different topics with high probability. Therefore, the connected components of the graph corresponds to the true clusters asymptotically. The detailed discussion of the convergence rate is provided in the Supplementary section.

It is noticeable that the sample complexity of the clustering is similar to that of the novel word detection. This means that the hardness of novel word detection and distance based clustering using the proposed algorithms are almost the same.

### 5.3 Topic Estimation Consistency

Finally, we show that the topic estimation by regression is also consistent.

###### Theorem 3.

Suppose that Algorithm 5 outputs given the indices of distinct novel words. Then, . Specifically, if

 M≥C3W4(log(W)+log(K)+log(1/δ3))λ2∧η8ϵ4a8∧

then for all and , will be close to with probability at least , with , being a constant, and .

Proof Sketch. We will provide a detailed analysis in the Supplementary section. To prove the consistency of the regression algorithm, we will use a consistency result for the extremum estimators : If we assume to be a stochastic objective function which is minimized at under the constraint (for a compact ), and converges uniformly to , which in turn is minimized uniquely in , then (Amemiya, 1985). In our setting, we may take to be the objective function in Algorithm 5. Then, , where . Note that if is positive definite, is uniquely minimized at , which satisfies the conditions of the optimization. Moreover, converges to uniformly as a result of Lipschitz continuity of . Therefore, according to Slutsky’s theorem, converges to , and hence the column normalization of converges to . We will provide a more detailed analysis of this part in the Supplementary section.

In sum, consider the approach outlined at the beginning of section 4 based on data-dependent projections method, and assume that is the output. Then,

###### Theorem 4.

The output of the topic modeling algorithm converges in probability to element-wise. To be precise, if

 M≥max⎧⎨⎩C′2W4logWKδλ2∧η8ϵ4a8∧,C′1logWδβ2∧η8min(λ2∧β2∧,ζ2a2∧)⎫⎬⎭

then with probability at least , for all and , will be close to , with , and being two constants.

The proof is a combination of Theorems 1, 2 and 3.

## 6 Experimental Results

### 6.1 Practical Considerations

DDP algorithm requires two parameters and . In practice, we can apply DDP without knowing them adaptively and agnostically. Note that is for the construction of . We can otherwise construct by finding words that are maximally distant from in the sense of Eq. 1. To bypass , we can rank the values of across all and declare the topmost values as the novel words.

The clustering algorithm also requires parameter . Note that is just for thresholding a weighted graph. In practice, we could avoid hard thresholding by using

as weights for the graph and apply spectral clustering. To point out, typically the size of

in Algorithm 4 is of the same order as . Hence the spectral clustering is on a relative small graph which typically adds computational complexity.

Implementation Details: We choose the parameters of the DDP and RP in the following way. For DDP in all datasets except the Donoho image corpus, we use the agnostic algorithm discussed in section 6.1 with . Moreover, we take . For the image dataset, we used and . For RP, we set the number of projections in all datasets to obtain the results.

### 6.2 Synthetic Dataset

In this section, we validate our algorithm on synthetic examples. We generate a separable topic matrix with novel words per topic as follows: first, iid row-vectors corresponding to non-novel words are generated uniformly on the probability simplex. Then, iid values are generated for the nonzero entries in the rows of novel words. The resulting matrix is then column-normalized to get one realization of . Let . Next, iid column-vectors are generated for the matrix according to a Dirichlet prior . Following (Griffiths & Steyvers, 2004), we set for all . Finally, we obtain by generating iid words for each document.

For different settings of , , , and , we calculate the distance of the estimated topic matrix to the ground truth after finding the best matching between two sets of topics. For each setting we average the error over random samples. For RP & DDP we set parameters as discussed in the implementation details.

We compare the DDP and RP against the Gibbs sampling approach (Griffiths & Steyvers, 2004) (Gibbs), a state-of-art NMF-based algorithm (Tan & Févotte, in press) (NMF) and the most recent practical provable algorithm in (Arora et al., 2012a) (RecL2). The NMF algorithm is chosen because it compensates for the type of noise in our topic model. Fig.  2 depicts the estimation error as a function of the number of documents (Upper) and the number of words/document (bottom). RP and DDP have similar performance and are uniformly better than comparable techniques. Gibbs performs relatively poor in the first setting and NMF in the second. RecL2 perform worse in all the settings. Note that is relatively small () compared to . DDP/RP outperform other methods with fairly small sample size. Meanwhile, as is also observed in (Arora et al., 2012a), RecL2 has a poor performance with small .

### 6.3 Swimmer Image Dataset

In this section we apply our algorithm to the synthetic swimmer image dataset introduced in (Donoho & Stodden, 2004). There are binary images, each with pixels. Each image represents a swimmer composed of four limbs, each of which can be in one of distinct positions, and a torso. We interpret pixel positions as words. Each image is interpreted as a document composed of pixel positions with non-zero values. Since each position of a limb features some unique pixels in the image, the topic matrix satisfies the separability assumption with “ground truth” topics that correspond to single limb positions.

Following the setting of (Tan & Févotte, in press), we set body pixel values to 10 and background pixel values to 1. We then take each “clean” image, suitably normalized, as an underlying distribution across pixels and generate a “noisy” document of iid “words” according to the topic model. Examples are shown in Fig. 3. We then apply RP and DDP algorithms to the “noisy” dataset and compare against Gibbs (Griffiths & Steyvers, 2004), NMF (Tan & Févotte, in press), and RecL2 (Arora et al., 2012a). Results are shown in Figs. 4 and 5. We set the parameters as discussed in the implementation details.

This dataset is a good validation test for different algorithms since the ground truth topics are known and unique. As we see in Fig. 4, both Gibbs and NMF produce topics that do not correspond to any pure left/right arm/leg positions. Indeed, many of them are composed of multiple limbs. Nevertheless, as shown in Fig. 5, no such errors are realized in RP and DDP and our topic-estimates are closer to the ground truth images. In the meantime, RecL2 algorithm failed to work even with the clean data. Although it also extracts extreme points of a convex body, the algorithm additionally requires these points to be linearly independent. It is possible that extreme points of a convex body are linearly dependent (for example, a 2-D square on a 3-D simplex). This is exactly the case in the swimmer dataset. As we see in the last row in Fig. 5, RecL2 produces only a few topics close to ground truth. Its extracted topics for the noisy images are shown in Fig. 4. Results of RecL2 on noisy images are no close to ground truth as shown in Fig. 4.

### 6.4 Real World Text Corpora

In this section, we apply our algorithm on two different real world text corpora from (Frank & Asuncion, 2010). The smaller corpus is NIPS proceedings dataset with documents, a vocabulary of words and an average of words in each document. Another is a large corpus New York (NY) Times articles dataset, with , , and . The vocabulary is obtained by deleting a standard “stop” word list used in computational linguistics, including numbers, individual characters, and some common English words such as “the”.

In order to compare with the practical algorithm in (Arora et al., 2012a), we followed the same pruning in their experiment setting to shrink the vocabulary size to for NIPS and for NY Times. Following typical settings in (Blei, 2012) and (Arora et al., 2012a), we set for NIPS and for NY Times. We set our parameters as discussed in implementation details.

We compare DDP and RP algorithms against RecL2 (Arora et al., 2012a) and a practically widely successful algorithm (Griffiths & Steyvers, 2004)(Gibbs). Table 1 and 2222the zzz prefix annotates the named entity. depicts typical topics extracted by the different methods. For each topic, we show its most frequent words, listed in descending order of the estimated probabilities. Two topics extracted by different algorithms are grouped if they are close in distance.

Different algorithms extract some fraction of similar topics which are easy to recognize. Table 1 indicates most of the topics extracted by RP and DDP are similar and are comparable with that of Gibbs. We observe that the recognizable themes formed with DDP or RP topics are more abundant than that by RecL2. For example, topic on “chip design” as shown in the first panel in Table 1 is not extracted by RecL2, and topics in Table 2 on “weather” and “emotions” are missing in RecL2. Meanwhile, RecL2 method produces some obscure topics. For example, in the last panel of Table 1, RecL2 contains more than one theme, and in the last panel of Table 2 RecL2 produce some unfathomable combination of words. More details about the topics extracted are given in the Supplementary section.

## 7 Conclusion and Discussion

We summarize our proposed approaches (DDP, Binning and RP) while comparing with other existing methods in terms of assumptions, computational complexity and sample complexity (see Table  3). Among the list of the algorithms, DDP and RecL2 are the best and competitive methods. While the DDP algorithm has a polynomial sample complexity, its running time is better than that of RecL2, which depends on . Although seems to be independent of , by increasing the elements of would be decreased and the precision () which is needed to recover would be decreased. This results in a larger time complexity in RecL2. In contrast, time complexity of DDP does not scale with . On the other hand, the sample complexity of both DDP and RecL2, while polynomially scaling, depend on too many different terms. This makes the comparison of these sample complexities difficult. However, terms corresponding to similar concepts appeared in the two bounds. For example, it can be seen that , because the novel words are possibly the most rare words. Moreover, and which are the and condition numbers of are closely related. Finally, , with and being the maximum and minimum values in .

## References

• Amemiya (1985) Amemiya, T. Advanced econometrics. Harvard University Press, 1985.
• Anandkumar et al. (2012) Anandkumar, A., Foster, D., Hsu, D., Kakade, S., and Liu, Y. Two svds suffice: Spectral decompositions for probabilistic topic modeling and latent dirichlet allocation. In Neural Information Processing Systems (NIPS), 2012.
• Arora et al. (2012a) Arora, S., Ge, R., Halpern, Y., Mimno, D., Moitra, A., Sontag, D., Wu, Y., and Zhu, Michael. A Practical Algorithm for Topic Modeling with Provable Guarantees. ArXiv e-prints, Dec. 2012a.
• Arora et al. (2012b) Arora, S., Ge, R., and Moitra, A. Learning topic models – going beyond SVD. arXiv:1204.1956v2 [cs.LG], Apr. 2012b.
• Blei & Lafferty (2007) Blei, D. and Lafferty, J. A correlated topic model of science. annals of applied statistics. Annals of Applied Statistics, pp. 17–35, 2007.
• Blei (2012) Blei, D. M. Probabilistic topic models. Commun. ACM, 55(4):77–84, Apr. 2012.
• Blei et al. (2003) Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, Mar. 2003. ISSN 1532–4435.
• Cichocki et al. (2009) Cichocki, A., Zdunek, R., Phan, A. H., and Amari, S.

Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation

.
Wiley, 2009.
• Donoho & Stodden (2004) Donoho, D. and Stodden, V. When does non-negative matrix factorization give a correct decomposition into parts? In Advances in Neural Information Processing Systems 16, Cambridge, MA, 2004. MIT Press.
• Frank & Asuncion (2010) Frank, A. and Asuncion, A.

UCI machine learning repository, 2010.

• Griffiths & Steyvers (2004) Griffiths, T. and Steyvers, M. Finding scientific topics. In Proceedings of the National Academy of Sciences, volume 101, pp. 5228–5235, 2004.
• Lee & Seung (1999) Lee, D. D. and Seung, H. S. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, Oct. 1999. ISSN 0028-0836. doi: 10.1038/44565.
• Li & McCallum (2007) Li, W. and McCallum, A. Pachinko allocation: Dag-structured mixture models of topic correlations. In International Conference on Machine Learning, 2007.
• Recht et al. (2012) Recht, B., Re, C., Tropp, J., and Bittorf, V. Factoring nonnegative matrices with linear programs. In Advances in Neural Information Processing Systems 25, pp. 1223–1231, 2012.
• Tan & Févotte (in press) Tan, V. Y. F. and Févotte, C. Automatic relevance determination in nonnegative matrix factorization with the beta-divergence. IEEE Transactions on Pattern Analysis and Machine Intelligence, in press. URL http://arxiv.org/abs/1111.6085.

## Appendix A Proofs

Given is separable, we can reorder the rows of such that , where is diagonal. We will assume the same structure for throughout the section.

### a.1 Proof of Proposition 3

Proposition 3 is a direct result of Theorem 1. Please refer to section A.7 for more details.

### a.2 Proof of Proposition 4

Recall that Proposition 4 summarizes the computational complexity of the DDP Algorithm 1. Here we provide more details.

Proposition 4 (in Section 4.1). The running time of Data dependent projection Algorithm DDP 1 is .

Proof : We can show that, because of the sparsity of , can be computed in time. First, note that is a scaled word-word co-occurrence matrix, which can be calculated by adding up the co-occurrence matrices of each document. This running time can be achieved, if all words in the vocabulary are first indexed by a hash table (which takes ). Then, since each document consists of at most words, time is needed to compute the co-occurrence matrix of each document. Finally, the summation of these matrices to obtain would cost , which results in total time complexity. Moreover, for each word , we have to find and test whether for all . Clearly, the cost to do this is in the worst case.

### a.3 Proof of Proposition 5

Recall that Proposition 5 summarizes the computational complexity of RP ( Algorithm 2) and Binning (and see Section B in appendix for more details). Here we provide a more detailed proof.

Proposition 5 (in Section 4.1) Running time of RP (Algorithm 2) and Binning algorithm (in Appendix Section B) are and , respectively.

Proof : Note that number of operations needed to find the projections is in Binning and in RP. This can be achieved by first indexing the words by a hash table and then finding the projection of each document along the corresponding component of the random directions. Clearly, that takes time for each document. In addition, finding the word with the maximum projection value (in RP) and the winner in each bin (in Binning) will take . This counts to be for all projections in RP and for all of the bins in Binning. Adding running time of these two parts, the computational complexity of the RP and Binning algorithms will be and , respectively.

### a.4 Proof of Proposition 6

Proposition 6 (in Section 4.2) is a direct result of Theorem 2. Please read section A.8 for the detailed proof.

### a.5 Validation of Assumptions in Section 5 for Dirichelet Distribution

In this section, we prove the validity of the assumptions and which were made in Section 5.

For with , has pdf . Let and .

Proposition A.1 For a Dirichlet prior :

1. The correlation matrix is positive definite with minimum eigenvalue ,

2. , .

###### Proof.

The covariance matrix of , denoted as , can be written as

 Σi,j=⎧⎪ ⎪⎨⎪ ⎪⎩−αiαjα20(α0+1)if i≠jαi(α0−αi)α20(α0+1)% otherwise (3)

Compactly we have with . The mean vector . Hence we obtain

 R =1α20(α0+1)(−αα⊤+α0diag(α))+1α20αα⊤ =1α0(α0+1)(αα⊤+diag(α))

Note that for all ,