Topic modeling refers to a family of generative models and associated algorithms for discovering the (latent) topical structure shared by a large corpus of documents. They are important for organizing, searching, and making sense of a large text corpus . In this paper we describe a novel geometric approach, with provable statistical and computational efficiency guarantees, for learning the latent topics in a document collection. This work is a culmination of a series of recent publications on certain structure-leveraging methods for topic modeling with provable theoretical guarantees [2, 3, 4, 5].
We consider a corpus of documents, indexed by , each composed of words from a fixed vocabulary of size . The distinct words in the vocabulary are indexed by . Each document is viewed as an unordered “bag of words” and is represented by an empirical word-counts vector , where is the number of times that word appears in document [6, 1, 7, 5]. The entire document corpus is then represented by the matrix . 111When it is clear from the context, we will use to represent either the empirical word-count or, by suitable column-normalization of , the empirical word-frequency. A “topic” is a distribution over the vocabulary. A topic model posits the existence of latent topics that are shared among all documents in the corpus. The topics can be collectively represented by the columns of a column-stochastic “topic matrix” . Each document is conceptually modeled as being generated independently of all other documents through a two-step process: 1) first draw a document-specific distribution over topics from a prior distribution
on the probability simplex with some hyper-parameters; 2) then draw iid words according to a document-specific word distribution over the vocabulary given by
which is a convex combination (probabilistic mixture) of the latent topics. Our goal is to estimatefrom the matrix of empirical observations . To appreciate the difficulty of the problem, consider a typical benchmark dataset such as a news article collection from the New York Times (NYT)  that we use in our experiments. In this dataset, after suitable pre-processing, , , and, on average, . Thus, , is very sparse, and is very large. Typically, .
This estimation problem in topic modeling has been extensively studied. The prevailing approach is to compute the MAP/ML estimate . The true posterior of given , however, is intractable to compute and the associated MAP and ML estimation problems are in fact NP-hard in the general case [9, 10]
. This necessitates the use of sub-optimal methods based on approximations and heuristics such as Variational-Bayes and MCMC[6, 11, 12, 13]. While they produce impressive empirical results on many real-world datasets, guarantees of asymptotic consistency or efficiency for these approaches are either weak or non-existent. This makes it difficult to evaluate model fidelity: failure to produce satisfactory results in new datasets could be due to the use of approximations and heuristics or due to model mis-specification which is more fundamental. Furthermore, these sub-optimal approaches are computationally intensive for large text corpora [7, 5].
To overcome the hardness of the topic estimation problem in its full generality, a new approach has emerged to learn the topic model by imposing additional structure on the model parameters [9, 7, 3, 5, 14, 15]. This paper focuses on a key structural property of the topic matrix called topic separability [7, 3, 5, 15] wherein every latent topic contains at least one word that is novel to it, i.e., the word is unique to that topic and is absent from the other topics. This is, in essence, a property of the support of the latent topic matrix . The topic separability property can be motivated by the fact that for many real-world datasets, the empirical topic estimates produced by popular Variational-Bayes and Gibbs Sampling approaches are approximately separable [7, 5]. Moreover, it has recently been shown that the separability property will be approximately satisfied with high probability when the dimension of the vocabulary scales sufficiently faster than the number of topics and is a realization of a Dirichlet prior that is typically used in practice . Therefore, separability is a natural approximation for most high-dimensional topic models.
Our approach exploits the following geometric implication of the key separability structure. If we associate each word in the vocabulary with a row-vector of a suitably normalized empirical word co-occurrence matrix, the set of novel words correspond to the extreme points of the convex hull formed by the row-vectors of all words. We leverage this geometric insight and develop a provably consistent and efficient algorithm. Informally speaking, we establish the following result:
If the topic matrix is separable and the mixing weights satisfy a minimum information-theoretically necessary technical condition, then our proposed algorithm runs in polynomial time in , and estimates the topic matrix consistently as with held fixed. Moreover, our proposed algorithm can estimate to within an element-wise error with a probability at least if .
The asymptotic setting with held fixed is motivated by text corpora in which the number of words in a single document is small while the number of documents is large. We note that our algorithm can be applied to any family of topic models whose topic mixing weights prior satisfies a minimum information-theoretically necessary technical condition. In contrast, the standard Bayesian approaches such as Variational-Bayes or MCMC need to be hand-designed separately for each specific topic mixing weights prior.
The highlight of our approach is to identify the novel words as extreme points through appropriately defined random projections. Specifically, we project the row-vector of each word in an appropriately normalized word co-occurrence matrix along a few independent and isotropically distributed random directions. The fraction of times that a word attains the maximum value along a random direction is a measure of its degree of robustness as an extreme point. This process of random projections followed by counting the number of times a word is a maximizer can be efficiently computed and is robust to the perturbations induced by sampling noise associated with having only a very small number of words per document . In addition to being computationally efficient, it turns out that this random projections based approach requires the minimum information-theoretically necessary technical conditions on the topic prior for asymptotic consistency, and can be naturally parallelized and distributed. As a consequence, it can provably achieve the efficiency guarantees of a centralized method while requiring insignificant communication between distributed document collections . This is attractive for web-scale topic modeling of large distributed text corpora.
Another advance of this paper is the identification of necessary and sufficient conditions on the mixing weights for consistent separable topic estimation. In previous work we showed that a simplicial condition on the mixing weights is both necessary and sufficient for consistently detecting all the novel words . In this paper we complete the characterization by showing that an affine independence condition on the mixing weights is necessary and sufficient for consistently estimating a separable topic matrix. These conditions are satisfied by practical choices of topic priors such as the Dirichlet distribution . All these necessary conditions are information-theoretic and algorithm-independent, i.e., they are irrespective of the specific statistics of the observations or the algorithms that are used. The provable statistical and computational efficiency guarantees of our proposed algorithm hold true under these necessary and sufficient conditions.
The rest of this paper is organized as follows. We review related work on topic modeling as well as the separability property in various domains in Sec. II. We introduce the separability property on , the simplicial and affine independence conditions on mixing weights, and the extreme point geometry that motivates our approach in Sec. III. We then discuss how the solid angle can be used to identify robust extreme points to deal with a finite number of samples (words per document) in Sec. IV. We describe our overall algorithm and sketch its analysis in Sec. V. We demonstrate the performance of our approach in Sec. VI on various synthetic and real-world examples. Proofs of all results appear in the appendices.
Ii Related Work
The idea of modeling text documents as mixtures of a few semantic topics was first proposed in  where the mixing weights were assumed to be deterministic. Latent Dirichlet Allocation (LDA) in the seminal work of  extended this to a probabilistic setting by modeling topic mixing weights using Dirichlet priors. This setting has been further extended to include other topic priors such as the log-normal prior in the Correlated Topic Model . LDA models and their derivatives have been successful on a wide range of problems in terms of achieving good empirical performance [1, 13].
The prevailing approaches for estimation and inference problems in topic modeling are based on MAP or ML estimation . However, the computation of posterior distributions conditioned on observations is intractable . Moreover, the MAP estimation objective is non-convex and has been shown to be -hard [10, 9]
. Therefore various approximation and heuristic strategies have been employed. These approaches fall into two major categories – sampling approaches and optimization approaches. Most sampling approaches are based on Markov Chain Monte Carlo (MCMC) algorithms that seek to generate (approximately) independent samples from a Markov Chain that is carefully designed to ensure that the sample distribution converges to the true posterior[11, 19]. Optimization approaches are typically based on the so-called Variational-Bayes methods. These methods optimize the parameters of a simpler parametric distribution so that it is close to the true posterior in terms of KL divergence [6, 12]
. Expectation-Maximization-type algorithms are typically used in these methods. In practice, while both Variational-Bayes and MCMC algorithms have similar performance, Variational-Bayes is typically faster than MCMC[20, 1].
Nonnegative Matrix Factorization (NMF) is an alternative approach for topic estimation. NMF-based methods exploit the fact that both the topic matrix and the mixing weights are nonnegative and attempt to decompose the empirical observation matrix into a product of a nonnegative topic matrix and the matrix of mixing weights by minimizing a cost function of the form [21, 22, 23, 20]
where is some measure of closeness and is a regularization term which enforces desirable properties, e.g., sparsity, on and the mixing weights. The NMF problem, however, is also known to be non-convex and -hard  in general. Sub-optimal strategies such as alternating minimization, greedy gradient descent, and heuristics are used in practice .
In contrast to the above approaches, a new approach has recently emerged which is based on imposing additional structure on the model parameters [9, 7, 3, 5, 14, 15]. These approaches show that the topic discovery problem lends itself to provably consistent and polynomial-time solutions by making assumptions about the structure of the topic matrix[25, 14]. The algorithm in  uses second order empirical moments and is shown to be asymptotically consistent when the topic matrix has a special sparsity structure. The algorithm in  uses the third order tensor of observations. It is, however, strongly tied to the specific structure of the Dirichlet prior on the mixing weights and requires knowledge of the concentration parameters of the Dirichlet distribution . Furthermore, in practice these approaches are computationally intensive and require some initial coarse dimensionality reduction, gradient descent speedups, and GPU acceleration to process large-scale text corpora like the NYT dataset .
Our work falls into the family of approaches that exploit the separability property of and its geometric implications [9, 7, 3, 5, 15, 26, 27]. An asymptotically consistent polynomial-time topic estimation algorithm was first proposed in . However, this method requires solving linear programs, each with variables and is computationally impractical. Subsequent work improved the computational efficiency [23, 15], but theoretical guarantees of asymptotic consistency (when fixed, and the number of documents ) are unclear. Algorithms in  and  are both practical and provably consistent. Each requires a stronger and slightly different technical condition on the topic mixing weights than . Specifically,  imposes a full-rank condition on the second-order correlation matrix of the mixing weights and proposes a Gram-Schmidt procedure to identify the extreme points. Similarly,  imposes a diagonal-dominance condition on the same second-order correlation matrix and proposes a random projections based approach. These approaches are tied to the specific conditions imposed and they both fail to detect all the novel words and estimate topics when the imposed conditions (which are sufficient but not necessary for consistent novel word detection or topic estimation) fail to hold in some examples . The random projections based algorithm proposed in  is both practical and provably consistent. Furthermore, it requires fewer constraints on the topic mixing weights.
, a singular value decomposition based approach is proposed for topic estimation. In, it is shown that the standard Variational-Bayes approximation can be asymptotically consistent if is separable. However, the additional constraints proposed essentially boil down to the requirement that each document contain predominantly only one topic. In addition to assuming the existence of such “pure” documents,  also requires a strict initialization. It is thus unclear how this can be achieved using only the observations .
The separability property has been re-discovered and exploited in the literature across a number of different fields and has found application in several problems. To the best of our knowledge, this concept was first introduced as the Pure Pixel Index assumption in the Hyperspectral Image unmixing problem . This work assumes the existence of pixels in a hyper-spectral image containing predominantly one species. Separability has also been studied in the NMF literature in the context of ensuring the uniqueness of NMF . Subsequent work has led to the development of NMF algorithms that exploit separability [23, 30]. The uniqueness and correctness results in this line of work has primarily focused on the noiseless case. We finally note that separability has also been recently exploited in the problem of learning multiple ranking preferences from pairwise comparisons for personal recommendation systems and information retrieval [31, 32] and has led to provably consistent and efficient estimation algorithms.
Iii Topic Separability, Necessary and Sufficient Conditions, and the Geometric Intuitions
In this section, we unravel the key ideas that motivate our algorithmic approach by focusing on the ideal case where there is no “sampling-noise”, i.e., each document is infinitely long (). In the next section, we will turn to the finite case. We recall that and denote the topic matrix and the empirical word counts/frequency matrix respectively. Also, , and denote, respectively, the number of documents, the vocabulary size, and the number of topics. For convenience, we group the document-specific mixing weights, the ’s, into a weight matrix and the document-specific distributions, the ’s, into a document distribution matrix . The generative procedure that describes a topic model then implies that . In the ideal case considered in this section (), the empirical word frequency matrix . Notation: A vector without specification will denote a column-vector, the all-ones column vector of suitable size, the -th column vector and the -th row vector of matrix , and a suitably row-normalized version (described later) of a nonnegative matrix . Also, .
Iii-a Key Structural Property: Topic Separability
We first introduce separability as a key structural property of a topic matrix . Formally,
(Separability) A topic matrix is separable if for each topic , there is some word such that and , .
Topic separability implies that each topic contains word(s) which appear only in that topic. We refer to these words as the novel words of the topics.
Figure 1 shows an example of a separable with topics. Words and are novel to topic , words and to topic , and word to topic . Other words that appear in multiple topics are called non-novel words (e.g., word ). Identifying the novel words for distinct topics is the key step of our proposed approach.
We note that separability has been empirically observed to be approximately satisfied by topic estimates produced by Variational-Bayes and MCMC based algorithms [7, 5, 26]. More fundamentally, in very recent work , it has been shown that topic separability is an inevitable consequence of having a relatively small number of topics in a very large vocabulary (high-dimensionality). In particular, when the columns (topics) of are independently sampled from a Dirichlet distribution (on a -dimensional probability simplex), the resulting topic matrix will be (approximately) separable with probability tending to as scales to infinity sufficiently faster than . A Dirichlet prior on is widely-used in smoothed settings of topic modeling .
As we will discuss next in Sec. III-C, the topic separability property combined with additional conditions on the second-order statistics of the mixing weights leads to an intuitively appealing geometric property that can be exploited to develop a provably consistent and efficient topic estimation algorithm.
Iii-B Conditions on the Topic Mixing Weights
Topic separability alone does not guarantee that there will be a unique that is consistent with all the observations . This is illustrated in Fig. 2 . Therefore, in an effort to develop provably consistent topic estimation algorithms, a number of different conditions have been imposed on the topic mixing weights in the literature [9, 7, 3, 5, 15]. Complementing the work in  which identifies necessary and sufficient conditions for consistent detection of novel words, in this paper we identify necessary and sufficient conditions for consistent estimation of a separable topic matrix. Our necessity results are information-theoretic and algorithm-independent in nature, meaning that they are independent of any specific statistics of the observations and the algorithms used. The novel words and the topics can only be identified up to a permutation and this is accounted for in our results.
Let and be the expectation vector and the correlation matrix of the weight prior . Without loss of generality, we can assume that the elements of are strictly positive since otherwise some topic(s) will not appear in the corpus. A key quantity is which may be viewed as a “normalized” second-moment matrix of the weight vector. The following conditions are central to our results.
(Simplicial Condition) A matrix is (row-wise) -simplicial if any row-vector of is at a Euclidean distance of at least from the convex hull of the remaining row-vectors. A topic model is -simplicial if its normalized second-moment is -simplicial.
(Affine-Independence) A matrix is (row-wise) -affine-independent if , where is the -th row of and the minimum is over all such that and . A topic model is -affine-independent if its normalized second-moment is -affine-independent.
Here, and are called the simplicial and affine-independence constants respectively. They are condition numbers which measure the degree to which the conditions that they are respectively associated with hold. The larger that these condition numbers are, the easier it is to estimate the topic matrix. Going forward, we will say that a matrix is simplicial (resp. affine independent) if it is -simplicial (resp. -affine-independent) for some (resp. ). The simplicial condition was first proposed in  and then further investigated in . This paper is the first to identify affine-independence as both necessary and sufficient for consistent separable topic estimation. Before we discuss their geometric implications, we point out that affine-independence is stronger than the simplicial condition:
is -affine-independent is at least -simplicial. The reverse implication is false in general.
The Simplicial Condition is both Necessary and Sufficient for Novel Word Detection: We first focus on detecting all the novel words of the distinct topics. For this task, the simplicial condition is an algorithm-independent, information-theoretic necessary condition. Formally,
(Simplicial Condition is Necessary for Novel Word Detection [4, Lemma 1]) Let be separable and . If there exists an algorithm that can consistently identify all novel words of all topics from , then is simplicial.
The key insight behind this result is that when is non-simplicial, we can construct two distinct separable topic matrices with different sets of novel words which induce the same distribution on the empirical observations . Geometrically, the simplicial condition guarantees that the rows of will be extreme points of the convex hull that they themselves form. Therefore, if is not simplicial, there will exist at least one redundant topic which is just a convex combination of the other topics.
It turns out that being simplicial is also sufficient for consistent novel word detection. This is a direct consequence of the consistency guarantees of our approach as outlined in Theorem 3.
Affine-Independence is Necessary and Sufficient for Separable Topic Estimation: We now focus on estimating a separable topic matrix , which is a stronger requirement than detecting novel words. It naturally requires conditions that are stronger than the simplicial condition. Affine-independence turns out to be an algorithm-independent, information-theoretic necessary condition. Formally,
(Affine-Independence is Necessary for Separable Topic Estimation) Let be separable with . If there exists an algorithm that can consistently estimate from , then its normalized second-moment is affine-independent.
Similar to Lemma 1, if is not affine-independent, we can construct two distinct and separable topic matrices that induce the same distribution on the observation which makes consistent topic estimation impossible. Geometrically, every point in a convex set can be decomposed uniquely as a convex combination of its extreme points, if, and only if, the extreme points are affine-independent. Hence, if is not affine-independent, a non-novel word can be assigned to different subsets of topics.
The sufficiency of the affine-independence condition in separable topic estimation is again a direct consequence of the consistency guarantees of our approach as in Theorems 3 and 4. We note that since affine-independence implies the simplicial condition (Proposition 1), affine-independence is sufficient for novel word detection as well.
) is assumed to have full-rank (with minimum eigenvalue). In , is assumed to be diagonal-dominant, i.e., . They are both sufficient conditions for detecting all the novel words of all distinct topics. The constants and are condition numbers which measure the degree to which the full-rank and diagonal-dominance conditions hold respectively. They are counterparts of and and like them, the larger they are, the easier it is to consistently detect the novel words and estimate . The relationships between these conditions are summarized in Proposition 2 and illustrated in Fig. 3.
Let be the normalized second-moment of the topic prior. Then,
is full rank with minimum eigenvalue is at least -affine-independent is at least -simplicial.
is -diagonal-dominant is at least -simplicial.
being diagonal-dominant neither implies nor is implied by being affine-independent (or full-rank).
We note that in our earlier work , the provable guarantees for estimating the separable topic matrix require to have full rank. The analysis in this paper provably extends the guarantees to the affine-independence condition.
Iii-C Geometric Implications and Random Projections Based Algorithm
We now demonstrate the geometric implications of topic separability combined with the simplicial/ affine-independence condition on the topic mixing weights. To highlight the key ideas we focus on the ideal case where . Then, the empirical document word-frequency matrix .
Novel Words are Extreme Points: To expose the underlying geometry, we normalize the rows of and to obtain row-stochastic matrices and . Then since , we have where is a row-normalized “topic matrix” which is both row-stochastic and separable with the same sets of novel words as .
Now consider the row vectors of and . First, it can be shown that if is simplicial (cf. Condition 1) then, with high probability, no row of will be in the convex hull of the others (see Appendix -D). Next, the separability property ensures that if is a novel word of topic , then and so that . Revisiting the example in Fig. 1, the rows of which correspond to novel words, e.g., words through , are all row-vectors of and together form a convex hull of extreme points. For example, and . If, however, is a non-novel word, then lives inside the convex hull of the rows of . In Fig. 1, row which corresponds to non-novel word , is inside the convex hull of . In summary, the novel words can be detected as extreme points of all the row-vectors of . Also, multiple novel words of the same topic correspond to the same extreme point (e.g., ). Formally,
Let be simplicial and be separable. Then, with probability at least , the -th row of is an extreme point of the convex hull spanned by all the rows of if, and only if, word is novel. Here the constant and . The model parameters are defined as follows. is the minimum element of . is the maximum singular-value of .
To see how identifying novel words can help us estimate , recall that the row-vectors of corresponding to novel words coincide with the rows of . Thus is known once one novel word for each topic is known. Also, for all words , . Thus, if we can uniquely decompose as a convex combination of the extreme points, then the coefficients of the decomposition will give us the -th row of . A unique decomposition exists with high probability when
is affine-independent and can be found by solving a constrained linear regression problem. This gives us. Finally, noting that , can be recovered by suitably renormalizing rows and then columns of . To sum up,
Let and one novel word per distinct topic be given. If is affine-independent, then, with probability at least , can be recovered uniquely via constrained linear regression. Here the constant and . The model parameters are defined as follows. is the minimum element of . is the maximum singular-value of .
Lemmas 3 and 4 together provide a geometric approach for learning from (equivalently ): Find extreme points of rows of . Cluster the rows of that correspond to the same extreme point into the same group. Express the remaining rows of as convex combinations of the distinct extreme points. Renormalize to obtain .
Detecting Extreme Points using Random Projections: A key contribution of our approach is an efficient random projections based algorithm to detect novel words as extreme points. The idea is illustrated in Fig. 1: if we project every point of a convex body onto an isotropically distributed random direction , the maximum (or minimum) projection value must correspond to one of the extreme points with probability . On the other hand, the non-novel words will not have the maximum projection value along any random direction. Therefore, by repeatedly projecting all the points onto a few isotropically distributed random directions, we can detect all the extreme points with very high probability as the number of random directions increase. An explicit bound on the number of projections needed appears in Theorem 3.
Finite in Practice: The geometric intuition discussed above was based on the row-vectors of . When , the matrix of row-normalized empirical word-frequencies of all documents. If is finite but very large, can be well-approximated by
thanks to the law of large numbers. However, in real-word text corpora,(e.g., while in the NYT dataset). Therefore, the row-vectors of are significantly perturbed away from the ideal rows of as illustrated in Fig. 1. We discuss the effect of small and how we address the accompanying issues next.
Iv Topic Geometry with Finite Samples: Word Co-occurrence Matrix Representation, Solid Angle, and Random Projections based approach
The extreme point geometry sketched in Sec. III-C is perturbed when is small as highlighted in Fig. 1. Specifically, the rows of the empirical word-frequency matrix deviate from the rows of . This creates several problems:
points in the convex hull corresponding to non-novel words may also become “outlier” extreme points (e.g.,in Fig. 1); some extreme points that correspond to novel words may no longer be extreme (e.g., in Fig. 1); multiple novel words corresponding to the same extreme point may become multiple distinct extreme points (e.g., and in Fig. 1). Unfortunately, these issues do not vanish as increases with fixed – a regime which captures the characteristics of typical benchmark datasets – because the dimensionality of the rows (equal to ) also increases. There is no “averaging” effect to smoothen-out the sampling noise.
Our solution is to seek a new representation, a statistic of , which can not only smoothen out the sampling noise of individual documents, but also preserve the same extreme point geometry induced by the separability and affine independence conditions. In addition, we also develop an extreme point robustness measure that naturally arises within our random projections based framework. This robustness measure can be used to detect and exclude the “outlier” extreme points.
Iv-a Normalized Word Co-occurrence Matrix Representation
We construct a suitably normalized word co-occurrence matrix from as our new representation. The co-occurrence matrix converges almost surely to an ideal statistic as for any fixed . Simultaneously, in the asymptotic limit, the original novel words continue to correspond to extreme points in the new representation and overall extreme point geometry is preserved.
The new representation is (conceptually) constructed as follows. First randomly divide all the words in each document into two equal-sized independent halves and obtain two empirical word-frequency matrices and each containing words. Then normalize their rows like in Sec. III-C to obtain and which are row-stochastic. The empirical word co-occurrence matrix of size is then given by
We note that in our random projection based approach, is not explicitly constructed by multiplying and . Instead, we keep and and exploit their sparsity properties to reduce the computational complexity of all subsequent processing.
Asymptotic Consistency: The first nice property of the word co-occurrence representation is its asymptotic consistency when is fixed. As the number of documents , the empirical converges, almost surely, to an ideal word co-occurrence matrix of size . Formally,
Here is the same normalized second-moment of the topic priors as defined in Sec. III and is a row-normalized version of . We make note of the abuse of notion for which was defined in Sec. III-C. It can be shown that the defined in Lemma 5 is the limit of the one defined in Sec. III-C as . The convergence result in Lemma 5 shows that the word co-occurrence representation can be consistently estimated by as and the deviation vanishes exponentially in which is large in typical benchmark datasets.
Novel Words are Extreme Points: Another reason for using this word co-occurrence representation is that it preserves the extreme point geometry. Consider the ideal word co-occurrence matrix . It is straightforward to show that if is separable and is simplicial then is also simplicial. Using these facts it is possible to establish the following counterpart of Lemma 3 for :
(Novel Words are Extreme Points [5, Lemma 1]) Let be simplicial and be separable. Then, a word is novel if, and only if, the -th row of is an extreme point of the convex hull spanned by all the rows of .
In another words, the novel words correspond to the extreme points of all the row-vectors of the ideal word co-occurrence matrix . Consider the example in Fig. 4 which is based on the same topic matrix as in Fig. 1. Here, , and are distinct extreme points of all row-vectors of and , which corresponds to a non-novel word, is inside the convex hull.
Once the novel words are detected as extreme points, we can follow the same procedure as in Lemma 4 and express each row of as a unique convex combination of the extreme rows of or equivalently the rows of . The weights of the convex combination are the ’s. We can then apply the same row and column renormalization to obtain . The following result is the counterpart of Lemma 4 for :
Let and one novel word for each distinct topic be given. If is affine-independent, then can be recovered uniquely via constrained linear regression.
One can follow the same steps as in the proof of Lemma 4. The only additional step is to check that is affine-independent if is affine-independent.
We note that the finite sampling noise perturbation is still not but vanishes as (in contrast to the representation in Sec. III-C). However, there is still a possibility of observing “outlier” extreme points if a non-novel word lies on the facet of the convex hull of the rows of . We next introduce an extreme point robustness measure based on a certain solid angle that naturally arises in our random projections based approach, and discuss how it can be used to detect and distinguish between “true” novel words and such “outlier” extreme points.
Iv-B Solid Angle Extreme Point Robustness Measure
To handle the impact of a small but nonzero perturbation , we develop an extreme point “robustness” measure. This is necessary for not only applying our approach to real-world data but also to establish finite sample complexity bounds. Intuitively, a robustness measure should be able to distinguish between the “true” extreme points (row vectors that are novel words) and the “outlier” extreme points (row vectors of non-novel words that become extreme points due to the nonzero perturbation). Towards this goal, we leverage a key geometric quantity, namely, the Normalized Solid Angle subtended by the convex hull of the rows of at an extreme point. To visualize this quantity, we revisit our running example in Fig. 4 and indicate the solid angles attached to each extreme point by the shaded regions. It turns out that this geometric quantity naturally arises in the context of random projections that was discussed earlier. To see this connection, in Fig. 4 observe that the shaded region attached to any extreme point coincides precisely with the set of directions along which its projection is larger (taking sign into account) than that of any other point (whether extreme or not). For example, in Fig. 4 the projection of along is larger than that of any other point. Thus, the solid angle attached to a point (whether extreme or not) can be formally defined as the set of directions . This set is nonempty only for extreme points. The solid angle defined above is a set. To derive a scalar robustness measure from this set and tie it to the idea of random projections, we adopt a statistical perspective and define the normalized solid angle of a point as the probability that the point will have the maximum projection value along an isotropically distributed random direction. Concretely, for the -th word (row vector), the normalized solid angle is defined as
where is drawn from an isotropic distribution in such as the spherical Gaussian. The condition in Eq. (3) is introduced to exclude the multiple novel words of the same topic that correspond to the same extreme point. For instance, in Fig. 4 , Hence, for , is excluded. To make it practical to handle finite sample estimation noise we replace the condition by the condition for some suitably defined .
As illustrated in Fig. 4, the solid angle for all the extreme points are strictly positive given is -simplicial. On the other hand, for that is non-novel, the corresponding solid angle is zero by definition. Hence the extreme point geometry in Lemma 6 can be re-expressed in term of solid angles as follows:
(Novel Words have Positive Solid Angles) Let be simplicial and be separable. Then, word is a novel word if, and only if, .
We denote the smallest solid angle among the distinct extreme points by . This is a robust condition number of the convex hull formed by the rows of and is related to the simplicial constant of .
In a real-world dataset we have access to only an empirical estimate of the ideal word co-occurrence matrix . If we replace with , then the resulting empirical solid angle estimate will be very close to the ideal if is close enough to . Then, the solid angles of “outlier” extreme points will be close to while they will be bounded away from zero for the “true” extreme points. One can then hope to correctly identify all extreme points by rank-ordering all empirical solid angle estimates and selecting the distinct row-vectors that have the largest solid angles. This forms the basis of our proposed algorithm. The problem now boils down to efficiently estimating the solid angles and establishing the asymptotic convergence of the estimates as . We next discuss how random projections can be used to achieve these goals.
Iv-C Efficient Solid Angle Estimation via Random Projections
and then propose to estimate it by
where are iid directions drawn from an isotropic distribution in . Algorithmically, by Eq. (5), we approximate the solid angle at the -th word (row-vector) by first projecting all the row-vectors onto iid isotropic random directions and then calculating the fraction of times each row-vector achieves the maximum projection value. It turns out that the condition is equivalent to in terms of its ability to exclude multiple novel words from the same topic and is adopted for its simplicity. 222We abuse the symbol by using it to indicate different thresholds in these conditions.
This procedure of taking random projections followed by calculating the number of times a word is a maximizer via Eq. (5) provides a consistent estimate of the solid angle in Eq. (3) as and the number of projections increases. The high-level idea is simple: as increases, the empirical average in Eq. 5 converges to the corresponding expectation. Simultaneously, as increases, . Overall, the approximation proposed in Eq (5) using random projections converges to .
This random projections based approach is also computationally efficient for the following reasons. First, it enables us to avoid the explicit construction of the dimensional matrix : Recall that each column of and has no more than nonzero entries. Hence and are both sparse. Since , the projection can be calculated using two sparse matrix-vector multiplications. Second, it turns out that the number of projections needed to guarantee consistency is small. In fact in Theorem 3 we provide a sufficient upper bound for which is a polynomial function of , and other model parameters, where is the probability that the algorithm fails to detect all the distinct novel words.
Parallelization, Distributed and Online Settings: Another advantage of the proposed random projections based approach is that it can be parallelized and is naturally amenable to online or distributed settings. This is based on the following observation that each projection has an additive structure:
The projections can also be computed independently. Therefore,
In a distributed setting in which the documents are stored on distributed servers, we can first share the same random directions across servers and then aggregate the projection values. The communication cost is only the “partial” projection values and is therefore insignificant  and does not scale as the number of observations increases.
In an online setting in which the documents are streamed in an online fashion , we only need to keep all the projection values and update the projection values (hence the empirical solid angle estimates) when new documents arrive.
The additive and independent structure guarantees that the statistical efficiency of these variations are the same as the centralized “batch” implementation. For the rest of this paper, we only focus on the centralized version.
Outline of Overall Approach: Our overall approach can be summarized as follows. Estimate the empirical solid angles using iid isotropic random directions as in Eq. 5. Select the words with distinct word co-occurrence patterns (rows) that have the largest empirical solid angles. Estimate the topic matrix using constrained linear regression as in Lemma 4. We will discuss the details of our overall approach in the next section and establish guarantees for its computational and statistical efficiency.
V Algorithm and Analysis
Algorithm 1 describes the main steps of our overall random projectons based algorithm which we call RP. The two main steps, novel word detection and topic matrix estimation are outlined in Algorithms 2 and 3 respectively. Algorithm 2 outlines the random projection and rank-ordering steps. Algorithm 3 describes the constrained linear regression and the renormalization steps in a combined way.
Computational Efficiency: We first summarize the computational efficiency of Algorithm 1:
Let the number of novel words for each topic be a constant relative to . Then, the running time of Algorithm 1 is .
This efficiency is achieved by exploiting the sparsity of and the property that there are only a small number of novel words in a typical vocabulary. A detailed analysis of the computational complexity is presented in the appendix. Here we point out that in order to upper bound the computation time of the linear regression in Algorithm 3 we used for matrix inversions, one for each of the words in the vocabulary. In practice, a gradient descent implementation can be used for the constrained linear regression which is much more efficient. We also note that these optimization problems are decoupled given the set of detected novel words. Therefore, they can be parallelized in a straightforward manner .
Asymptotic Consistency and Statistical Efficiency: We now summarize the asymptotic consistency and sample complexity bounds for Algorithm 1. The analysis is a combination of the consistency of the novel word detection step (Algorithm 2) and the topic estimation step (Algorithm 3). We state the results for both of these steps. First, for detecting all the novel words of the distinct topics, we have the following result:
Let topic matrix be separable and be -simplicial. If the projection directions are iid sampled from any isotropic distribution, then Algorithm 2 can identify all the novel words of the distinct topics as . Furthermore, , if
then Algorithm 2 fails with probability at most . The model parameters are defined as follows. where , , is the maximum eigenvalue of , , and is the set of non-novel words. Finally, is the minimum solid angle of the extreme points of the convex hull of the rows of .
The detailed proof is presented in the appendix. The results in Eq. (6) provide a sufficient finite sample complexity bound for novel word detection. The bound is polynomial with respect to , and other model parameters. The number of projections that impacts the computational complexity scales as in this sufficient bound where can be upper bounded by . In practice, we have found that setting is a good choice .
We note that the result in Theorem 3 only requires the simplicial condition which is the minimum condition required for consistent novel word detection (Lemma 1). This theorem holds true if the topic prior satisfies stronger conditions such as affine-independence. We also point out that our proof in this paper holds for any isotropic distribution on the random projection directions . The previous result in 
, however, only applies to some specific isotopic distributions such as the Spherical Gaussian or the uniform distribution in a unit ball. In practice, we use Spherical Gaussian since sampling from such prior is simple and requires onlytime for generating each random direction.
Next, given the successful detection of the set of novel words for all topics, we have the following result for the accurate estimation of the separable topic matrix :
Let topic matrix be separable and be -affine-independent. Given the successful detection of novel words for all distinct topics, the output of Algorithm 3 element-wise (up to a column permutation). Specifically, if
then , will be close to with probability at least , for any . is the same as in Theorem 3. is the minimum value in .
Vi Experimental Results
In this section, we present experimental results on both synthetic and real world datasets. We report different performance measures that have been commonly used in the topic modeling literature. When the ground truth is available (Sec. VI-A), we use the reconstruction error between the ground truth topics and the estimates after proper topic alignment. For the real-world text corpus in Sec. VI-B, we report the held-out probability, which is a standard measure used in the topic modeling literature. We also qualitatively (semantically) compare the topics extracted by the different approaches using the top probable words for each topic.
Vi-a Semi-synthetic text corpus
In order to validate our proposed algorithm, we generate “semi-synthetic” text corpora by sampling from a synthetic, yet realistic, ground truth topic model. To ensure that the semi-synthetic data is similar to real-world data, in terms of dimensionality, sparsity, and other characteristics, we use the following generative procedure adapted from [7, 5].
We first train an LDA model (with ) on a real-world dataset using a standard Gibbs Sampling method with default parameters (as described in [11, 33]) to obtain a topic matrix of size . The real-world dataset that we use to generate our synthetic data is derived from a New York Times (NYT) articles dataset . The original vocabulary is first pruned based on document frequencies. Specifically, as is standard practice, only words that appear in more than documents are retained. Thereafter, again as per standard practice, the words in the so-called stop-word list are deleted as recommended in . After these steps, , , and the average document length . We then generate semi-synthetic datasets, for various values of , by fixing and using and a Dirichlet topic prior. As suggested in  and used in [7, 5], we use symmetric hyper-parameters () for the Dirichlet topic prior.
The topic matrix may not be separable. To enforce separability, we create a new separable dimensional topic matrix by inserting synthetic novel words (one per topic) having suitable probabilities in each topic. Specifically, is constructed by transforming as follows. First, for each synthetic novel word in , the value of the sole nonzero entry in its row is set to the probability of the most probable word in the topic (column) of for which it is a novel word. Then the resulting dimensional nonnegative matrix is renormalized column-wise to make it column-stochastic. Finally, we generate semi-synthetic datasets, for various values of , by fixing and using and the same symmetric Dirichlet topic prior used for .
We use the name Semi-Syn to refer to datasets that are generated using and the name Semi-SynNovel for datasets generated using .
In our proposed random projections based algorithm, which we call RP, we set , , and . We compare RP against the provably efficient algorithm RecoverL2 in  and the standard Gibbs Sampling based LDA algorithm (denoted by Gibbs) in [11, 33]. In order to measure the performance of different algorithms in our experiments based on semi-synthetic data, we compute the norm of the reconstruction error between and . Since all column permutations of a given topic matrix correspond to the same topic model (for a corresponding permutation of the topic mixing weights), we use a bipartite graph matching algorithm to optimally match the columns of with those of (based on minimizing the sum of distances between all pairs of matching columns) before computing the norm of the reconstruction error between and .
The results on both Semi-SynNovel NYT and Semi-Syn NYT are summarized in Fig. 5 for all three algorithms for various choices of the number of documents . We note that in these figures the norm of the error has been normalized by the number of topics ().
As Fig. 5 shows, when the separability condition is strictly satisfied (Semi-SynNovel ), the reconstruction error of RP converges to 0 as becomes large and outperforms the approximation-based Gibbs. When the separability condition is not strictly satisfied (Semi-Syn), the reconstruction error of RP is comparable to Gibbs (a practical benchmark).
Solid Angle and Model Selection: In our proposed algorithm RP, the number of topics (the model-order) needs to be specified. When is unavailable, it needs to be estimated from the data. Although not the focus of this work, Algorithm 2, which identifies novel words by sorting and clustering the estimated solid angles of words, can be suitably modified to estimate .
Indeed, in the ideal scenario where there is no sampling noise (, and ), only novel words have positive solid angles (’s) and the rows of corresponding to the novel words of the same topic are identical, i.e., the distance between the rows is zero or, equivalently, they are within a neighborhood of size zero of each other. Thus, the number of distinct neighborhoods of size zero among the non-zero solid angle words equals .
In the nonideal case is finite. If is sufficiently large, one can expect that the estimated solid angles of non-novel words will not all be zero. They are, however, likely to be much smaller than those of novel words. Thus to reliably estimate one should not only exclude words with exactly zero solid angle estimates, but also those above some nonzero threshold. When is finite, the the rows of corresponding to the novel words of the same topic are unlikely to be identical, but if is sufficiently large they are likely to be close to each other. Thus, if the threshold in Algorithm 2, which determines the size of the neighborhood for clustering all novel words belonging to the same topic, is made sufficiently small, then each neighborhood will have only novel words belonging to the same topic.
With the two modifications discussed above, the number of distinct neighborhoods of a suitably nonzero size (determined by ) among the words whose solid angle estimates are larger than some threshold will provide an estimate of . The values of and should, in principle, decrease to zero as increases to infinity. Leaving the task of unraveling the dependence of and on to future work, here we only provide a brief empirical validation on both the Semi-SynNovel and Semi-Syn NYT datasets. We set so that the reconstruction error has essentially converged (see Fig. 5), and consider different choices of the threshold .
We run Algorithm 2 with , , and a new line of code: 16’: (if , break); inserted between lines 16 and 17 (this corresponds to
). The input hyperparameteris not the actual number of estimated topics. It should be interpreted as specifying an upper bound on the number of topics. The value of (little) when Algorithm 2 terminates (see lines 14–21) provides an estimate of the number of topics.
Figure 6 illustrates how the solid angles of all words, sorted in descending order, decay for different choices of and how they can be used to detect the novel words and estimate the value of . We note that in both the semi-synthetic datasets, for a wide range of values of (0.1–5), the modified Algorithm 2 correctly estimates the value of as . When is large (e.g., in Fig. 6), many interior points would be declared as novel words and multiple ideal novel words would be grouped into one cluster resulting. This causes to be underestimated (46 and 41 in Fig. 6).
Vi-B Real-world data
We now describe results on the actual real-world NYT dataset that was used in Sec. VI-A to construct the semi-synthetic datasets. Since ground truth topics are unavailable, we measure performance using the so-called predictive held-out log-probability. This is a standard measure which is typically used to evaluate how well a learned topic model fits real-world data. To calculate this for each of the three topic estimation methods (Gibbs [11, 33], RecoverL2 , and RP), we first randomly select documents to test the goodness of fit and use the remaining documents to produce an estimate of the topic matrix. Next we assume a Dirichlet prior on the topics and estimate its concentration hyper-parameter . In Gibbs, this estimate is a byproduct of the algorithm. In RecoverL2 and RP this can be estimated from and . We then calculate the probability of observing the test documents given the learned topic model and :
Since an exact evaluation of this predictive log-likelihood is intractable in general, we calculate it using the MCMC based approximation proposed in  which is now a standard approximation tool . For RP, we use , , and as in Sec. VI-A. We report the held-out log probability, normalized by the total number of words in the test documents, averaged across 5 training/testing splits. The results are summarized in Table I.
As shown in Table I, Gibbs has the best descriptive power for new documents. RP and RecoverL2 have similar, but somewhat lower values than Gibbs. This may be attributed to missing novel words that appear only in the test set and are crucial to the success of RecoverL2 and RP. Specifically, in real-world examples, there is a model-mismatch as a result of which the data likelihoods of RP and RecoverL2 suffer.
Finally, we qualitatively access the topics produced by our RP algorithm. We show some example topics extracted by RP trained on the entire NYT dataset of documents in Table II 333The zzz prefix in the NYT vocabulary is used to annotate certain special named entities. For example, zzz_nfl annotates NFL.
|Topic label||Words in decreasing order of estimated probabilities|
|“weather”||weather wind air storm rain cold|
|“feeling”||feeling sense love character heart emotion|
|“election”||election zzz_florida ballot vote zzz_al_gore recount|
|“game”||yard game team season play zzz_nfl|
For each topic, its most frequent words are listed. As can be seen, the estimated topics do form recognizable themes that can be assigned meaningful labels. The full list of all topics estimated on the NYT dataset can be found in .
Vii Conclusion and Discussion
This paper proposed a provably consistent and efficient algorithm for topic discovery. We considered a natural structural property – topic separability – on the topic matrix and exploited its geometric implications. We resolved the necessary and sufficient conditions that can guarantee consistent novel words detection as well as separable topic estimation. We then proposed a random projections based algorithm that has not only provably polynomial statistical and computational complexity but also state-of-the-art performance on semi-synthetic and real-world datasets.
While we focused on the standard centralized batch implementation in this paper, it turns out that our random projections based scheme is naturally amenable to an efficient distributed implementation which is of interest when the documents are stored on a network of distributed servers. This is because the iid isotropic projection directions can be precomputed and shared across document servers, and counts, projections, and co-occurrence matrix computations have an additive structure which allows partial computations to be performed at each document server locally and then aggregated at a fusion center with only a small communication cost. It turns out that the distributed implementation can provably match the polynomial computational and statistical efficiency guarantees of its centralized counterpart. As a consequence, it provides a provably efficient alternative to the distributed topic estimation problem which has been tackled using variations of MCMC or Variational-Bayes in the literature [35, 20, 36, 37] This is appealing for modern web-scale databases, e.g., those generated by Twitter Streaming. A comprehensive theoretical and empirical investigation of the distributed variation of our algorithm can be found in .
Separability of general measures: We defined and studied the notion of separability for a topic matrix which is a finite collection of probability distributions over a finite set (of size ). It turns out that we can extend the notion separability to a finite collection of measures over a measurable space. This necessitates making a small technical modification to the definition of separability to accommodate the possibility of only having “novel subsets” that have zero measure. We also show that our generalized definition of separability is equivalent to the so-called irreducibility property of a finite collection of measures that has recently been studied in the context of mixture models to establish conditions for the identifiability of the mixing components [38, 39].
Consider a collection of measures over a measurable space , where is a set and is a -algebra over . We define the generalized notion of separability for measures as follows.
(Separability) A collection of measures over a measurable space is separable if for all ,
Separability requires that for each measure , there exists a sequence of measurable sets , of nonzero measure with respect to , such that, for all , the ratios vanish asymptotically. Intuitively, this means that for each measure there exists a sequence of nonzero-measure measurable subsets that are asymptotically “novel” for that measure. When is a finite set as in topic modeling, this reduces to the existence of novel words as in Definition 1 and are simply the sets of novel words for topic .
The separability property just defined is equivalent to the so-called irreducibility property. Informally, a collection of measures is irreducible if only nonnegative linear combinations of them can produce a measure. Formally,
(Irreducibility) A collection of measures