Since the introduction by Blei et al. (2003) and Pritchard et al. (2000), the Latent Dirichlet Allocation (LDA) model has remained an important tool to explore and organize large corpora of texts and images. The goal of topic modeling can be summarized as finding a set of topics that summarizes the observed corpora, where each document is a combination of topics lying on the topic simplex.
There are many extensions of LDA, including a nonparametric extension based on the Dirichlet process called Hierarchical Dirichlet Process (Teh et al., 2005), a correlated topic extension based on the logistic normal prior on the topic proportions (Lafferty and Blei, 2006), and a time-varying topic modeling extension (Blei and Lafferty, 2006)
. There are two main approaches for estimation of the parameters of probabilistic topic models: the variational approximation popularized byBlei et al. (2003) and the sampling based approach studied by Pritchard et al. (2000). These inference algorithms either approximate or sample from the posterior distributions of the latent variable representing the topic labels. Therefore, the estimates do not necessarily have a meaningful geometric interpretation in terms of the topic simplex - complicating assessment of goodness of fit to the model. In order to address this problem, Yurochkin and Nguyen (2016)
introduced Geometric Dirichlet Mean (GDM), a novel geometric approach to topic modeling. It is based on a geometric loss function that is surrogate to the LDA’s likelihood and builds upon a weighted k-means clustering algorithm, introducing a bias correction. It avoids excessive redundancy of the latent topic label variables and thus improves computation speed and learning accuracy. This geometric viewpoint was extended to a nonparametric setting(Yurochkin et al., 2017).
LDA-type models also arise in the hyperspectral unmixing problem. Similar to the documents in topic modeling, hyperspectral image pixels are assumed to be mixtures of a few spectral signatures, called endmembers (equivalent to topics). Unmixing procedures aim to identify the number of endmembers, their spectral signatures, and their abundances at each pixel (equivalent to topic proportions). One difference between topic modeling and unmixing is that hyperspectral spectra are not normalized. Nonetheless, algorithms for hyperspectral unmixing are similar to topic model algorithms, and similar models have been applied to both problems. Geometric approaches in the hyperspectral unmixing literature take advantage of the fact that linearly mixed vectors also lie in a simplex set or in a positive cone. One of the early geometric approaches to unmixing was introduced inNascimento and Dias (2005) and Bioucas-Dias (2009), which aim to first identify the K-dimensional subspace of the data and then estimate the endmembers that minimize the volume of the simplex spanned by these endmembers. Bioucas-Dias (2009) estimates the endmembers through minimizing the log determinant of the endmember matrix, as the log-determinant is proportional to the volume of the simplex defined by the endmembers. This idea of minimizing the simplex volume motivated the algorithm proposed in this paper for topic modeling. In Bioucas-Dias (2009), however, the authors experience an optimization issue as their formulation is highly non-convex. It was found that the local minima of the objective in Bioucas-Dias (2009) may be unstable.
The topic modeling problem also has similarities to matrix factorization. In particular, nonnegative matrix factorization, while it does not enforce a sum-to-one constraint, is directly applicable to topic modeling (Deerwester et al., 1990; Xu et al., 2003; Anandkumar et al., 2012; Arora et al., 2013; Fu et al., 2018). Recover KL, recently introduced by Arora et al. (2013), provides a fast algorithm that identifies the model under a separability assumption, which is the assumption that the sample set includes the vertices of the true topic model (pure endmembers). As the separabiiltiy assumption is often not satisfied in practice, Fu et al. (2018) introduced a weaker assumption called the sufficiently scattered assumption. We provide a theoretical justification of our geometric minimum value method under this weaker assumption.
We propse a new geometric inference method for LDA that is formulated as minimizing the volume of the topic simplex. The estimator is shown to be identifiable under the separability assumption and the sufficiently scattered assumption. Compared to Bioucas-Dias (2009), our geometric objective involves instead of , making our objective function convex. At the same time, the term remains proportional to the volume enclosed by the topic matrix and simplifies the optimization. In particular, we propose a convex relaxation of the minimization problem whose global minimization is equivalent to the original problem. This relaxed objective function is minimized using an iterative augmented Lagrangian approach, implemented using the alternating direction method of multipliers (ADMM), that is shown to be locally convergent.
We use the following notations. We are given a corpus with documents, topics, vocabulary size and words in document for . Let be the space of row-stochastic matrices. Then, our goal is to decompose as , where is the matrix of topic proportions, and is the topic-term matrix. Finally, represents the -dimensional simplex. It is assumed that the documents in the corpus obey the following generative LDA model.
For each topic for
Draw a topic distribution
For -th document in the corpus for
Choose the topic proportion
For each word in the document
Choose a topic
Choose a word
2 Proposed Approach
We assume that the number of topics is known in advance and is much smaller than size of the vocabulary, i.e. . Furthermore, since LDA models the document as being inside the topic simplex, it is advantageous to represent the documents in a -dimensional subspace basis.
Let be a matrix of dimension with orthogonal directions spanning the document subspace. Specifically, we define as the set of eigenvectors of the sample covariance matrix of the documents , .
Most of the paper focuses on working with , which corresponds to the coordinates of in . Note that we can recover the projected documents in the original -dimensional space by
where is the sample average of the observed documents. Therefore,
where belongs to the simplex . This -dimensional probability simplex is defined by the topic distributions, which are the rows of . For the rest of the paper, given , we denote as the corresponding coordinates in the projected subspace and as the projected vector in the original -dimensional space.
2.1 Topic Estimation
Let . Then, it follows that . We know that is invertible as we assume that there are distinct topics, and the rank of the topic matrix is . Then, as noted in Nascimento and Bioucas-Dias (2012), the likelihood w.r.t. can be written as
This formulation gives a nice geometric interpretation.
Geometric Interpretation of log likelihood: As we increase the number of documents , the dominant term is . That is,
Note that is proportional to the volume enclosed by the row vectors of , i.e. the topic simplex in the projected subspace. In other words, the estimated topic matrix that minimizes its intrinsic volume is asymptotically equivalent to the asymptotic form of the log-likelihood (1). This is the main motivation for our proposal to minimize the volume of topic simplex.
3 Minimum Volume Topic Modeling
In the remote sensing literature, Nascimento and Bioucas-Dias (2012) proposed to work with the likelihood (1) by modeling as a Dirichlet mixture. However, their endmembers are spectra and do not necessarily satisfy the sum-to-one constraints on the endmember matrix; constraints which are fundamental to topic modeling. These additional constraints on the endmember complicate the minimization of (1). The first difficulty arises from the term, as is not a symmetric matrix, which makes the log likelihood non-convex. Due to this non-convexity issue, Nascimento and Bioucas-Dias (2012) propose using a second-order approximation to the term. Yet, no rigorous justification has been provided to their approach. In contrast, we propose using instead of , prove identifiability under the sufficiently scattered assumption, and derive an ADMM update.
As we are optimizing directly, we use the notation in the sequel. We can then rewrite the objective (2) as follows
where . The first set of constraints corresponds to the sum-to-one and nonnegative constraint on the topic proportions , and the second constraint imposes the same conditions on . Thus, the problem (3) provides an exact solution to the asymptotic estimation of (1). However, this is not a convenient formulation of the optimization problem, as it involves the constraint on the inverse of . Note that as we assume intrinsically lives in a -dimensional subspace, there is a one-to-one mapping between and . Throughout this paper, we will make use of this relationship between and . Here, working with a geometric interpretation of the second set of constraints we propose a relaxed version of (3).
Sum-to-one constraint on : Combined with the non-negativity constraint, the sum-to-one constraint forces the rows of to lie in the -dimensional topic simplex within the word simplex. To be specific, narrows our search space to be in an affine subspace, which is accomplished with a projection of the documents onto this -dimensional affine subspace. This projection takes care of the sum-to-one constraint in the objective (3).
Non-negativity constraint on : We propose relaxing the non-negativity constraint to the following
is the minimum singular value of. As illustrated in Figure 1, this is interpreted as replacing the non-negativity constraint on the elements of the matrix with a radius ball constraint on the rows of the matrix . As noted before, there is a mapping between and through . Thus, if is the current iterate of an iterative optimization algorithm, to be specified below, then we can represent the corresponding -th topic vector in the projected space as . It follows that
With this spectral relaxation of the non-negativity constraint, the relaxed version of the problem (3) becomes
Here we establish identifiability of the model obtained by solving problem (5). Identifiability gained interests in the topic modeling literature (Arora et al. (2013) and Fu et al. (2018)). We show the identifiability under the sufficiently scattered condition. We first state the following lemma.
Let be a solution to the problem (5). If , we have that , where
Intuitively, Lemma 3.1 tells us that we cannot have the solution outside of the blue triangle in Figure 2. If there was a solution outside of the triangle (Figure 1(a)), we could find the projection (Figure 1(b)) onto the word simplex (blue triangle) that still satisfies the constraint yet has a smaller volume, which is a contradiction.
We prove this statement by contradiction. Suppose . Then, as is an optimal solution to the problem (5), we have that . Furthermore, since and is obtained from PCA, we have that . Thus, it follows that
Therefore, the only constraint that could possibly violate is non-negativity of . Let be the projection of onto the simplex and let . Then, satisfies all the constraints in the optimization problem (5), but we also have that
since the volume of is smaller than that of . This is a contradiction as is the optimal solution to the problem (5). Thus, it follows that . ∎
We now state the sufficiently scattered assumption from Huang et al. (2016).
Assumption 1: (sufficiently scattered condition (Huang et al. (2016))) Let cone be the polyhedral cone of and be the second order cone. Matrix is called sufficiently scattered if it satisfies:
2) cone, where denotes the boundary of .
The sufficiently scattered assumption can be interpreted as an assumption that we observe a sufficient number of documents on the faces of the topic simplex. In real-world topic model applications, such an assumption is not unreasonable since there are usually documents in the corpora having sparse representations.
Let be the optimal solution to the problem (5) and be the corresponding topic matrix. If the true topic matrix is sufficiently scattered and rank, then , where is a permutation matrix.
The proof structure is similar to the one in Huang et al. (2016), and we include here for completeness.
Given a corpus , let be the true topic-word matrix. Suppose rank and is sufficiently scattered. Let be the solution to the problem (5). Then, by Lemma 3.1, we have that . Furthermore, since rank, we have that rank as where . It also follows that rank due to the constraint . Therefore, and are strictly positive. In other words, we cannot have a trivial solution to (5) as the objective is bounded. As and
are full row rank, there exists an invertible matrixsuch that . Also, as , it follows that and
The inequality constraint tells us that rows of are contained in cone. As is sufficiently scattered, it follows that
by the first condition of (A1). Then, by the definition of the second order cone , it follows that
The first inequality comes from the Hadamard inequality, which states that the equality holds if and only if the vectors ’s are orthogonal to each other. The second inequality holds when . In other words, when . Then, together with (6), it follows that
Thus, it follows that the achieves its maximum at 1, when
sums to one and is an orthogonal matrix, i.e. whenis a permutation matrix.
Furthermore, since , we have that
where the equality holds when . In other words, the minimum is achieved when is a permutation matrix. Therefore, our solution to the problem (5) and the corresponding are equal to the true topic-word matrix up to permutation. ∎
Assumption 2 (Separability assumption from Arora et al. (2013)) There exists a set of indices such that Diag(c), where .
The separability assumption, also known as the anchor-word assumption, states that every topic has a unique word that only shows up in topic . These words are also referred to as the anchor words as introduced in Arora et al. (2013).
Remark: The identifiability statement in Proposition 3.1 holds true under the separability assumption as well, as the sufficiently scattered assumption is a weaker version of the separability assumption.
3.2 Augmented Lagrangian Formulation
With , we work with the following augmented Lagrangian version of the constrained optimization problem (5)
where , , and is a hinge loss that captures the non-negativity constraint on . Furthermore, the linear constraint is converted to , which is the same constraint as . For simplicity, we define .
The Lagrangian objective function in (9) can be written as
Introducing the auxiliary optimization variables and , we reformulate (5)
For a penalty parameter and Lagrange multiplier matrix , we consider the augmented Lagrangian of this problem
This function can be minimized using an iterative ADMM update scheme on the arguments , , , , and . The update for and can be accomplished by standard proximal operators that implement soft-thresholding and a projection. Furthermore, the -update can be derived in a closed form by solving a quadratic equation in its singular values. The details of the ADMM updates are included in the supplement. First, consider the -subproblem without the linear constraint . Then, as derived in the supplement, the resulting update equation for is
where is defined in the supplement. Using , we obtain a closed-form solution to the sub-problem in (12) as follows
This solution to the linear constrained problem can be easily derived as a stationary point of the convex function that is minimized in (13). Note that, by construction, .
In the nonnegative matrix factorization literature, Liu et al. (2017) used a large-cone penalty that constrains either the volume or the pairwise angles of the simplex vertices. However, this does not impose a sum-to-one constraint on the topics, and the optimization is performed over . Furthermore, our formulation has an advantage over the problem in Liu et al. (2017) as we directly work with the latent topic proportions . This is possible in our formulation as we decoupled from using the ADMM mechanism.
This follows by applying a standard convergence proof of the ADMM algorithm (Algorithm 1) based on the KKT condition. The proposition states that our ADMM formulation converges to a stationary point. However, while the unconstrained objective function in (10) is convex, the constraint on the minimum singular value makes the constrained optimization function non-convex. Thus, our algorithm is only guaranteed to converge to a stationary point of (10).
Figure 3 demonstrates the convergence of our algorithm with synthetic data generated from an LDA model with parameters , and .
equal to the identity matrix. The left panel shows the relative Frobenius error between the iteratesand the true . The right panel shows the convergence in terms of the objective values.
4 Performance Comparison
To demonstrate the performance of the proposed minimum volume topic model (MVTM) estimation algorithm (Algorithm 1), we generate the LDA data with the parameters with varying
, which is Dirichlet hyperparameter for the topic proportion. For ease of visualization, the first two dimensions of the projected documents and the estimated topics are used. The first scenario () in Figure 4 shows the performance of our algorithm is comparable to the vertex based method GDM (Yurochkin and Nguyen (2016)), when there are plenty of observed documents around the vertices. While there is no anchor word in the generated dataset, we observe enough documents around the vertices. In other words, the separability assumption is slightly violated.
With higher values of , however, Figure 5 shows the advantages of our method, denoted as MVTM.
Note that the higher values of correspond to the situation where the sufficiently scattered condition is satisfied, but the separability condition is violated. Thus, we can see the vertex based method (GDM) starts to suffer in the oracle performance. In contrast, with an appropriate choice of for the hinge loss, our method recovers the correct topics even for the well mixed scenario where . Figure 4(b) shows that there is a kink in the optimization path, where MVTM is finding the right orientation of the true simplex. Furthermore, there is a lack of loops in the optimization path, illustrating the identifiability of MVTM.
Lastly, we explore the asymptotic behavior by varying document lengths with , , , , and 100 held-out documents. MVTM is all initialized at the identity matrix, and VEM had 10 restarts as the objective for the variational method is nonconvex.
Figure 6 tells us that 1) Gibbs sampling and MVTM have comparable performance in terms of perplexity, 2) MVTM and VEM both show the computational advantages over the Gibbs sampling method, and 3) VEM suffers from the statistical performance due to the nature of the non-convex objective function of VEM. Additional simulation results can be found in the supplement.
4.1 NIPS dataset
To illustrate the performance of MVTM on a real-world data, we apply our algorithm to NIPS dataset. We preprocess the raw data using a standard stop word list and filter the resulting data through a stemmer. After preprocessing, words that appeared more than 25 times across the whole corpus are retained. Then, we further remove the documents that have less than 10 words. The final dataset contained 4492 unique words and 1491 documents with mean document length of 1187. We compare our algorithm’s performance to GDM and Gibbs sampling at K=5, 10, 15, and 20. The perplexity score is used to perform the comparison in Table 1.
The additional time comparison and top 10 words of top 10 learned topics for MVTM, GDM, and Gibbs sampling are provided in the supplement.
This paper presents a new estimation procedure for LDA topic modeling based on the minimization of the volume of the topic simplex
. Such formulation can be thought of as an asymptotic estimation to the LDA model. The proposed minimum volume topic model (MVTM) algorithm differs from moment-based methods including RecoverKL and the vertex based method such as the GDM. We proved the identifiability of MVTM under the sufficiently scattered assumption introduced inHuang et al. (2016). When the sufficiently scattered assumption is satisfied and the separability assumption is violated, MVTM continues to perform well with an appropriate choice of the hinge loss parameter.
There are open questions on the statistical convergence of our estimator in terms of the document length and the number of documents. Such relationships have been explored in the work of Tang et al. (2014), and it would be interesting to see if these could be applied to the proposed MVTM. The understanding of the statistical behavior of MVTM will provide us with the theoretical guidance on the choice the hinge loss parameter. Besides the theoretical questions, MVTM also has some potential modeling extensions. The immediate extension includes the nonparametric setting, where one would also estimate the number of topics .
This research was partially supported by grant ARO W911NF-15-1-0479.
- Anandkumar et al. (2012) Anandkumar, A., Foster, D. P., Hsu, D. J., Kakade, S. M., and Liu, Y.-K. (2012). A spectral algorithm for latent dirichlet allocation. In Advances in Neural Information Processing Systems, pages 917–925.
Arora et al. (2013)
Arora, S., Ge, R., Halpern, Y., Mimno, D., Moitra, A., Sontag, D., Wu, Y., and
Zhu, M. (2013).
A practical algorithm for topic modeling with provable guarantees.
International Conference on Machine Learning, pages 280–288.
- Bioucas-Dias (2009) Bioucas-Dias, J. M. (2009). A variable splitting augmented lagrangian approach to linear spectral unmixing. In Hyperspectral Image and Signal Processing: Evolution in Remote Sensing, 2009. WHISPERS’09. First Workshop on, pages 1–4. IEEE.
- Blei and Lafferty (2006) Blei, D. M. and Lafferty, J. D. (2006). Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning, pages 113–120. ACM.
- Blei et al. (2003) Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022.
- Deerwester et al. (1990) Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–407.
- Fu et al. (2018) Fu, X., Huang, K., Sidiropoulos, N. D., Shi, Q., and Hong, M. (2018). Anchor-free correlated topic modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Hoffman et al. (2013) Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347.
- Huang et al. (2016) Huang, K., Fu, X., and Sidiropoulos, N. D. (2016). Anchor-free correlated topic modeling: Identifiability and algorithm. In Advances in Neural Information Processing Systems, pages 1786–1794.
- Lafferty and Blei (2006) Lafferty, J. D. and Blei, D. M. (2006). Correlated topic models. In Advances in neural information processing systems, pages 147–154.
- Li et al. (2014) Li, A. Q., Ahmed, A., Ravi, S., and Smola, A. J. (2014). Reducing the sampling complexity of topic models. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 891–900. ACM.
Liu et al. (2017)
Liu, T., Gong, M., and Tao, D. (2017).
Large-cone nonnegative matrix factorization.
IEEE transactions on neural networks and learning systems, 28(9):2129–2142.
- Nascimento and Bioucas-Dias (2012) Nascimento, J. M. and Bioucas-Dias, J. M. (2012). Hyperspectral unmixing based on mixtures of dirichlet components. IEEE Transactions on Geoscience and Remote Sensing, 50(3):863–878.
- Nascimento and Dias (2005) Nascimento, J. M. and Dias, J. M. (2005). Vertex component analysis: A fast algorithm to unmix hyperspectral data. IEEE transactions on Geoscience and Remote Sensing, 43(4):898–910.
- Nguyen (2015) Nguyen, X. (2015). Posterior contraction of the population polytope in finite admixture models. Bernoulli, 21(1):618–646.
- Pritchard et al. (2000) Pritchard, J. K., Stephens, M., and Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics, 155(2):945–959.
- Tang et al. (2014) Tang, J., Meng, Z., Nguyen, X., Mei, Q., and Zhang, M. (2014). Understanding the limiting factors of topic modeling via posterior contraction analysis. In International Conference on Machine Learning, pages 190–198.
- Teh et al. (2005) Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2005). Sharing clusters among related groups: Hierarchical dirichlet processes. In Advances in neural information processing systems, pages 1385–1392.
- Xu et al. (2003) Xu, W., Liu, X., and Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 267–273. ACM.
- Yuan et al. (2015) Yuan, J., Gao, F., Ho, Q., Dai, W., Wei, J., Zheng, X., Xing, E. P., Liu, T.-Y., and Ma, W.-Y. (2015). Lightlda: Big topic models on modest computer clusters. In Proceedings of the 24th International Conference on World Wide Web, pages 1351–1361. International World Wide Web Conferences Steering Committee.
- Yurochkin et al. (2017) Yurochkin, M., Guha, A., and Nguyen, X. (2017). Conic scan-and-cover algorithms for nonparametric topic modeling. In Advances in Neural Information Processing Systems, pages 3881–3890.
- Yurochkin and Nguyen (2016) Yurochkin, M. and Nguyen, X. (2016). Geometric dirichlet means algorithm for topic inference. In Advances in Neural Information Processing Systems, pages 2505–2513.
Appendix A Algorithm Analysis
a.1 ADMM update derivation
For completeness, we derive the ADMM steps of the problem in (12). Given current iterates and ,
where we soft-threshold the matrix with the regularization parameter .
where and is the projection onto the set .
where we have that
We can derive the update for , as it is a convex problem with a linear constraint. First, consider the (16) without the linear constraint . Then, we can rewrite the unconstrained -subproblem as
where and . Then we can solve the above problem element by element. Looking at the -th entry, we can take the derivative and set it to zero. That is
leading to the following quadratic formula
which has the solution
Then, using these diagonal elements , it follows that
We make the final adjustment to satisfy the linear constraint. Thus, the update is
a.2 Proof of Proposition 3.2
The first order conditions of the updates in Algorithm 1 give us
Note that the first order condition for is different as it is a equality constrained convex problem. Also, by the definitions of and
Then, combining these two sets of equations, we have that
Then, let us define be a sequence of iterates with a limit point . Then, by the last two equations of (19), we have that . Therefore, the first two equations give us that
Lastly, using the third equation in (19), it follows that