Neural Sinkhorn Topic Model

08/12/2020 ∙ by He Zhao, et al. ∙ 0

In this paper, we present a new topic modelling approach via the theory of optimal transport (OT). Specifically, we present a document with two distributions: a distribution over the words (doc-word distribution) and a distribution over the topics (doc-topic distribution). For one document, the doc-word distribution is the observed, sparse, low-level representation of the content, while the doc-topic distribution is the latent, dense, high-level one of the same content. Learning a topic model can then be viewed as a process of minimising the transportation of the semantic information from one distribution to the other. This new viewpoint leads to a novel OT-based topic modelling framework, which enjoys appealing simplicity, effectiveness, and efficiency. Extensive experiments show that our framework significantly outperforms several state-of-the-art models in terms of both topic quality and document representations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As an unsupervised approach, topic modelling has enjoyed great success in automatic text analysis. In general, a topic model aims to discover a set of latent topics from a collection of documents, each of which describes an interpretable semantic concept. The most important construction of a topic model is its way of modelling the relationships of the three key concepts: “document”, “word”, and “topic”. A document consists of multiple tokens, each of which corresponds to a word in the vocabulary.111We use “token” to describe a term in a document and “word” to denote the term’s type in the vocabulary.

Following the Bag-of-Words model, a document is presented by a vector of word counts, indicating their occurrences. A topic is a distribution over all the words, where the words with the largest weights are used to describe its meanings. Given a set of topics, a document is endowed with a distribution over all the topics, which describes its semantic focuses. Many existing topic models follow the above settings such as Latent Dirichlet Allocation (LDA) 

blei2003latent , while in this paper, we present a novel view of topic models from the angle of Optimal Transport (OT).

Specifically, for a document, we consider its content to be encoded by two representations: the observed representation, , a distribution over all the words in the vocabulary and the latent representation, , a distribution over all the topics. can be obtained by normalising a document’s word count vector while needs to be learned by a model. For a document collection, the vocabulary size (i.e., the number of unique words) can be very large but one individual document usually consists of a tiny subset of the words. Therefore, is a sparse and low-level representation of the semantic information of a document. As the number of topics is much smaller than the vocabulary size, is the relatively dense and high-level representation of the same content. Therefore, the learning of a topic model can be viewed as the process of finding proper topics for the document collection and for each document, which minimises the transportation of semantic information from to .

Motivated by the above view, we propose a new topic model based on the theory of OT, which is a powerful tool for measuring the distance travelled in transporting the mass in one distribution to match another given a specific cost function. To be specific, we first embed topics and words into an embedding space, where the semantic distance between a topic and a word is modelled by their physical distance in that space. Next, we compute the OT distance between and , which are two discrete distributions on the support of words and topics, respectively. The cost function of the OT distance is defined according to the embedding distances between topics and words. In this way, the OT distance measures the transportation of the semantic information of a document from one representation to the other. Intuitively, consider a document consists of a lot of “sport” words, then a large amount of ’s mass is put to those words. The OT distance between and is expected to be small if assigns a large amount of its mass on “sport” topics, as transporting “sport” words to “sport” topics is less expensive, according to their semantic distances. With the recent development on computational OT (e.g., in cuturi2013sinkhorn ; frogner2015learning ; seguy2018large ; peyre2019computational ), it is feasible to minimise the OT transport in terms of , which semantically pushes and close to each other. Moreover, presented by the embeddings, topics themselves are learned by minimising the same OT distance in terms of the cost function cuturi2014ground ; sun2020learning . The above model construction and learning process lead to a novel topic modelling framework based on OT, which enjoys appealing properties and state-of-the-art performance.

Our contributions in this paper can be highlighted as follows: i) We provide a novel OT viewpoint of topic modelling, which is different from many previous models; ii) The viewpoint leads to a new topic modelling framework, which can be efficiently learned by the recent development of computational OT; iii) The connections between the proposed framework and previous models are comprehensively studied; iv) With extensive experiments, the proposed framework achieves significantly better performance in the comparison with state-of-the-art topic models; v) The framework enjoys appealing simplicity, effectiveness, and efficiency, which facilitate further development of extensions and variants.

2 Neural Sinkhorn Topic Model

2.1 Reminders on optimal transport

OT distances have been widely used for the comparison of probabilities. Here we limit our discussion on OT for discrete distributions, although it applies for continuous distributions as well. Specifically, let us consider two probability vectors

and , where denotes a -dimensional simplex. The OT distance222To be precise, an OT distance becomes a “distance metric” in mathematics only if the cost function is induced from a distance metric. We call it “OT distance” to assist the readability of our paper. between the two probability vectors can be defined as:

(1)

where denotes the Frobenius dot-product; is the cost matrix/function of the transport; is the transport matrix/plan; denotes the transport polytope of and , which is the polyhedral set of matrices: ; and is the

dimensional vector of ones. Intuitively, if we consider two discrete random variables

and , the transport matrix is a joint probability of , i.e., and is the set of all the joint probabilities. The above optimal transport distance can be computed by finding the optimal transport matrix . It is also noteworthy that the Wasserstein distance can be viewed as a specific case of the OT distances, where the cost matrix takes the form of the geodesic distance.

As directly optimising Eq. (1) can be time-consuming for large-scale problems, a regularised optimal transport distance with an entropic constraint is introduced in cuturi2013sinkhorn , named the Sinkhorn distance:

(2)

where }, is the entropy function, and .

As shown in cuturi2013sinkhorn , the Sinkhorn distance coincides with the original OT distance when is large enough (). When , the Sinkhorn distance has the closed form of . To compute the Sinkhorn distance, a Lagrange multiplier is introduced for the entropy constraint to minimise Eq. (2), resulting in the Sinkhorn algorithm, widely-used for discrete OT problems.

2.2 Proposed model

Suppose the document collection to be analysed has unique words (i.e., vocabulary size). For any document, its word count vector is denoted as and the total number of tokens is . The document’s observed representation, i.e., the doc-word distribution is a probability vector obtained by normalising the word count vector: . Besides, the document’s latent representation in our model is a distribution over topics: . To push towards , we propose the following minimisation of the OT distance between them:

(3)

Here is the cost matrix, where indicates the semantic distance between topic and word . Therefore, each row of can be used to sort the importance of the words in the corresponding topic, which is similar to the topic-word distributions in conventional topic models, except that the rows of are unnormalised. With parameters involved, can be hard to learn for a large number of topics or a huge vocabulary. Therefore, we embed the topics and words into a -dimensional space and specify the following cost function:

(4)

where

is the cosine similarity;

and are the embeddings of topic and token , respectively. The embeddings are expected to capture the semantic information of the topics and words and then the cost function is able to measure their semantic distances.

Figure 1: Demonstration of our proposed model.

The above construction reduces the parameter space from to , which can be further reduced to when pretrained word embeddings such as word2vec mikolov2013efficient and GloVe pennington2014glove are used. Importantly, this enables the model to incorporate word embeddings in a natural and straightforward way. For easy presentation, we denote and as the collection of the embeddings of all topics and words, respectively. Instead of learning individually for each document, we also propose to generate it by nonlinear transformations of : where

can be implemented by deep neural networks. Now we can rewrite Eq. (

3) as:

(5)

We depict a demonstration to our proposed model in Figure 1.

2.3 Connections to other topic models

Before discussing the connections, we first present topic models from the viewpoint of Nonnegative Matrix Factorisations (NMFs). When applied in modelling documents, a NMF factorises a document’s word count vector into several latent factors (i.e., topics): , where corresponds to the doc-topic distribution (named the factor score vector in NMFs) and corresponds to the topic-word distributions (named the factor loading matrix). According to lee2001algorithms , NMFs can be learned by minimising the Kullback–Leibler (KL) divergence of , i.e., . It turns out that this is equivalent to building a generative model of

and learning it by Maximum Likelihood Estimation (MLE). Many topic models can be presented in the above NMF framework, for example, in the newly-proposed Neural Topic Models (NTMs) (e.g., in 

miao2016neural ; srivastava2017autoencoding ; krishnan2018challenges ; burkhardt2019decoupling

) built on the framework of Variational AutoEncoders (VAEs) 

kingma2013auto ; rezende2014stochastic , is generated from an encoder and is captured in the neural network weights of a decoder. NTMs are trained by maximising the Evidence Lower BOund (ELBO) in Variational Inference (VI), which consists of the maximisation of the multinomial likelihood like above and the minimisation of the KL divergence between the posterior and prior of as the regulariser.

Next, we present a theorem to reveal the relationships between other topic models and ours, whose proof is shown in the appendix.

Theorem 1.

Given that , by defining , we have:

(6)

when .

Note that serves as the similar role of in NMFs, which models the similarity between topics and words. This makes correspond to . With Theorem 1, we have:

Lemma 1.

Maximising the multinomial likelihood or minimising the KL divergence in topic models/NMFs is equivalent to minimising the upper bound of the OT distance in our model.

More discussion on the connections will be provided in Section 2.4.

2.4 Learning

input : Input documents, Pretrained word embeddings , Topic number , ,
output : 
Randomly initialise and ;
while Not converged do
       Sample a batch of input documents ;
       Compute cost matrix with and by Eq. (4);
       Compute doc-topic distributions ;
       # Sinkhorn iterations # ;
       ;
       while  changes or any other relevant stopping criterion do
             ;
             ;
            
       end while
      ;
       Compute the multinomial likelihood loss given , , and ;
       Compute the gradient of the loss in Eq. (7) in terms of ;
       Update ;
      
end while
Algorithm 1 Training algorithm for NSTM. and consists of the word count vectors and doc-topic distributions for all the documents, respectively; is the element-wise multiplication.

As the multinomial likelihood is easier to maximise and can be helpful to guide the optimisation of the original OT distance, inspired by Theorem 1, we propose to add it to the learning objective of the model. To assist the learning, we also replace the OT distance with the the Sinkhorn distance cuturi2013sinkhorn , which leads to the final training objective:

(7)

where is parameterised by ; is parameterised by ; is parameterised by and ; and are the word count vector and its normalisation, respectively; and

are hyperparameters.

Specifically, is the weight of the multinomial likelihood. If is large, the model approximately reduces to the maximum likelihood estimation. This is similar to learning a Probabilistic Latent Semantic Analysis (PLSA) hofmann1999probabilistic model without any priors by MLE, which usually leads to less diverse topics wallach2009rethinking . Moreover, is the weight of the entropic regularisation in the Sinkhorn divergence. When , and is the singleton (cuturi2013sinkhorn, , Appendix). Therefore, if is too small, the Sinkhorn distance reduces to something similar to PLSA with MLE as well. We will empirically study these parameters in the experiment section.

To compute the Sinkhorn distance, we leverage the Sinkhorn algorithm cuturi2013sinkhorn . Accordingly, we name our model Neural Sinkhorn Topic Model (NSTM), whose training algorithm is shown in Algorithm  1. After training the model, we can infer the doc-topic distribution by conducting a forward-pass of the neural network with the input .

3 Related works

In topic modelling, there have been considerable Bayesian hierarchical extensions to LDA blei2003latent , such as in blei2010nested ; paisley2015nested ; gan2015learning ; zhou2016augmentable . Compared with these models, due to the nature of the proposed NSTM, we consider Neural Topic Models (NTMs), a recent update of topic modelling with deep generative models, as a closer line to ours. Mainly based on the framework of VAEs, NTMs such as in miao2016neural ; srivastava2017autoencoding ; krishnan2018challenges ; card2018neural ; burkhardt2019decoupling use an encoder that takes as input and approximates the posterior of . The posterior samples are further input into a decoder to generate data. Although NSTM uses neural networks to generate , NSTM is based on OT instead of VAEs or VI thus is different from the above NTMs in terms of both modelling and learning processes.

Recently, word embeddings have been widely-used as complementary metadata for topic models, especially for modelling short texts. For Bayesian probabilistic topic models, word embeddings are usually incorporated into the generative process of word counts, such as in petterson2010word ; nguyen2015improving ; li2016topic ; zhao2017word ; dieng2019topic . Due to the flexibility of NTMs, word embeddings can be incorporated as part of the encoder input, which helps the inference of the doc-topic distributions. Our novelty with NSTM is that word embeddings are leveraged to define the cost function of the OT distance between words and topics.

Finally, to our knowledge, the works that connect topic modelling with OT are still very limited. In yurochkin2019hierarchical it is proposed to compare two documents’ similarity with the OT distance between their doc-topic distributions extracted from a pretrained LDA, but the aim is not to learn a topic model. Another recent work related to ours is Wasserstein LDA (WLDA) nan2019topic , which adapts the framework of Wasserstein AutoEncoders (WAEs) tolstikhin2017wasserstein . The key difference from ours is that WLDA minimises the Wasserstein distance between the fake data generated with topics and real data, which can be viewed as an OT variant of VAEs. However, our NSTM directly minimises the OT distance between doc-topic and doc-word distributions, where there are no generative processes from topics to data. The most related work to ours is Distilled Wasserstein Learning (DWL) xu2018distilled , which adapts the idea of Wasserstein barycentres and Wasserstein Dictionary Learning schmitz2018wasserstein . There are fundamental differences between DWL and ours in terms of the relations between documents, topics, and words. Specifically, in DWL, documents and topics locate in one space of words (i.e., both are distributions over words) and a doc-word distribution can be approximated with the weighted Wasserstein barycentres of all the topic-word distributions, where the weights can be interpreted as the topic proportions of the document. However, in NSTM, a document locates both the topic space and the word space and topics and words are embedded in an additional embedding space. These differences lead to different views of topic modelling and different frameworks as well. Moreover, DWL mainly focuses on learning word embeddings and representations for International Classification of Diseases (ICD) codes, while NSTM aims to be a general method of topic modelling.

4 Experiments

We conduct extensive experiments on several benchmark text datasets to evaluate the performance of NSTM against the state-of-the-art neural topic models.

4.1 Experimental settings

Datasets:  Our experiments are conducted on five widely-used benchmark text datasets, varying in different sizes, including 20 News Groups (20NG)333http://qwone.com/~jason/20Newsgroups/, Web Snippets (WSphan2008learning , Tag My News (TMNvitale2012classification 444http://acube.di.unipi.it/tmn-dataset/, Reuters extracted from the Reuters-21578 dataset555https://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html, Reuters Corpus Volume 2 (RCV2lewis2004rcv1 666https://trec.nist.gov/data/reuters/reuters.html. The statistics of these datasets are shown in Table 1. In particular, the document lengths of WS and TMN are relatively short compared with the others; 20NG, WS, and TMN are associated with document labels777We do not consider the labels of Reuters and RCV2 as there are multiple labels for one document..

Number of docs Vocabulary size (V) Total number of words Number of labels
20NG 18,846 22,636 2,037,671 20
WS 12,337 100,052 192,483 8
TMN 32,597 13,368 592,973 7
Reuters 11,367 8,817 836,397 N/A
RCV2 804,414 7,282 60,209,009 N/A
Table 1: Statistics of the datasets

Evaluation metrics We report Topic Coherence (TC) and Topic Diversity (TD) as performance metrics for topic quality. TC measures the semantic coherence in the most significant words (top words) of a topic, given a reference corpus. We apply the widely-used Normalized Pointwise Mutual Information (NPMI) aletras2013evaluating ; lau2014machine computed over the top 10 words of each topic, by the Palmetto package roder2015exploring 888http://palmetto.aksw.org. For a model on one dataset, we report the average score over top 50% topics with highest NPMI, where “rubbish” topics are eliminated, following yang2015efficient ; zhao2018dirichlet . TD, as its name implies, measures how diverse the discovered topics are. We define topic diversity to be the percentage of unique words in the top 25 words dieng2019topic of the top 50% topics with highest NPMI, similar in topic coherence. TD close to 0 indicates redundant topics; TD close to 1 indicates more varied topics. As doc-topic distributions can be viewed as unsupervised document representations, to evaluate the quality of such representations, we perform a document clustering task and report the purity and Normalized Mutual Information (NMImanning2008introduction on 20NG, WS, and TMN, where document labels are considered. A dataset is split into 80% training and 20% testing documents. We then train a model on the training documents and infer the doc-topic distributions on the testing documents. The most significant topic of a testing document is used as its clustering assignment to compute purity and NMI. Note that our goal is not to achieve the state-of-the-art document clustering results but compare document representations of topic models. Finally, higher values of the four metrics indicate better performance.

Baseline methods and their settings:  We select the state-of-the-art models that are closely related to ours, including999We are unable to compare with DWL xu2018distilled as the code is not publicly available.: LDA with Products of Experts (ProdLDAsrivastava2017autoencoding , which replaces the mixture model in LDA with a product of experts and uses the autoencoded VI for training; Dirichlet VAE (DVAEburkhardt2019decoupling , which is a neural topic model imposing the Dirichlet prior of . We use the variant of DVAE with rejection sampling VI, which is reported to perform the best; Embedding Topic Model (ETMdieng2019topic , which a topic model that incorporates word embeddings and is learned by autoencoded VI; Wasserstein LDA (WLDAnan2019topic

, which is a WAE-based topic model. For all the above baselines, we use their official TensorFlow/Pytorch/MXNet code with the best reported settings.

Settings for NSTM:

NSTM is implemented with TensorFlow. For

, to keep simplicity, we use a fully-connected neural network with one hidden layer of 200 units and ReLU as the activation function, following the settings of 

burkhardt2019decoupling . For the Sinkhorn algorithm, following cuturi2013sinkhorn , the number of maximum iterations is 1,000 and the stop tolerance is 0.005101010The Sinkhorn algorithm usually reaches the stop tolerance in less than 50 iterations in NSTM. In all the experiments on all the datasets, we fix and . We further vary the two parameters to study our model’s sensitivity to them in Figure 2. The optimisation of NSTM is done by Adam kingma2014adam with learning rate 0.001 and batch size 200 for maximally 50 iterations. For NSTM and ETM, the 50-dimensional GloVe word embeddings pennington2014glove pre-trained on Wikipedia111111https://nlp.stanford.edu/projects/glove/ are used. For all the methods and datasets, we use the number of topics .

4.2 Results

20NG WS TMN Reuters RCV2 sum
ProdLDA -0.0730.004 -0.0310.073 -0.0530.056 -0.0670.005 -0.1940.002 -0.419
DVAE 0.0130.007 0.0820.010 0.0960.011 -0.0100.003 -0.0260.009 0.155
ETM 0.0190.008 0.0030.014 0.1150.011 -0.0120.008 0.0670.005 0.192
WLDA 0.0190.009 0.0400.007 0.0360.013 -0.0600.007 -0.0880.005 -0.052
NSTM 0.1020.003 0.1220.005 0.1370.008 0.0680.011 0.1170.003 0.548
Table 2: Topic coherence.
20NG WS TMN Reuters RCV2 sum
ProdLDA 0.8220.012 0.7610.120 0.7960.148 0.7140.034 0.4440.007 3.539
DVAE 0.6780.019 0.5490.018 0.6320.028 0.5760.014 0.6230.186 3.060
ETM 0.5500.026 0.5840.025 0.2640.026 0.4790.034 0.5010.020 2.380
WLDA 0.3830.028 0.2200.042 0.3690.181 0.2960.009 0.9510.018 2.222
NSTM 0.6460.031 0.9440.007 0.9650.004 0.7050.025 0.6380.006 3.900
Table 3: Topic diversity.
Purity NMI
20NG WS TMN 20NG WS TMN
ProdLDA 0.4170.004 0.2930.023 0.4050.157 0.3210.004 0.0660.016 0.0910.101
DVAE 0.2810.006 0.2840.005 0.4770.012 0.1870.005 0.0590.001 0.1130.004
ETM 0.0630.003 0.2150.001 0.5560.022 0.0050.005 0.0030.003 0.3280.010
WLDA 0.1170.001 0.2390.003 0.2600.002 0.0600.001 0.0260.001 0.0090.001
NSTM 0.4770.011 0.4510.009 0.6370.010 0.4150.012 0.2010.004 0.3340.004
Table 4: Purity and NMI for document clustering.

For all the models in comparison, we run them for five times with different random seeds and report the mean and standard deviation (as error bars). We show the results of topic coherence and diversity in Table 

2 and 3, respectively. The rightmost column of each of the two tables is the sum of the values over all the datasets, aiming to provide an overall view of the performance. The best and second scores of each dataset are highlighted in boldface and with an underline, respectively. For TC, our proposed NSTM outperforms the others significantly, in any individual dataset as well as the sum of all datasets. For TD, NSTM wins either the best or the second best place in four out of five datasets, indicating it ability to discover more diverse topics. Moreover, it achieves the best sum for topic diversity. The results of the document clustering experiment are shown in Table 4. It can be observed that NSTM again performs the best among the compared models. This demonstrates that NSTM is not only able to discover interpretable topics with better quality but also learn good document representations for clustering.

Instead of fixing the values of and in the previous experiments, we report the performance of NSTM on 20NG (blue lines) under different settings of the two hyperparameters in Figure 2. Moreover, we also propose a variant of NSTM that removes the LHS Sinkhorn distance in the training loss of Eq.(7) (i.e., only the RHS multinomial likelihood term left). This variant mimics the case where and its performance is shown as the red lines. It can be observed that when larger or smaller are used, purity and NMI become higher, indicating better quality of document representation. This is reasonable because of larger or smaller push NSTM closer to PLSA with MLE, which essentially tries its best to reconstruct word counts of documents. However, this also leads to less diverse topics, in line with the analysis in Section 2.3. Without the Sinkhorn distance, the variant does not perform as well as the original NSTM, especially in terms of TC and TD.

Recall that a topic in our model is embedded in the embedding space. To qualitatively examine the topic embeddings of NSTM, we show the t-SNE maaten2008visualizing visualisation in Figure 3. Specifically, we select the top 50 topics with the highest NPMI learned by a run of NSTM on RCV2 with and feed their (50 dimensional) embeddings into the t-SNE method that reduces the dimensions to 2. We also show the top five words and the topic number (1 to 50) of each topic. In Figure 3, we can observe that although the words of the topics are different, the semantic similarity between the topics captured by the embeddings is highly interpretable. For example, the group with topic 20, 23, and 40 focuses on the financial and legal aspects while the group of topic 26, 19, 31, and 10 are about the sport aspect. More qualitative analysis on topics are provided in the appendix.

(a) TC with
(b) TD with
(c) Purity with
(d) NMI with
(e) TC with
(f) TD with
(g) Purity with
(h) NMI with
Figure 2: Parameter sensitivity of NSTM on 20News.
Figure 3: t-SNE visualisation of topic embeddings. One red dot represents a topic. The top 5 words and the topic number (1 to 50) of each topic are also shown.

5 Conclusion

In this paper, we have presented a novel topic modelling framework based on optimal transport, where a document is endowed with two representations: a doc-word distribution, , and a doc-topic distribution, . An OT distance is leveraged to compare the semantic distance between the two distributions, whose cost function is defined according to the cosine similarities between topics and words in the embedding space. is obtained from a neural network that takes as input and is trained by minimising the OT distance between and . With pretrained word embeddings, topic embeddings are learned by the same minimisation of the OT distance in terms of the cost function. We have provided theoretical analysis to the connections of the proposed framework with previous models. In addition, extensive experiments have been conducted, showing that our model achieves the state-of-the-art performance on both discovering quality topics and deriving useful document representations. Thanks to the flexibility and simplicity of the framework, future work will be on developing extensions and variants that discover more complex topic patterns e.g, similar to Correlated Topic Models lafferty2006correlated and Dynamic Topic Models blei2006dynamic .

References

  • [1] N. Aletras and M. Stevenson. Evaluating topic coherence using distributional semantics. In International Conference on Computational Semantics, pages 13–22, 2013.
  • [2] D. M. Blei, T. L. Griffiths, and M. I. Jordan. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57(2):7, 2010.
  • [3] D. M. Blei and J. D. Lafferty. Dynamic topic models. In ICML, pages 113–120, 2006.
  • [4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. JMLR, 3:993–1022, 2003.
  • [5] S. Burkhardt and S. Kramer. Decoupling sparsity and smoothness in the Dirichlet variational autoencoder topic model. JMLR, 20(131):1–27, 2019.
  • [6] D. Card, C. Tan, and N. A. Smith. Neural models for documents with metadata. In ACL, pages 2031–2040, 2018.
  • [7] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, pages 2292–2300, 2013.
  • [8] M. Cuturi and D. Avis. Ground metric learning. JMLR, 15(1):533–564, 2014.
  • [9] A. B. Dieng, F. J. Ruiz, and D. M. Blei. Topic modeling in embedding spaces. arXiv preprint arXiv:1907.04907, 2019.
  • [10] C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio. Learning with a Wasserstein loss. In NIPS, pages 2053–2061, 2015.
  • [11] Z. Gan, R. Henao, D. Carlson, and L. Carin. Learning deep sigmoid belief networks with data augmentation. In AISTATS, pages 268–276, 2015.
  • [12] T. Hofmann. Probabilistic latent semantic analysis. In UAI, pages 289–296, 1999.
  • [13] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [14] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. ICLR, 2013.
  • [15] R. Krishnan, D. Liang, and M. Hoffman.

    On the challenges of learning with inference networks on sparse, high-dimensional data.

    In AISTATS, pages 143–151, 2018.
  • [16] J. D. Lafferty and D. M. Blei. Correlated topic models. In NIPS, pages 147–154, 2006.
  • [17] J. H. Lau, D. Newman, and T. Baldwin. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In EACL, pages 530–539, 2014.
  • [18] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, pages 556–562, 2001.
  • [19] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. JMLR, 5(Apr):361–397, 2004.
  • [20] C. Li, H. Wang, Z. Zhang, A. Sun, and Z. Ma. Topic modeling for short texts with auxiliary word embeddings. In SIGIR, pages 165–174, 2016.
  • [21] L. v. d. Maaten and G. Hinton. Visualizing data using t-SNE. JMLR, 9(Nov):2579–2605, 2008.
  • [22] C. D. Manning, P. Raghavan, H. Schütze, et al. Introduction to Information Retrieval. Cambridge University Press, Cambridge, 2008.
  • [23] Y. Miao, L. Yu, and P. Blunsom. Neural variational inference for text processing. In ICML, pages 1727–1736, 2016.
  • [24] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. ICLR, 2013.
  • [25] F. Nan, R. Ding, R. Nallapati, and B. Xiang. Topic modeling with Wasserstein autoencoders. In ACL, pages 6345–6381, 2019.
  • [26] D. Q. Nguyen, R. Billingsley, L. Du, and M. Johnson. Improving topic models with latent feature word representations. TACL, 3:299–313, 2015.
  • [27] J. Paisley, C. Wang, D. M. Blei, and M. I. Jordan. Nested hierarchical Dirichlet processes. TPAMI, 37(2):256–270, 2015.
  • [28] J. Pennington, R. Socher, and C. Manning. GloVe: Global vectors for word representation. In EMNLP, pages 1532–1543, 2014.
  • [29] J. Petterson, W. Buntine, S. M. Narayanamurthy, T. S. Caetano, and A. J. Smola. Word features for latent Dirichlet allocation. In NIPS, pages 1921–1929, 2010.
  • [30] G. Peyré, M. Cuturi, et al. Computational optimal transport.

    Foundations and Trends® in Machine Learning

    , 11(5-6):355–607, 2019.
  • [31] X.-H. Phan, L.-M. Nguyen, and S. Horiguchi.

    Learning to classify short and sparse text & web with hidden topics from large-scale data collections.

    In WWW, pages 91–100, 2008.
  • [32] D. J. Rezende, S. Mohamed, and D. Wierstra.

    Stochastic backpropagation and approximate inference in deep generative models.

    In ICML, pages 1278–1286, 2014.
  • [33] M. Röder, A. Both, and A. Hinneburg. Exploring the space of topic coherence measures. In WSDM, pages 399–408, 2015.
  • [34] M. A. Schmitz, M. Heitz, N. Bonneel, F. Ngole, D. Coeurjolly, M. Cuturi, G. Peyré, and J.-L. Starck. Wasserstein dictionary learning: Optimal transport-based unsupervised nonlinear dictionary learning. SIAM Journal on Imaging Sciences, 11(1):643–678, 2018.
  • [35] V. Seguy, B. B. Damodaran, R. Flamary, N. Courty, A. Rolet, and M. Blondel. Large scale optimal transport and mapping estimation. ICLR, 2018.
  • [36] A. Srivastava and C. Sutton. Autoencoding variational inference for topic models. ICLR, 2017.
  • [37] H. Sun, H. Zhou, H. Zha, and X. Ye. Learning cost functions for optimal transport. arXiv preprint arXiv:2002.09650, 2020.
  • [38] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasserstein auto-encoders. ICLR, 2018.
  • [39] D. Vitale, P. Ferragina, and U. Scaiella. Classification of short texts by deploying topical annotations. In ECIR, pages 376–387, 2012.
  • [40] H. M. Wallach, D. M. Mimno, and A. McCallum. Rethinking LDA: Why priors matter. In NIPS, pages 1973–1981, 2009.
  • [41] H. Xu, W. Wang, W. Liu, and L. Carin. Distilled Wasserstein learning for word embedding and topic modeling. In NeurIPS, pages 1716–1725, 2018.
  • [42] Y. Yang, D. Downey, and J. Boyd-Graber. Efficient methods for incorporating knowledge into topic models. In EMNLP, pages 308–317, 2015.
  • [43] M. Yurochkin, S. Claici, E. Chien, F. Mirzazadeh, and J. M. Solomon. Hierarchical optimal transport for document representation. In NeurIPS, pages 1599–1609, 2019.
  • [44] H. Zhao, L. Du, and W. Buntine. A word embeddings informed focused topic model. In ACML, pages 423–438, 2017.
  • [45] H. Zhao, L. Du, W. Buntine, and M. Zhou. Dirichlet belief networks for topic structure learning. In NeurIPS, pages 7966–7977, 2018.
  • [46] M. Zhou, Y. Cong, and B. Chen. Augmentable gamma belief networks. JMLR, 17(163):1–44, 2016.

Appendix

Proof of Theorm 1

Proof.

Before showing the proof, we introduce the following notations: We denote and as the index of and , respectively; The () token of the document picks a word in the vocabulary, denoted by ; the normaliser in the softmax function of is denoted as so:

With these notations, we first have:

+rCl+x* x logr^T &=& 1S ∑_s=1^S logr_w_s
&=& 1S ∑_s=1^S ( ∑_k=1^K z_k (2 - m_k w_s) - log^r )
&=& 2 - log^r - 1S ∑_s=1^S ∑_k=1^K z_k m_k w_s .

Recall that in Eq. (1), the transport matrix

is one of the joint distributions of

and . We introduce the conditional distribution of given as , where indicates the probability of assigning a token of word to topic .

Given that satisfies and , must satisfy . With , we can rewrite the OT distance as:

+rCl+x* d_M (z, x) &=& min_Q ∈U’(z, x) ∑_k=1, v=1^K, V x_v q(k|v) m_kv
&=& 1S min_Q ∈U’(z, x) ∑_k=1^K ∑_s=1^S q(k|w_s) m_k w_s.

If we let , meaning that all the tokens of a document to the topics according to the document’s doc-topic distribution, then satisfies , which leads to:

+rCl+x* d_M (z, x) ≤1S ∑_k=1^K ∑_s=1^S z_k m_k w_s .

Together with Eq. (Proof of Theorm 1), the definition of , and the fact that we have:

+rCl+x* x logr^T &=& 2 - log^r - 1S ∑_s=1^S ∑_k=1^K z_k m_k w_s
&≤& -log(∑_v=1^V e^-∑_k=1^K z_k m_k v )- d_M(z, x)
&≤& -(logV -2 )- d_M(z, x)
&≤&- d_M(z, x) , where the last equation holds if , i.e., . ∎