MetaLDA: a Topic Model that Efficiently Incorporates Meta information

09/19/2017 ∙ by He Zhao, et al. ∙ Monash University 0

Besides the text content, documents and their associated words usually come with rich sets of meta informa- tion, such as categories of documents and semantic/syntactic features of words, like those encoded in word embeddings. Incorporating such meta information directly into the generative process of topic models can improve modelling accuracy and topic quality, especially in the case where the word-occurrence information in the training data is insufficient. In this paper, we present a topic model, called MetaLDA, which is able to leverage either document or word meta information, or both of them jointly. With two data argumentation techniques, we can derive an efficient Gibbs sampling algorithm, which benefits from the fully local conjugacy of the model. Moreover, the algorithm is favoured by the sparsity of the meta information. Extensive experiments on several real world datasets demonstrate that our model achieves comparable or improved performance in terms of both perplexity and topic quality, particularly in handling sparse texts. In addition, compared with other models using meta information, our model runs significantly faster.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the rapid growth of the internet, huge amounts of text data are generated in social networks, online shopping and news websites, etc. These data create demand for powerful and efficient text analysis techniques. Probabilistic topic models such as Latent Dirichlet Allocation (LDA) [1] are popular approaches for this task, by discovering latent topics from text collections. Many conventional topic models discover topics purely based on the word-occurrences, ignoring the meta information (a.k.a., side information) associated with the content. In contrast, when we humans read text it is natural to leverage meta information to improve our comprehension, which includes categories, authors, timestamps, the semantic meanings of the words, etc. Therefore, topic models capable of using meta information should yield improved modelling accuracy and topic quality.

In practice, various kinds of meta information are available at the document level and the word level in many corpora. At the document level, labels of documents can be used to guide topic learning so that more meaningful topics can be discovered. Moreover, it is highly likely that documents with common labels discuss similar topics, which could further result in similar topic distributions. For example, if we use authors as labels for scientific papers, the topics of the papers published by the same researcher can be closely related.

At the word level, different semantic/syntactic features are also accessible. For example, there are features regarding word relationships, such as synonyms obtained from WordNet [2]

, word co-occurrence patterns obtained from a large corpus, and linked concepts from knowledge graphs. It is preferable that words having similar meaning but different morphological forms, like “dog” and “puppy”, are assigned to the same topic, even if they barely co-occur in the modelled corpus. Recently, word embeddings generated by GloVe 

[3] and word2vec [4]

, have attracted a lot of attention in natural language processing and related fields. It has been shown that the word embeddings can capture both the semantic and syntactic features of words so that similar words are close to each other in the embedding space. It seems reasonable to expect that these word embedding will improve topic modelling

[5, 6].

Conventional topic models can suffer from a large performance degradation over short texts (e.g., tweets and news headlines) because of insufficient word co-occurrence information. In such cases, meta information of documents and words can play an important role in analysing short texts by compensating the lost information in word co-occurrences. At the document level, for example, tweets are usually associated with hashtags, users, locations, and timestamps, which can be used to alleviate the data sparsity problem. At the word level, word semantic similarity and embeddings obtained or trained on large external corpus (e.g., Google News or Wikipedia) have been proven useful in learning meaningful topics from short texts [7, 8].

The benefit of using document and word meta information separately is shown in several models such as [9, 10, 6]. However, in existing models this is usually not efficient enough due to non-conjugacy and/or complex model structures. Moreover, only one kind of meta information (either at document level or at word level) is used in most existing models. In this paper, we propose MetaLDA111Code at https://github.com/ethanhezhao/MetaLDA/, a topic model that can effectively and efficiently leverage arbitrary document and word meta information encoded in binary form. Specifically, the labels of a document in MetaLDA are incorporated in the prior of the per-document topic distributions. If two documents have similar labels, their topic distributions should be generated with similar Dirichlet priors. Analogously, at the word level, the features of a word are incorporated in the prior of the per-topic word distributions, which encourages words with similar features to have similar weights across topics. Therefore, both document and word meta information, if and when they are available, can be flexibly and simultaneously incorporated using MetaLDA. MetaLDA has the following key properties:

  1. MetaLDA jointly incorporates various kinds of document and word meta information for both regular and short texts, yielding better modelling accuracy and topic quality.

  2. With the data augmentation techniques, the inference of MetaLDA can be done by an efficient and closed-form Gibbs sampling algorithm that benefits from the full local conjugacy of the model.

  3. The simple structure of incorporating meta information and the efficient inference algorithm give MetaLDA advantage in terms of running speed over other models with meta information.

We conduct extensive experiments with several real datasets including regular and short texts in various domains. The experimental results demonstrate that MetaLDA achieves improved performance in terms of perplexity, topic coherence, and running time.

Ii Related Work

In this section, we review three lines of related work: models with document meta information, models with word meta information, and models for short texts.

At the document level, Supervised LDA (sLDA) [11] models document labels by learning a generalised linear model with an appropriate link function and exponential family dispersion function. But the restriction for sLDA is that one document can only have one label. Labelled LDA (LLDA) [12] assumes that each label has a corresponding topic and a document is generated by a mixture of the topics. Although multiple labels are allowed, LLDA requires that the number of topics must equal to the number of labels, i.e., exactly one topic per label. As an extension to LLDA, Partially Labelled LDA (PLLDA) [10] relaxes this requirement by assigning multiple topics to a label. The Dirichlet Multinomial Regression (DMR) model [9] incorporates document labels on the prior of the topic distributions like our MetaLDA but with the logistic-normal transformation. As full conjugacy does not exist in DMR, a part of the inference has to be done by numerical optimisation, which is slow for large sets of labels and topics. Similarly, in the Hierarchical Dirichlet Scaling Process (HDSP) [13], conjugacy is broken as well since the topic distributions have to be renormalised. [14]

introduces a Poisson factorisation model with hierarchical document labels. But the techniques cannot be applied to regular topic models as the topic proportion vectors are also unnormalised.

Recently, there is growing interest in incorporating word features in topic models. For example, DF-LDA [15] incorporates word must-links and cannot-links using a Dirichlet forest prior in LDA; MRF-LDA [16] encodes word semantic similarity in LDA with a Markov random field; WF-LDA [17] extends LDA to model word features with the logistic-normal transform; LF-LDA [6] integrates word embeddings into LDA by replacing the topic-word Dirichlet multinomial component with a mixture of a Dirichlet multinomial component and a word embedding component; Instead of generating word types (tokens), Gaussian LDA (GLDA) [5]

directly generates word embeddings with the Gaussian distribution. Despite the exciting applications of the above models, their inference is usually less efficient due to the non-conjugacy and/or complicated model structures.

Analysis of short text with topic models has been an active area with the development of social networks. Generally, there are two ways to deal with the sparsity problem in short texts, either using the intrinsic properties of short texts or leveraging meta information. For the first way, one popular approach is to aggregate short texts into pseudo-documents, for example, [18] introduces a model that aggregates tweets containing the same word; Recently, PTM [19] aggregates short texts into latent pseudo documents. Another approach is to assume one topic per short document, known as mixture of unigrams or Dirichlet Multinomial Mixture (DMM) such as [20, 7]. For the second way, document meta information can be used to aggregate short texts, for example, [18] aggregates tweets by the corresponding authors and [21] shows that aggregating tweets by their hashtags yields superior performance over other aggregation methods. One closely related work to ours is the models that use word features for short texts. For example, [7] introduces an extension of GLDA on short texts which samples an indicator variable that chooses to generate either the type of a word or the embedding of a word and GPU-DMM [8] extends DMM with word semantic similarity obtained from embeddings for short texts. Although with improved performance there still exists challenges for existing models: (1) for aggregation-based models, it is usually hard to choose which meta information to use for aggregation; (2) the “single topic” assumption makes DMM models lose the flexibility to capture different topic ingredients of a document; and (3) the incorporation of meta information in the existing models is usually less efficient.

To our knowledge, the attempts that jointly leverage document and word meta information are relatively rare. For example, meta information can be incorporated by first-order logic in Logit-LDA 

[22] and score functions in SC-LDA [23]. However, the first-order logic and score functions need to be defined for different kinds of meta information and the definition can be infeasible for incorporating both document and word meta information simultaneously.

Iii The MetaLDA Model

Given a corpus, LDA uses the same Dirichlet prior for all the per-document topic distributions and the same prior for all the per-topic word distributions [24]

. While in MetaLDA, each document has a specific Dirichlet prior on its topic distribution, which is computed from the meta information of the document, and the parameters of the prior are estimated during training. Similarly, each topic has a specific Dirichlet prior computed from the word meta information. Here we elaborate our MetaLDA, in particular on how the meta information is incorporated. Hereafter, we will use labels as document meta information, unless otherwise stated.

Fig. 1: The graphical model of MetaLDA

Given a collection of documents , MetaLDA generates document with a mixture of topics and each topic is a distribution over the vocabulary with tokens, denoted by . For document with words, to generate the () word , we first sample a topic from the document’s topic distribution , and then sample from . Assume the labels of document are encoded in a binary vector where is the total number of unique labels. indicates label is active in document and vice versa. Similarly, the features of token are stored ∂in a binary vector . Therefore, the document and word meta information associated with are stored in the matrix and respectively. Although MetaLDA incorporates binary features, categorical features and real-valued features can be converted into binary values with proper transformations such as discretisation and binarisation.

Fig. 1 shows the graphical model of MetaLDA and the generative process is as following:

  1. For each topic :

    1. For each doc-label : Draw

    2. For each word-feat : Draw

    3. For each token : Compute

    4. Draw

  2. For each document :

    1. For each topic : Compute

    2. Draw

    3. For each word in document :

      1. Draw topic

      2. Draw word

where , ,

are the gamma distribution, the Dirichlet distribution, and the categorical distribution respectively.

, , and are the hyper-parameters.

To incorporate document labels, MetaLDA learns a specific Dirichlet prior over the topics for each document by using the label information. Specifically, the information of document ’s labels is incorporated in , the parameter of Dirichlet prior on . As shown in Step 2a, is computed as a log linear combination of the labels . Since is binary, is indeed the multiplication of over all the active labels of document , i.e., . Drawn from the gamma distribution with mean 1, controls the impact of label on topic . If label has no or less impact on topic , is expected to be 1 or close to 1, and then will have no or little influence on and vice versa. The hyper-parameter controls the variation of . The incorporation of word features is analogous but in the parameter of the Dirichlet prior on the per-topic word distributions as shown in Step 1c.

The intuition of our way of incorporating meta information is: At the document level, if two documents have more labels in common, their Dirichlet parameter will be more similar, resulting in more similar topic distributions ; At the word level, if two words have similar features, their in topic will be similar and then we can expect that their

could be more or less the same. Finally, the two words will have similar probabilities of showing up in topic

. In other words, if a topic “prefers” a certain word, we expect that it will also prefer other words with similar features to that word. Moreover, at both the document and the word level, different labels/features may have different impact on the topics (/), which is automatically learnt in MetaLDA.

Iv Inference

Unlike most existing methods, our way of incorporating the meta information facilitates the derivation of an efficient Gibbs sampling algorithm. With two data augmentation techniques (i.e., the introduction of auxiliary variables), MetaLDA admits the local conjugacy and a close-form Gibbs sampling algorithm can be derived. Note that MetaLDA incorporates the meta information on the Dirichlet priors, so we can still use LDA’s collapsed Gibbs sampling algorithm for the topic assignment . Moreover, Step 2a and 1c show that one only needs to consider the non-zero entries of and in computing the full conditionals, which further reduces the inference complexity.

Similar to LDA, the complete model likelihood (i.e., joint distribution) of MetaLDA is:

(1)

where , , and is the indicator function.

Iv-a Sampling :

To sample , we first marginalise out in the right part of Eq. (1) with the Dirichlet multinomial conjugacy:

(2)

where , , and is the gamma function. Gamma ratio 1 in Eq. (2

) can be augmented with a set of Beta random variables

as:

(3)

where for each document , . Given a set of for all the documents, Gamma ratio 1 can be approximated by the product of , i.e., .

Gamma ratio 2 in Eq. (2) is the Pochhammer symbol for a rising factorial, which can be augmented with an auxiliary variable  [25, 26, 27, 28] as follows:

(4)

where indicates an unsigned Stirling number of the first kind. Gamma ratio 2 is a normalising constant for the probability of the number of tables in the Chinese Restaurant Process (CRP) [29], can be sampled by a CRP with as the concentration and as the number of customers:

(5)

where

samples from the Bernoulli distribution. The complexity of sampling

by Eq. (5) is . For large

, as the standard deviation of

is [29], one can sample in a small window around the current value in complexity .

By ignoring the terms unrelated to , the augmentation of Eq. (4) can be simplified to a single term . With auxiliary variables now introduced, we simplify Eq. (2) to:

(6)

Replacing with , we can get:

Recall that all the document labels are binary and is involved in computing iff . Extracting all the terms related to in Eq. (IV-A), we get the marginal posterior of :

where is the value of with removed when . With the data augmentation techniques, the posterior is transformed into a form that is conjugate to the gamma prior of . Therefore, it is straightforward to yield the following sampling strategy for :

(7)
(8)
(9)

We can compute and cache the value of first. After is sampled, can be updated by:

(10)

where is the newly-sampled value of .

To sample/compute Eqs. (7)-(10), one only iterates over the documents where label is active (i.e., ). Thus, the sampling for all takes where is the average number of documents where a label is active (i.e., the column-wise sparsity of ). It is usually that because if a label exists in nearly all the documents, it provides little discriminative information. This demonstrates how the sparsity of document meta information is leveraged. Moreover, sampling all the tables takes ( is the total number of words in ) which can be accelerated with the window sampling technique explained above.

Iv-B Sampling :

Since the derivation of sampling is analogous to , we directly give the sampling formulas:

(11)
(12)
(13)

where the two auxiliary variables can be sampled by: and . Similarly, sampling all takes where is the average number of tokens where a feature is active (i.e., the column-wise sparsity of and usually ) and sampling all the tables takes .

Iv-C Sampling topic :

Given and , the collapsed Gibbs sampling of a new topic for a word in MetaLDA is:

(14)

which is exactly the same to LDA.

V Experiments

In this section, we evaluate the proposed MetaLDA against several recent advances that also incorporate meta information on 6 real datasets including both regular and short texts. The goal of the experimental work is to evaluate the effectiveness and efficiency of MetaLDA’s incorporation of document and word meta information both separately and jointly compared with other methods. We report the performance in terms of perplexity, topic coherence, and running time per iteration.

V-a Datasets

In the experiments, three regular text datasets and three short text datasets were used:

  • Reuters is widely used corpus extracted from the Reuters-21578 dataset where documents without any labels are removed222 MetaLDA is able to handle documents/words without labels/features. But for fair comparison with other models, we removed the documents without labels and words without features.. There are 11,367 documents and 120 labels. Each document is associated with multiple labels. The vocabulary size is 8,817 and the average document length is 73.

  • 20NG, 20 Newsgroup, a widely used dataset consists of 18,846 news articles with 20 categories. The vocabulary size is 22,636 and the average document length is 108.

  • NYT, New York Times is extracted from the documents in the category “Top/News/Health” in the New York Times Annotated Corpus333https://catalog.ldc.upenn.edu/ldc2008t19. There are 52,521 documents and 545 unique labels. Each document is with multiple labels. The vocabulary contains 21,421 tokens and there are 442 words in a document on average.

  • WS, Web Snippet, used in [8], contains 12,237 web search snippets and each snippet belongs to one of 8 categories. The vocabulary contains 10,052 tokens and there are 15 words in one snippet on average.

  • TMN, Tag My News, used in [6], consists of 32,597 English RSS news snippets from Tag My News. With a title and a short description, each snippet belongs to one of 7 categories. There are 13,370 tokens in the vocabulary and the average length of a snippet is 18.

  • AN, ABC News, is a collection of 12,495 short news descriptions and each one is in multiple of 194 categories. There are 4,255 tokens in the vocabulary and the average length of a description is 13.

All the datasets were tokenised by Mallet444http://mallet.cs.umass.edu and we removed the words that exist in less than 5 documents and more than 95% documents.

V-B Meta Information Settings

Document labels and word features. At the document level, the labels associated with documents in each dataset were used as the meta information. At the word level, we used a set of 100-dimensional binarised word embeddings as word features\getrefnumberfn-pre-process\getrefnumberfn-pre-processfootnotemark: fn-pre-process, which were obtained from the 50-dimensional GloVe word embeddings pre-trained on Wikipedia555https://nlp.stanford.edu/projects/glove/. To binarise word embeddings, we first adopted the following method similar to [30]:

(15)

where is the original embedding vector for word , is the binarised value for element of , and and are the average value of all the positive elements and negative elements respectively. The insight is that we only consider features with strong opinions (i.e., large positive or negative value) on each dimension. To transform to the final , we use two binary bits to encode one dimension of : the first bit is on if and the second is on if . Besides, MetaLDA can work with other word features such as semantic similarity as well.

Default feature. Besides the labels/features associated with the datasets, a default label/feature for each document/word is introduced in MetaLDA, which is always equal to 1. The default can be interpreted as the bias term in /, which captures the information unrelated to the labels/features. While there are no document labels or word features, with the default, MetaLDA is equivalent in model to asymmetric-asymmetric LDA of [24].

Compute with Compute with
MetaLDA Document labels Word features
MetaLDA-dl-def Document labels Default feature
MetaLDA-dl-0.01 Document labels Symmetric 0.01 (fixed)
MetaLDA-def-wf Default label Word features
MetaLDA-0.1-wf Symmetric 0.1 (fixed) Word features
MetaLDA-def-def Default label Default feature
TABLE I: MetaLDA and its variants.

V-C Compared Models and Parameter Settings

We evaluate the performance of the following models:

  • MetaLDA and its variants: the proposed model and its variants. Here we use MetaLDA to indicate the model considering both document labels and word features. Several variants of MetaLDA with document labels and word features separately were also studied, which are shown in Table I. These variants differ in the method of estimating and . All the models listed in Table I were implemented on top of Mallet. The hyper-parameters and were set to .

  • LDA [1]: the baseline model. The Mallet implementation of SparseLDA [31] is used.

  • LLDA, Labelled LDA [12] and PLLDA, Partially Labelled LDA [10]: two models that make use of multiple document labels. The original implementation666https://nlp.stanford.edu/software/tmt/tmt-0.4/ is used.

  • DMR, LDA with Dirichlet Multinomial Regression [9]: a model that can use multiple document labels. The Mallet implementation of DMR based on SparseLDA was used. Following Mallet, we set the mean of

    to 0.0 and set the variances of

    for the default label and the document labels to 100.0 and 1.0 respectively.

  • WF-LDA, Word Feature LDA [17]: a model with word features. We implemented it on top of Mallet and used the default settings in Mallet for the optimisation.

  • LF-LDA, Latent Feature LDA [6]: a model that incorporates word embeddings. The original implementation777https://github.com/datquocnguyen/LFTM was used. Following the paper, we used 1500 and 500 MCMC iterations for initialisation and sampling respectively and set to 0.6, and used the original 50-dimensional GloVe word embeddings as word features.

  • GPU-DMM, Generalized Pólya Urn DMM [8]: a model that incorporates word semantic similarity. The original implementation888https://github.com/NobodyWHU/GPUDMM was used. The word similarity was generated from the distances of the word embeddings. Following the paper, we set the hyper-parameters and to 0.1 and 0.7 respectively, and the symmetric document Dirichlet prior to .

  • PTM, Pseudo document based Topic Model [19]: a model for short text analysis. The original implementation999http://ipv6.nlsde.buaa.edu.cn/zuoyuan/ was used. Following the paper, we set the number of pseudo documents to 1000 and to 0.1.

All the models, except where noted, the symmetric parameters of the document and the topic Dirichlet priors were set to 0.1 and 0.01 respectively, and 2000 MCMC iterations are used to train the models.

V-D Perplexity Evaluation

Perplexity is a measure that is widely used [24] to evaluate the modelling accuracy of topic models. The lower the score, the higher the modelling accuracy. To compute perplexity, we randomly selected some documents in a dataset as the training set and the remaining as the test set. We first trained a topic model on the training set to get the word distributions of each topic (). Each test document was split into two halves containing every first and every second words respectively. We then fixed the topics and trained the models on the first half to get the topic proportions () of test document and compute perplexity for predicting the second half. In regard to MetaLDA, we fixed the matrices and output from the training procedure. On the first half of test document , we computed the Dirichlet prior with and the labels of test document (See Step 2a), and then point-estimated . We ran all the models 5 times with different random number seeds and report the average scores and the standard deviations.

In testing, we may encounter words that never occur in the training documents (a.k.a., unseen words or out-of-vocabulary words). There are two strategies for handling unseen words for calculating perplexity on test documents: ignoring them or keeping them in computing the perplexity. Here we investigate both strategies:

V-D1 Perplexity Computed without Unseen Words

In this experiment, the perplexity is computed only on the words that appear in the training vocabulary. Here we used 80% documents in each dataset as the training set and the remaining 20% as the test set.

Dataset Reuters 20NG NYT
#Topics 50 100 150 200 50 100 150 200 200 500
{ 216mm[
No meta info
]
LDA 6771 6342 6291 6311 21477 19307 18205 17623 22938 21544
MetaLDA-def-def 6483 5922 5591 5401 20936 18437 17085 16264 22589 20798
{ 313mm[
Doc labels
]
DMR 6401 5771 5442 5262 20808 18118 16704 15781 223113 2013
MetaLDA-dl-0.01 6492 5822 5513 5302 20679 18217 16805 15901 22194 20184
MetaLDA-dl-def 6423 5763 5431 5261 20504 18046 16758 15892 22303 20225
{ 416.5mm[
Word features
]
LF-LDA 8414 7874 7723 7714 285521 25763 24337 23268 28312 27005
WF-LDA 6592 6162 6151 6131 20897 18752 17842 17273 22876 21346
MetaLDA-0.1-wf 6593 6211 6191 6231 20987 18878 17968 17444 22834 21432
MetaLDA-def-wf 6432 5824 5523 5351 20686 18191 16857 16003 22607 20956
Doc labels &
word features
MetaLDA 6332 5682 5362 5171 202512 17818 16405 15516 22176 20206
Dataset Reuters 20NG NYT
#Topics per label
5 10 20 50 5 10 20 50 2 5
{ 213mm[
Doc labels
]
PLLDA 714 708 733 829 1997 1786 1605 1482 2839 2846
LLDA 834 2607 2948
TABLE II: Perplexity comparison on the regular text datasets. The best results are highlighted in boldface.
Dataset WS TMN AN
#Topics 50 100 150 200 50 100 150 200 50 100
{ 216mm[
No meta info
]
LDA 9616 8788 8696 8885 196914 18736 18819 19164 40614 42212
MetaLDA-def-def 88410 7336 6716 6256 180011 157819 14694 14226 35216 33611
{ 313mm[
Doc labels
]
DMR 8457 6834 6071 5622 17508 15063 13917 13235 3266 2905
MetaLDA-dl-0.01 8407 6936 6183 5884 176711 152810 14167 134513 32113 3038
MetaLDA-dl-def 8324 6795 6227 5825 17207 150516 139511 132512 3199 2937
{ 416.5mm[
Word features
]
LF-LDA 11646 103917 101911 9926 241535 239311 237110 237414 48217 51419
WF-LDA 8946 8396 82710 8424 18536 176612 183060 185445 3975 4106
MetaLDA-0.1-wf 8896 8323 8392 8534 18654 17842 17999 18316 3883 4108
MetaLDA-def-wf 8306 6888 6245 5844 173014 15043 140213 13424 34615 3328
Doc labels &
word features
MetaLDA 7749 6276 5723 5344 16574 141516 13046 12356 3149 2939
Dataset WS TMN AN
#Topics per label 5 10 20 50 5 10 20 50 5 10
{ 213mm[
Doc labels
]
PLLDA 1060 886 735 642 2181 1863 1647 1456 440 525
LLDA 1543 2958 392
TABLE III: Perplexity comparison without unseen words on the short text datasets. The best results are highlighted in boldface.

Tables II and III show101010For GPU-DMM and PTM, perplexity is not evaluated because the inference code for unseen documents is not public available. The random number seeds used in the code of LLDA and PLLDA are pre-fixed in the package. So the standard deviations of the two models are not reported.: the average perplexity scores with standard deviations for all the models. Note that: (1) The scores on AN with 150 and 200 topics are not reported due to overfitting observed in all the compared models. (2) Given the size of NYT, the scores of 200 and 500 topics are reported. (3) The number of latent topics in LLDA must equal to the number of document labels. (4) For PLLDA, we varied the number of topics per label from 5 to 50 (2 and 5 topics on NYT). The number of topics in PPLDA is the product of the numbers of labels and topics per label.

(a) Reuters with 200 topics
(b) 20NG with 200 topics
(c) TMN with 100 topics
(d) WS with 50 topics
Fig. 2: Perplexity comparison with unseen words in different proportions of the training documents. Each pair of the numbers on the horizontal axis are the proportion of the training documents and the proportion of unseen tokens in the vocabulary of the test documents, respectively. The error bars are the standard deviations over 5 runs.

The results show that MetaLDA outperformed all the competitors in terms of perplexity on nearly all the datasets, showing the benefit of using both document and word meta information. Specifically, we have the following remarks:

  • By looking at the models using only the document-level meta information, we can see the significant improvement of these models over LDA, which indicates that document labels can play an important role in guiding topic modelling. Although the performance of the two variants of MetaLDA with document labels and DMR is comparable, our models runs much faster than DMR, which will be studied later in Section V-F.

  • It is interesting that PLLDA with 50 topics for each label has better perplexity than MetaLDA with 200 topics in the 20NG dataset. With the 20 unique labels, the actual number of topics in PLLDA is 1000. However, if 10 topics for each label in PLLDA are used, which is equivalent to 200 topics in MetaLDA, PLLDA is outperformed by MetaLDA significantly.

  • At the word level, MetaLDA-def-wf performed the best among the models with word features only. Moreover, our model has obvious advantage in running speed (see Table V). Furthermore, comparing MetaLDA-def-wf with MetaLDA-def-def and MetaLDA-0.1-wf with LDA, we can see using the word features indeed improved perplexity.

  • The scores show that the improvement gained by MetaLDA over LDA on the short text datasets is larger than that on the regular text datasets. This is as expected because meta information serves as complementary information in MetaLDA and can have more significant impact when the data is sparser.

  • It can be observed that models usually gained improved perplexity, if is sampled/optimised, in line with [24].

  • On the AN dataset, there is no statistically significant difference between MetaLDA and DMR. On NYT, a similar trend is observed: the improvement in the models with the document labels over LDA is obvious but not in the models with the word features. Given the number of the document labels (194 of AN and 545 of NYT), it is possible that the document labels already offer enough information and the word embeddings have little contribution in the two datasets.

V-D2 Perplexity Computed with Unseen Words

To test the hypothesis that the incorporation of meta information in MetaLDA can significantly improve the modelling accuracy in the cases where the corpus is sparse, we varied the proportion of documents used in training from 20% to 80% and used the remaining for testing. It is natural that when the proportion is small, the number of unseen words in testing documents will be large. Instead of simply excluding the unseen words in the previous experiments, here we compute the perplexity with unseen words for LDA, DMR, WF-LDA and the proposed MetaLDA. For perplexity calculation, for each topic and each token in the test documents is needed. If occurs in the training documents, can be directly obtained. While if is unseen, can be estimated by the prior: . For LDA and DMR which do not use word features, ; For WF-LDA and MetaLDA which are with word features, is computed with the features of the unseen token. Following Step 1c, for MetaLDA, .

Figure 2 shows the perplexity scores on Reuters, 20NG, TMN and WS with 200, 200, 100 and 50 topics respectively. MetaLDA outperformed the other models significantly with a lower proportion of training documents and relatively higher proportion of unseen words. The gap between MetaLDA and the other three models increases while the training proportion decreases. It indicates that the meta information helps MetaLDA to achieve better modelling accuracy on predicting unseen words.

V-E Topic Coherence Evaluation

We further evaluate the semantic coherence of the words in a topic learnt by LDA, PTM, DMR, LF-LDA, WF-LDA, GPU-DMM and MetaLDA. Here we use the Normalised Pointwise Mutual Information (NPMI) [32, 33] to calculate topic coherence score for topic with top words: , where is the probability of word , and is the joint probability of words and that co-occur together within a sliding window. Those probabilities were computed on an external large corpus, i.e., a 5.48GB Wikipedia dump in our experiments. The NPMI score of each topic in the experiments is calculated with top 10 words () by the Palmetto package111111http://palmetto.aksw.org. Again, we report the average scores and the standard deviations over 5 random runs.

All 100 topics Top 20 topics
WS TMN AN WS TMN AN
{ 216mm[
No meta info
]
LDA -0.00300.0047 0.03190.0032 -0.06360.0033 0.10250.0067 0.1370.0043 -0.00100.0052
PTM -0.00290.0048 0.03550.0016 -0.06400.0037 0.10330.0081 0.15270.0052 0.00040.0037
Doc labels
DMR 0.00910.0046 0.03960.0044 -0.04570.0024 0.12960.0085 0.14720.1507 0.02760.0101
{ 317mm[
Word features
]
LF-LDA 0.01300.0052 0.03970.0026 -0.05230.0023 0.12300.0153 0.14560.0087 0.02720.0042
WF-LDA 0.00910.0046 0.03900.0051 -0.04570.0024 0.12960.0085 0.15070.0055 0.02760.0101
GPU-DMM -0.09340.0106 -0.09700.0034 -0.07690.0012 0.08360.0105 0.09680.0076 -0.06130.0020
Doc labels &
word features
MetaLDA 0.03110.0038 0.04510.0034 -0.03260.0019 0.15110.0093 0.15840.0072 0.05900.0065
TABLE IV: Topic coherence (NPMI) on the short text datasets.
Dataset Reuters WS NYT
#Topics 50 100 150 200 50 100 150 200 200 500
{ 216mm[
No meta info
]
LDA 0.0899 0.1023 0.1172 0.1156 0.0219 0.0283 0.0301 0.0351 0.7509 1.1400
PTM 4.9232 5.8885 7.2226 7.7670 1.1840 1.6375 1.8288 2.0030 - -
{ 213mm[
Doc labels
]
DMR 0.6112 0.9237 1.2638 1.6066 0.4603 0.8549 1.2521 1.7173 13.7546 31.9571
MetaLDA-dl-0.01 0.1187 0.1387 0.1646 0.1868 0.0396 0.0587 0.0769 0.112 1 2.4679 4.9928
{ 417mm[
Word features
]
LF-LDA 2.6895 5.3043 8.3429 11.4419 2.4920 6.0266 9.1245 11.5983 95.5295 328.0862
WF-LDA 1.0495 1.6025 3.0304 4.8783 1.8162 3.7802 6.1863 8.6599 14.0538 31.4438
GPU-DMM 0.4193 0.7190 1.0421 1.3229 0.1206 0.1855 0.2487 0.3118 - -
MetaLDA-0.1-wf 0.2427 0.4274 0.6566 0.9683 0.1083 0.1811 0.2644 0.3579 4.6205 12.4177
Doc labels &
word features
MetaLDA 0.2833 0.5447 0.7222 1.0615 0.1232 0.2040 0.3282 0.4167 6.4644 16.9735
TABLE V: Running time (seconds per iteration) on 80% documents of each dataset.

It is known that conventional topic models directly applied to short texts suffer from low quality topics, caused by the insufficient word co-occurrence information. Here we study whether or not the meta information helps MetaLDA improve topic quality, compared with other topic models that can also handle short texts. Table IV shows the NPMI scores on the three short text datasets. Higher scores indicate better topic coherence. All the models were trained with 100 topics. Besides the NPMI scores averaged over all the 100 topics, we also show the scores averaged over top 20 topics with highest NPMI, where “rubbish” topics are eliminated, following [23]. It is clear that MetaLDA performed significantly better than all the other models in WS and AN dataset in terms of NPMI, which indicates that MetaLDA can discover more meaningful topics with the document and word meta information. We would like to point out that on the TMN dataset, even though the average score of MetaLDA is still the best, the score of MetaLDA has overlapping with the others’ in the standard deviation, which indicates the difference is not statistically significant.

V-F Running Time

In this section, we empirically study the efficiency of the models in term of per-iteration running time. The implementation details of our MetaLDA are as follows: (1) The SparseLDA framework [31] reduces the complexity of LDA to be sub-linear by breaking the conditional of LDA into three “buckets”, where the “smoothing only” bucket is cached for all the documents and the “document only” bucket is cached for all the tokens in a document. We adopted a similar strategy when implementing MetaLDA. When only the document meta information is used, the Dirichlet parameters for different documents in MetaLDA are different and asymmetric. Therefore, the “smoothing only” bucket has to be computed for each document, but we can cache it for all the tokens, which still gives us a considerable reduction in computing complexity. However, when the word meta information is used, the SparseLDA framework no longer works in MetaLDA as the parameters for each topic and each token are different. (2) By adapting the DistributedLDA framework [34], our MetaLDA implementation runs in parallel with multiple threads, which makes MetaLDA able to handle larger document collections. The parallel implementation was used on the NYT dataset.

The per-iteration running time of all the models is shown in Table V. Note that: (1) On the Reuters and WS datasets, all the models ran with a single thread on a desktop PC with a 3.40GHz CPU and 16GB RAM. (2) Due to the size of NYT, we report the running time for the models that are able to run in parallel. All the parallelised models ran with 10 threads on a cluster with a 14-core 2.6GHz CPU and 128GB RAM. (3) All the models were implemented in JAVA. (4) As the models with meta information add extra complexity to LDA, the per-iteration running time of LDA can be treated as the lower bound.

At the document level, both MetaLDA-df-0.01 and DMR use priors to incorporate the document meta information and both of them were implemented in the SparseLDA framework. However, our variant is about 6 to 8 times faster than DMR on the Reuters dataset and more than 10 times faster on the WS dataset. Moreover, it can be seen that the larger the number of topics, the faster our variant is over DMR. At the word level, similar patterns can be observed: our MetaLDA-0.1-wf ran significantly faster than WF-LDA and LF-LDA especially when more topics are used (20-30 times faster on WS). It is not surprising that GPU-DMM has comparable running speed with our variant, because only one topic is allowed for each document in GPU-DMM. With both document and word meta information, MetaLDA still ran several times faster than DMR, LF-LDA, and WF-LDA. On NYT with the parallel settings, MetaLDA maintains its efficiency advantage as well.

Vi Conclusion

In this paper, we have presented a topic modelling framework named MetaLDA that can efficiently incorporate document and word meta information. This gains a significant improvement over others in terms of perplexity and topic quality. With two data augmentation techniques, MetaLDA enjoys full local conjugacy, allowing efficient Gibbs sampling, demonstrated by superiority in the per-iteration running time. Furthermore, without losing generality, MetaLDA can work with both regular texts and short texts. The improvement of MetaLDA over other models that also use meta information is more remarkable, particularly when the word-occurrence information is insufficient. As MetaLDA takes a particular approach for incorporating meta information on topic models, it is possible to apply the same approach to other Bayesian probabilistic models, where Dirichlet priors are used. Moreover, it would be interesting to extend our method to use real-valued meta information directly, which is the subject of future work.

Acknowledgement

Lan Du was partially supported by Chinese NSFC project under grant number 61402312. Gang Liu was partially supported by Chinese PostDoc Fund under grant number LBH-Q15031.

References

  • [1] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet Allocation,” JMLR, pp. 993–1022, 2003.
  • [2] G. A. Miller, “WordNet: a lexical database for English,” Communications of the ACM, pp. 39–41, 1995.
  • [3] J. Pennington, R. Socher, and C. Manning, “GloVe: Global vectors for word representation,” in EMNLP, 2014, pp. 1532–1543.
  • [4]

    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionally,” in

    NIPS, 2013, pp. 3111–3119.
  • [5] R. Das, M. Zaheer, and C. Dyer, “Gaussian LDA for topic models with word embeddings,” in ACL, 2015, pp. 795–804.
  • [6] D. Q. Nguyen, R. Billingsley, L. Du, and M. Johnson, “Improving topic models with latent feature word representations,” TACL, pp. 299–313, 2015.
  • [7] G. Xun, V. Gopalakrishnan, F. Ma, Y. Li, J. Gao, and A. Zhang, “Topic discovery for short texts using word embeddings,” in ICDM, 2016, pp. 1299–1304.
  • [8] C. Li, H. Wang, Z. Zhang, A. Sun, and Z. Ma, “Topic modeling for short texts with auxiliary word embeddings,” in SIGIR, 2016, pp. 165–174.
  • [9] D. Mimno and A. McCallum, “Topic models conditioned on arbitrary features with Dirichlet-multinomial regression,” in UAI, 2008, pp. 411–418.
  • [10] D. Ramage, C. D. Manning, and S. Dumais, “Partially labeled topic models for interpretable text mining,” in SIGKDD, 2011, pp. 457–465.
  • [11] J. D. Mcauliffe and D. M. Blei, “Supervised topic models,” in NIPS, 2008, pp. 121–128.
  • [12] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning, “Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora,” in EMNLP, 2009, pp. 248–256.
  • [13] D. Kim and A. Oh, “Hierarchical Dirichlet scaling process,” Machine Learning, pp. 387–418, 2017.
  • [14] C. Hu, P. Rai, and L. Carin, “Non-negative matrix factorization for discrete data with hierarchical side-information,” in AISTATS, 2016, pp. 1124–1132.
  • [15] D. Andrzejewski, X. Zhu, and M. Craven, “Incorporating domain knowledge into topic modeling via Dirichlet forest priors,” in ICML, 2009, pp. 25–32.
  • [16] P. Xie, D. Yang, and E. Xing, “Incorporating word correlation knowledge into topic modeling,” in NAACL, 2015, pp. 725–734.
  • [17] J. Petterson, W. Buntine, S. M. Narayanamurthy, T. S. Caetano, and A. J. Smola, “Word features for Latent Dirichlet Allocation,” in NIPS, 2010, pp. 1921–1929.
  • [18] L. Hong and B. D. Davison, “Empirical study of topic modeling in twitter,” in Workshop on social media analytics, 2010, pp. 80–88.
  • [19] Y. Zuo, J. Wu, H. Zhang, H. Lin, F. Wang, K. Xu, and H. Xiong, “Topic modeling of short texts: A pseudo-document view,” in SIGKDD, 2016, pp. 2105–2114.
  • [20] J. Yin and J. Wang, “A Dirichlet multinomial mixture model-based approach for short text clustering,” in SIGKDD, 2014, pp. 233–242.
  • [21] R. Mehrotra, S. Sanner, W. Buntine, and L. Xie, “Improving LDA topic models for microblogs via tweet pooling and automatic labeling,” in SIGIR, 2013, pp. 889–892.
  • [22] D. Andrzejewski, X. Zhu, M. Craven, and B. Recht, “A framework for incorporating general domain knowledge into Latent Dirichlet Allocation using first-order logic,” in IJCAI, 2011, pp. 1171–1177.
  • [23] Y. Yang, D. Downey, and J. Boyd-Graber, “Efficient methods for incorporating knowledge into topic models,” in EMNLP, 2015, pp. 308–317.
  • [24] H. M. Wallach, D. M. Mimno, and A. McCallum, “Rethinking LDA: Why priors matter,” in NIPS, 2009, pp. 1973–1981.
  • [25] C. Chen, L. Du, and W. Buntine, “Sampling table configurations for the hierarchical Poisson-Dirichlet process,” in ECML, 2011, pp. 296–311.
  • [26] Y. Teh, M. Jordan, M. Beal, and D. Blei, “Hierarchical Dirichlet processes,” Journal of the American Statistical Association, pp. 1566–1581, 2012.
  • [27] M. Zhou and L. Carin, “Negative binomial process count and mixture modeling,” TPAMI, pp. 307–320, 2015.
  • [28] H. Zhao, L. Du, and W. Buntine, “Leveraging node attributes for incomplete relational data,” in ICML, 2017, pp. 4072–4081.
  • [29] W. Buntine and M. Hutter, “A Bayesian view of the Poisson-Dirichlet process,” arXiv preprint arXiv:1007.0296v2 [math.ST], 2012.
  • [30]

    J. Guo, W. Che, H. Wang, and T. Liu, “Revisiting embedding features for simple semi-supervised learning,” in

    EMNLP, 2014, pp. 110–120.
  • [31] L. Yao, D. Mimno, and A. McCallum, “Efficient methods for topic model inference on streaming document collections,” in SIGKDD, 2009, pp. 937–946.
  • [32] N. Aletras and M. Stevenson, “Evaluating topic coherence using distributional semantics,” in International Conference on Computational Semantics, 2013, pp. 13–22.
  • [33] J. H. Lau, D. Newman, and T. Baldwin, “Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality,” in EACL, 2014, pp. 530–539.
  • [34] D. Newman, A. Asuncion, P. Smyth, and M. Welling, “Distributed algorithms for topic models,” JMLR, pp. 1801–1828, 2009.