I Introduction
With the rapid growth of the internet, huge amounts of text data are generated in social networks, online shopping and news websites, etc. These data create demand for powerful and efficient text analysis techniques. Probabilistic topic models such as Latent Dirichlet Allocation (LDA) [1] are popular approaches for this task, by discovering latent topics from text collections. Many conventional topic models discover topics purely based on the wordoccurrences, ignoring the meta information (a.k.a., side information) associated with the content. In contrast, when we humans read text it is natural to leverage meta information to improve our comprehension, which includes categories, authors, timestamps, the semantic meanings of the words, etc. Therefore, topic models capable of using meta information should yield improved modelling accuracy and topic quality.
In practice, various kinds of meta information are available at the document level and the word level in many corpora. At the document level, labels of documents can be used to guide topic learning so that more meaningful topics can be discovered. Moreover, it is highly likely that documents with common labels discuss similar topics, which could further result in similar topic distributions. For example, if we use authors as labels for scientific papers, the topics of the papers published by the same researcher can be closely related.
At the word level, different semantic/syntactic features are also accessible. For example, there are features regarding word relationships, such as synonyms obtained from WordNet [2]
, word cooccurrence patterns obtained from a large corpus, and linked concepts from knowledge graphs. It is preferable that words having similar meaning but different morphological forms, like “dog” and “puppy”, are assigned to the same topic, even if they barely cooccur in the modelled corpus. Recently, word embeddings generated by GloVe
[3] and word2vec [4], have attracted a lot of attention in natural language processing and related fields. It has been shown that the word embeddings can capture both the semantic and syntactic features of words so that similar words are close to each other in the embedding space. It seems reasonable to expect that these word embedding will improve topic modelling
[5, 6].Conventional topic models can suffer from a large performance degradation over short texts (e.g., tweets and news headlines) because of insufficient word cooccurrence information. In such cases, meta information of documents and words can play an important role in analysing short texts by compensating the lost information in word cooccurrences. At the document level, for example, tweets are usually associated with hashtags, users, locations, and timestamps, which can be used to alleviate the data sparsity problem. At the word level, word semantic similarity and embeddings obtained or trained on large external corpus (e.g., Google News or Wikipedia) have been proven useful in learning meaningful topics from short texts [7, 8].
The benefit of using document and word meta information separately is shown in several models such as [9, 10, 6]. However, in existing models this is usually not efficient enough due to nonconjugacy and/or complex model structures. Moreover, only one kind of meta information (either at document level or at word level) is used in most existing models. In this paper, we propose MetaLDA^{1}^{1}1Code at https://github.com/ethanhezhao/MetaLDA/, a topic model that can effectively and efficiently leverage arbitrary document and word meta information encoded in binary form. Specifically, the labels of a document in MetaLDA are incorporated in the prior of the perdocument topic distributions. If two documents have similar labels, their topic distributions should be generated with similar Dirichlet priors. Analogously, at the word level, the features of a word are incorporated in the prior of the pertopic word distributions, which encourages words with similar features to have similar weights across topics. Therefore, both document and word meta information, if and when they are available, can be flexibly and simultaneously incorporated using MetaLDA. MetaLDA has the following key properties:

MetaLDA jointly incorporates various kinds of document and word meta information for both regular and short texts, yielding better modelling accuracy and topic quality.

With the data augmentation techniques, the inference of MetaLDA can be done by an efficient and closedform Gibbs sampling algorithm that benefits from the full local conjugacy of the model.

The simple structure of incorporating meta information and the efficient inference algorithm give MetaLDA advantage in terms of running speed over other models with meta information.
We conduct extensive experiments with several real datasets including regular and short texts in various domains. The experimental results demonstrate that MetaLDA achieves improved performance in terms of perplexity, topic coherence, and running time.
Ii Related Work
In this section, we review three lines of related work: models with document meta information, models with word meta information, and models for short texts.
At the document level, Supervised LDA (sLDA) [11] models document labels by learning a generalised linear model with an appropriate link function and exponential family dispersion function. But the restriction for sLDA is that one document can only have one label. Labelled LDA (LLDA) [12] assumes that each label has a corresponding topic and a document is generated by a mixture of the topics. Although multiple labels are allowed, LLDA requires that the number of topics must equal to the number of labels, i.e., exactly one topic per label. As an extension to LLDA, Partially Labelled LDA (PLLDA) [10] relaxes this requirement by assigning multiple topics to a label. The Dirichlet Multinomial Regression (DMR) model [9] incorporates document labels on the prior of the topic distributions like our MetaLDA but with the logisticnormal transformation. As full conjugacy does not exist in DMR, a part of the inference has to be done by numerical optimisation, which is slow for large sets of labels and topics. Similarly, in the Hierarchical Dirichlet Scaling Process (HDSP) [13], conjugacy is broken as well since the topic distributions have to be renormalised. [14]
introduces a Poisson factorisation model with hierarchical document labels. But the techniques cannot be applied to regular topic models as the topic proportion vectors are also unnormalised.
Recently, there is growing interest in incorporating word features in topic models. For example, DFLDA [15] incorporates word mustlinks and cannotlinks using a Dirichlet forest prior in LDA; MRFLDA [16] encodes word semantic similarity in LDA with a Markov random field; WFLDA [17] extends LDA to model word features with the logisticnormal transform; LFLDA [6] integrates word embeddings into LDA by replacing the topicword Dirichlet multinomial component with a mixture of a Dirichlet multinomial component and a word embedding component; Instead of generating word types (tokens), Gaussian LDA (GLDA) [5]
directly generates word embeddings with the Gaussian distribution. Despite the exciting applications of the above models, their inference is usually less efficient due to the nonconjugacy and/or complicated model structures.
Analysis of short text with topic models has been an active area with the development of social networks. Generally, there are two ways to deal with the sparsity problem in short texts, either using the intrinsic properties of short texts or leveraging meta information. For the first way, one popular approach is to aggregate short texts into pseudodocuments, for example, [18] introduces a model that aggregates tweets containing the same word; Recently, PTM [19] aggregates short texts into latent pseudo documents. Another approach is to assume one topic per short document, known as mixture of unigrams or Dirichlet Multinomial Mixture (DMM) such as [20, 7]. For the second way, document meta information can be used to aggregate short texts, for example, [18] aggregates tweets by the corresponding authors and [21] shows that aggregating tweets by their hashtags yields superior performance over other aggregation methods. One closely related work to ours is the models that use word features for short texts. For example, [7] introduces an extension of GLDA on short texts which samples an indicator variable that chooses to generate either the type of a word or the embedding of a word and GPUDMM [8] extends DMM with word semantic similarity obtained from embeddings for short texts. Although with improved performance there still exists challenges for existing models: (1) for aggregationbased models, it is usually hard to choose which meta information to use for aggregation; (2) the “single topic” assumption makes DMM models lose the flexibility to capture different topic ingredients of a document; and (3) the incorporation of meta information in the existing models is usually less efficient.
To our knowledge, the attempts that jointly leverage document and word meta information are relatively rare. For example, meta information can be incorporated by firstorder logic in LogitLDA
[22] and score functions in SCLDA [23]. However, the firstorder logic and score functions need to be defined for different kinds of meta information and the definition can be infeasible for incorporating both document and word meta information simultaneously.Iii The MetaLDA Model
Given a corpus, LDA uses the same Dirichlet prior for all the perdocument topic distributions and the same prior for all the pertopic word distributions [24]
. While in MetaLDA, each document has a specific Dirichlet prior on its topic distribution, which is computed from the meta information of the document, and the parameters of the prior are estimated during training. Similarly, each topic has a specific Dirichlet prior computed from the word meta information. Here we elaborate our MetaLDA, in particular on how the meta information is incorporated. Hereafter, we will use labels as document meta information, unless otherwise stated.
Given a collection of documents , MetaLDA generates document with a mixture of topics and each topic is a distribution over the vocabulary with tokens, denoted by . For document with words, to generate the () word , we first sample a topic from the document’s topic distribution , and then sample from . Assume the labels of document are encoded in a binary vector where is the total number of unique labels. indicates label is active in document and vice versa. Similarly, the features of token are stored ∂in a binary vector . Therefore, the document and word meta information associated with are stored in the matrix and respectively. Although MetaLDA incorporates binary features, categorical features and realvalued features can be converted into binary values with proper transformations such as discretisation and binarisation.
Fig. 1 shows the graphical model of MetaLDA and the generative process is as following:

For each topic :

For each doclabel : Draw

For each wordfeat : Draw

For each token : Compute

Draw


For each document :

For each topic : Compute

Draw

For each word in document :

Draw topic

Draw word


where , ,
are the gamma distribution, the Dirichlet distribution, and the categorical distribution respectively.
, , and are the hyperparameters.To incorporate document labels, MetaLDA learns a specific Dirichlet prior over the topics for each document by using the label information. Specifically, the information of document ’s labels is incorporated in , the parameter of Dirichlet prior on . As shown in Step 2a, is computed as a log linear combination of the labels . Since is binary, is indeed the multiplication of over all the active labels of document , i.e., . Drawn from the gamma distribution with mean 1, controls the impact of label on topic . If label has no or less impact on topic , is expected to be 1 or close to 1, and then will have no or little influence on and vice versa. The hyperparameter controls the variation of . The incorporation of word features is analogous but in the parameter of the Dirichlet prior on the pertopic word distributions as shown in Step 1c.
The intuition of our way of incorporating meta information is: At the document level, if two documents have more labels in common, their Dirichlet parameter will be more similar, resulting in more similar topic distributions ; At the word level, if two words have similar features, their in topic will be similar and then we can expect that their
could be more or less the same. Finally, the two words will have similar probabilities of showing up in topic
. In other words, if a topic “prefers” a certain word, we expect that it will also prefer other words with similar features to that word. Moreover, at both the document and the word level, different labels/features may have different impact on the topics (/), which is automatically learnt in MetaLDA.Iv Inference
Unlike most existing methods, our way of incorporating the meta information facilitates the derivation of an efficient Gibbs sampling algorithm. With two data augmentation techniques (i.e., the introduction of auxiliary variables), MetaLDA admits the local conjugacy and a closeform Gibbs sampling algorithm can be derived. Note that MetaLDA incorporates the meta information on the Dirichlet priors, so we can still use LDA’s collapsed Gibbs sampling algorithm for the topic assignment . Moreover, Step 2a and 1c show that one only needs to consider the nonzero entries of and in computing the full conditionals, which further reduces the inference complexity.
Similar to LDA, the complete model likelihood (i.e., joint distribution) of MetaLDA is:
(1) 
where , , and is the indicator function.
Iva Sampling :
To sample , we first marginalise out in the right part of Eq. (1) with the Dirichlet multinomial conjugacy:
(2) 
where , , and is the gamma function. Gamma ratio 1 in Eq. (2
) can be augmented with a set of Beta random variables
as:(3) 
where for each document , . Given a set of for all the documents, Gamma ratio 1 can be approximated by the product of , i.e., .
Gamma ratio 2 in Eq. (2) is the Pochhammer symbol for a rising factorial, which can be augmented with an auxiliary variable [25, 26, 27, 28] as follows:
(4) 
where indicates an unsigned Stirling number of the first kind. Gamma ratio 2 is a normalising constant for the probability of the number of tables in the Chinese Restaurant Process (CRP) [29], can be sampled by a CRP with as the concentration and as the number of customers:
(5) 
where
samples from the Bernoulli distribution. The complexity of sampling
by Eq. (5) is . For large, as the standard deviation of
is [29], one can sample in a small window around the current value in complexity .By ignoring the terms unrelated to , the augmentation of Eq. (4) can be simplified to a single term . With auxiliary variables now introduced, we simplify Eq. (2) to:
(6) 
Replacing with , we can get:
Recall that all the document labels are binary and is involved in computing iff . Extracting all the terms related to in Eq. (IVA), we get the marginal posterior of :
where is the value of with removed when . With the data augmentation techniques, the posterior is transformed into a form that is conjugate to the gamma prior of . Therefore, it is straightforward to yield the following sampling strategy for :
(7)  
(8)  
(9) 
We can compute and cache the value of first. After is sampled, can be updated by:
(10) 
where is the newlysampled value of .
To sample/compute Eqs. (7)(10), one only iterates over the documents where label is active (i.e., ). Thus, the sampling for all takes where is the average number of documents where a label is active (i.e., the columnwise sparsity of ). It is usually that because if a label exists in nearly all the documents, it provides little discriminative information. This demonstrates how the sparsity of document meta information is leveraged. Moreover, sampling all the tables takes ( is the total number of words in ) which can be accelerated with the window sampling technique explained above.
IvB Sampling :
Since the derivation of sampling is analogous to , we directly give the sampling formulas:
(11)  
(12)  
(13) 
where the two auxiliary variables can be sampled by: and . Similarly, sampling all takes where is the average number of tokens where a feature is active (i.e., the columnwise sparsity of and usually ) and sampling all the tables takes .
IvC Sampling topic :
Given and , the collapsed Gibbs sampling of a new topic for a word in MetaLDA is:
(14) 
which is exactly the same to LDA.
V Experiments
In this section, we evaluate the proposed MetaLDA against several recent advances that also incorporate meta information on 6 real datasets including both regular and short texts. The goal of the experimental work is to evaluate the effectiveness and efficiency of MetaLDA’s incorporation of document and word meta information both separately and jointly compared with other methods. We report the performance in terms of perplexity, topic coherence, and running time per iteration.
Va Datasets
In the experiments, three regular text datasets and three short text datasets were used:

Reuters is widely used corpus extracted from the Reuters21578 dataset where documents without any labels are removed^{2}^{2}2 MetaLDA is able to handle documents/words without labels/features. But for fair comparison with other models, we removed the documents without labels and words without features.. There are 11,367 documents and 120 labels. Each document is associated with multiple labels. The vocabulary size is 8,817 and the average document length is 73.

20NG, 20 Newsgroup, a widely used dataset consists of 18,846 news articles with 20 categories. The vocabulary size is 22,636 and the average document length is 108.

NYT, New York Times is extracted from the documents in the category “Top/News/Health” in the New York Times Annotated Corpus^{3}^{3}3https://catalog.ldc.upenn.edu/ldc2008t19. There are 52,521 documents and 545 unique labels. Each document is with multiple labels. The vocabulary contains 21,421 tokens and there are 442 words in a document on average.

WS, Web Snippet, used in [8], contains 12,237 web search snippets and each snippet belongs to one of 8 categories. The vocabulary contains 10,052 tokens and there are 15 words in one snippet on average.

TMN, Tag My News, used in [6], consists of 32,597 English RSS news snippets from Tag My News. With a title and a short description, each snippet belongs to one of 7 categories. There are 13,370 tokens in the vocabulary and the average length of a snippet is 18.

AN, ABC News, is a collection of 12,495 short news descriptions and each one is in multiple of 194 categories. There are 4,255 tokens in the vocabulary and the average length of a description is 13.
All the datasets were tokenised by Mallet^{4}^{4}4http://mallet.cs.umass.edu and we removed the words that exist in less than 5 documents and more than 95% documents.
VB Meta Information Settings
Document labels and word features. At the document level, the labels associated with documents in each dataset were used as the meta information. At the word level, we used a set of 100dimensional binarised word embeddings as word features^{\getrefnumberfnpreprocess}^{\getrefnumberfnpreprocess}footnotemark: fnpreprocess, which were obtained from the 50dimensional GloVe word embeddings pretrained on Wikipedia^{5}^{5}5https://nlp.stanford.edu/projects/glove/. To binarise word embeddings, we first adopted the following method similar to [30]:
(15) 
where is the original embedding vector for word , is the binarised value for element of , and and are the average value of all the positive elements and negative elements respectively. The insight is that we only consider features with strong opinions (i.e., large positive or negative value) on each dimension. To transform to the final , we use two binary bits to encode one dimension of : the first bit is on if and the second is on if . Besides, MetaLDA can work with other word features such as semantic similarity as well.
Default feature. Besides the labels/features associated with the datasets, a default label/feature for each document/word is introduced in MetaLDA, which is always equal to 1. The default can be interpreted as the bias term in /, which captures the information unrelated to the labels/features. While there are no document labels or word features, with the default, MetaLDA is equivalent in model to asymmetricasymmetric LDA of [24].
Compute with  Compute with  

MetaLDA  Document labels  Word features 
MetaLDAdldef  Document labels  Default feature 
MetaLDAdl0.01  Document labels  Symmetric 0.01 (fixed) 
MetaLDAdefwf  Default label  Word features 
MetaLDA0.1wf  Symmetric 0.1 (fixed)  Word features 
MetaLDAdefdef  Default label  Default feature 
VC Compared Models and Parameter Settings
We evaluate the performance of the following models:

MetaLDA and its variants: the proposed model and its variants. Here we use MetaLDA to indicate the model considering both document labels and word features. Several variants of MetaLDA with document labels and word features separately were also studied, which are shown in Table I. These variants differ in the method of estimating and . All the models listed in Table I were implemented on top of Mallet. The hyperparameters and were set to .

LLDA, Labelled LDA [12] and PLLDA, Partially Labelled LDA [10]: two models that make use of multiple document labels. The original implementation^{6}^{6}6https://nlp.stanford.edu/software/tmt/tmt0.4/ is used.

DMR, LDA with Dirichlet Multinomial Regression [9]: a model that can use multiple document labels. The Mallet implementation of DMR based on SparseLDA was used. Following Mallet, we set the mean of
to 0.0 and set the variances of
for the default label and the document labels to 100.0 and 1.0 respectively. 
WFLDA, Word Feature LDA [17]: a model with word features. We implemented it on top of Mallet and used the default settings in Mallet for the optimisation.

LFLDA, Latent Feature LDA [6]: a model that incorporates word embeddings. The original implementation^{7}^{7}7https://github.com/datquocnguyen/LFTM was used. Following the paper, we used 1500 and 500 MCMC iterations for initialisation and sampling respectively and set to 0.6, and used the original 50dimensional GloVe word embeddings as word features.

GPUDMM, Generalized Pólya Urn DMM [8]: a model that incorporates word semantic similarity. The original implementation^{8}^{8}8https://github.com/NobodyWHU/GPUDMM was used. The word similarity was generated from the distances of the word embeddings. Following the paper, we set the hyperparameters and to 0.1 and 0.7 respectively, and the symmetric document Dirichlet prior to .

PTM, Pseudo document based Topic Model [19]: a model for short text analysis. The original implementation^{9}^{9}9http://ipv6.nlsde.buaa.edu.cn/zuoyuan/ was used. Following the paper, we set the number of pseudo documents to 1000 and to 0.1.
All the models, except where noted, the symmetric parameters of the document and the topic Dirichlet priors were set to 0.1 and 0.01 respectively, and 2000 MCMC iterations are used to train the models.
VD Perplexity Evaluation
Perplexity is a measure that is widely used [24] to evaluate the modelling accuracy of topic models. The lower the score, the higher the modelling accuracy. To compute perplexity, we randomly selected some documents in a dataset as the training set and the remaining as the test set. We first trained a topic model on the training set to get the word distributions of each topic (). Each test document was split into two halves containing every first and every second words respectively. We then fixed the topics and trained the models on the first half to get the topic proportions () of test document and compute perplexity for predicting the second half. In regard to MetaLDA, we fixed the matrices and output from the training procedure. On the first half of test document , we computed the Dirichlet prior with and the labels of test document (See Step 2a), and then pointestimated . We ran all the models 5 times with different random number seeds and report the average scores and the standard deviations.
In testing, we may encounter words that never occur in the training documents (a.k.a., unseen words or outofvocabulary words). There are two strategies for handling unseen words for calculating perplexity on test documents: ignoring them or keeping them in computing the perplexity. Here we investigate both strategies:
VD1 Perplexity Computed without Unseen Words
In this experiment, the perplexity is computed only on the words that appear in the training vocabulary. Here we used 80% documents in each dataset as the training set and the remaining 20% as the test set.
Dataset  Reuters  20NG  NYT  

#Topics  50  100  150  200  50  100  150  200  200  500  
{ 216mm[

LDA  6771  6342  6291  6311  21477  19307  18205  17623  22938  21544  
MetaLDAdefdef  6483  5922  5591  5401  20936  18437  17085  16264  22589  20798  
{ 313mm[

DMR  6401  5771  5442  5262  20808  18118  16704  15781  223113  2013  
MetaLDAdl0.01  6492  5822  5513  5302  20679  18217  16805  15901  22194  20184  
MetaLDAdldef  6423  5763  5431  5261  20504  18046  16758  15892  22303  20225  
{ 416.5mm[

LFLDA  8414  7874  7723  7714  285521  25763  24337  23268  28312  27005  
WFLDA  6592  6162  6151  6131  20897  18752  17842  17273  22876  21346  
MetaLDA0.1wf  6593  6211  6191  6231  20987  18878  17968  17444  22834  21432  
MetaLDAdefwf  6432  5824  5523  5351  20686  18191  16857  16003  22607  20956  

MetaLDA  6332  5682  5362  5171  202512  17818  16405  15516  22176  20206 
Dataset  Reuters  20NG  NYT  

5  10  20  50  5  10  20  50  2  5  
{ 213mm[

PLLDA  714  708  733  829  1997  1786  1605  1482  2839  2846  
LLDA  834  2607  2948 
Dataset  WS  TMN  AN  

#Topics  50  100  150  200  50  100  150  200  50  100  
{ 216mm[

LDA  9616  8788  8696  8885  196914  18736  18819  19164  40614  42212  
MetaLDAdefdef  88410  7336  6716  6256  180011  157819  14694  14226  35216  33611  
{ 313mm[

DMR  8457  6834  6071  5622  17508  15063  13917  13235  3266  2905  
MetaLDAdl0.01  8407  6936  6183  5884  176711  152810  14167  134513  32113  3038  
MetaLDAdldef  8324  6795  6227  5825  17207  150516  139511  132512  3199  2937  
{ 416.5mm[

LFLDA  11646  103917  101911  9926  241535  239311  237110  237414  48217  51419  
WFLDA  8946  8396  82710  8424  18536  176612  183060  185445  3975  4106  
MetaLDA0.1wf  8896  8323  8392  8534  18654  17842  17999  18316  3883  4108  
MetaLDAdefwf  8306  6888  6245  5844  173014  15043  140213  13424  34615  3328  

MetaLDA  7749  6276  5723  5344  16574  141516  13046  12356  3149  2939 
Dataset  WS  TMN  AN  
#Topics per label  5  10  20  50  5  10  20  50  5  10  
{ 213mm[

PLLDA  1060  886  735  642  2181  1863  1647  1456  440  525  
LLDA  1543  2958  392 
Tables II and III show^{10}^{10}10For GPUDMM and PTM, perplexity is not evaluated because the inference code for unseen documents is not public available. The random number seeds used in the code of LLDA and PLLDA are prefixed in the package. So the standard deviations of the two models are not reported.: the average perplexity scores with standard deviations for all the models. Note that: (1) The scores on AN with 150 and 200 topics are not reported due to overfitting observed in all the compared models. (2) Given the size of NYT, the scores of 200 and 500 topics are reported. (3) The number of latent topics in LLDA must equal to the number of document labels. (4) For PLLDA, we varied the number of topics per label from 5 to 50 (2 and 5 topics on NYT). The number of topics in PPLDA is the product of the numbers of labels and topics per label.
The results show that MetaLDA outperformed all the competitors in terms of perplexity on nearly all the datasets, showing the benefit of using both document and word meta information. Specifically, we have the following remarks:

By looking at the models using only the documentlevel meta information, we can see the significant improvement of these models over LDA, which indicates that document labels can play an important role in guiding topic modelling. Although the performance of the two variants of MetaLDA with document labels and DMR is comparable, our models runs much faster than DMR, which will be studied later in Section VF.

It is interesting that PLLDA with 50 topics for each label has better perplexity than MetaLDA with 200 topics in the 20NG dataset. With the 20 unique labels, the actual number of topics in PLLDA is 1000. However, if 10 topics for each label in PLLDA are used, which is equivalent to 200 topics in MetaLDA, PLLDA is outperformed by MetaLDA significantly.

At the word level, MetaLDAdefwf performed the best among the models with word features only. Moreover, our model has obvious advantage in running speed (see Table V). Furthermore, comparing MetaLDAdefwf with MetaLDAdefdef and MetaLDA0.1wf with LDA, we can see using the word features indeed improved perplexity.

The scores show that the improvement gained by MetaLDA over LDA on the short text datasets is larger than that on the regular text datasets. This is as expected because meta information serves as complementary information in MetaLDA and can have more significant impact when the data is sparser.

It can be observed that models usually gained improved perplexity, if is sampled/optimised, in line with [24].

On the AN dataset, there is no statistically significant difference between MetaLDA and DMR. On NYT, a similar trend is observed: the improvement in the models with the document labels over LDA is obvious but not in the models with the word features. Given the number of the document labels (194 of AN and 545 of NYT), it is possible that the document labels already offer enough information and the word embeddings have little contribution in the two datasets.
VD2 Perplexity Computed with Unseen Words
To test the hypothesis that the incorporation of meta information in MetaLDA can significantly improve the modelling accuracy in the cases where the corpus is sparse, we varied the proportion of documents used in training from 20% to 80% and used the remaining for testing. It is natural that when the proportion is small, the number of unseen words in testing documents will be large. Instead of simply excluding the unseen words in the previous experiments, here we compute the perplexity with unseen words for LDA, DMR, WFLDA and the proposed MetaLDA. For perplexity calculation, for each topic and each token in the test documents is needed. If occurs in the training documents, can be directly obtained. While if is unseen, can be estimated by the prior: . For LDA and DMR which do not use word features, ; For WFLDA and MetaLDA which are with word features, is computed with the features of the unseen token. Following Step 1c, for MetaLDA, .
Figure 2 shows the perplexity scores on Reuters, 20NG, TMN and WS with 200, 200, 100 and 50 topics respectively. MetaLDA outperformed the other models significantly with a lower proportion of training documents and relatively higher proportion of unseen words. The gap between MetaLDA and the other three models increases while the training proportion decreases. It indicates that the meta information helps MetaLDA to achieve better modelling accuracy on predicting unseen words.
VE Topic Coherence Evaluation
We further evaluate the semantic coherence of the words in a topic learnt by LDA, PTM, DMR, LFLDA, WFLDA, GPUDMM and MetaLDA. Here we use the Normalised Pointwise Mutual Information (NPMI) [32, 33] to calculate topic coherence score for topic with top words: , where is the probability of word , and is the joint probability of words and that cooccur together within a sliding window. Those probabilities were computed on an external large corpus, i.e., a 5.48GB Wikipedia dump in our experiments. The NPMI score of each topic in the experiments is calculated with top 10 words () by the Palmetto package^{11}^{11}11http://palmetto.aksw.org. Again, we report the average scores and the standard deviations over 5 random runs.
All 100 topics  Top 20 topics  
WS  TMN  AN  WS  TMN  AN  
{ 216mm[

LDA  0.00300.0047  0.03190.0032  0.06360.0033  0.10250.0067  0.1370.0043  0.00100.0052  
PTM  0.00290.0048  0.03550.0016  0.06400.0037  0.10330.0081  0.15270.0052  0.00040.0037  

DMR  0.00910.0046  0.03960.0044  0.04570.0024  0.12960.0085  0.14720.1507  0.02760.0101  
{ 317mm[

LFLDA  0.01300.0052  0.03970.0026  0.05230.0023  0.12300.0153  0.14560.0087  0.02720.0042  
WFLDA  0.00910.0046  0.03900.0051  0.04570.0024  0.12960.0085  0.15070.0055  0.02760.0101  
GPUDMM  0.09340.0106  0.09700.0034  0.07690.0012  0.08360.0105  0.09680.0076  0.06130.0020  

MetaLDA  0.03110.0038  0.04510.0034  0.03260.0019  0.15110.0093  0.15840.0072  0.05900.0065 
Dataset  Reuters  WS  NYT  

#Topics  50  100  150  200  50  100  150  200  200  500  
{ 216mm[

LDA  0.0899  0.1023  0.1172  0.1156  0.0219  0.0283  0.0301  0.0351  0.7509  1.1400  
PTM  4.9232  5.8885  7.2226  7.7670  1.1840  1.6375  1.8288  2.0030      
{ 213mm[

DMR  0.6112  0.9237  1.2638  1.6066  0.4603  0.8549  1.2521  1.7173  13.7546  31.9571  
MetaLDAdl0.01  0.1187  0.1387  0.1646  0.1868  0.0396  0.0587  0.0769  0.112 1  2.4679  4.9928  
{ 417mm[

LFLDA  2.6895  5.3043  8.3429  11.4419  2.4920  6.0266  9.1245  11.5983  95.5295  328.0862  
WFLDA  1.0495  1.6025  3.0304  4.8783  1.8162  3.7802  6.1863  8.6599  14.0538  31.4438  
GPUDMM  0.4193  0.7190  1.0421  1.3229  0.1206  0.1855  0.2487  0.3118      
MetaLDA0.1wf  0.2427  0.4274  0.6566  0.9683  0.1083  0.1811  0.2644  0.3579  4.6205  12.4177  

MetaLDA  0.2833  0.5447  0.7222  1.0615  0.1232  0.2040  0.3282  0.4167  6.4644  16.9735 
It is known that conventional topic models directly applied to short texts suffer from low quality topics, caused by the insufficient word cooccurrence information. Here we study whether or not the meta information helps MetaLDA improve topic quality, compared with other topic models that can also handle short texts. Table IV shows the NPMI scores on the three short text datasets. Higher scores indicate better topic coherence. All the models were trained with 100 topics. Besides the NPMI scores averaged over all the 100 topics, we also show the scores averaged over top 20 topics with highest NPMI, where “rubbish” topics are eliminated, following [23]. It is clear that MetaLDA performed significantly better than all the other models in WS and AN dataset in terms of NPMI, which indicates that MetaLDA can discover more meaningful topics with the document and word meta information. We would like to point out that on the TMN dataset, even though the average score of MetaLDA is still the best, the score of MetaLDA has overlapping with the others’ in the standard deviation, which indicates the difference is not statistically significant.
VF Running Time
In this section, we empirically study the efficiency of the models in term of periteration running time. The implementation details of our MetaLDA are as follows: (1) The SparseLDA framework [31] reduces the complexity of LDA to be sublinear by breaking the conditional of LDA into three “buckets”, where the “smoothing only” bucket is cached for all the documents and the “document only” bucket is cached for all the tokens in a document. We adopted a similar strategy when implementing MetaLDA. When only the document meta information is used, the Dirichlet parameters for different documents in MetaLDA are different and asymmetric. Therefore, the “smoothing only” bucket has to be computed for each document, but we can cache it for all the tokens, which still gives us a considerable reduction in computing complexity. However, when the word meta information is used, the SparseLDA framework no longer works in MetaLDA as the parameters for each topic and each token are different. (2) By adapting the DistributedLDA framework [34], our MetaLDA implementation runs in parallel with multiple threads, which makes MetaLDA able to handle larger document collections. The parallel implementation was used on the NYT dataset.
The periteration running time of all the models is shown in Table V. Note that: (1) On the Reuters and WS datasets, all the models ran with a single thread on a desktop PC with a 3.40GHz CPU and 16GB RAM. (2) Due to the size of NYT, we report the running time for the models that are able to run in parallel. All the parallelised models ran with 10 threads on a cluster with a 14core 2.6GHz CPU and 128GB RAM. (3) All the models were implemented in JAVA. (4) As the models with meta information add extra complexity to LDA, the periteration running time of LDA can be treated as the lower bound.
At the document level, both MetaLDAdf0.01 and DMR use priors to incorporate the document meta information and both of them were implemented in the SparseLDA framework. However, our variant is about 6 to 8 times faster than DMR on the Reuters dataset and more than 10 times faster on the WS dataset. Moreover, it can be seen that the larger the number of topics, the faster our variant is over DMR. At the word level, similar patterns can be observed: our MetaLDA0.1wf ran significantly faster than WFLDA and LFLDA especially when more topics are used (2030 times faster on WS). It is not surprising that GPUDMM has comparable running speed with our variant, because only one topic is allowed for each document in GPUDMM. With both document and word meta information, MetaLDA still ran several times faster than DMR, LFLDA, and WFLDA. On NYT with the parallel settings, MetaLDA maintains its efficiency advantage as well.
Vi Conclusion
In this paper, we have presented a topic modelling framework named MetaLDA that can efficiently incorporate document and word meta information. This gains a significant improvement over others in terms of perplexity and topic quality. With two data augmentation techniques, MetaLDA enjoys full local conjugacy, allowing efficient Gibbs sampling, demonstrated by superiority in the periteration running time. Furthermore, without losing generality, MetaLDA can work with both regular texts and short texts. The improvement of MetaLDA over other models that also use meta information is more remarkable, particularly when the wordoccurrence information is insufficient. As MetaLDA takes a particular approach for incorporating meta information on topic models, it is possible to apply the same approach to other Bayesian probabilistic models, where Dirichlet priors are used. Moreover, it would be interesting to extend our method to use realvalued meta information directly, which is the subject of future work.
Acknowledgement
Lan Du was partially supported by Chinese NSFC project under grant number 61402312. Gang Liu was partially supported by Chinese PostDoc Fund under grant number LBHQ15031.
References
 [1] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet Allocation,” JMLR, pp. 993–1022, 2003.
 [2] G. A. Miller, “WordNet: a lexical database for English,” Communications of the ACM, pp. 39–41, 1995.
 [3] J. Pennington, R. Socher, and C. Manning, “GloVe: Global vectors for word representation,” in EMNLP, 2014, pp. 1532–1543.

[4]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionally,” in
NIPS, 2013, pp. 3111–3119.  [5] R. Das, M. Zaheer, and C. Dyer, “Gaussian LDA for topic models with word embeddings,” in ACL, 2015, pp. 795–804.
 [6] D. Q. Nguyen, R. Billingsley, L. Du, and M. Johnson, “Improving topic models with latent feature word representations,” TACL, pp. 299–313, 2015.
 [7] G. Xun, V. Gopalakrishnan, F. Ma, Y. Li, J. Gao, and A. Zhang, “Topic discovery for short texts using word embeddings,” in ICDM, 2016, pp. 1299–1304.
 [8] C. Li, H. Wang, Z. Zhang, A. Sun, and Z. Ma, “Topic modeling for short texts with auxiliary word embeddings,” in SIGIR, 2016, pp. 165–174.
 [9] D. Mimno and A. McCallum, “Topic models conditioned on arbitrary features with Dirichletmultinomial regression,” in UAI, 2008, pp. 411–418.
 [10] D. Ramage, C. D. Manning, and S. Dumais, “Partially labeled topic models for interpretable text mining,” in SIGKDD, 2011, pp. 457–465.
 [11] J. D. Mcauliffe and D. M. Blei, “Supervised topic models,” in NIPS, 2008, pp. 121–128.
 [12] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning, “Labeled LDA: A supervised topic model for credit attribution in multilabeled corpora,” in EMNLP, 2009, pp. 248–256.
 [13] D. Kim and A. Oh, “Hierarchical Dirichlet scaling process,” Machine Learning, pp. 387–418, 2017.
 [14] C. Hu, P. Rai, and L. Carin, “Nonnegative matrix factorization for discrete data with hierarchical sideinformation,” in AISTATS, 2016, pp. 1124–1132.
 [15] D. Andrzejewski, X. Zhu, and M. Craven, “Incorporating domain knowledge into topic modeling via Dirichlet forest priors,” in ICML, 2009, pp. 25–32.
 [16] P. Xie, D. Yang, and E. Xing, “Incorporating word correlation knowledge into topic modeling,” in NAACL, 2015, pp. 725–734.
 [17] J. Petterson, W. Buntine, S. M. Narayanamurthy, T. S. Caetano, and A. J. Smola, “Word features for Latent Dirichlet Allocation,” in NIPS, 2010, pp. 1921–1929.
 [18] L. Hong and B. D. Davison, “Empirical study of topic modeling in twitter,” in Workshop on social media analytics, 2010, pp. 80–88.
 [19] Y. Zuo, J. Wu, H. Zhang, H. Lin, F. Wang, K. Xu, and H. Xiong, “Topic modeling of short texts: A pseudodocument view,” in SIGKDD, 2016, pp. 2105–2114.
 [20] J. Yin and J. Wang, “A Dirichlet multinomial mixture modelbased approach for short text clustering,” in SIGKDD, 2014, pp. 233–242.
 [21] R. Mehrotra, S. Sanner, W. Buntine, and L. Xie, “Improving LDA topic models for microblogs via tweet pooling and automatic labeling,” in SIGIR, 2013, pp. 889–892.
 [22] D. Andrzejewski, X. Zhu, M. Craven, and B. Recht, “A framework for incorporating general domain knowledge into Latent Dirichlet Allocation using firstorder logic,” in IJCAI, 2011, pp. 1171–1177.
 [23] Y. Yang, D. Downey, and J. BoydGraber, “Efficient methods for incorporating knowledge into topic models,” in EMNLP, 2015, pp. 308–317.
 [24] H. M. Wallach, D. M. Mimno, and A. McCallum, “Rethinking LDA: Why priors matter,” in NIPS, 2009, pp. 1973–1981.
 [25] C. Chen, L. Du, and W. Buntine, “Sampling table configurations for the hierarchical PoissonDirichlet process,” in ECML, 2011, pp. 296–311.
 [26] Y. Teh, M. Jordan, M. Beal, and D. Blei, “Hierarchical Dirichlet processes,” Journal of the American Statistical Association, pp. 1566–1581, 2012.
 [27] M. Zhou and L. Carin, “Negative binomial process count and mixture modeling,” TPAMI, pp. 307–320, 2015.
 [28] H. Zhao, L. Du, and W. Buntine, “Leveraging node attributes for incomplete relational data,” in ICML, 2017, pp. 4072–4081.
 [29] W. Buntine and M. Hutter, “A Bayesian view of the PoissonDirichlet process,” arXiv preprint arXiv:1007.0296v2 [math.ST], 2012.

[30]
J. Guo, W. Che, H. Wang, and T. Liu, “Revisiting embedding features for simple semisupervised learning,” in
EMNLP, 2014, pp. 110–120.  [31] L. Yao, D. Mimno, and A. McCallum, “Efficient methods for topic model inference on streaming document collections,” in SIGKDD, 2009, pp. 937–946.
 [32] N. Aletras and M. Stevenson, “Evaluating topic coherence using distributional semantics,” in International Conference on Computational Semantics, 2013, pp. 13–22.
 [33] J. H. Lau, D. Newman, and T. Baldwin, “Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality,” in EACL, 2014, pp. 530–539.
 [34] D. Newman, A. Asuncion, P. Smyth, and M. Welling, “Distributed algorithms for topic models,” JMLR, pp. 1801–1828, 2009.
Comments
There are no comments yet.