I Introduction
Existing topic modeling is often based off Latent Dirichlet allocation (LDA) [1] and involves analyzing a given corpus to produce a distribution over words for each latent topic and a distribution over latent topics for each document. The distributions representing topics are often useful and generally representative of a linguistic topic. Unfortunately, assigning labels to these topics is often left to manual interpretation.
Identifying topic labels is useful in summarizing a set of words with a single label. For example, words such as pencil, laptop, ruler, eraser, and book can be mapped to the label “School Supplies.” Adding descriptive semantics to each topic can help people, especially those without domain knowledge, to understand topics obtained by topic modeling.
A motivating application of accurate topic labeling is to develop summarization systems for primary care physicians, who are faced with the challenges of being inundated with too much data for a patient and too little time to comprehend it all [2]. The labels can be used to more appropriately and quickly give an overview, or a summary, of patient’s medical history, leading to better outcomes for the patient. This added information can bring significant value to the field of clinical informatics which already utilizes topic modeling without labeling [3, 4, 5].
Existing approaches in labeling topics usually do their fitting of labels to topics after completion of the unsupervised topic modeling process. A topic produced by this approach may not always match well with any semantic concepts and would therefore be difficult to categorize with a single label. These problems are best illustrated via a simple case study.
I1 Case Study
Suppose a corpus of a news source that consists of two articles is given by documents and each with three words:
 pencil, pencil, umpire
 ruler, ruler, baseball
LDA (with the traditionally used collapsed Gibbs sampler, standard hyperparameters and the number of topics () set as two) would output different results for different runs due to the inherent stochastic nature. It is very possible to obtain the following result of topic assignments:
 , ,
 , ,
But these assignments to topics differs from the ideal solution that involves knowing the context of the topics in which these words come from. If the topic modeling was to incorporate prior knowledge about the topics “School Supplies” and “Baseball”, then a topic modeling process will more likely generate the ideal topic assignments of:
 , ,
 , ,
and assign a label of “School Supplies” to topic and “Baseball” to topic . Furthermore it is advantageous to incorporate this prior knowledge during the topic modeling process. Consider the following table displaying four different mapping techniques of the first result using the Wikipedia articles of “School Supplies” and “Baseball” as the prior knowledge:
Technique  Topic 1  Topic 2 

JS Divergence  Baseball  Baseball 
TFIDF/CS  (same)  (same) 
Counting  Baseball  Baseball 
PMI  (same)  (same) 
Applying this labeling post topic modeling can lead to problems dealing with the topic themselves. This is not so much a problem of the mapping techniques but of the topics used as input. By separating the topics during inference this problem of combining different semantic topics can be avoided.
To overcome this problem, one may take a supervised approach that incorporates such prior knowledge into the topic modeling process to improve the quality of topic assignments and more effectively label topics. However, existing supervised approaches [6, 7, 8] are either too lenient or too strict. For example, in the Concepttopic model (CTM) [6]
, a multinomial distribution is placed over known concepts with associated word sets. This pioneering approach does integrate prior knowledge, but does not take into account word distributions. For example if a document is generated about the topic “School Supplies” it is much more probable to see the word “pencil” than the word “compass” even though both words may be associated with the topic “School Supplies”. This technique also requires some supervision which requires manually inputting preexisting concepts and their bags of words.
Another approach given by Hansen et al. as explicit Dirichlet allocation [7]
incorporates a preexisting distribution based off Wikipedia but does not allow for variance from the Wikipedia distribution. This approach fulfills the goal of incorporating prior knowledge with their distributions but requires the topic in the generated corpus to strictly follow the Wikipedia word distributions.
To address these limitations, we propose the SourceLDA model which is a balance between these two approaches. The goal is to allow for simultaneous discovery of both known and unknown topics. Given a collection of known topics and their word distributions, SourceLDA is able to identify the subset of these topics that appear in a given corpus. It allows some variance in word distributions to the extent that it optimizes the topic modeling. A summary of the contributions of this work are:

We propose a novel technique to topic modeling in a semisupervised fashion that takes into account preexisting topic distributions.

We show how to find the appropriate topics in a corpus given an input set that contains a subset of the topics used to generate a corpus.

We explain how to make use of prior knowledge sources. In particular, we show how to use Wikipedia articles to form word distributions.

We introduce an approach that allows for variance from an input topic to the latent topic discovered during the topic modeling process.
The rest of this paper is organized as follows: In Section 2, we give a brief introduction to the LDA algorithm and the Dirichlet distribution. A more detailed description of the SourceLDA algorithm is presented in Section 3. In Section 4, the algorithm is used and evaluated under various metrics. Related literature is highlighted in Section 5. Section 6 gives the conclusions of this paper.
For reproducible research, we make all of our code available online.^{1}^{1}1https://github.com/uclascai/SourceLDA
Ii Preliminaries
Iia Dirichlet Distribution
The Dirichlet distribution is a distribution over probability mass functions with a specific number of atoms and is commonly used in Bayesian models. A property of the Dirichlet that is often used in inference of Bayesian models is conjugacy to the multinomial distribution. This allows for the posterior of a random variable with a multinomial likelihood and a Dirichlet prior to also be a Dirichlet distribution.
The parameters are given as a vector denoted by
. The probability density function for a given probability mass function (PMF)
and parameter vector of length is defined as:A sample from the Dirichlet distribution produces a PMF that is parameterized by . The choice of a particular set of values influences the outcome of the generated PMF. If all values are the same (symmetric parameter), as approaches , the probability will be concentrated on a smaller set of atoms. As
approaches infinity, the PMF will become the uniform distribution. If all
are natural numbers then each individual can be thought of as the “virtual” count for the value [9].IiB Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) is the basis for many existing probabilistic topic models, and the framework for the approach presented by this paper. Since we enhance the LDA model in our proposed approach it is worth giving a brief overview of the algorithm and model of LDA.
LDA is a hierarchical Bayes model which utilizes Dirichlet priors to estimate the intractable latent variables of the model. At a high level, LDA is based on a generative model in which each word of an input document from a corpus is chosen by first selecting a topic that corresponds to that word and then selecting the word from a topictoword distribution. Each topictoword distribution and wordtotopic distribution is drawn from its respective Dirichlet distribution. The formal definition of the generative algorithm over a corpus is:
From the generative algorithm the resultant Bayes model is shown by Figure 1(a).
Bayes’ law is used to infer the latent distribution, distribution, and
Unfortunately the exact computation of this equation is intractable. Hence, it must be approximated with techniques such as expectationmaximization
[1], Gibbs sampling or collapsed Gibbs sampling [10].Iii Proposed Approach
SourceLDA is an extension of the LDA generative model. In SourceLDA, after a known set of topics are determined, an initial wordtotopic distribution is generated from corresponding Wikipedia articles. The desiderata is to enhance existing LDA topic modeling by integrating prior knowledge into the topic modeling process. The relevant terms and concepts used in the following discussion are defined below.
Definition 1 (Knowledge source)
A knowledge source is a collection of documents that are focused on describing a set of concepts. For example the knowledge source used in our experiments are Wikipedia articles that describe the categories we select from the Reuters dataset.
Definition 2 (Source Distribution)
The source distribution is a discrete probability distribution over the words of a document describing a topic. The probability mass function is given by
where is the set of all words in the document, , and is the number of times word appears in the document.
Definition 3 (Source Hyperparameters)
For a given document in a knowledge source the knowledge source hyperparameters are defined by the vector where and is a very small positive number that allows for nonzero probability draws from the Dirichlet distribution. is the size of the vocabulary of the corpus for which we are topic modeling, and is the number of times the word from the corpus vocabulary appears in the knowledge source document.
We detail three approaches to capture the intent of SourceLDA. The first approach is a simple enhancement to the LDA model that allows for the influencing of topic distributions, but suffers from needing more user intervention. The second approach allows for the mixing of unknown topics, and the third approach combines the previous two approaches. It moves toward a complete solution to topic modeling based off prior knowledge sources.
Iiia Bijective Mapping
In the simplest approach, the SourceLDA model assumes that there exists a 1to1 mapping between a known set of topics and the topics used to generate a corpus. The generative model then assumes that, instead of selecting topictoword distributions from sampling from the Dirichlet distribution, a set of distributions are given as input and sampled from after each topic assignment is sampled for a given token position. The generative process for a corpus adapted from the traditional LDA generative model during the construction of the distributions is as follows (for brevity only the relevant parts of the existing LDA algorithm are shown):
Where represents the knowledge source hyperparameters for the knowledge source document. The generative model only differs from the traditional LDA model in how each is built. Therefore the derivation for inference is a simple factor as well. To approximate the distributions for and , a collapsed Gibbs sampler can approximate the assignments as follows:
From the Bayesian Model the following equations can be easily be generated
with
and
in this and the following equations represent a count matrix for the number of times a word is assigned to a topic and the number of times a topic is assigned to a document respectively. For brevity since the prior probability is unchanged in the “Bijective Mapping” model we will skip the derivation which is well defined in other articles
[10, 11, 12].Putting the two equations together gives the final Gibbs sampling equation:
Given the approximation to the topic assignments, the and distributions are calculated as:
(1) 
In the case when all topics are known, this model has the advantage of conforming the distributions to the source distributions, but has three drawbacks. First, even though there is some variability between the distribution and source distribution, as illustrated by Figure 2, there may be cases in which this constraint should be relaxed even further. This is because it is entirely possible to generate a corpus about a known topic without exactly following the frequencies at which the topic is discussed in its respective article. This model also requires the user to input the known topics, and other possible supervised approaches may be better suited to the task [14, 15, 16]. The third drawback is that we are not allowing the possibility that the corpus was generated from a mixture of known topics and unknown topics, which is a more realistic scenario for an arbitrary document. The next model aims to resolve this last deficiency.
IiiB Known Mixture of Topics
The next model assumes that in the topic model it is given how many topics are known topics (as well as their word distributions) and how many are unknown topics. The previous approach works quite well in this situation in that an unknown topic will have a symmetric beta parameter which will capture assignments which were unallocated due to a low probability in matching any known topic.
The resulting model helps to solve the existing problems of the bijective model and only requires a minor input to the existing generative model. The resulting model works quite well with the bijective model in that the symmetric Dirichlet prior can be used to guide a topic toward being a general unknown topic or a known topic. The model changes as shown below with a minor change to the generative algorithm and the collapsed Gibbs sampling.
Where is the total number of nonsource topics. The change required to the collapsed Gibbs sampling is then:
and
(2) 
This approach gives the benefit of allowing a mixture of known topics and unknown topics, but problems still arise in that the Dirichlet distributions for the source distribution may be too restricting.
IiiC SourceLDA
By using the counts as hyperparameters, the resultant
distribution will take on the shape of the word distribution derived from the knowledge source. However, this might be at odds with the aim of enhancing existing topic modeling. With the goal to influence the
distribution, it is entirely plausible to have divergence between the two distributions. In other words, may not need to strictly follow the corresponding knowledge source distribution.IiiC1 Variance from the source distribution
To allow for this relaxation, another parameter is introduced into the model which is used to allow for a higher deviance from the source distribution. To obtain this variance each source hyperparameter will be raised to a power of . Thus as approaches each hyperparameter will approach and the subsequent Dirichlet draw will allow all discrete distributions with equal probability. As approaches the Dirichlet draw will be tightly conformed to the source distribution.
The addition of changes the existing generative model only slightly and allows for a variance for each individual , which frees us from an overly restrictive binding to the associated knowledge source distribution. The parameter acts as a measure of how much divergence is allowed for a given modeled topic from the knowledge source distribution. Figure 3 shows how the JS Divergence changes with changes to the parameter.
With the introduction of as an input parameter, the new topic model has the advantage of allowing variance and also leaves the collapsed Gibbs sampling equation unchanged. However this also requires a uniform variance from the knowledge base distribution for all latent topics. This can be a problem if the corpus was generated with some topics influenced strongly while others less so. To solve this we can introduce as a hidden parameter of the model.
IiiC2 Approximating
In the ideal situation will be as close to for most knowledge based latent topics, with the flexibility to deviate as required by the data. For this we assume a Gaussian prior over with mean set to . The variance then becomes a modeled parameter that conceptually can be thought of as how much variance from the knowledge source distribution we wish to allow in our topic model. In assuming a Gaussian prior for , we must integrate out of the collapsed Gibbs sampling equations (only the probability of under topic is shown, the probability of topic in document is unchanged and omitted).
then becomes
Unfortunately closed form expressions for these integrals are hard to obtain and so they must be approximated numerically during sampling.
Another problem arises in that the change of
is not in par with the change of the Gaussian distribution, as can be seen in Figure
3. To make the changes of more in line with that expected from the Gaussian PDF, we must map each individual value in the range to with a value which produces a change in the JS divergence in a linear fashion. We approximate a function, with a linear derivative, shown in Figure 4. The approach taken to approximateis by linear interpolation of an aggregated large number of samples for each point taken in the range
to . Our collapsed Gibbs sampling equations then becomes:(3) 
and
(4) 
IiiC3 Superset Topic Reduction
A third problem involves knowing the right mixture of known topics and unknown topics. It is also entirely possible that many known topics may not be used by the generative model. Our desire to leave the model as unsupervised as possible calls for input that is a superset of the actual generative topic selection in order to avoid manual topic selection. In the case of modeling only a specific number of topics over the corpus, the problem then becomes how to choose which knowledge source latent topics to allow in the model vs. how many unlabeled topics to allow.
The goal then is to allow for a superset of knowledge source topics as input and then during the inference to select the best subset of these with a mixture of unknown topics where the total number of unlabeled topics is given as input . The approach given is to use a mixture of unlabeled topics alongside the labeled knowledge source topics. The total number of topics then becomes
. During the inference we eliminate topics which are not assigned to any documents. At the end of the sampling phase we then can use a clustering algorithm (such as kmeans, JS divergence) to further reduce the modeled topics and give a total of
topics. As described more in the experimental section, with the goal of capturing topics that were frequently occurring in the corpus, topics not appearing in a frequent enough of documents were eliminated.The complete generative process is shown in Figure 1(b) and described below:
The full collapsed Gibbs sampling algorithm is given in algorithm 1.
IiiC4 Analysis
By using a clustering algorithm or thresholding the topic document frequency, the collapsed Gibbs algorithm is guaranteed to produce topics. The running time is a function of the number of iterations , average words per document , number of documents , number of topics and number of approximation steps , and is . This differs only from the traditional collapsed Gibbs sampling in LDA by an increase of . But since we have built the approach to potentially have a large this difference can have a significant impact on running times.
Approaches exist that can parallelize the sampling procedure, but these are often approximations or can potentially have slower than baseline running times [17, 18, 19]. We present two modifications to the original algorithm that allow for inference while guaranteeing the exactness of the results to the original Gibbs sampling. The first one makes use of prefix sums rules [20] and guarantees a running time of:
with being the number of parallel units. This algorithm is given by Algorithm 2.
This algorithm is practical in situations where is large, but suffers from the limitations of the number of context switches required for the threads to wait at their respective barriers. A simpler implementation approach that reduces the number of context switches is to add the sums for each thread then wait for a barrier. When the barrier is released we add the end values together and then in parallel we add the remaining necessary items. This approach is given in Algorithm 3. The running time is then:
These two algorithms allow for mitigation of the increase in the number of topics and should approach times very similar to those of standard LDA runs. They are also very extensible and can be used in other optimization algorithms.
IiiC5 Input determination
Determining the necessary parameters and inputs into LDA is an established research area [21], but since the proposed model introduces additional input requirements a brief overview will be given about how to best set the parameters and determine the knowledge source.
Parameter selection
To determine the appropriate parameters, techniques utilizing log likelihood have previously been established [10]. Since these approaches generally require held out data and are a function of the , , and variables the introduction of and will not differentiate from their original equations. For example the perplexity calculations used for SourceLDA are based off of importance sampling [22], or latent variable estimation via Gibbs sampling [23]. Importance sampling is only a function of given by Equation 4, and estimation via Gibbs sampling can made using Equation 4 and by the following equation (, , and represent the corresponding variables in the test document set):
and
It is recommended to set the parameters so as to maximize the log likelihood. Further analysis such as whether or not the parameters can be learned a priori from the data are not the focus of this paper and are thus left as an open research area.
Knowledge source selection
SourceLDA is designed to be used only with a corpus which has a known super set of topics which comprise a large portion of the tokens. An example of such a case is that of a corpus consisting of clinical patient notes. Since there are extensive knowledge sources comprising essentially all medical topics, SourceLDA can be useful in discovering and labeling these existing topics. In cases where it is not so easy to collect a superset of topics traditional approaches may be more useful.
Iv Evaluation
To test the results of the SourceLDA algorithm we set up experiments to test against competing models. The most similar models to our proposed approach were used in comparison. These are: latent Dirichlet allocation (LDA) [1], explicit Dirichlet allocation (EDA) [7], and the Concepttopic model (CTM) [6]. Other approaches such as supervised latent Dirichlet allocation (sLDA) [14], discriminative LDA (DiscLDA) [15], and labeled LDA (LLDA) [16] are not used since a main desiderata of SourceLDA is to require much less supervision than what is needed by these methods. Likewise hierarchical methods [24] are omitted because there is no established hierarchy in the knowledge source data for this model. We describe in more detail below the experimental setups and metrics used to compare results.
Iva A Graphical Example
Following a previously established experiment [10], we show the utility of SourceLDA by visualizing topics created with words that correspond to the pixel locations in a picture; but we add a key difference. The original topics are augmented, used to generate a corpus, and then hidden. Only the non augmented topics are given as input with the goal of discovering the augmented topics using the corpus and their original topics.
IvA1 Experimental Setup
We start by creating ten topics with the vocabulary being the set of pixel locations in a picture. The vocabulary () and bag of words representation of a topic () are defined as:
The topics are shown by Figure 5(a) with the intensity () of a pixel corresponding to word in topic equal to:
The representation of topics in this manner leads to a total of topics. These original topics are then augmented by pairing each topic with a random different topic and swapping a random word (pixel) that is assigned to each topic given that the swapped words do not belong to their original assignments. Figure 5(b) shows the augmented topics which represent a augmentation rate between the original topics. From the set of augmented topics we generate a 2,000 document corpus using the generative model of LDA. Each document consists of words with topic assignments drawn from a distribution sampled from the Dirichlet distribution parameterized by . With the knowledge source consisting solely of the original non augmented topics we run SourceLDA on the corpus hoping to discover and properly label the augmented topics. For comparative analysis we also run EDA and CTM against the same data set.
IvA2 Experimental Results
As shown in Figure 6, SourceLDA discovers the augmented topics given the set of original topics. Not only is SourceLDA able to find the topics correctly to the augmented distributions used in the generation of the corpus, but it is also able to match them to their respective non augmented source distributions. This simple experiment highlights a big advantage of SourceLDA; which is the ability to discover topics that differ from their respective supervised input set. Other models such as EDA and CTM are unable to label the augmented topics correctly due to the topics containing a word (pixel) not in the original distribution. The comparative average JS divergence was , , and for SourceLDA, EDA, and CTM respectively.
IvB Integrating
A reasonable assumption to a corpus in which some topics are generated from a knowledge source is that the topics used in the corpus are going to deviate (more or less similar) from their respect source distributions and that each individual topic is going to deviate at a different rate than other topics. The introduction of to SourceLDA as a parameter to be learned by the data allows the flexibility of different topics to be influenced differently by , but comes at an increase in computation cost. To show that in certain cases this flexibility is needed to obtain more accurate results we derive an experiment consisting of topics with different deviations from their respective source distributions.
IvB1 Experimental Setup
A synthetic document corpus is generated from a knowledge source of randomly selected Wikipedia topics. The corpus is generated using the bijective model of SourceLDA as outlined in Section 3(A), consisting of topics, an average word count per document of words, , and . Furthermore even though for each topic was drawn from we bound the value drawn to the interval for comparative analysis. We then run SourceLDA under the bijective model for a baseline of , against runs of SourceLDA with fixed. After each run we compare the classification accuracy and perplexity values.
IvB2 Experimental Results
For all fixed runs the baseline approach of varying
in accordance with the normal distribution results in a higher classification accuracy. By allowing
to deviate, the model can make up for incorrect parameter assignments due to a misleading perplexity value. As shown in Figure 7, classification accuracy is not perfectly correlated with perplexity. This is shown by the baseline method reporting a higher perplexity value than the fixed value while maintaining a higher classification accuracy. Even though we still recommend perplexity or other loglikelihood maximization approaches to set the parameters in any unknown data set, maximizing loglikelihood has been shown to be a less than perfect metric for evaluating topic models [25, 26]. In this experiment and the remaining experiments we take classification accuracy to be a more appropriate measurement for evaluating topic models.Inventories  Natural Gas  Balance of Payments  

SRCLDA  IRLDA  CTM  SRCLDA  IRLDA  CTM  SRCLDA  IRLDA  CTM 
inventory  systems  sales  gas  corp  gas  account  said  said 
cost  products  year  natural  contract  said  surplus  public  june 
stock  said  sold  used  company  total  deficit  state  april 
accounting  information  retail  water  services  value  current  private  beginning 
goods  technology  given  oil  unit  near  balance  planned  great 
management  company  place  carbon  subsidiary  natural  currency  reduce  later 
time  data  marketing  cubic  completed  properties  trade  local  remain 
costs  network  improved  energy  work  california  exchange  added  reserve 
financial  kodak  passed  fuel  dlr  wells  capital  make  equivalent 
process  available  addition  million  received  future  foreign  did  imported 
IvC Reuters Newswire Analysis
To show the type of topics discovered from SourceLDA we run the model on an existing dataset. This collection contains documents from the Reuters newswire from 1987. The dataset contains 21,578 articles, among a large set of categories. One important feature of the dataset are a set of given categories that we can use for our topic labeling. These include broad categories such as shipping, interest rates, and trade, as well as more refined categories such as rubber, zinc, and coffee. Our choice to apply our topic labeling method to this dataset is due to the fact that the Reuters dataset is widely used for information retrieval and text categorization applications. Due to its widespread use, it can considerably aid us in comparing our results to other studies. Additionally, because it contains distinct categories that we can use as our known set of topics, we can easily demonstrate the viability of our model.
IvC1 Experimental Setup
SourceLDA, LDA, and CTM were run against the Reuters21578 newswire collection. Since EDA does not discover new topics, nor does it update the word distributions of the input topics, we do not include EDA in this experiment. From the original 21,578 document corpus we select a subset of 2,000 documents. The SourceLDA and CTM supplementary distributions were generated by first obtaining a list of topics from the Reuters21578 dataset. Next, for each topic, the corresponding Wikipedia article was crawled and the words in the topic were counted, forming their respective distributions. Querying Wikipedia resulted in distinct topics as our superset for the knowledge source. Out of the crawled available topics, only topics appear in the 2,000 document corpus. This represents the ideal conditions in which SourceLDA is to be applied; that of a corpus which a significant portion of tokens are generated from a subset of a larger and relatively easy to obtain topic set. For all models, a symmetric Dirichlet parameter of (where is the number of topics) and (where is the size of the vocabulary) was used for and respectively. For SourceLDA, and were determined by experimentally finding a local minimum value of perplexity which resulted from the parameter values of for and for
. The bag of words used in the CTM were taken from the top 10,000 words by frequency for each topic. The models showed good convergence after 1,000 iterations. After sampling was complete for LDA, the resulting topictoword distribution was mapped using an information retrieval (IR) approach. The IR approach was to use cosine similarity of documents mapped to term frequencyinverse document frequency (TFIDF) vectors with TFIDF weighted query vectors formed from the top 10 words per topic.
2) Experimental Results: After the LDA model converged, we label the topics using the IR approach described above (we referred to this topic labeling method as IRLDA). Given similar labels from the models it is an intuitive approach to compare the word assignments to each topic model. Example comparisons are shown in Table I. The label assignments generated from SourceLDA show a more accurate assignment of labels to topics than both IRLDA and CTM. IRLDA appears to suffer from mixing of different concepts into a single topic, for example with the topic “Inventories,” the topic assignments could possibly be the combination of “Inventories” and “Information Technology”. The CTM seems to assign more weight to less important words. One approach to rectify this problem for CTM is to use a smaller number of words for the bag of words, but this leads to significant dropout and no labeled topics are passed through. Out of the total returned topics, CTM only discovered labeled topics, with SourceLDA discovering . Since the IR approach forces all topics to a label regardless of the quality of the label, LDA required all topics to be matched to a label. Out of the labeled CTM topics only were overlapping with SourceLDA and IRLDA and are shown in Table I. The remaining CTM topics were bad matches for the label with an average of of words not appropriate for the label as determined by human judgment (we acknowledge the potential for bias). Meanwhile SourceLDA mismatched at a rate of , with IRLDA at a rate of . SourceLDA is more consistent with the meaning of the topic as opposed to what words you may find when talking about this topic, which can be generally applied to many concepts.
IvD Wikipedia Corpus
A comparison of SourceLDA against EDA, and CTM is made using a corpus generated using a known knowledge source corresponding to medical topics extracted from MedlinePlus (a consumerfriendly medical dictionary) [27]. We evaluate the strength of SourceLDA under different models proposed in Section 3 using the metrics of classification accuracy, JS divergence and Pointwise mutual information (PMI).
PMI is an established evaluation of learned topics which takes as input a subset of the most popular tokens comprising a topic and determines the frequency of all pairs in the subset occurring at a given input distance from each other in the corpus. The more that these pairs occur close to each other then the better the learned topics. PMI differs from the JS divergence evaluation for this experiment in that PMI will tell us how good our topics are where as the JS divergence will tell us how good our distribution over topics for each document is.
IvD1 Experimental Setup
A corpus of Wikipedia vocabulary articles was generated by following the steps of the generative model for SourceLDA, where the chosen topics are a subset of a larger collection of Wikipedia topics. The topics consisted of Wikipedia articles representing the collection of topic labels from MedlinePlus. The number of topics () was given as , chosen from an entire collection of topics (), the number of documents () was given as and the average document word count () as , and were set to and for the bijective evaluation and for the SourceLDA model respectively. After these documents were generated the topic assignments were recorded and used as the ground truth measurement. The word assignments were used as the corpus and the different topic models were applied to these documents. The first round of topic models consisted of comparing SourceLDA, EDA, and CTM. For SourceLDA and were set to match that of the generative model. For all models, a symmetric Dirichlet parameter of and was used for and respectively. After convergence of the models they were evaluated against the ground truth measurement. In the second round of experiments each topic model was run under the bijective model, that is they only considered topics which were used in the ground truth assignments.
To compare SourceLDA against LDA using PMI, 5 corpora were generated under the bijective model with the number of topics ranging from to . , , , , and were set to , , , , and respectively. The parameters for SourceLDA followed the generative model and all other parameters are the same as the previous experiments. After iterations the top words given for each topic were used in the PMI assessment.
IvD2 Experimental Results
The topic assignments for each token in the corpus were recorded for all models and the results compared against each other. Since we know a priori the correct topic assignment for each token we use the number of correct topic assignments to be an appropriate measure of classification accuracy. Note that in evaluations where the ground truth is known, classification accuracy is a much better determination of the goodness of a model than log likelihood maximizations such as perplexity and therefore we do not evaluate the model using perplexity. In Figure 8, all topic models run under the full SourceLDA model are tagged with an “Unk” label, and likewise topic models run under the bijective model are tagged with “Exact”. The overall number of correct topic assignments for each model are shown in Figure 8(a) for the mixed model and Figure 8(b) for the bijective model. Since the LDA model has unknown topics, JS divergence was used to map each LDA topic to its best matching Wikipedia topic. As expected the SourceLDA model (SRCUnk and SRCExact) had the best results amongst all other topic models for classification accuracy.
In the second analysis the topic to document distributions were analyzed using sorted JS Divergence, and is irrespective to any unknown mapping. The results again show the SourceLDA model to be effective in accurately mapping topics to documents whether or not the topics used in the generative model are unknown (Figure 8(d)) or a known set of topics as shown in Figure 8(e). Even though an accurate alignment of by itself does not lend much weight to any one model being superior, we do find it important to demonstrate how is being affected by the different algorithms.
The PMI analysis detailed by Figure 8(c) show that by PMI, SourceLDA provides a better mapping of labels to topics over the input corpora. This is an encouraging result, even though the differences are not large, since LDA is a function of topic proximity in a document and word frequency in a topic, whereas SourceLDA is a function of the same plus the likelihood of a word being in an augmented source distribution.
IvE Performance Benchmarking
To show the performance gains used by the parallel sampling algorithm and experiment was set up to generate topics randomly from a given vocabulary. The corpus was generated using the same parameters as in Section 4(B) but with ranging from to . The benchmarking is visualized by Figure 8(d). It clearly demonstrates that SourceLDA is linearly scalable and easily parallelized.
V Related Work
Much existing literature exists related to the proposed approach in this paper. These methods are mainly extensions of LDA, and add to the original model by introducing enhancements such as topic labeling, integration with contextual information and hierarchical modeling.
Va Topic Labeling
In the early research stage, labels were often generated by hand[28, 29, 30, 31]. Though manual labeling may generate more understandable and accurate semantics of a topic, it costs a lot of human effort and it is prone to subjectivity [32]. For example, in the most conventional LDA model, topics are interpreted by selecting the top words in the distribution [1, 33, 32, 28]. The Topics over Time (TOT) model implements continuous time stamps with each topic [32]. The model has been applied in three kinds of datasets, and results show more accurate topics and better timestamp predictions. However, the interpretation of topics is manual and posthoc labeling can be timeconsuming and subjective.
Mei et al. proposed probabilistic approaches to automatically interpreting multinomial topic models objectively. The intuition of this algorithm was to minimize the semantic distance between the topic model and the label. To this end, they extracted candidate labels from noun phrases chunked by an NLP Chunker and most significant 2grams. Then they ranked labels to minimize KullbackLeibler divergence and maximize mutual information between a topic model and a label. The approach achieved the automatic interpretation of topics, but available candidate labels were limited to phrases inside documents.
Lau et al came up with an automatic topic label generation method which obtains candidate labels from Wikipedia articles containing the topranking topic terms, topranked document titles, and subphrases. To rank those candidates topic labels, they used different lexical measurements, such as pointwise mutual information, Student’s ttest, Dice’s coefficient and the log likelihood ratio
[34]. Supervised methods like support vector regression were also applied in the ranking process. Results showed that supervised algorithm outperforms unsupervised baseline in all four corpora.In previous approaches, topics were treated individually and relation among topics was not considered. Mao et al created hierarchical descriptor for topics, and results proved that innertopic relation could increase the accuracy of topic labels [35]. Hulpus et al proposed a graphbased approach for topic labeling [36]. In Yashar Mehdad’s work, they built an entailment graph over phrases. Based on that, they then aggregated relevant phrases by generalization and merging [37].
Conceptual labeling is an approach to generate a minimum sized set of labels that best describe a bag of words which includes topics generated from topic modeling [38]. Concepts used in the topic labeling are taken from a semantic network and deemed appropriate using the metric Minimum Description Length. This approach is applied after topic modeling and represents an effective way of labeling topics over existing approaches.
VB Supervised Labeling
Supervised Latent Dirichlet Allocation (sLDA) is a supervised approach to labeling topics [14]
. The approach includes a response variable into the LDA model to obtain latent topics that potentially provide an optimal prediction for the response variable of a new unlabeled document. This approach requires, during training, the manual input of individual topic labels and is constrained to permitting one label per topic.
Similar to sLDA is Discriminative LDA (DiscLDA) which attempts to solve the same problem as sLDA, but differs in the approach [15]
. The differing approach was centered around introducing a classdependent linear transformation on the topic mixture proportions. This transformation matrix was learned through a conditional likelihood criterion. This method has the benefit of both reducing the dimension of documents in the corpus and labeling the lower dimension documents.
Both sLDA and DiscLDA only allow for a supervised input set that label a single topic. An approach that allows for multiple labels in a topic is given by Labeled LDA (LLDA) [16]. This model differs in the generation of multinomial distribution theta over the topics in the model. The scaling parameter is then modified by a label projection matrix to restrict the distribution to those topics considered most relevant to the document.
VC Contextual Integration
An existing approach that takes into account concepts supplied by prior sources requires a manual input set of relevant terms [39]. In the topic model then these concepts are applied to the assignment of topics to a token in a document. Alongside this concept topic modeling a hierarchical method can also be used to incorporate concepts into a hierarchical structure. This work shows the utility of bringing in prior knowledge into topic modeling.
An approach that integrates Wikipedia information into the topic modeling differs than the supervised approach by only requiring an existing Wikipedia article [7]. The assumption in this work is that in the generative process the topics are selected from the Wikipedia word distributions. The results show that Wikipedia articles can be used as effective topics in topic modeling.
Wikipedia again was shown as a basis for topic modeling, albeit for a tangential approach, entity disambiguation [7]. The approach involved topic modeling as a way of annotating entities in text. This involved the use of a large dataset of topics so efficient methods were introduced. Experiments against a public dataset resulted in a state of the art performance.
Vi Conclusion
We have described in this paper a novel methodology for semisupervised topic modeling with meaningful labels, as well as provided parallel algorithms to speed up the inference process. This methodology uses prior knowledge sources to influence a topic model in order to allow the labels from these external sources to be used for topics generated over a corpus of interest. In addition, this approach results in more meaningful topics generated based on the quality of the external knowledge source. We have tested our methodology against the Reuters21578 newswire collection corpus for labeling and Wikipedia as external knowledge sources. The analysis of the quality of topic models using PMI show the ability of SourceLDA to enhance existing topic models.
Acknowledgments
This work was supported by NIHNCI National Cancer Institute T32CA201160 to JW, the NIHNational Library of Medicine R21LM011937 to CA, and NIH U01HG008488, NIH R01GM115833, NIH U54GM114833, and NSF IIS1313606 to WW. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. We would also like to thank Tianran Zhang, Jiayun Li, Karthik Sarma, Mahati Kumar, Sara Melvin, Jie Yu, Nicholas Matiasz, Ariyam Das and all the reviewers for their thoughtful input into different aspects of this paper.
References

[1]
D. M. Blei et al., “Latent dirichlet allocation,”
Journal of Machine Learning Research
, vol. 3, pp. 993–1022, 2003.  [2] R. S. Margalit et al., “Electronic medical record use and physicianpatient communication: an observational study of Israeli primary care encounters,” Patient Education and Counseling, vol. 1, pp. 131–141, 2006.
 [3] C. W. Arnold et al., “Clinical casebased retrieval using latent topic analysis,” AMIA Annual Symposium Proceedings, vol. 2010, p. 26, 2010.
 [4] H. Bisgin et al., “Mining FDA drug labels using an unsupervised learning technique  topic modeling,” BMC Bioinformatics, vol. 12, no. S10, p. S11, 2011.
 [5] W. Speier, M. K. Ong, and C. W. Arnold, “Using phrases and document metadata to improve topic modeling of clinical reports,” Journal of Biomedical Informatics, vol. 61, pp. 260–266, 2016. [Online]. Available: http://dx.doi.org/10.1016/j.jbi.2016.04.005
 [6] C. C. others, “Text modeling using unsupervised topic models and concept hierarchies,” CoRR, vol. abs/0808.0973, 2008.
 [7] J. A. Hansen et al., “Probabilistic explicit topic modeling using wikipedia,” in Language Processing and Knowledge in the Web  25^{th} International Conference, GSCL 2013, Darmstadt, Germany, September 2527, 2013. Proceedings, ser. Lecture Notes in Computer Science, I. Gurevych, C. Biemann, and T. Zesch, Eds., vol. 8105. Springer, 2013, pp. 69–82. [Online]. Available: http://dx.doi.org/10.1007/9783642407222
 [8] J. Jagarlamudi et al., “Incorporating lexical priors into topic models,” in EACL 2012, 13^{th} Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, April 2327, 2012, W. Daelemans, M. Lapata, and L. Màrquez, Eds. The Association for Computer Linguistics, 2012, pp. 204–213. [Online]. Available: http://aclweb.org/anthologynew/E/E12/

[9]
T. P. Minka, “Bayesian inference, entropy, and the multinomial distribution,” 2000.
 [10] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings of the National Academy of Sciences, vol. 101, no. Suppl. 1, pp. 5228–5235, Apr. 2004.
 [11] W. M. Darling, “A theoretical and practical implementation tutorial on topic modeling and gibbs sampling,” in Proceedings of the 49^{th} annual meeting of the association for computational linguistics: Human language technologies, 2011, pp. 642–647.
 [12] T. Griffiths, “Gibbs sampling in the generative model of latent dirichlet allocation,” 2002.
 [13] M. W. Beck, “Average dissertation and thesis length,” https://github.com/fawda123/diss_proc, 2014.
 [14] D. M. Blei and J. D. McAuliffe, “Supervised topic models,” in Advances in Neural Information Processing Systems 20, Proceedings of the TwentyFirst Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 36, 2007, J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, Eds. Curran Associates, Inc., 2007, pp. 121–128. [Online]. Available: http://papers.nips.cc/book/advancesinneuralinformationprocessingsystems202007
 [15] S. LacosteJulien et al., “Disclda: Discriminative learning for dimensionality reduction and classification,” in Advances in Neural Information Processing Systems 21, Proceedings of the TwentySecond Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 811, 2008, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, Eds. Curran Associates, Inc., 2008, pp. 897–904. [Online]. Available: http://papers.nips.cc/book/advancesinneuralinformationprocessingsystems212008

[16]
D. Ramage et al., “Labeled LDA: A supervised topic model for credit
attribution in multilabeled corpora,” in
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, 67 August 2009, Singapore, A meeting of SIGDAT, a Special Interest Group of the ACL
. ACL, 2009, pp. 248–256.  [17] Y. Wang et al., “PLDA: parallel latent dirichlet allocation for largescale applications,” in Algorithmic Aspects in Information and Management, 5^{th} International Conference, AAIM 2009, San Francisco, CA, USA, June 1517, 2009. Proceedings, ser. Lecture Notes in Computer Science, A. V. Goldberg and Y. Zhou, Eds., vol. 5564. Springer, 2009, pp. 301–314. [Online]. Available: http://dx.doi.org/10.1007/9783642021589
 [18] D. Newman et al., “Distributed inference for latent dirichlet allocation,” in Advances in Neural Information Processing Systems 20, Proceedings of the TwentyFirst Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 36, 2007, J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, Eds. Curran Associates, Inc., 2007, pp. 1081–1088. [Online]. Available: http://papers.nips.cc/book/advancesinneuralinformationprocessingsystems202007
 [19] I. Porteous et al., “Fast collapsed gibbs sampling for latent dirichlet allocation,” in Proceedings of the 14^{th} ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 2427, 2008, Y. Li, B. Liu, and S. Sarawagi, Eds. ACM, 2008, pp. 569–577.
 [20] G. E. Blelloch, “Prefix sums and their applications,” Synthesis of Parallel Algorithms, Tech. Rep., 1990.
 [21] H. M. Wallach et al., “Rethinking LDA: why priors matter,” in Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009. Proceedings of a meeting held 710 December 2009, Vancouver, British Columbia, Canada., Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, Eds. Curran Associates, Inc., 2009, pp. 1973–1981. [Online]. Available: http://papers.nips.cc/paper/3854rethinkingldawhypriorsmatter
 [22] H. M. Wallach et al., “Evaluation methods for topic models,” in Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 1418, 2009, ser. ACM International Conference Proceeding Series, A. P. Danyluk, L. Bottou, and M. L. Littman, Eds., vol. 382. ACM, 2009, pp. 1105–1112. [Online]. Available: http://doi.acm.org/10.1145/1553374.1553515
 [23] G. Heinrich, “Parameter estimation for text analysis,” University of Leipzig, Tech. Rep, 2008.
 [24] J. Kang et al., “Transfer topic modeling with ease and scalability,” in Proceedings of the Twelfth SIAM International Conference on Data Mining, Anaheim, California, USA, April 2628, 2012. SIAM / Omnipress, 2012, pp. 564–575. [Online]. Available: http://dx.doi.org/10.1137/1.9781611972825.49
 [25] J. Chang et al., “Reading tea leaves: How humans interpret topic models,” in Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009. Proceedings of a meeting held 710 December 2009, Vancouver, British Columbia, Canada., Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, Eds. Curran Associates, Inc., 2009, pp. 288–296. [Online]. Available: http://papers.nips.cc/paper/3700readingtealeaveshowhumansinterprettopicmodels
 [26] C. W. Arnold, A. Oh, S. Chen, and W. Speier, “Evaluating topic model interpretability from a primary care physician perspective,” Computer Methods and Programs in Biomedicine, vol. 124, pp. 67–75, 2016. [Online]. Available: http://dx.doi.org/10.1016/j.cmpb.2015.10.014
 [27] “Medlineplus [internet],” https://www.nlm.nih.gov/medlineplus/.
 [28] Q. Mei et al., “Automatic labeling of multinomial topic models,” in Proceedings of the 13^{th} ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, California, USA, August 1215, 2007, P. Berkhin, R. Caruana, and X. Wu, Eds. ACM, 2007, pp. 490–499.
 [29] Q. Mei et al., “A probabilistic approach to spatiotemporal theme pattern mining on weblogs,” in Proceedings of the 15^{th} international conference on World Wide Web, WWW 2006, Edinburgh, Scotland, UK, May 2326, 2006, L. Carr, D. D. Roure, A. Iyengar, C. A. Goble, and M. Dahlin, Eds. ACM, 2006, pp. 533–542.
 [30] Q. Mei and C. Zhai, “Discovering evolutionary theme patterns from text: an exploration of temporal text mining,” in Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, Illinois, USA, August 2124, 2005, R. Grossman, R. J. Bayardo, and K. P. Bennett, Eds. ACM, 2005, pp. 198–207.
 [31] Q. Mei and C. Zhai, “A mixture model for contextual text mining,” in Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, August 2023, 2006, T. EliassiRad, L. H. Ungar, M. Craven, and D. Gunopulos, Eds. ACM, 2006, pp. 649–655.
 [32] X. Wang and A. McCallum, “Topics over time: a nonMarkov continuoustime model of topical trends,” in Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, August 2023, 2006, T. EliassiRad, L. H. Ungar, M. Craven, and D. Gunopulos, Eds. ACM, 2006, pp. 424–433.
 [33] J. H. Lau et al., “Automatic labelling of topic models,” in The 49^{th} Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 1924 June, 2011, Portland, Oregon, USA, D. Lin, Y. Matsumoto, and R. Mihalcea, Eds. The Association for Computer Linguistics, 2011, pp. 1536–1545.
 [34] P. Pecina, “Lexical association measures and collocation extraction,” Language Resources and Evaluation, vol. 44, no. 12, pp. 137–158, 2010.
 [35] X. Mao et al., “Automatic labeling hierarchical topics,” in 21^{st} ACM International Conference on Information and Knowledge Management, CIKM’12, Maui, HI, USA, October 29  November 02, 2012, X. Chen, G. Lebanon, H. Wang, and M. J. Zaki, Eds. ACM, 2012, pp. 2383–2386. [Online]. Available: http://dl.acm.org/citation.cfm?id=2396761
 [36] I. Hulpus et al., “Unsupervised graphbased topic labelling using dbpedia,” in Sixth ACM International Conference on Web Search and Data Mining, WSDM 2013, Rome, Italy, February 48, 2013, S. Leonardi, A. Panconesi, P. Ferragina, and A. Gionis, Eds. ACM, 2013, pp. 465–474. [Online]. Available: http://dl.acm.org/citation.cfm?id=2433396
 [37] Y. Mehdad et al., “Towards topic labeling with phrase entailment and aggregation,” in Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 914, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA, L. Vanderwende, H. D. III, and K. Kirchhoff, Eds. The Association for Computational Linguistics, 2013, pp. 179–189.

[38]
X. Sun et al., “On conceptual labeling of a bag of words,” in
Proceedings of the TwentyFourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 2531, 2015
, Q. Yang and M. Wooldridge, Eds. AAAI Press, 2015, pp. 1326–1332. [Online]. Available: http://ijcai.org/Abstract/15/191  [39] M. Steyvers et al., “Combining background knowledge and learned topics,” topiCS, vol. 3, no. 1, pp. 18–47, 2011.
 [40] Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, Eds., Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009. Proceedings of a meeting held 710 December 2009, Vancouver, British Columbia, Canada. Curran Associates, Inc., 2009. [Online]. Available: http://papers.nips.cc/book/advancesinneuralinformationprocessingsystems222009
 [41] J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, Eds., Advances in Neural Information Processing Systems 20, Proceedings of the TwentyFirst Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 36, 2007. Curran Associates, Inc., 2008. [Online]. Available: http://papers.nips.cc/book/advancesinneuralinformationprocessingsystems202007
 [42] T. EliassiRad, L. H. Ungar, M. Craven, and D. Gunopulos, Eds., Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, August 2023, 2006. ACM, 2006.
Comments
There are no comments yet.