We live in the information age. With the Internet, information can be obtained easily and almost instantly. This has changed the dynamic of information acquisition, for example, we can now (1) attain knowledge by visiting digital libraries, (2) be aware of the world by reading news online, (3) seek opinions from social media, and (4) engage in political debates via web forums. As technology advances, more information is created, to a point where it is infeasible for a person to digest all the available content. To illustrate, in the context of a healthcare database (PubMed), the number of entries has seen a growth rate of approximately 3,000 new entries per day in the ten-year period from 2003 to 2013 (Suominen et al., 2014)
. This motivates the use of machines to automatically organise, filter, summarise, and analyse the available data for the users. To this end, researchers have developed various methods, which can be broadly categorised into computer vision(Low, 1991; Mai, 2010), speech recognition (Rabiner and Juang, 1993; Jelinek, 1997), and natural language processing (NLP, Manning and Schütze, 1999; Jurafsky and Martin, 2000). This article focuses on text analysis within NLP.
In text analytics, researchers seek to accomplish various goals, including sentiment analysis or opinion mining (Pang and Lee, 2008; Liu, 2012), information retrieval (Manning et al., 2008), text summarisation (Lloret and Palomar, 2012), and topic modelling (Blei, 2012)
. To illustrate, sentiment analysis can be used to extract digestible summaries or reviews on products and services, which can be valuable to consumers. On the other hand, topic models attempt to discover abstract topics that are present in a collection of text documents.
Topic models were inspired by latent semantic indexing (LSI, Landauer et al., 2007) and its probabilistic variant, probabilistic latent semantic indexing (pLSI), also known as the probabilistic latent semantic analysis (pLSA, Hofmann, 1999). Pioneered by Blei et al. (2003), latent Dirichlet allocation (LDA) is a fully Bayesian extension of pLSI, and can be considered the simplest Bayesian topic model. The LDA is then extended to many different types of topic models. Some of them are designed for specific applications (Wei and Croft, 2006; Mei et al., 2007), some of them model the structure in the text (Blei and Lafferty, 2006; Du, 2012), while some incorporate extra information in their modelling (Ramage et al., 2009; Jin et al., 2011).
On the other hand, due to the well known correspondence between the Gamma-Poisson family of distributions and the Dirichlet-multinomial family, Gamma-Poisson factor models (Canny, 2004) and their nonparametric extensions, and other Poisson-based variants of non-negative matrix factorisation (NMF) form a methodological continuum with topic models. These NMF methods are often applied to text, however, we do not consider these methods here.
This article will concentrate on topic models that take into account additional information. This information can be auxiliary data
(or metadata) that accompany the text, such as keywords (or tags), dates, authors, and sources; or external resources like word lexicons. For example, onTwitter, a popular social media platform, its messages, known as tweets, are often associated with several metadata like location, time published, and the user who has written the tweet. This information is often utilised, for instance, Kinsella et al. (2011) model tweets with location data, while Wang et al. (2011b) use hashtags for sentiment classification on tweets. On the other hand, many topic models have been designed to perform bibliographic analysis by using auxiliary information. Most notable of these is the author-topic model (ATM, Rosen-Zvi et al., 2004), which, as its name suggests, incorporates authorship information. In addition to authorship, the Citation Author Topic model (Tu et al., 2010) and the Author Cite Topic Model (Kataria et al., 2011) make use of citations to model research publications. There are also topic models that employ external resources to improve modelling. For instance, He (2012) and Lim and Buntine (2014) incorporate a sentiment lexicon as prior information for a weakly supervised sentiment analysis.
Independent to the use of auxiliary data, recent advances in nonparametric Bayesian methods have produced topic models that utilise nonparametric Bayesian priors. The simplest examples replace Dirichlet distributions by the Dirichlet process (DP, Ferguson, 1973). The simplest is hierarchical Dirichlet process LDA (HDP-LDA) proposed by Teh et al. (2006) that replaces just the document by topic matrix in LDA. One can further extend topic models by using the Pitman-Yor process (PYP, Ishwaran and James, 2001) that generalises the DP, by replacing the second Dirichlet distribution which generates the topic by word matrix in LDA. This includes the work of Sato and Nakagawa (2010), Du et al. (2012b), Lindsey et al. (2012), among others. Like the HDP, the PYPs can be stacked to form hierarchical Pitman-Yor processes (HPYP), which are used in more complex models. Another fully nonparametric extension to topic modelling uses the Indian buffet process (Archambeau et al., 2015) to sparsify both the document by topic matrix and the topic by word matrix in LDA.
Advantages of employing nonparametric Bayesian methods with topic models is the ability to estimate the topic and word priors and to infer the number of clusters111This is known as the number of topics in topic modelling. from the data. Using the PYP also allows the modelling of the power-law property exhibited by natural languages (Goldwater et al., 2005). These touted advantages have been shown to yield significant improvements in performance (Buntine and Mishra, 2014). However, we note the best known approach for learning with hierarchical Dirichlet (or Pitman-Yor) processes is to use the Chinese restaurant franchise (Teh and Jordan, 2010). Because this requires dynamic memory allocation to implement the hierarchy, there has been extensive research in attempting to efficiently implement just the HDP-LDA extension to LDA mostly based around variational methods (Teh et al., 2008; Wang et al., 2011a; Bryant and Sudderth, 2012; Sato et al., 2012; Hoffman et al., 2013). Variational methods have rarely been applied to more complex topic models, as we consider here, and unfortunately Bayesian nonparametric methods are gaining a reputation of being difficult to use. A newer collapsed and blocked Gibbs sampler (Chen et al., 2011) has been shown to generally outperform the variational methods as well as the original Chinese restaurant franchise in both computational time and space and in some standard performance metrics (Buntine and Mishra, 2014). Moreover, the technique does appear suitable for more complex topic models, as we consider here.
This article,222We note that this article adapts and extends our previous work (Lim et al., 2013). extending the algorithm of Chen et al. (2011), shows how to develop fully nonparametric and relatively efficient Bayesian topic models that incorporate auxiliary information, with a goal to produce more accurate models that work well in tackling several applications. As a by-product, we wish to encourage the use of state-of-the-art Bayesian techniques, and also to incorporate auxiliary information, in modelling.
The remainder of this article is as follows. We first provide a brief background on the Pitman-Yor process in Section 2. Then, in Section 3, we detail our modelling framework by illustrating it on a simple topic model. We continue through to the inference procedure on the topic model in Section 4. Finally, in Section 5, we present an application on modelling social network data, utilising the proposed framework. Section 6 concludes.
2 Background on Pitman-Yor Process
We provide a brief, informal review of the Pitman-Yor process (PYP, Ishwaran and James, 2001) in this section. We assume the readers are familiar with basic probability distributions (see Walck, 2007) and the Dirichlet process (DP, Ferguson, 1973). In addition, we refer the readers to Hjort et al. (2010) for a tutorial on Bayesian nonparametric modelling.
2.1 Pitman-Yor Process
The Pitman-Yor process (PYP, Ishwaran and James, 2001) is also known as the two-parameter Poisson-Dirichlet process. The PYP is a two-parameter generalisation of the DP, now with an extra parameter named the discount parameter in addition to the concentration parameter . Similar to DP, a sample from a PYP corresponds to a discrete distribution (known as the output distribution) with the same support as its base distribution . The underlying distribution of the PYP is the Poisson-Dirichlet distribution (PDD), which was introduced by Pitman and Yor (1997).
The PDD is defined by its construction process. For and , let be distributed independently as follows:
and define as
If we let be a sorted version of in descending order, then is Poisson-Dirichlet distributed with parameter and :
Note that the unsorted version follows a distribution, which is named after Griffiths, Engen and McCloskey (Pitman, 2006).
With the PDD defined, we can then define the PYP formally. Let be a distribution over a measurable space , for and , suppose that follows a PDD (or GEM) with parameters and , then PYP is given by the formula
where are independent samples drawn from the base measure and represents probability point mass concentrated at (i.e., it is an indicator function that is equal to when and otherwise):
This construction, Equation (1), is named the stick-breaking process. The PYP can also be constructed using an analogue to Chinese restaurant process (which explicitly draws a sequence of samples from the base distribution). A more extensive review on the PYP is given by Buntine and Hutter (2012).
A PYP is often more suitable than a DP in modelling since it exhibits a power-law behaviour (when ), which is observed in natural languages (Goldwater et al., 2005; Teh and Jordan, 2010). The PYP has also been employed in genomics (Favaro et al., 2009) and economics (Aoki, 2008). Note that when the discount parameter is , the PYP simply reduces to a DP.
2.2 Pitman-Yor Process with a Mixture Base
Note that the base measure of a PYP is not necessarily restricted to a single probability distribution. can also be a mixture distribution such as
where and is a set of distributions over the same measurable space as .
With this specification of , the PYP is also named the compound Poisson-Dirichlet process in Du (2012), or the doubly hierarchical Pitman-Yor process in Wood and Teh (2009). A special case of this is the DP equivalent, which is also known as the DP with mixed random measures in Kim et al. (2012). Note that we have assumed constant values for the , though of course we can go fully Bayesian and assign a prior distribution for each of them, a natural prior would be the Dirichlet distribution.
2.3 Remark on Bayesian Inference
Performing exact Bayesian inference on nonparametric models is often intractable due to the difficulty in deriving the closed-formposterior distributions. This motivates the use of Markov chain Monte Carlo (MCMC) methods (see Gelman et al., 2013) for approximate inference. Most notable of the MCMC methods are the Metropolis-Hastings (MH) algorithms (Metropolis et al., 1953; Hastings, 1970) and Gibbs samplers (Geman and Geman, 1984). These algorithms serve as a building block for more advanced samplers, such as the MH algorithms with delayed rejection (Mira, 2001). Generalisations of the MCMC method include the reversible jump MCMC (Green, 1995) and its delayed rejection variant (Green and Mira, 2001) can also be employed for Bayesian inference, however, they are out of the scope in this article.
Instead of sampling one parameter at a time, one can develop an algorithm that updates more parameters in each iteration, a so-called blocked Gibbs sampler (Liu, 1994). Also, in practice we are usually only interested in a certain subset of the parameters; in such cases we can sometimes derive more efficient collapsed Gibbs samplers (Liu, 1994) by integrating out the nuisance parameters. In the remainder of this article, we will employ a combination of the blocked and collapsed Gibbs samplers for Bayesian inference.
3 Modelling Framework with Hierarchical Pitman-Yor Process
In this section, we discuss the basic design of our nonparametric Bayesian topic models using thierarchical Pitman-Yor processes (HPYP). In particular, we will introduce a simple topic model that will be extended later. We discuss the general inference algorithm for the topic model and hyperparameter optimisation.
Development of topic models is fundamentally motivated by their applications. Depending on the application, a specific topic model that is most suitable for the task should be designed and used. However, despite the ease of designing the model, the majority of time is spent on implementing, assessing, and redesigning it. This calls for a better designing cycle/routine that is more efficient, that is, spending less time in implementation and more time in model design and development.
We can achieve this by a higher level implementation of the algorithms for topic modelling. This has been made possible in other statistical domains by BUGS (Bayesian inference using Gibbs sampling, Lunn et al., 2000) or JAGS (just another Gibbs sampler, Plummer, 2003), albeit with standard probability distributions. Theoretically, BUGS and JAGS will work on LDA; however, in practice, running Gibbs sampling for LDA with BUGS and JAGS is very slow. This is because their Gibbs samplers are uncollapsed and not optimised. Furthermore, they cannot be used in a model with stochastic processes, like the Gaussian process (GP) and DP.
Below, we present a framework that allows us to implement HPYP topic models efficiently. This framework allows us to test variants of our proposed topic models without significant reimplementation.
3.1 Hierarchical Pitman-Yor Process Topic Model
The HPYP topic model is a simple network of PYP nodes since all distributions on the probability vectors are modelled by the PYP. For simplicity, we assume a topic model with three PYP layers, although in practice there is no limit to the number of PYP layers. We present the graphical model of our generic topic model in Figure 1. This model is a variant of those presented in Buntine and Mishra (2014), and is presented here as a starting model for illustrating our methods and for subsequent extensions.
At the root level, we have and distributed as PYPs:
The variable is the root node for the topics in a topic model while is the root node for the words. To allow arbitrary number of topics to be learned, we let the base distribution for , , to be a continuous distribution or a discrete distribution with infinite samples.
We usually choose a discrete uniform distribution forbased on the word vocabulary size of the text corpus. This decision is technical in nature, as we are able to assign a tiny probability to words not observed in the training set, which eases the evaluation process. Thus where is the set of all word vocabulary of the text corpus.
We now consider the topic side of the HPYP topic model. Here we have , which is the child node of . It follows a PYP given , which acts as its base distribution:
For each document in a text corpus of size , we have a document–topic distribution , which is a topic distribution specific to a document. Each of them tells us about the topic composition of a document.
While for the vocabulary side, for each topic learned by the model, we have a topic–word distribution which tells us about the words associated with each topic. The topic–word distribution is PYP distributed given the parent node , as follows:
Here, is the number of topics in the topic model.
For every word in a document which is indexed by (from to , the number of words in document ), we have a latent topic (also known as topic assignment) which indicates the topic the word represents. and
are categorical variables generated fromand respectively:
The above and are the discount and concentration parameters of the PYPs (see Section 2.1), note that they are called the hyperparameters in the model. We present a list of variables used in this section in Table 1.
|Topic||Topical label for word .|
|Word||Observed word or phrase at position in document .|
|Topic–word distribution||Probability distribution in generating words for topic .|
|Document–topic distribution||Probability distribution in generating topics for document .|
|Global word distribution||Word prior for .|
|Global topic distribution||Topic prior for .|
|Global topic distribution||Topic prior for .|
|Discount||Discount parameter for PYP .|
|Concentration||Concentration parameter for PYP .|
|Base distribution||Base distribution for PYP .|
|Customer count||Number of customers having dish in restaurant .|
|Table count||Number of tables serving dish in restaurant .|
|All topics||Collection of all topics .|
|All words||Collection of all words .|
|All hyperparameters||Collection of all hyperparameters and constants.|
|All customer counts||Collection of all customers counts .|
|All table counts||Collection of all table counts .|
3.2 Model Representation and Posterior Likelihood
In a Bayesian setting, posterior inference requires us to analyse the posterior distribution of the model variables given the observed data. For instance, the joint posterior distribution for the HPYP topic model is
Here, we use bold face capital letters to represent the set of all relevant variables. For instance, captures all words in the corpus. Additionally, we denote as the set of all hyperparameters and constants in the model.
Note that deriving the posterior distribution analytically is almost impossible due to its complex nature. This leaves us with approximate Bayesian inference techniques as mentioned in Section 2.3. However, even with these techniques, performing posterior inference with the posterior distribution is difficult due to the coupling of the probability vectors from the PYPs.
The key to an efficient inference procedure with the PYPs is to marginalise out the PYPs in the model and record various associated counts instead, which yields a collapsed sampler. To achieve this, we adopt a Chinese Restaurant Process (CRP) metaphor (Teh and Jordan, 2010; Blei et al., 2010) to represent the variables in the topic model. With this metaphor, all data in the model (e.g., topics and words) are the customers; while the PYP nodes are the restaurants the customers visit. In each restaurant, each customer is to be seated at only one table, though each table can have any number of customers. Each table in a restaurant serves a dish, the dish corresponds to the categorical label a data point may have (e.g., the topic label or word). Note that there can be more than one table serving the same dish. In a HPYP topic model, the tables in a restaurant are treated as the customers for the parent restaurant (in the graphical model, points to ), and they share the same dish. This means that the data is passed up recursively until the root node. For illustration, we present a simple example in Figure 2, showing the seating arrangement of the customers from two restaurants.
Naïvely recording the seating arrangement (table and dish) of each customer brings about computational inefficiency during inference. Instead, we adopt the table multiplicity (or table counts) representation of Chen et al. (2011) which requires no dynamic memory, thus consuming only a factor of memory at no loss of inference efficiency. Under this representation, we store only the customer counts and table counts associated with each restaurant. The customer count denotes the number of customers who are having dish in restaurant . The corresponding symbol without subscript, , denotes the collection of customer counts in restaurant , that is, . The total number of customers in a restaurant is denoted by the capitalised symbol instead, . Similar to the customer count, the table count denotes the number of non-empty tables serving dish in restaurant . The corresponding and are defined similarly. For instance, from the example in Figure 2, we have and , the corresponding illustration of the table multiplicity representation is presented in Figure 3. We refer the readers to Chen et al. (2011) for a detailed derivation of the posterior likelihood of a restaurant.
For the posterior likelihood of the HPYP topic model, we marginalise out the probability vector associated with the PYPs and represent them with the customer counts and table counts, following Chen et al. (2011, Theorem 1). We present the modularised version of the full posterior of the HPYP topic model, which allows the posterior to be computed very quickly. The full posterior consists of the modularised likelihood associated with each PYP in the model, defined as
With the CRP representation, the full posterior of the HPYP topic model can now be written — in terms of given in Equation (16) — as
This result is a generalisation of Chen et al. (2011, Theorem 1) to account for discrete base distribution — the last term in Equation (19) corresponds to the base distribution of , and indexes each unique word in vocabulary set . The bold face and denote the collection of all table counts and customer counts, respectively. Note that the topic assignments are implicitly captured by the customer counts:
where is the indicator function, which evaluates to when the statement inside the function is true, and otherwise. We would like to point out that even though the probability vectors of the PYPs are integrated out and not explicitly stored, they can easily be reconstructed. This is discussed in Section 4.4. We move on to Bayesian inference in the next section.
4 Posterior Inference for the HPYP Topic Model
We focus on the MCMC method for Bayesian inference on the HPYP topic model. The MCMC method on topic models follows these simple procedures — decrementing counts contributed by a word, sample a new topic for the word, and update the model by accepting or rejecting the proposed sample. Here, we describe the collapsed blocked Gibbs sampler for the HPYP topic model. Note the PYPs are marginalised out so we only deal with the counts.
4.1 Decrementing the Counts Associated with a Word
The first step in a Gibbs sampler is to remove a word and corresponding latent topic, then decrement the associated customer counts and table counts. To give an example from Figure 2, if we remove the red customer from Restaurant 2, we would decrement the customer count by . Additionally, we also decrement the table count by because the red customer is the only customer on its table. This in turn decrements the customer count by . However, this requires us to keep track of the customers’ seating arrangement which leads to increased memory requirements and poorer performance due to inadequate mixing (Chen et al., 2011).
To overcome the above issue, we follow the concept of table indicator (Chen et al., 2011) and introduce a new auxiliary Bernoulli indicator variable , which indicates whether removing the customer also removes the table by which the customer is seated. Note that our Bernoulli indicator is different to that of Chen et al. (2011) which indicates the restaurant a customer contributes to. The Bernoulli indicator is sampled as needed in the decrementing procedure and it is not stored, this means that we simply “forget” the seating arrangements and re-sample them later when needed, thus we do not need to store the seating arrangement. The Bernoulli indicator of a restaurant depends solely on the customer counts and the table counts:
In the context of the HPYP topic model described in Section 3.1, we formally present how we decrement the counts associated with the word and latent topic from document and position . First, on the vocabulary side (see Figure 1), we decrement the customer count associated with by 1. Then sample a Bernoulli indicator according to Equation (21). If , we decrement the table count and also the customer count by one. In this case, we would sample a Bernoulli indicator for , and decrement if . We do not decrement the respective customer count if the Bernoulli indicator is . Second, we would need to decrement the counts associated with the latent topic . The procedure is similar, we decrement by and sample the Bernoulli indicator . Note that whenever we decrement a customer count, we sample the corresponding Bernoulli indicator. We repeat this procedure recursively until the Bernoulli indicator is or until the procedure hits the root node.
4.2 Sampling a New Topic for a Word
After decrementing the variables associated with a word , we use a blocked Gibbs sampler to sample a new topic for the word and the corresponding customer counts and table counts. The conditional posterior used in sampling can be computed quickly when the full posterior is represented in a modularised form. To illustrate, the conditional posterior for and its associated customer counts and table counts is
which is further broken down by substituting the posterior likelihood defined in Equation (19), giving the following ratios of the modularised likelihoods:
The superscript indicates that the variables associated with the word are removed from the respective sets, that is, the customer counts and table counts are after the decrementing procedure. Since we are only sample the topic assignment associated with one word, the customer counts and table counts can only increment by at most , see Table 2 for a list of all possible proposals.
. We note that the proposals can include states that are invalid, but this is not an issue since those states have zero posterior probability and thus will not be sampled.
This allows the ratios of the modularised likelihoods, which consists of ratios of Pochhammer symbol and ratio of Stirling numbers
to simplify further. For instance, the ratios of Pochhammer symbols can be reduced to constants, as follows:
With the conditional posterior defined, we proceed to the sampling process. Our first step involves finding all possible changes to the topic , customer counts, and the table counts (hereafter known as ‘state’) associated with adding the removed word back into the topic model. Since only one word is added into the model, the customer counts and the table counts can only increase by at most , constraining the possible states to a reasonably small number. Furthermore, the customer counts of a parent node will only be incremented when the table counts of its child node increases. Note that it is possible for the added customer to generate a new dish (topic) for the model. This requires the customer to increment the table count of a new dish in the root node by (from ).
Next, we compute the conditional posterior (Equation (22)) for all possible states. The conditional posterior (up to a proportional constant) can be computed quickly by breaking down the posterior and calculating the relevant parts. We then normalise them to sample one of the states to be the proposed next state. Note that the proposed state will always be accepted, which is an artifact of Gibbs sampler.
Finally, given the proposal, we update the HPYP model by incrementing the relevant customer counts and table counts.
4.3 Optimising the Hyperparameters
Choosing the right hyperparameters for the priors is important for topic models. Wallach et al. (2009a) show that an optimised hyperparameter increases the robustness of the topic models and improves their model fitting. The hyperparameters of the HPYP topic models are the discount parameters and concentration parameters of the PYPs. Here, we propose a procedure to optimise the concentration parameters, but leave the discount parameters fixed due to their coupling with the Stirling numbers cache.
The concentration parameters of all the PYPs are optimised using an auxiliary variable sampler similar to Teh (2006). Being Bayesian, we assume the concentration parameter of a PYP node has the following hyperprior:
where is the shape parameter and is the rate parameter. The gamma prior is chosen due to its conjugacy which gives a gamma posterior for .
To optimise , we first sample the auxiliary variables and given the current value of and , as follows:
With these, we can then sample a new from its conditional posterior
The collapsed Gibbs sampler is summarised by Algorithm 1.
4.4 Estimating the Probability Vectors of the PYPs
Recall that the aim of topic modelling is to analyse the posterior of the model parameters, such as one in Equation (15). Although we have marginalised out the PYPs in the above Gibbs sampler, the PYPs can be reconstructed from the associated customer counts and table counts. Recovering the full posterior distribution of the PYPs is a complicated task. So, instead, we will analyse the PYPs via the expected value of their conditional marginal posterior distribution, or simply, their posterior mean,
The posterior mean of a PYP corresponds to the probability of sampling a new customer for the PYP. To illustrate, we consider the posterior of the topic distribution . We let to be a unknown future latent topic in addition to the known . With this, we can write the posterior mean of as
with the posterior predictive distribution ofand note that can be sampled using the CRP, as follows:
Thus, the posterior mean of is given as
which is written in term of the posterior mean of its parent PYP, . The posterior means of the other PYPs such as can be derived by taking a similar approach. Generally, the posterior mean corresponds to a PYP (with parent PYP ) is as follows:
By applying Equation (34) recursively, we obtain the posterior mean for all the PYPs in the model.
We note that the dimension of the topic distributions (, , ) is , where is the number of observed topics. This accounts for the generation of a new topic associated with the new customer, though the probability of generating a new topic is usually much smaller. In practice, we may instead ignore the extra dimension during the evaluation of a topic model since it does not provide useful interpretation. One way to do this is to simply discard the extra dimension of all the probability vectors after computing the posterior mean. Another approach would be to normalise the posterior mean of the root node after discarding the extra dimension, before computing the posterior mean of others PYPs. Note that for a considerably large corpus, the difference in the above approaches would be too small to notice.
4.5 Evaluations on Topic Models
Generally, there are two ways to evaluate a topic model. The first is to evaluate the topic model based on the task it performs, for instance, the ability to make predictions. The second approach is the statistical evaluation of the topic model on modelling the data, which is also known as the goodness-of-fit test. In this section, we will present some commonly used evaluation metrics that are applicable to all topic models, but we first discuss the procedure for estimating variables associated with the test set.
4.5.1 Predictive Inference on the Test Documents
Test documents, which are used for evaluations, are set aside from learning documents. As such, the document–topic distributions associated with the test documents are unknown and hence need to be estimated. One estimate for is its posterior mean given the variables learned from the Gibbs sampler:
obtainable by applying Equation (34). Note that since the latent topics corresponding to the test set are not sampled, the customer counts and table counts associated with are , thus is equal to , the posterior mean of . However, this is not a good estimate for the topic distribution of the test documents since they will be identical for all the test documents. To overcome this issue, we will instead use some of the words in the test documents to obtain a better estimate for . This method is known as document completion (Wallach et al., 2009b), as we use part of the text to estimate , and use the rest for evaluation.
Getting a better estimate for requires us to first sample some of the latent topics in the test documents. The proper way to do this is by running an algorithm akin to the collapsed Gibbs sampler, but this would be excruciatingly slow due to the need to re-sample the customer counts and table counts for all the parent PYPs. Instead, we assume that the variables learned from the Gibbs sampler are fixed and sample the from their conditional posterior sequentially, given the previous latent topics:
Whenever a latent topic is sampled, we increment the customer count for the test document. For simplicity, we set the table count to be half the corresponding customer counts , this avoids the expensive operation of sampling the table counts. Additionally, is re-estimated using Equation (35) before sampling the next latent topic. We note that the estimated variables are unbiased.
The final becomes an estimate for the topic distribution of the test document . The above procedure is repeated times to give samples of , which are used to compute the following Monte Carlo estimate of :
This Monte Carlo estimate can then be used for computing the evaluation metrics. Note that when estimating , we have ignored the possibility of generating a new topic, that is, the latent topics are constrained to the existing topics, as previously discussed in Section 4.4.
4.5.2 Goodness-of-fit Test
Measures of goodness-of-fit usually involves computing the discrepancy of the observed values and the predicted values under the model. However, the observed variables in a topic model are the words in the corpus, which are not quantifiable since they are discrete labels. Thus evaluations on topic models are usually based on the model likelihoods instead.
A popular metric commonly used to evaluate the goodness-of-fit of a topic model is perplexity, which is negatively related to the likelihood of the observed words given the model, this is defined as
where is the likelihood of sampling the word given the document–topic distribution and the topic–word distributions . Computing requires us to marginalise out
from their joint distribution, as follows:
Although perplexity can be computed on the whole corpus, in practice we compute the perplexity on test documents. This is to measure if the topic model generalises well to unseen data. A good topic model would be able to predict the words in the test set better, thereby assigning a higher probability in generating the words. Since perplexity is negatively related to the likelihood, a lower perplexity is better.
4.5.3 Document Clustering
We can also evaluate the clustering ability of the topic models. Note that topic models assign a topic to each word in a document, essentially performing a soft clustering (Erosheva and Fienberg, 2005) for the documents in which the membership is given by the document–topic distribution . To evaluate the clustering of the documents, we convert the soft clustering to hard clustering by choosing a topic that best represents the documents, hereafter called the dominant topic. The dominant topic of a document corresponds to the topic that has the highest proportion in the topic distribution, that is,
Two commonly used evaluation measures for clustering are purity and normalised mutual information (NMI, Manning et al., 2008). The purity is a simple clustering measure which can be interpreted as the proportion of documents correctly clustered, while NMI is an information theoretic measures used for clustering comparison. If we denote the ground truth classes as and the obtained clusters as , where each and represents a collection (set) of documents, then the purity and NMI can be computed as
where denotes the mutual information between two sets and denotes the entropy. They are defined as follows:
Note that the higher the purity or NMI, the better the clustering.
5 Application: Modelling Social Network on Twitter
This section looks at how we can employ the framework discussed above for an application of tweet modelling, using auxiliary information that is available on Twitter. We propose the Twitter-Network topic model (TNTM) to jointly model the text and the social network in a fully Bayesian nonparametric way, in particular, by incorporating the authors, hashtags, the “follower” network, and the text content in modelling. The TNTM employs a HPYP for text modelling and a Gaussian process (GP) random function model for social network modelling. We show that the TNTM significantly outperforms several existing nonparametric models due to its flexibility.
Emergence of web services such as blogs, microblogs and social networking websites allows people to contribute information freely and publicly. This user-generated information is generally more personal, informal, and often contains personal opinions. In aggregate, it can be useful for reputation analysis of entities and products (Aula, 2010), natural disaster detection (Karimi et al., 2013), obtaining first-hand news (Broersma and Graham, 2012), or even demographic analysis (Correa et al., 2010). We focus on Twitter, an accessible source of information that allows users to freely voice their opinions and thoughts in short text known as tweets.
Although LDA (Blei et al., 2003) is a popular model for text modelling, a direct application on tweets often yields poor result as tweets are short and often noisy (Zhao et al., 2011; Baldwin et al., 2013), that is, tweets are unstructured and often contain grammatical and spelling errors, as well as informal words such as user-defined abbreviations due to the 140 characters limit. LDA fails on short tweets since it is heavily dependent on word co-occurrence. Also notable is that the text in tweets may contain special tokens known as hashtags; they are used as keywords and allow users to link their tweets with other tweets tagged with the same hashtag. Nevertheless, hashtags are informal since they have no standards. Hashtags can be used as both inline words or categorical labels. When used as labels, hashtags are often noisy, since users can create new hashtags easily and use any existing hashtags in any way they like.333 For example, hashtag hijacking, where a well defined hashtag is used in an “inappropriate” way. The most notable example would be on the hashtag #McDStories, though it was initially created to promote happy stories on McDonald’s, the hashtag was hijacked with negative stories on McDonald’s. Hence instead of being hard labels, hashtags are best treated as special words which can be the themes of the tweets. These properties of tweets make them challenging for topic models, and ad hoc alternatives are used instead. For instance, Maynard et al. (2012) advocate the use of shallow method for tweets, and Mehrotra et al. (2013) utilise a tweet-pooling approach to group short tweets into a larger document. In other text analysis applications, tweets are often ‘cleansed’ by NLP methods such as lexical normalisation (Baldwin et al., 2013). However, the use of normalisation is also criticised (Eisenstein, 2013), as normalisation can change the meaning of text.
In the following, we propose a novel method for better modelling of microblogs by leveraging the auxiliary information that accompanies tweets. This information, complementing word co-occurrence, also opens the door to more applications, such as user recommendation and hashtag suggestion. Our major contributions include (1) a fully Bayesian nonparametric model named the Twitter-Network topic model (TNTM) that models tweets well, and (2) a combination of both the HPYP and the GP to jointly model text, hashtags, authors and the followers network. Despite the seeming complexity of the TNTM model, its implementation is made relatively straightforward using the flexible framework developed in Section 3. Indeed, a number of other variants were rapidly implemented with this framework as well.
5.2 The Twitter-Network Topic Model
The TNTM makes use of the accompanying hashtags, authors, and followers network to model tweets better. The TNTM is composed of two main components: a HPYP topic model for the text and hashtags, and a GP based random function network model for the followers network. The authorship information serves to connect the two together. The HPYP topic model is illustrated by region b⃝ in Figure 4 while the network model is captured by region a⃝.
5.2.1 HPYP Topic Model
The HPYP topic model described in Section 3 is extended as follows. For the word distributions, we first generate a parent word distribution prior for all topics:
where is a discrete uniform distribution over the complete word vocabulary .444 The complete word vocabulary contains words and hashtags seen in the corpus. Then, we sample the hashtag distribution and word distribution for each topic , with as the base distribution:
Note that the tokens of the hashtags are shared with the words, that is, the hashtag #happy shares the same token as the word happy, and are thus treated as the same word. This treatment is important since some hashtags are used as words instead of labels.555 For instance, as illustrated by the following tweet: i want to get into #photography. can someone recommend a good beginner #camera please? i dont know where to start Additionally, this also allows any words to be hashtags, which will be useful for hashtag recommendation.
For the topic distributions, we generate a global topic distribution , which serves as a prior, from a GEM distribution. Then generate the author–topic distribution for each author , and a miscellaneous topic distribution to capture topics that deviate from the authors’ usual topics:
For each tweet , given the author–topic distribution and the observed author , we sample the document–topic distribution , as follows:
Next, we generate the topic distributions for the observed hashtags () and the observed words (), following the technique used in the adaptive topic model (Du et al., 2012a). We explicitly model the influence of hashtags to words, by generating the words conditioned on the hashtags. The intuition comes from hashtags being the themes of a tweet, and they drive the content of the tweet. Specifically, we sample the mixing proportions , which control the contribution of and for the base distribution of , and then generate given :
We set and as the parent distributions of . This flexible configuration allows us to investigate the relationship between , and , that is, we can examine if is directly determined by , or through the . The mixing proportions and the topic distribution is generated similarly:
The hashtags and words are then generated in a similar fashion to LDA. For the -th hashtag in tweet , we sample a topic and the hashtag by
where is the number of seen hashtags in tweet . While for the -th word in tweet , we sample a topic and the word by
where is the number of observed words in tweet . We note that all above , and are the hyperparameters of the model. We show the importance of the above modelling with ablation studies in Section 5.6. Although the HPYP topic model may seem complex, it is a simple network of PYP nodes since all distributions on the probability vectors are modelled by the PYP.
5.2.2 Random Function Network Model
The network modelling is connected to the HPYP topic model via the author–topic distributions , where we treat as inputs to the GP in the network model. The GP, represented by , determines the link between two authors (), which indicates the existence of the social links between author and author . For each pair of authors, we sample their connections with the following random function network model:
where is the sigmoid function:
By marginalising out , we can write , where is a vectorised collection of .666 , note that and follow the same indexing. denotes the mean vector and is the covariance matrix of the GP:
where , and are the hyperparameters associated to the kernel. is a similarity function that has a range between and , here chosen to be cosine similarity due to its ease of computation and popularity.
5.2.3 Relationships with Other Models
The TNTM is related to many existing models after removing certain components of the model. When hashtags and the network components are removed, the TNTM is reduced to a nonparametric variant of the author topic model (ATM). Oppositely, if authorship information is discarded, the TNTM resembles the correspondence LDA (Blei and Jordan, 2003), although it differs in that it allows hashtags and words to be generated from a common vocabulary.
In contrast to existing parametric models, the network model in the TNTM provides possibly the most flexible way of network modelling via a nonparametric Bayesian prior (GP), following Lloyd et al. (2012). Different to Lloyd et al. (2012), we propose a new kernel function that fits our purpose better and achieves significant improvement over the original kernel.
5.3 Representation and Model Likelihood
As with previous sections, we represent the TNTM using the CRP representation discussed in Section 3.2. However, since the PYP variables in the TNTM can have multiple parents, we extend the representation following Du et al. (2012a). The distinction is that we store multiple tables counts for each PYP, to illustrate, represents the number of tables in PYP serving dish that are contributed to the customer counts in PYP , . Similarly, the total table counts that contribute to is denoted as . Note the number of tables in PYP is , while the total number of tables is . We refer the readers to Lim et al. (2013, Appendix B) for a detailed discussion.
We use bold face capital letters to denote the set of all relevant lower case variables, for example, we denote as the set of all words and hashtags; as the set of all topic assignments for the words and the hashtags; as the set of all table counts and as the set of all customer counts; and we introduce as the set of all hyperparameters. By marginalising out the latent variables, we write down the model likelihood corresponding to the HPYP topic model in terms of the counts:
where is the modularised likelihood corresponding to node , as defined by Equation (16), and is the likelihood corresponding to the probability that controls which parent node to send a customer to, defined as