Models of bibliographic data need to consider many kinds of information. Articles are usually accompanied by metadata such as authors, publication data, categories and time. Cited papers can also be available. When authors’ topic preferences are modelled, we need to associate the document topic information somehow with the authors’. Jointly modelling text data with citation network information can be challenging for topic models, and the problem is confounded when also modelling author-topic relationships.
In this paper, we propose a topic model to jointly model authors’ topic preferences, text content111Abstract and publication title. and the citation network. The model is a nonparametric extension of previous models discussed in Section 2
. Using simple assumptions and approximations, we derive a novel algorithm that allows the probability vectors in the model to be integrated out. This yields a Markov chain Monte Carlo (MCMC) inferencevia discrete sampling.
As an extension of our previous work (LimBuntineCNTM), we propose a supervised approach to improve document clustering, by making use of categorical information that is available. Our method allows the level of supervision to be adjusted through a variable, giving us a model with no supervision, semi-supervised or fully supervised. Additionally, we present a more extensive qualitative analysis of the learned topic models, and display a visualisation snapshot of the learned author-topics network. We also perform additional diagnostic tests to assess our proposed topic model. For example, we study the convergence of the proposed learning algorithm and report on the computation complexity of the algorithm.
In the next section, we discuss the related work. Section 3, 4 and 5 detail our topic model and its inference algorithm. We describe the datasets in Section 6 and report on experiments in Section 7. Applying our model on research publication data, we demonstrate the model’s improved performance, on both model fitting and a clustering task, compared to several baselines. Additionally, in Section 8, we qualitatively analyse the inference results produced by our model. We find that the learned topics have high comprehensibility. Additionally, we present a visualisation snapshot of the learned topic models. Finally, we perform diagnostic assessment of the topic model in Section 9 and conclude the paper in Section 10.
2 Related Work
Latent Dirichlet Allocation (LDA) (Blei:2003:LDA:944919.944937) is the simplest Bayesian topic model used in modelling text, which also allows easy learning of the model. TehJor2010a proposed the Hierarchical Dirichlet process (HDP) LDA, which utilises the Dirichlet process (DP) as a nonparametric prior which allows a non-symmetric, arbitrary dimensional topic prior to be used. Furthermore, one can replace the Dirichlet prior on the word vectors with the Pitman-Yor Process (PYP, also known as the two-parameter Poisson Dirichlet process) (Teh:2006:HBL:1220175.1220299), which models the power-law of word frequency distributions in natural language (Goldwater:2011:PPD:1953048.2021075), yielding significant improvement (Sato:2010:TMP:1835804.1835890).
Variants of LDA allow incorporating more aspects of a particular task and here we consider authorship and citation information. The author-topic model (ATM) (Rosen-Zvi:2004:AMA:1036843.1036902) uses the authorship information to restrict topic options based on author. Some recent work jointly models the document citation network and text content. This includes the relational topic model (chang2010hierarchical), the Poisson mixed-topic link model (PMTLM) (ZhuYGM:2013) and Link-PLSA-LDA (Nallapati:2008:JLT:1401890.1401957). An extensive review of these models can be found in ZhuYGM:2013. The Citation Author Topic (CAT) model (Tu:2010:CAT:1944566.1944711) models the author-author network on publications based on citations using an extension of the ATM. Note that our work is different to CAT in that we model the author-document-citation network instead of author-author network.
The Topic-Link LDA (Liu:2009:TLJ:1553374.1553460) jointly models author and text by using the distance between the document and author topic vectors. Similarly the Twitter-Network topic model (Lim2013Twitter) models the author network222The author network here corresponds to the Twitter follower network. based on author topic distributions, but using a Gaussian process to model the network. Note that our work considers the author-document-citation of Liu:2009:TLJ:1553374.1553460. We use the PMTLM of ZhuYGM:2013 to model the network, which lets one integrate PYP hierarchies with the PMTLM using efficient MCMC sampling.
There is also existing work on analysing the degree of authors’ influence. On publication data, Kataria:2011:CST:2283696.2283777 and Mimno:2007:MDL:1255175.1255196 analyse influential authors with topic models, while Weng:2010:TFT:1718487.1718520, Tang:2009:SIA:1557019.1557108, and Liu:2010:MTI:1871437.1871467 use topic models to analyse users’ influence on social media.
3 Supervised Citation Network Topic Model
In our previous work (LimBuntineCNTM), we proposed the Citation Network Topic Model (CNTM) that jointly models the text, authors, and the citation network of research publications (documents). The CNTM allows us to both model the authors and text better by exploiting the correlation between the authors and their research topics. However, the benefit of the above modelling is not realised when the author information is simply missing from the data. This could be due to error in data collection (e.g. metadata not properly formatted), or even simply that the author information is lost during preprocessing.
In this section, we propose an extension of the CNTM that remedies the above issue, by making use of additional metadata that is available. For example, the metadata could be the research areas or keywords associated with the publications, which are usually provided by the authors during the publication submission. However, this information might not always be reliable as it is not standardised across different publishers or conferences. In this paper, rather than using the mentioned metadata, we will instead incorporate the categorical labels that were previously used as ground truth for evaluation. As such, our extension gives rise to a supervised model, which we will call the Supervised Citation Network Topic Model (SCNTM).
We first describe the topic model part of SCNTM for which the citations are not considered, it will be used for comparison later in Section 7. We then complete the SCNTM with the discussion on its network component. The full graphical model for SCNTM is displayed in Figure 1.
To clarify the notations used in this paper, variables that are without subscript represent a collection of variables of the same notation. For instance, represents all the words in document , that is, where is the number of words in document ; and represents all words in a corpus, , where is the number of documents.
3.1 Hierarchical Pitman-Yor Topic Model
The SCNTM uses both the Griffiths-Engen-McCloskey (GEM) distribution (pitman1996some) and the Pitman-Yor process (PYP) (Teh:2006:HBL:1220175.1220299) to generate probability vectors. Both the GEM distribution and the PYP are parameterised by a discount parameter and a concentration parameter . The PYP is additionally parameterised by a base distribution , which is also the mean of the PYP when it can be represented by a probability vector. Note that the base distribution can also be a PYP. This gives rise to the hierarchical Pitman-Yor process (HPYP).
In modelling authorship, the SCNTM modifies the approach of the author-topic model (Rosen-Zvi:2004:AMA:1036843.1036902) which assumes that the words in a publication are equally attributed to the different authors. This is not reflected in practice since publications are often written more by the first author, excepting when the order is alphabetical. Thus, we assume that the first author is dominant and attribute all the words in a publication to the first author. Although, we could model the contribution of each author on a publication by, say, using a Dirichlet distribution, we found that considering only the first author gives a simpler learning algorithm and cleaner results.
The generative process of the topic model component of the SCNTM is as follows. We first sample a root topic distribution with a GEM distribution to act as a base distribution for the author-topic distributions for each author , and also for the category-topic distributions for each category :
Here, represents the set of all authors while denotes the set of all categorical labels in the text corpus. Note we have used the same symbol () for both the author-topic distributions and the category-topic distributions.
We introduce a parameter called the author threshold which controls the level of supervision used by SCNTM. We say an author is significant if the author has produced more than or equal to publications, i.e.
Here, represents the author for document , and is the indicator function that evaluates to if is true, else .
Next, for each document in a publication collection of size , we sample the document-topic prior from or depending on whether the author for the document is significant:
where is the categorical label associated with document . For the sake of notational simplicity, we introduce a variable to capture both the author and the category. We let takes the value of for each author in , and let takes the value of for the categories in . Note that . Thus, we can also write the distribution of as
where if , else .
By modelling this way, we are able to handle missing authors and incorporate supervision into the SCNTM. For example, choosing allows us to make use of the categorical information for documents that have no valid author. Alternatively, we could select a higher , this smooths out the document-topic distributions for documents that are written by authors who have authored only a small number of publications. This treatment leads to a better clustering result as these authors are usually not discriminative enough for prediction. On the extreme, we can set to achieve full supervision. We note that the SCNTM reverts to the CNTM when , in this case the model is not supervised.
We then sample the document-topic distribution given :
Note that instead of modelling a single document-topic distribution, we model a document-topic hierarchy with and . The primed represents the topics of the document in the context of the citation network. The unprimed represents the topics of the text, naturally related to but not the same. Such modelling gives citation information a higher impact to take into account the relatively low amount of citations compared to the text. The technical details on the effect of such modelling is presented in Section 9.2.
For the vocabulary side, we generate a background word distribution given , a discrete uniform vector of length , i.e. . is the set of distinct word tokens observed in a corpus. Then, we sample a topic-word distribution for each topic , with as the base distribution:
Modelling word burstiness (Buntine:2014:ENT:2623330.2623691) is important since words in a document are likely to repeat in the document. The same applies to publication abstract, as shown in Section 6. To address this property, we make the topics bursty so each document only focuses on a subset of words in the topic. This is achieved by defining the document-specific topic-word distribution for each topic in document as:
Finally, for each word in document , we sample the corresponding topic assignment from the document-topic distribution ; while the word is sampled from the topic-word distribution given :
Note that includes words from the publications’ title and abstract, but not the full article. This is because title and abstract provide a good summary of a publication’s topics and thus more suited for topic modelling, while the full article contains too much technical detail that might not be too relevant.
In the next section, we describe the modelling of the citation network accompanying a publication collections. This completes the SCNTM.
3.2 Citation Network Poisson Model
To model the citation network between publications, we assume that the citations are generated conditioned on the topic distributions of the publications. Our approach is motivated by the degree-corrected variant of PMTLM (ZhuYGM:2013). Denoting as the number of times document citing document , we model
with a Poisson distribution with mean parameter:
Here, is the propensity of document to cite and represents the popularity of cited document , while scales the -th topic, effectively penalising common topics and strengthen rare topics. Hence, a citation from document to document is more likely when these documents are having relevant topics. Due to the limitation of the data, the can only be or , i.e.
it is a Boolean variable. Nevertheless, the Poisson distribution is used instead of a Bernoulli distribution because it leads to dramatically reduced complexity in analysis(ZhuYGM:2013). Note that the Poisson distribution is similar to the Bernoulli distribution when the mean parameter is small. We present a list of variables associated with the SCNTM in Table 1.
|Topic||Topical label for word .|
|Word||Observed word or phrase at position in document .|
|Citations||Number of times document cites document .|
|Author||Author for document .|
|Category||Category label for document .|
|Document-topic- word distribution||Probability distribution in generating words given document and topic .|
|Topic-word distribution||Word prior for .|
|Document-topic distribution||Probability distribution in generating topics for document .|
|Document-topic prior||Topic prior for .|
|Author/category- topic distribution||Probability distribution in generating topics for author or category .|
|Global word distribution||Word prior for .|
|Global topic distribution||Topic prior for .|
|Discount||Discount parameter of the PYP .|
|Concentration||Concentration parameter of the PYP .|
|Base distribution||Base distribution of the PYP .|
|Rate||Rate parameter or the mean for .|
|Cite propensity||Propensity to cite for document .|
|Cited propensity||Propensity to be cited for document .|
|Scaling factor||Citation scaling factor for topic .|
4 Model Representation and Posterior Likelihood
Before presenting the posterior used to develop the MCMC sampler, we briefly review handling of the hierarchical PYP models in Section 4.1. We cannot provide an adequately detailed review in this paper, thus we present the main ideas.
4.1 Modelling with Hierarchical PYPs
The key to efficient sampling with PYPs is to marginalise out the probability vectors (e.g. topic distributions) in the model and record various associated counts instead, thus yielding a collapsed sampler. While a common approach here is to use the hierarchical Chinese Restaurant Process (CRP) of TehJor2010a, we use another representation that requires no dynamic memory and has better inference efficiency (Chen:2011:STC:2034063.2034095).
We denote as the marginalised likelihood associated with the probability vector . Since the vector is marginalised out, the marginalised likelihood is in terms of — using the CRP terminology — the customer counts and the table counts . The customer count corresponds to the number of data points (e.g. words) assigned to group (e.g. topic) for variable . Here, the table counts represent the subset of that gets passed up the hierarchy (as customers for the parent probability vector of ). Thus , and if and only if since the counts are non-negative. We also denote as the total customer counts for node , and similarly, is the total table counts. The marginalised likelihood , in terms of and , is given as
is the generalised Stirling number that is easily tabulated; both and denote the Pochhammer symbol (rising factorial), see 2012arXiv1007.0296B for details. Note the GEM distribution behaves like a PYP in which the table count is always for non-zero .
The innovation of Chen:2011:STC:2034063.2034095 was to notice that sampling with Equation 14 directly led to poor performance. The problem was that sampling an assignment to a latent variable, say moving a customer from group to (so decreases by and increases by ), the potential effect on and could not immediately be measured. Whereas, the hierarchical CRP automatically included table configurations in its sampling process and thus included the influence of the hierarchy in the sampling. Thus sampling directly with Equation 14 lead to comparatively poor mixing. As a solution, Chen:2011:STC:2034063.2034095 develop a collapsed version of the hierarchical CRP following the well known practice of Rao-Blackwellisation of sampling schemes (casella1996rao)
, which, while not being as fast per step, it has two distinct advantages, (1) it requires no dynamic memory and (2) the sampling has significantly lower variance so converges much faster. This has empirically been shown to lead to better mixing of the samplers(Chen:2011:STC:2034063.2034095) and has been confirmed on different complex topic models (Buntine:2014:ENT:2623330.2623691).
The technique for collapsing the hierarchical CRP uses Equation 14 but the counts () are now derived variables. They are derived from Boolean variables associated with each data point. The technique comprises the following conceptual steps: (1) add Boolean indicators to the data from which the counts and can be derived, (2) modify the marginalised posterior accordingly, and (3) derive a sampler for the model.
4.1.1 Adding Boolean indicators
We first consider , which has a “+1” contributed to for every in document , hence . We now introduce a new Bernoulli indicator variable associated with , which is “on” (or ) when the data also contributed a “+1” to . Note that , so every data contributing a “+1” to may or may not contribute a “+1” to . The result is that one derives .
Now consider the parent of , which is . Its customer count is derived as . Its table count can now be treated similarly. Those data that contribute a “+1” to (and thus ) have a new Bernoulli indicator variable , which is used to derive , similar as before. Note that if then necessarily .
Similarly, one can define Boolean indicators for , , , , and to have a full suite from which all the counts and are now derived. We denote , as the collection of the Boolean indicators for data (, ).
4.1.2 Probability of Boolean indicators
By symmetry, if there are Boolean indicators “on” (out of ), we are indifferent as to which is on. Thus the indicator variable is not stored, that is, we simply “forget” who contributed a table count and re-sample as needed:
Moreover, this means that the marginalised likelihood of Equation 14 is extended to include the probability of , which is written in terms of , and as:
4.2 Likelihood for the Hierarchical PYP Topic Model
We use bold face capital letters to denote the set of all relevant lower case variables. For example, denotes the set of all topic assignments. Variables , , and are similarly defined, that is, they denote the set of all words, table counts, customer counts, and Boolean indicators respectively. Additionally, we denote
as the set of all hyperparameters (such as the’s). With the probability vectors replaced by the counts, the likelihood of the topic model can be written — in terms of as given in Equation 16 — as
Note that the last term in Equation 17 corresponds to the parent probability vector of (see Section 3.1), and indexes the unique word tokens in vocabulary set . Note that the extra terms for are simply derived using Equation 16 and not stored in the model. So in the discussions below we will usually represent implicitly by and , and introduce the when explicitly needed.
Note that even though the probability vectors are integrated out and not explicitly stored, they can easily be estimated from the associated counts. The probability vectorcan be estimated from its posterior mean given the counts and parent probability vector :
4.3 Likelihood for the Citation Network Poisson Model
For the citation network, the Poisson likelihood for each is given as
Note that the term is dropped in Equation 19 due to the limitation of the data that , thus is evaluated to . With conditional independence of , the joint likelihood for the whole citation network can be written as
where is the number of citations for publication , , and is the number of times publication being cited, . We also make a simplifying assumption that for all documents , that is, all publications are treated as self-cited. This assumption is important since defining allows us to rewrite the joint likelihood into Equation 20, which leads to a cleaner learning algorithm that utilises an efficient caching. Note that if we do not define , we have to explicitly consider the case when in Equation 20 which results in messier summation and products.
Note the likelihood in Equation 20 contains the document-topic distribution in vector form. This is problematic as performing inference with the likelihood requires the probability vectors , and to be stored explicitly (instead of counts as discussed in Section 4.1). To overcome this issue, we propose a novel representation that allows the probability vectors to remain integrated out. Such representation also leads to an efficient sampling algorithm for the citation network, as we will see in Section 5.
We introduce an auxiliary variable , named the citing topic, to denote the topic that prompts publication to cite publication . To illustrate, for a biology publication that cites a machine learning
publication for the learning technique, the citing topic would be ‘machine learning’ instead of ‘biology’. FromEquation 13, we model the citing topic as jointly Poisson with :
Incorporating , the set of all , we rewrite the citation network likelihood as
where is the number of connections publication made due to topic .
To integrate out , we note the term appears like a multinomial likelihood, so we absorb them into the likelihood for where they correspond to additional counts for , with added to . To disambiguate the source of the counts, we will refer to these customer counts contributed by as network counts, and denote the augmented counts ( plus network counts) as . For the exponential term, we use the delta method (delta_method) to approximate , where is the expected value according to a distribution proportional to . This approximation is reasonable as long as the terms in the exponential are small (see Appendix A). The approximate full posterior of SCNTM can then be written as
where . We note that is the same as Equation 17 but now with instead of .
In the next section, we demonstrate that our model representation gives rise to an intuitive sampling algorithm for learning the model. We also show how the Poisson model integrates into the topic modelling framework.
5 Inference Techniques
Here, we derive the Markov chain Monte Carlo (MCMC) algorithms for learning the SCNTM. We first describe the sampler for the topic model and then for the citation network. The full inference procedure is performed by alternating between the two samplers. Finally, we outline the hyperparameter samplers that are used to estimate the hyperparameters automatically.
5.1 Sampling for the Hierarchical PYP Topic Model
To sample the words’ topic and the associated counts and in the SCNTM, we design a Metropolis-Hastings (MH) algorithm based on the collapsed Gibbs sampler designed for the PYP (Chen:2011:STC:2034063.2034095). The concept of the MH sampler is analogous to LDA, which consists of (1) decrementing the counts associated with a word, (2) sampling the respective new topic assignment for the word, and (3) incrementing the associated counts. However, our sampler is more complicated than LDA. In particular, we have to consider the indicators described in Section 4.1 operating on the hierarchy of PYPs. Our MH sampler consists of two steps. First we sample the latent topic associated with the word . We then sample the customer counts and table counts .
The sampler proceeds by considering the latent variables associated with a given word . First, we decrement the counts associated with the word and the latent topic . This is achieved by sampling the suite of indicators according to Equation 15 and decrementing the relevant customer counts and table counts. For example, we decrement by 1 if . After decrementing, we apply a Gibbs sampler to sample a new topic from its conditional posterior distribution, given as
Note that the joint distribution inEquation 24 can be written as the ratio of the likelihood for the topic model (Equation 17):
Here, the superscript indicates that the topic , indicators and the associated counts for word are not observed in the respective sets, i.e. the state after decrement. Additionally, we use the superscripts and to denote the proposed sample and the old value respectively. The modularised likelihood of Equation 17 allows the conditional posterior (Equation 24) to be computed easily, since it simplifies to ratios of likelihood , which simplifies further since the counts differ by at most during sampling. For instance, the ratio of the Pochhammer symbols, , simplifies to , while the ratio of Stirling numbers, such as , can be computed quickly via caching (2012arXiv1007.0296B).
Next, we proceed to sample the relevant customer counts and table counts given the new . We propose an MH algorithm for this. We define the proposal distribution for the new customer counts and table counts as
Here, the potential sample space for and are restricted to just and where is either or . Doing so allows us to avoid considering the exponentially many possibilities of and . The acceptance probability associated with the newly sampled and is
Thus we always accept the proposed sample.333The algorithm is named MH algorithm instead of Gibbs sampling due to the fact that the sample space for the counts is restricted and thus we are not sampling from the posterior directly. Note that since is GEM distributed, incrementing is equivalent to sampling a new topic, i.e. the number of topics increases by .
5.2 Sampling for the Citation Network
For the citation network, we propose another MH algorithm. The MH algorithm can be summarised in three steps: (1) estimate the document topic prior , (2) propose a new citing topic , and (3) accept or reject the proposed following an MH scheme. Note that the MH algorithm is similar to the sampler for the topic model, where we decrement the counts, sample a new state and update the counts. Since all probability vectors are represented as counts, we do not need to deal with their vector form. Additionally, our MH algorithm is intuitive and simple to implement. Like the words in a document, each citation is assigned a topic, hence the words and citations can be thought as voting to determine a documents’ topic.
We describe our MH algorithm for the citation network as follows. First, for each document , we estimate the expected document-topic prior from Equation 18. Then, for each document pair where , we decrement the network counts associated with , and re-sample with a proposal distribution derived from Equation 21:
which can be further simplified since the terms inside the exponential are very small, hence the exp term approximates to . We empirically inspected the exponential term and we found that almost all of them are between and . This means the ratio of the exponentials is not significant for sampling new citing topic . So we ignore the exponential term and let
We compute the acceptance probability for the newly sampled , changed from , and the successive change to the document-topic priors (from to ):
Note that we have abused the notations and in the above equation, where the and in the summation indexes all documents instead of pointing to particular document and document . We decided against introducing additional variables to make things less confusing.
Finally, if the sample is accepted, we update and the associated customer counts. Otherwise, we discard the sample and revert the changes.
5.3 Hyperparameter Sampling
Hyperparameter sampling for the priors are important (WallachPrior2009). In our inference algorithm, we sample the concentration parameters of all PYPs with an auxiliary variable sampler (Teh06abayesian), but leave the discount parameters fixed. We do not sample the due to the coupling of the parameter with the Stirling numbers cache.
Here we outline the procedure to sample the concentration parameter of a PYP distributed variable , using an auxiliary variable sampler. Assuming eachand rate , we first sample the auxiliary variables and for :
We then sample a new from the following conditional posterior given the auxiliary variables:
In addition to the PYP hyperparameters, we also sample , and with a Gibbs sampler. We let the hyperpriors for , and to be Gamma distributed with shape and rate . With the conjugate Gamma prior, the posteriors for , and are also Gamma distributed, so they can be sampled directly.
We apply vague priors to the hyperpriors by setting .