Joint Modeling of Topics, Citations, and Topical Authority in Academic Corpora

06/02/2017 ∙ by Jooyeon Kim, et al. ∙ KAIST 수리과학과 Australian National University KAIST 0

Much of scientific progress stems from previously published findings, but searching through the vast sea of scientific publications is difficult. We often rely on metrics of scholarly authority to find the prominent authors but these authority indices do not differentiate authority based on research topics. We present Latent Topical-Authority Indexing (LTAI) for jointly modeling the topics, citations, and topical authority in a corpus of academic papers. Compared to previous models, LTAI differs in two main aspects. First, it explicitly models the generative process of the citations, rather than treating the citations as given. Second, it models each author's influence on citations of a paper based on the topics of the cited papers, as well as the citing papers. We fit LTAI to four academic corpora: CORA, Arxiv Physics, PNAS, and Citeseer. We compare the performance of LTAI against various baselines, starting with the latent Dirichlet allocation, to the more advanced models including author-link topic model and dynamic author citation topic model. The results show that LTAI achieves improved accuracy over other similar models when predicting words, citations and authors of publications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Overview of Latent Topical Authority Indexing (LTAI). Based on content, citation, and authorship information (top), the LTAI discovers topical authority of authors; it increases when a paper with certain topics gets cited (bottom). Topical authority examples are the results of the LTAI with CORA dataset and 100 topics.

With a corpus of scientific literature, we can observe the complex and intricate process of scientific progress. We can learn the major topics in journal articles and conference proceedings, follow authors who are prolific and influential, and find papers that are highly cited. The huge number of publications and authors, however, makes it practically impossible to attain any deep or detailed understanding beyond the very broad trends. For example, if we want to identify authors who are particularly influential in a specific research field, it is difficult to do so without the aid of automatic analysis.

Online publication archives, such as Google Scholar, provide near real-time metrics of scholarly impact, such as the h-index [Hirsch2005], the journal impact factor [Garfield2006]

, and citation count. Those indices, however, are still at a coarse level of granularity. For example, both Michael Jordan and Richard Sutton are researchers with very high citation count and h-index, but they are authoritative in different topics, Jordan in the more general machine learning topic of statistical learning, and Sutton in the topic of reinforcement learning. It would be much more helpful to know that via topical authority scores, as shown in Figure 

1.

Fortunately, various academic publication archives contain the full contents, references, and meta-data including titles, venues, and authors. With such data, we can build and fit a model to partition researchers’ scholarly domain into topics at a much finer-grain and discover their academic authority within each topic. To do that, we propose a model named Latent Topical-Authority Indexing (LTAI), based on the latent Dirichlet allocation, to jointly model the topics, authors’ topical authority, and citations among the publications.

We illustrate the modeling power of the LTAI with four corpora encompassing a diverse set of academic fields: CORA, Arxiv Physics, PNAS, and Citeseer. To show the improvements over other related models, we carry out prediction tasks on word, citation and authorship using the LTAI and compare the results with those of latent Dirichlet allocation [Blei et al.2003], relational topic model [Chang and Blei2010], author-link topic model, and dynamic author-cite topic model [Kataria et al.2011], as well as simple baselines of topical h-index. The results show that the LTAI outperforms these other models for all prediction tasks.

The rest of this paper is organized as follows. In section 2, we describe related work, including models that are most similar to the LTAI, and describe how the LTAI fits in and contributes to the field. In section 3, we describe the LTAI model in detail and present the generative process. In section 4, we explain the algorithm for approximate inference, and in section 5, we present a faster algorithm for scalability. In section 6, we describe the experimental setup and in section 7, we present the results to show that the LTAI performs better than other related models for word, citation and authorship prediction.

2 Related Work

In this section, we review related papers, first in the field of NLP and ML-based analysis of scientific corpora, then the approaches based on the Bayesian topic models for academic corpora, and lastly joint models of topics, authors, and citations. In analyzing scientific corpora, previous research presents classifying scientific publications

[Caragea et al.2015], recommending yet unlinked citations [Huang et al.2015, Neiswanger et al.2014, Wang et al.2015, Jiang2015], summarizing and extracting key phrases [Cohan and Goharian2015, Caragea et al.2014], triggering better model fit [He et al.2015], incorporating authorship information to increase the content and link predictability [Sim et al.2015]

, estimating a paper’s potential influence on academic community

[Dong et al.2015], and finding and classifying different functionalities of citation practices [Moravcsik and Murugesan1975, Teufel et al.2006, Valenzuela et al.2015].

Several variants of topic modeling consider the relationship between topics and citations in academic corpora. Topic models that use text and citation network are divided into two types: (a) models that generate text given citation network [Dietz et al.2007, Foulds and Smyth2013] and (b) models that generate citation network given text [Nallapati et al.2008, Liu et al.2009, Chang and Blei2010]. While our model falls into the latter category, we also take into account the influence of the authors on the citation structure.

Most closely related to the LTAI are the citation author topic model [Tu et al.2010], the author-link topic model, and the dynamic author-cite topic model [Kataria et al.2011]

. Similar to the LTAI, they are designed to capture the influence of the authors. However, these models infer authority by referencing only the citing papers’ text, while our authority is based on the predictive modeling of comparing both the citing and the cited papers. Furthermore, the LTAI defines a generative model of citations and publications by introducing a latent authority index, whereas the previous models assume the citation structure is given. the LTAI thus explicitly gives a topical authority index, which directly answers the question of which author increases the probability of a paper being cited.

3 Latent Topical-Authority Indexing

The LTAI models the complex relationship among the topics of publications, the topical authority of the authors, and the citations among these publications. The generative process of the LTAI can be divided into two parts: content generation and citation network generation. We make several assumptions in the LTAI to model citation structure of academic corpora. First, we assume a citation is more likely to occur between two papers that are similar in their topic proportions. Second, we assume that an author differs in their authority (i.e., potential to induce citation) for each topic, and an author’s topical authority positively correlates with the probability of citation among publications. Also, in the LTAI, when there are multiple authors in a single cited publication, their contribution of forming citations with respect to different citing papers varies according to their topical authority. Lastly, we assign different concentration parameters for a pair of papers with and without citation. In this paper, we use terms positive and negative links to denote pairs of papers with and without citations respectively.

Figure 2 illustrates the graphical model of the LTAI, and we summarize the generative process of the LTAI, where the variables of the model are explained in the remainder of this section, as follows:

  1. For each topic , draw topic .

  2. For each document :

    1. Draw topic proportion .

    2. For each word :

      1. Draw topic assignment .

      2. Draw word .

  3. For each author and topic :

    1. Draw authority index .

  4. For each document pair from to :

    1. Draw influence proportion
      parameter .

    2. Draw author .

    3. Draw link .

Figure 2: Graphical representation of the LTAI. The LTAI jointly models content-related variables , , and author and citation related variables and .

3.1 Content Generation

To model the content of publications, we follow a standard document generative process of latent Dirichlet allocation (LDA) [Blei et al.2003]. Also, we inherit notations for variables from LDA; is the per-document topic distribution, is the per-topic word distribution, is the topic for each word in a document where is the corresponding word, and , are the Dirichlet parameters of and respectively.

3.2 Citation Generation

Let be a binary valued variable which indicates that publication cites publication . We formulate a continuous variable which is a linear combination of the authority variable and the topic proportion variable to approximate

by minimizing the sum of squared errors between the two variables. There is a body of research on using continuous user and item-related variables to approximate binary variables in the field of recommender systems

[Rennie and Srebro2005, Koren et al.2009].

Approximating binary variables using linear combination of continuous variables can be probabilistically generalized [Salakhutdinov and Mnih2007]. Using probabilistic matrix factorization, we approximate probability mass function

using probability density function

, where the precision parameter can be set differently for each pair of papers as will be discussed below.

Content Similarity Between Publications: In the LTAI, we model relationship between a random pair of documents and . The probability of publication citing publication is proportional to the similarity of topic proportions of two publications, i.e., positively correlates to . Following relational topic model’s approach [Chang and Blei2010], we use instead of topic proportion parameter .

Topical Authority of Cited Paper: We introduce a

-dimensional vector

for representing the topical authority index of author .

is a real number drawn from the zero-mean normal distribution with variance

. Given the authority indices for author of cited publication , the probability of citation is further modeled as , where the authority indices can promote or demote the probability of citation.

Different Degree of Contribution among Multiple Authors: Academic publications are often written by more than one author. Thus, we need to distinguish the influence of each author on a citation between two publications. Let be a set of authors of publication . To measure the influence proportion of author on the citation from to , we introduce additional parameter which is a one-hot vector drawn from a Dirichlet distribution with -dimensional parameter . is an element of which measures the influence of author on the citation from to and sums up to one () over all authors of publication . We approximate the probability of citation from publication to publication by which is a mixture of normal distributions with precision parameter . Therefore, if topic distributions of paper and are similar and if values of the cited paper’s authors are high, the citation formation probability increases; on the other hand, dissimilar or topically irrelevant pair of papers with less authoritative authors on the cited paper will be assigned with low probability of citation formation.

Different Treatment between Positive and Negative links: Citation is a binary problem where is either one or zero. When is zero, this can be interpreted in two ways: 1) the authors of citing publication are unaware of the publication , or 2) the publication is not relevant to publication

. Identifying which case is true is impossible unless we are the authors of the publication. Therefore the model embraces this uncertainty in the absence of a link between publications. We control the ambiguity by the Gaussian distribution with precision parameter

as follows:

(1)

where to ensure that we have more confidence on the observed citations. This is an implicit feedback approach that permits using negative examples () of sparse observations by mitigating their importance [Hu et al.2008, Wang and Blei2011, Purushotham et al.2012]. Setting different values to the precision parameter according to

induces cyclic dependencies between the two variables, and due to this cycle, the model no longer becomes a Bayesian network, or a directed acyclic graph. However, we note that this setting does lead to better experimental results, and we show the pragmatic benefit of the setting in the Evaluation section.

3.3 Joint Modeling of the LTAI

In the LTAI, the topics and the link structures are simultaneously learned, and thus the content-related variables and the citation-related variables mutually reshape one another during the posterior inference. On the other hand, if content and citation data are modeled separately, the topics would not reflect any information about the document citation structure. Thus, in the LTAI, documents with shared links are more likely to have similar topic distributions which leads to better model fit. We develop and explain this joint inference in section 4. In section 7, we illustrate the differences in word-level predictive powers of the LTAI and LDA.

4 Posterior Inference

We develop a hybrid inference algorithm in which the posterior of content-related parameters , , and are approximated by variational inference, and author-related parameters and are approximated by EM. In algorithm 1, we summarize the full inference procedure of the LTAI.

4.1 Content Parameters: Variational Update

Since computing the posterior distribution of the LTAI is intractable, we use variational inference to optimize variational parameters each of which correspond to original content-related variables. Following the standard mean-field variational approach, we define fully factorized variational distributions over the topic-related latent variables where for each factorized variational distribution, we place the same family of distributions as the original distribution. Using the variational distributions, we bound the log-likelihood of the model as follows:

(2)

where is the negative entropy of .

Taking the derivatives of this lower bound with respect to each variational parameter, we can obtain the coordinate ascent updates. The update for the variational Dirichlet parameters and the is the same as the standard variational update for LDA [Blei et al.2003]. The update for the variational multinomial is:

(3)

where the gradient of expected log probabilities of both incoming link and outgoing link contribute to the variational parameter. The first expectation can be rewritten as

(4)

where is the set of authors of . We take the lower bound of the expectation using Jensen’s inequality. The last term is approximated by the first order Taylor expansion . Finally, the approximated gradient of with respect to the incoming directions to document is

(5)

where diag is a diagonalization operator and is . We can compute the gradient with respect to the outgoing directions in the same way.

Initialize , , , and randomly
Set learning-rate parameter that satisfies
Robbins-Monro condition
Set subsample sizes , , and
repeat
     Variational update: local publication parameters
      randomly sampled publications
     for  in  do
          for  to  do
                Set of random samples
               Update using Equation 4, 5, 9.
          end for
          
     end for
     
     EM update: local author parameters
      randomly sampled authors
     for  in  do
           random publication pairs
          Update using Equation 7, 10
          for  in and to  do
               
          end for
     end for
     
     Stochastic variational update
     for  to  do
          
     end for
     Set
until satisfying converge criteria
Algorithm 1 Posterior inference algorithm for the LTAI

4.2 Author Parameters: EM Step

We use the EM algorithm to update author-related parameters , and based on the lower bound computed by variational inference. In the E step, we compute the probability of author contribution to the link between document and .

(6)

In the M step, we optimize the authority parameter for each author. Given the other estimated parameters, taking the gradient of with respect to and setting it to zero leads to the following update equation:

(7)

Let be the set of documents written by author and be the th document written by . Then is a vertical stack of matrices , whose th row is , the Hadamard product between and . Similarly, is a vertical stack of matrices whose th diagonal element is , and is a vertical stack of vectors whose th element is . Finally, we update .

5 Faster Inference Using Stochastic Optimization

To model topical authority, the LTAI considers the linkage information. If two papers are linked by citation, the topical authority of the cited paper’s authors will increase while the negative link buffers the potential noise of irrelevant topics. This algorithmic design of the LTAI results in high model complexity. To remedy this issue, we adopt the noisy gradient method from the stochastic approximation algorithm [Robbins and Monro1951] to subsample negative links for updating per-document topic variational parameter and authority parameter . The prior work of using subsampled negative links to reduce computational complexity is introduced in [Raftery et al.2012]. Also, we elucidate how stochastic variational inference [Hoffman et al.2013] is applied in our model to update global per-topic-word variational parameter .

5.1 Updating and

Updating for document in variational update requires iterating over every other document and computing the gradient of link probability. This leads to the time complexity for every .

To apply the noisy gradient method, we divide the gradient of the expected log probability of link into two parts:

(8)

where the first and the second term of RHS is the gradient sum of positive links and negative links (), respectively. Compared to positive links, the order of negative links is close to the total number of documents, and thus computing the second term results in computational inefficiency. However, in our model, we reduced the importance of the negative links by assigning a larger variance compared to the positive links, and the empirical mean of for negative links follows the Dirichlet expectation due to the large number of negative links. Therefore, we approximate the expectation of the gradient for the negative links using the noisy gradient as follows:

(9)

where is the number of negative links (i.e. ) of document , and is the size of sub-samples for the variational update. We randomly sample documents, compute gradients on the sampled documents, and then scale the average gradient to the size of the negative link . This noisy gradient method reduces the updating time complexity from to .

Now, we discuss how to approximate author’s topical authority based on Equation 7. When , the computational bottleneck is which has time complexity . To alleviate this complexity, we once again approximate the large number of negative links using smaller number of subsamples. Specifically, while keeping the positive link rows intact, we approximate negative link rows in using smaller matrix that has rows, or the size of subsamples for the EM step. Using this approximation, we can represent as

(10)

with the time complexity of , where is the number of rows with negative links in . Also, although we do not incorporate rigorous analysis on the performance of our model given the size of the subsamples, we confirm that the negative link size greater than 100 does not degrade the model performance in any of our experiment.

Figure 3: Training time of the LTAI on CORA dataset with stochastic and batch variational inference. Using stochastic variational inference, the per-word predictive log likelihood converges faster than using the batch variational inference.
Dataset # Tokens # Documents # Authors Avg C/D Avg C/A
CORA 17,059 13,147 12,111 3.46 12.17
Arxiv-Physics 49,807 27,770 10,950 12.70 67.93
PNAS 39,664 31,054 9,862 1.57 13.18
Citeseer 21,223 4,255 6,384 1.24 4.38
Table 1: Datasets. From left to right, each column shows the number of word tokens, number of documents, number of authors, average citations per document (Avg C/D), and average citations per author (Avg C/A).

5.2 Updating

In traditional coordinate ascent based variational inference, the global variational parameter is updated infrequently because all the other local parameters need to be updated beforehand. This problem is more noticeable in the LTAI since updating using equation 4.1 is slower than updating in vanilla LDA; moreover, per-author topical authority variable is another local variable that algorithm needs to update a priori. However, using the stochastic variational inference, the global parameters are updated after a small portion of local parameters are updated [Hoffman et al.2013]. Applying stochastic variational inference for the LTAI is straightforward after we calculate the intermediate topic-word variational parameter by from the noisy estimate of the natural gradient with respect to subsampled local parameters where is the number of words for document , and is the subsample size for the minibatch stochastic variational inference. The final global parameter for the iteration is updated by where is the learning-rate. Posterior inference is guaranteed to converge at local optimum when the learning rate satisfies the condition [Robbins and Monro1951]. In Figure 3, we confirm that stochastic variational inference is applicable for the LTAI and reduces the training time compared to using the batch counterpart, while maintaining similar performance.

6 Experimental Settings

In this section, we introduce the four academic corpora used to fit the LTAI, describe comparison models, and provide information about the evaluation metric and parameter settings for the LTAI

111Code and datasets are available at http://uilab.kaist.ac.kr/research/TACL2017/.

6.1 Datasets

We experiment with four academic corpora: CORA [McCallum et al.2000], Arxiv-Physics [Gehrke et al.2003], the Proceedings of the National Academy of Sciences (PNAS), and Citeseer [Lu and Getoor2003]. CORA, Arxiv-Physics, and PNAS datasets contain abstracts only, and the locations of the citations within each paper are not preserved, whereas the Citeseer dataset contains the citation locations. For CORA, Arxiv-Physics, and PNAS, we lemmatize words, remove stop words, and discard words that occur fewer than four times in the corpus. Table 1 describes the datasets in detail. Note that we obtain citation data from the entire document, not only from the abstract. Also, we consider within-corpus citation only, which leads to less than 13 average citation counts per document for all corpora.

(a) CORA
(b) Arxiv-Physics
(c) PNAS
(d) Citeseer
Figure 4: Word-level prediction result. We measured per-word log predictive probability on four datasets. As shown in graphs, our model performs better than LDA.

6.2 Comparison Models

We compare predictive performance of the LTAI with five other models. Different comparison models have different degrees of expressive powers; each model conducts a certain type of prediction task; while RTM, ALTM, and DACTM predicts citation structures, the topical h-index predicts authorship information. Also, the baseline topic models are implemented based on the inference methods suggested in the corresponding papers; LDA, RTM and the LTAI variants use variational inference, while ALTM and DACTM use collapsed Gibbs sampling. Finally, all the conditions for implementation such as the choice of programming language and modules, except for parts that convey each model’s unique assumption, are identically set; thus, the performance differences between models are due to their model assumption and different degrees of data usage, rather than the implementation technicalities.

Latent Dirichlet Allocation: LDA [Blei et al.2003] discovers topics and represents each publication by mixture of the topics. Compared to other models, LDA only uses the content information.

LTAI-%: In LTAI-%, we remove % of actual citations and displace them with arbitrarily selected false connections. Note that the link structures are displaced rather than removed; if the citation links are just removed, the LTAI and LTAI-% cannot be fairly compared as the density of the citation structures will be affected and each model needs different concentration values. Performance difference between the LTAI and this indicates that under identical conditions, using the correct linkage information is indeed beneficial for prediction.

LTAI-C: In LTAI-C the precision parameter has constant value, rather than assigning different values according to as discussed in section 3.

LTAI-SEP: LTAI-SEP has an identical structure as the LTAI, but the topic and the authority variables are separately learned. Once the topic variables are learned using the vanilla LDA, authority and citation variables are then inferred consecutively. Thus, the performance edge of the LTAI over LTAI-SEP highlights the necessity of the LTAI’s joint modeling in which both topic and authority related variables reshape one another in an iterative fashion.

Relational Topic Model: RTM [Chang and Blei2010] jointly models content and citation, and thus, topic proportions of a pair of publications become similar if the pair is connected by citations. Compared to the LTAI, the author information is not considered, the link structure does not have directionality and the model does not consider negative links.

Author-Link Topic Model: ALTM [Kataria et al.2011] is a variation of author topic model (ATM) [Rosen-Zvi et al.2004] that models both topical interests and influence of authors in scientific corpora. The model uses content information of citing papers and names of the cited authors as word tokens. ALTM outputs per-topic author distribution that functions as author influence indices.

Dynamic Author-Citation Topic Model: DACTM [Kataria et al.2011] is an extension of ALTM that requires publication corpora which preserve sentence structures. To model author influence, DACTM selectively uses words that are close to the point where the citation is presented. In our corpora, only Citeseer dataset preserves the sentence structure.

Topical h-index: To compute topical h-index, we separate the papers into several clusters using LDA and calculate the h-index within each cluster. Topical h-index is used for author prediction in the same manner as we did for our model, except the topic proportions are replaced to the LDA’s result and is replaced to the topical h-index values.

(a) CORA
(b) Arxiv-Physics
(c) PNAS
(d) Citeseer
Figure 5: Citation prediction results. The task is to find out which paper is originally linked to a cited paper. We measure mean reciprocal rank (MRR) to evaluate model performance. For all cases, the LTAI performs better than the other methods.

6.3 Evaluation Metric and Parameter Settings

We use mean reciprocal rank (MRR) [Voorhees1999] to measure the predictive performance of the LTAI and the comparison models. MRR is a widely used metric for evaluating link prediction tasks [Balog and De Rijke2007, Diehl et al.2007, Radlinski et al.2008, Huang et al.2015]

. When the models outputs the correct answers as ranks, MRR is the inverse of the harmonic mean of such ranks.

We report the parameter values used for evaluations. For all datasets, we set to 1. To predict citation, we set to 10,000, 100, 1,000, 10, and to predict authorship, we set to 1,000, 1,000, 10,000, 1,000 for CORA, Arxiv-Physics, PNAS, and Citeseer datasets. These values are obtained through exhaustive parameter analysis. We set to , and to . We fix the subsample sizes to 500222Although we do not present thorough sensitivity analysis in this paper, we confirm that the performance of our model was robust against adjusting the parameters within a factor of 2.. For fair comparison, all the parameters that the LTAI and the baseline models share are set to have the same values, and for other parameters that uniquely belong to the baseline models, the values are exhaustively tuned as done in the LTAI. Finally, we note that all parameters are tuned using the training set, and test dataset is used only for the testing purpose.

7 Evaluation

We conduct the evaluation of the LTAI with three different quantitative tasks, along with one qualitative analysis. In the first task, we check whether using citation and authorship information in the LTAI helps increase the word-level predictive performance. In the second and third tasks, we measure the predictability of the LTAI regarding missing publication-publication linkage and author-publication linkage; with these two tasks, we compare the predictive power of the LTAI with other comparison models and use MRR as evaluation metric. Finally, we observe famous researchers’ topical authority scores generated by the LTAI and investigate how these scores capture notable academic characteristics of the researchers.

(a) CORA
(b) Arxiv-Physics
(c) PNAS
(d) Citeseer
Figure 6: Author prediction results. The task is to find out who the author of a cited paper is, given all the citing papers. For all cases, the LTAI performs better than the other methods.

7.1 Word-level Prediction

In the LTAI, citation and authorship information affect per-document topic proportions, as can be confirmed in equation 4.1. This joint modeling of content and linkage structure, compared to vanilla LDA that uses content data only, yields better performance in terms of predicting missing words in documents. In this task, we use log-predictive probability, a metric that is widely used in other researches for measuring model fitness [Teh et al.2006, Asuncion et al.2009, Hoffman et al.2013]. For each corpus, we separate one third of documents as test set, and for all documents in each test set, we use half of the words for training per-document topic proportion and predict the probability of word occurrence regarding the remaining half. Specifically, the predictive probability for a word in a test set with respect to the given words and the training document is computed using equation .

Figure 4 illustrates the per-word log-predictive probability in each corpus. We confirm that when using the LTAI, the log predictive probability converges at higher value compared to the result using LDA. Also, when we corrupt the link structure from to the predictive performances of the LTAI gradually decrease. Thus, the LTAI’s superior predictive performance is attributed to its usage of correct citations rather than the algorithmic bias.

7.2 Citation Prediction

We evaluate model predictability regarding which publication is originally citing a certain publication. Specifically, we randomly remove one citation from each of the documents in the test set. To predict the citation link between publications, we first compute the probability that publication cites from . Given the topic proportion of the cited publication and the topical authorities of the authors , we compute which publication is more likely to cite the publication. Based on our model assumption in subsection 3.2, using topical authority increases the performance of predicting linkage structure.

In Figure 5, the LTAI yields better citation prediction performance than other models for all datasets and with most number of topics. Since the LTAI incorporates topical authority for predicting citations, it performs better than RTM, which does not discover topical authority. We can attribute the better performance of the LTAI compared to ALTM and DACTM to the LTAI’s multiple model assumptions explained in section 3. We note that DACTM requires additional information such as citation location and sentence structure, and thus, is only applicable for limited kinds of datasets.

Author h-index # cite # paper Representative Topic T Authority
D Padua 12 291 21 parallel, efficient, computation, runtime 10.36
V Lesser 11 303 48 interaction, intelligent, multiagent, autonomous 11.92
M Lam 11 440 20 memory, processor, cache, synchronization 12.74
M Bellare 11 280 43 scheme, security, signature, attack 13.21
L Peterson 10 297 24 operating, mechanism, interface, thread 9.28
D Ferrari 10 377 18 traffic, delay, bandwidth, allocation 14.16
O Goldreich 9 229 49 proof, known, extended, notion 12.57
M Jordan 9 263 27 approximation, intelligence, artificial, correlation 10.15
D Culler 9 565 30 operating, mechanism, interface, thread 12.37
A Pentland 8 207 39 image, motion, visual, estimate 10.82
Table 2: Authors with highest h-index scores and their statistics from the CORA dataset. We show the authors with their h-index, number of citations (# cite), and number of papers (# paper), representative topic, and their topical authority (T Authority) of the corresponding topic. We show that while the authors have the highest h-indices with lots of papers written and lots of citations earned, the topics which the authors exert authority varies.

7.3 Author Prediction

For author prediction, we randomly remove one of the authors from documents in the test set while preserving citation structures. Similar to citation prediction, we predict which author is more likely to write the cited publication based on the topic proportions of cited publication and a set of citing publications . We approximate the probability of researcher being an author of publication from . Because the mixture proportion of an unknown author cannot be obtained during posterior inference, we assume the cited publication is written by a single author to approximate the probability. For author prediction, we choose the author that maximizes the above probability. In Figure 6, the LTAI outperforms the comparison models in most of the settings.

7.4 Qualitative Analysis

To stress our model’s additional characteristics that are not observed in the quantitative analysis, we look at the assigned topical authority indices as well as other statistics of some researchers in the dataset. In the analyses, we set the number of topics to 100, and use CORA dataset for demonstration.

We first demonstrate famous authors’ authoritative topics that can be unveiled using our model. In Table 2, we list top 10 authors with highest h-indices along with their number of citations, number of papers, and their representative topics. Authors’ representative topics are the topics with highest authority scores. In the table, we observe that all authors with top h-indices have wrote at least 18 papers and earned at least 207 citations, which are the top 0.8% and 0.2% values respectively. However, their authoritative topics retrieved by the LTAI do not overlap for any of the authors. This table illustrates that each of the top authors in the table exerts authority on different academic topics that can be captured by the LTAI, while the authors commonly have highest h-index scores as well as other statistics.

We now stress attributes of topical authority index that are different from other topic irrelevant statistics. From Tables 3 to 5, we show four example topics extracted by our model and list notable authors within each topic with their topical authority indices, h-indices, number of citations, and number of papers. In the tables, we first find that all four authors with highest topical authority values, Monica Lam, Alex Pentland, Michael Jordan, and Mihir Bellare are also listed in the topic-irrelevant authority rankings in Table 2. From this, we confirm that authority score of the LTAI has a certain degree of correlation to other statistics, while it splits the authors by their authoritative topics.

At the same time, the topical authority score correlates less with topic-irrelevant statistics than those statistics correlate with themselves; in Table 5

, Oded Goldreich has lower topical authority score for the computer security topic while having higher topic irrelevant scores than the above four researchers, because his main research filed is in the theory of computation and randomness. Also, we can spot authors who exert high authority on multiple academic fields, such as Tomaso Poggio in

Table 3 and in Table 4. Similarity, when comparing Federico Girosi and Tomaso Poggio in Table 4

, the two researchers have similar authority indices for this topic while Tomaso Poggio has higher values for the other three topic-irrelevant indices. This is a reasonable outcome when we investigate the two researchers’ publication history. Federico Girosi has relatively focused academic interest, with his publication history being skewed towards machine-learning-related subjects, while Tomaso Poggio has broader topical interests that include computer vision and statistical learning, while also co-authoring most of the papers that Federico Girosi wrote. Thus, Federico Girosi has similar authority index for this topic but has lower authority indices for other topics than Tomaso Poggio.

Also, our model is able to capture topic-specific authoritative researchers that have relatively low topic-irrelevant scores. For example, researchers such as Stan Sclaroff and Kentaro Toyama are the top 5 authoritative researchers in computer vision topic according to the LTAI, but it is difficult to detect these researchers out of many other authoritative authors using the topic-irrelevant scores.

Finally, the LTAI detect researchers’ topical authority that is peripheral but not negligible. Mark Jones in Table 4, who has high h-index, number of citations, and wrote many papers, is a researcher whose academic interest lies in programming language design and application. However, while most of his papers’ main topics are about programming language, he often uses inference techniques and algorithms in machine learning in his papers. Our model captures that tendency and assigns some authority score for machine learning to him.

8 Conclusion and Discussion

Topic: image, motion, visual, estimate,
robust, shape, scene, geometric
Rank Author Topical Authority h-index # cite # paper
1 A Pentland 10.82 8 207 39
2 J Fessler 9.09 6 92 26
3 T Poggio 8.22 6 178 27
4 S Sclaroff 7.61 3 69 11
5 K Toyama 6.65 4 41 10
Table 3: Authors who have high authority score in computer architecture topic computer vision topic.
Topic: approximation, intelligence, artificial,
correlation, support, recognition, model, representation
Rank Author Topical Authority h-index # cite # paper
1 M Jordan 10.15 9 263 27
2 M Warmuth 9.57 8 160 17
13 T Poggio 3.48 6 178 27
17 F Girosi 3.22 3 101 9
34 M Jones 2.06 7 151 20
Table 4:

Authors who have high authority score in artificial intelligence topic.

Topic: scheme, security, signature, attack,
threshold, authentication, cryptographic, encryption
Rank Author Topical Authority h-index # cite # paper
1 M Bellare 13.21 11 280 43
2 P Rogaway 11.98 7 117 13
3 H Krawczyk 7.29 6 75 15
4 R Canetti 7.13 4 40 10
9 O Goldreich 3.70 9 229 49
Table 5: Authors who have high authority score in computer security topic.

We proposed Latent Topical Authority Indexing (LTAI) to model the topical-authority of academic researchers. Based on the hypothesis that authors play an important role in citation, we specifically focus on their authority and develop a Bayesian model to capture the authority. With model assumptions that are necessary for extracting convincing and interpretable topical authority values for authors, we have proposed speed-up methods that are based on stochastic optimization.

While there is prior research in topic modeling that provides topic-specific indices when modeling the link structure, these do not extend to individual indices, and most previous citation-based indices are defined for each individual but without considering topics. On the other hand, our model combines the merits of both topic-specific and individual-specific indices to provide topical authority information for academic researchers.

With four academic datasets, we demonstrated that the joint modeling of publication and author related variables improve topic quality, when compared to vanilla LDA. Also, we quantitatively manifested that including authority variables increases the predictive performance in terms of citation and author predictions. Finally, we qualitatively demonstrated the interpretability by topical-authority outcomes of the LTAI from the CORA corpus.

Finally, there are issues that can be dealt in future work. In our model, we do not consider time information in terms of when papers are published and when pairs of papers are linked; we can use datasets that incorporate timestamps to enhance the model capability to predict future citations and authorships.

Acknowledgments

We thank Jae Won Kim for his help on collecting, refining the dataset and contributing to the early version of the manuscript, anonymous reviewers as well as the TACL editor Noah Smith for detailed and thoughtful comments, and Joon Hee Kim and other UILab members for providing helpful insights in the research. This work was supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIP) (No.B0101-15-0307, Basic Software Research in Human-level Lifelong Machine Learning (Machine Learning Center)).

References

  • [Asuncion et al.2009] Arthur Asuncion, Max Welling, Padhraic Smyth, and Yee Whye Teh. 2009. On smoothing and inference for topic models. In UAI.
  • [Balog and De Rijke2007] Krisztian Balog and Maarten De Rijke. 2007. Determining expert profiles (with an application to expert finding). In IJCAI.
  • [Blei et al.2003] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. JMLR, pages 993–1022.
  • [Caragea et al.2014] Cornelia Caragea, Florin Adrian Bulgarov, Andreea Godea, and Sujatha Das Gollapalli. 2014. Citation-enhanced keyphrase extraction from research papers: A supervised approach. In EMNLP.
  • [Caragea et al.2015] Cornelia Caragea, Florin Bulgarov, and Rada Mihalcea. 2015. Co-training for topic classification of scholarly data. In EMNLP.
  • [Chang and Blei2010] Jonathan Chang and David M. Blei. 2010. Hierarchical relational models for document networks. The Annals of Applied Statistics, pages 124–150.
  • [Cohan and Goharian2015] Arman Cohan and Nazli Goharian. 2015.

    Scientific article summarization using citation-context and article’s discourse structure.

    In EMNLP.
  • [Diehl et al.2007] Christopher P. Diehl, Galileo Namata, and Lise Getoor. 2007. Relationship identification for social network discovery. In AAAI.
  • [Dietz et al.2007] Laura Dietz, Steffen Bickel, and Tobias Scheffer. 2007. Unsupervised prediction of citation influences. In ICML.
  • [Dong et al.2015] Yuxiao Dong, Reid A. Johnson, and Nitesh V. Chawla. 2015. Will this paper increase your h-index?: Scientific impact prediction. In WSDM.
  • [Foulds and Smyth2013] James R. Foulds and Padhraic Smyth. 2013. Modeling scientific impact with topical influence regression. In EMNLP.
  • [Garfield2006] Eugene Garfield. 2006. The history and meaning of the journal impact factor. JAMA, 295(1):90–93.
  • [Gehrke et al.2003] Johannes Gehrke, Paul Ginsparg, and Jon Kleinberg. 2003. Overview of the 2003 KDD Cup. ACM SIGKDD Explorations Newsletter, 5(2):149–151.
  • [He et al.2015] Yuan He, Cheng Wang, and Changjun Jiang. 2015. Discovering canonical correlations between topical and topological information in document networks. In CIKM.
  • [Hirsch2005] Jorge E. Hirsch. 2005. An index to quantify an individual’s scientific research output. PNAS, 102(46):16569–16572.
  • [Hoffman et al.2013] Matthew D. Hoffman, David M. Blei, Chong Wang, and John W. Paisley. 2013. Stochastic variational inference. JMLR, 14(1):1303–1347.
  • [Hu et al.2008] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering for implicit feedback datasets. In ICDM.
  • [Huang et al.2015] Wenyi Huang, Zhaohui Wu, Chen Liang, Prasenjit Mitra, and C. Lee Giles. 2015. A neural probabilistic model for context based citation recommendation. In AAAI.
  • [Jiang2015] Zhuoren Jiang. 2015. Chronological scientific information recommendation via supervised dynamic topic modeling. In WSDM.
  • [Kataria et al.2011] Saurabh Kataria, Prasenjit Mitra, Cornelia Caragea, and C. Lee Giles. 2011. Context sensitive topic models for author influence in document networks. In IJCAI.
  • [Koren et al.2009] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer, 42(8).
  • [Liu et al.2009] Yan Liu, Alexandru Niculescu-Mizil, and Wojciech Gryc. 2009. Topic-link LDA: joint models of topic and author community. In ICML.
  • [Lu and Getoor2003] Qing Lu and Lise Getoor. 2003. Link-based classification. In ICML.
  • [McCallum et al.2000] Andrew Kachites McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. 2000. Automating the construction of internet portals with machine learning. Information Retrieval, 3(2):127–163.
  • [Moravcsik and Murugesan1975] Michael J. Moravcsik and Poovanalingam Murugesan. 1975. Some results on the function and quality of citations. Social studies of science, 5(1):86–92.
  • [Nallapati et al.2008] Ramesh M. Nallapati, Amr Ahmed, Eric P. Xing, and William W Cohen. 2008. Joint latent topic models for text and citations. In SIGKDD.
  • [Neiswanger et al.2014] Willie Neiswanger, Chong Wang, Qirong Ho, and Eric P. Xing. 2014. Modeling citation networks using latent random offsets. In UAI.
  • [Purushotham et al.2012] Sanjay Purushotham, Yan Liu, and C.-C. Jay Kuo. 2012. Collaborative topic regression with social matrix factorization for recommendation systems. arXiv preprint arXiv:1206.4684.
  • [Radlinski et al.2008] Filip Radlinski, Madhu Kurup, and Thorsten Joachims. 2008. How does clickthrough data reflect retrieval quality? In CIKM.
  • [Raftery et al.2012] Adrian E. Raftery, Xiaoyue Niu, Peter D. Hoff, and Ka Yee Yeung. 2012. Fast inference for the latent space network model using a case-control approximate likelihood. Journal of Computational and Graphical Statistics, 21(4):901–919.
  • [Rennie and Srebro2005] Jasson D.M. Rennie and Nathan Srebro. 2005. Fast maximum margin matrix factorization for collaborative prediction. In ICML.
  • [Robbins and Monro1951] Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The annals of mathematical statistics, pages 400–407.
  • [Rosen-Zvi et al.2004] Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. 2004. The author-topic model for authors and documents. In UAI.
  • [Salakhutdinov and Mnih2007] Ruslan Salakhutdinov and Andriy Mnih. 2007. Probabilistic matrix factorization. In NIPS.
  • [Sim et al.2015] Yanchuan Sim, Bryan R. Routledge, and Noah A. Smith. 2015. A utility model of authors in the scientific community. In EMNLP.
  • [Teh et al.2006] Yee Whye Teh, Michael I Jordan, Matthew J. Beal, and David M Blei. 2006. Hierarchical dirichlet processes. Journal of the American Statistical Association.
  • [Teufel et al.2006] Simone Teufel, Advaith Siddharthan, and Dan Tidhar. 2006. Automatic classification of citation function. In EMNLP.
  • [Tu et al.2010] Yuancheng Tu, Nikhil Johri, Dan Roth, and Julia Hockenmaier. 2010. Citation author topic model in expert search. In COLING.
  • [Valenzuela et al.2015] Marco Valenzuela, Vu Ha, and Oren Etzioni. 2015. Identifying meaningful citations. In Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence.
  • [Voorhees1999] Ellen M. Voorhees. 1999. The TREC-8 question answering track report. In TREC, volume 99, pages 77–82.
  • [Wang and Blei2011] Chong Wang and David M. Blei. 2011. Collaborative topic modeling for recommending scientific articles. In SIGKDD.
  • [Wang et al.2015] Jingang Wang, Dandan Song, Zhiwei Zhang, Lejian Liao, Luo Si, and Chin-Yew Lin. 2015. LDTM: A latent document type model for cumulative citation recommendation. In EMNLP.