With the rapid evolution of scientific research, there are a huge volume of literatures published every year. This situation is expected to remain within the foreseeable future. Fig. 1 shows the statistics on AMiner (Tang et al., 2008), which is a large literature database in Computer Science. Fig. 1(a) visualizes the explosive increase on the volume of publications in the past years from to . We can see that the literature quantity assumes the exponential order to grow. Effective scientific research requires reviewing the previous researches. It is not wise, nor possible, for researchers to track all existing related work due to the extremely large volume of the existing publications. In general, researchers follow, or cite merely a small proportion of high quality publications. Accordingly, citation count offers a quantitative proxy of publications’ importance or a scientist’s standing in the research community.
Citation count (Margolis, 1967) has been the main evaluation measure for the quality and influence of scientific work for a long time. For the dominant use frequency, it stands out from the many quantification measures of scientific impact. Many other important evaluation criteria of authors (e.g., h-index (Garfield, 2001)) and journals (e.g., Impact Factor (Hirsch, 2005)) are calculated based on the publication citation count. Fig. 1(b) illustrates the citation distribution (the number of papers vs. citation counts) of about two million papers in AMiner. It is natural to find that not all publications attract equal attention in the academia. A few research papers accumulate the vast majority of citations, and most of the other papers attract only a few citations (Barab si et al., 2012). That is, some research papers are more likely to attract scientists’ attention than the others. For the ever-growing literature quantity, it is significative to forecast which paper is more likely to attract more attention in the academia.
As widely recognized metrics to scientific impact, current citation count and the derived metrics can only capture the past accomplishment. It is lack of the predictive power to quantify the future impact (Wang and Barab si, 2013). Predicting an individual paper’s citation count over time is significant, but (arguably) very difficult. To predict the citation count of individual items within a complex evolving system, current models of fall into two main paradigms. One models the citation network and utilizes graph mining techniques to solve the citation count prediction problem (Mcgovern et al., 2003; Pobiedina and Ichise, 2016). The other prevalent line of research formulate the citation count over time as time series, making predictions by either exploiting temporal correlations (Szabo and Huberman, 2010), or fitting these time series with certain classes of functions (Matsubara et al., 2012; Bao et al., 2013), including regression models (Yan et al., 2011), counting process (Vu et al., 2011), point process or specific Poisson process (Wang and Barab si, 2013), Reinforced Poisson Process (RPP) (Shen et al., 2014), self-excited Hawkes Process (Peng, 2016), RPP with self-excited Hawkes Process (Xiao et al., 2016).
In this paper, we integrally formulate the four major phenomena, which is confirmed independently in previous studies of long-term scientific impact quantification, including the intrinsic quality of publications, the aging effect, the Matthew effect and the recency effect. Based on a foundation of the above formulations, we propose a long-term individual-level citation count prediction (LT-CCP) model via recurrent neural network (RNN) with long short-term memory (LSTM) units. We validate the proposed model by applying it on a real-large citation dataset in AMiner. Experimental results demonstrate that our proposed model consistently outperforms the existing models. Our contributions in this paper are that: () we are the first to simultaneously consider the four key phenomena of long-term scientific impact quantification; () we are the first to model citation count prediction with RNN, and formulate the long-term effectiveness with LSTM units; () the LT-CCP model achieves a significant performance improvement in long-term citation count prediction.
2. Problem Formulation
The received citation count of an individual paper during time period is characterized by a time-stamped sequence , where represents the number of citation counts received by paper at time . In the context of given the historical citation, the goal is to model the future citation count and predict it over an arbitrary time.
Definition 0 ().
Citation count. Given the literature corpus , , the citation counts of a literature article at time is defined as:
The citation count prediction problem can be formalized as follows.
Input: For each paper , the input is , where , and is expressed as a
-dimensional feature vector, anddenotes the citation counts of paper at time . Without loss of generality, we have .
Learning: The goal of citation count prediction is to learn a predictive function to predict the citation counts of an article after a given time period . Formally, we have
where is the predicted citation count and is the actual one. In this paper, the prediction function can be learned independently from each paper.
Prediction: Based on the learned prediction function, we can predict the citation count of a paper for the next years, for example, the citation count of paper at time is given by .
In the citation count prediction model, we mainly consider four key phenomena confirmed independently in previous studies of long-term scientific impact quantification: () intrinsic quality, characterizing the inherent competitiveness of an item against others; () aging effect, capturing the fact that each paper’s novelty fades eventually over time; () the Matthew effect, documenting the well-known “rich-get-richer” phenomenon; () the recency effect, favoring more on recent citations. Based on a foundation of the above observations, we derive our LT-CCP model via RNN with LSTM units by considering these four major phenomena.
Fig. 2 illustrates the diagram of the proposed prediction model, which is constructed by two-layer RNNs with LSTM units. Given a time-stamped sequence as input, RNN generates the hidden states for the current working state, and outputs a sequence (Pascanu et al., 2012). In the proposed prediction model, LSTM unit is used for its popularity and well-know capability for efficient long-range dependency learning (Xiao et al., 2017).
In this paper, our major contribution is that we simultaneously consider the four aforementioned phenomena of long-term scientific impact quantification. The detailed specifications of these four phenomena are formulated as follows:
Intrinsic quality. The intrinsic quality captures the inherent differences between papers, accounting for the perceived novelty and importance of a publication. Actually, the highly cited papers are more visible and more likely to be cited again than the less-cited papers (Wang and Barab si, 2013). In our proposed prediction model, the input space of every paper with citation count records reflects the intrinsic quality of the paper.
Aging effect. The aging effect can be modeled by the forget gate in the LSTM cell. It provides a way to get rid of the previously stored memory value. Here is the formulate of the forget gate:
where are weights that govern the behavior of the forget gate. The symbol
represents the Sigmoid function. The values of the vectorare between and . If one of the values of is (or close to ), it means that the LSTM cell should remove that piece of information in the corresponding component in the memory. If one of the values is , the corresponding information will be kept.
Matthew effect. The Matthew effect of accumulated advantage is summarized by the “rich-get-richer” phenomenon, i.e., previous accumulated attention triggers more subsequent attentions (Crane and Sornette, 2008). We need to update the model and take the long-term dependencies into consideration. The following formula for the model updation is used:
Similar to , is also a vector of values between and .
Remembering information for long periods of time is practically the default behavior of LSTM. The long-term accumulative influence is formulated as follows:
Recency effect. Aggregating all past citations in the model is less effective to capture the citation dynamics (Xiao et al., 2016). In the citation count prediction model, we need to give emphasis on the new coming citations. The recent items shored in the current working state have an advantage in reading over those stored in the long-term memory. Thus, it is possible to capture the Recency effect. Building on the recency effect, the prediction model can naturally address the problem in RPP that some papers are simulated with spiking citation curve.
The formulation of the four aforementioned phenomena is illustrated in Fig. 3. The long-term memory of the LSTM unit is formulated as . The current working memory is updated as the following formula:
in which, reflects the current state of the LSTM unit. Finally, the citations of paper at time is given by the prediction . It is calculated as the following formula:
In this section, we demonstrate the effectiveness of the proposed citation count prediction model.
Experiments are conducted on a real-world dataset111https://www.aminer.cn/data, which is extracted from the academic search and mining platform – AMiner. We select publications in Computer Science for more than years (from to ), which consists of papers authored by researchers. The full graph of citation network contained in this dataset has vertices (literature papers) and edges (citations).
4.2. Baseline Models and Evaluation Metrics
To compare the predictive performance of the proposed LT-CCP model against other models, we introduce several published models that have been used to model and predict the citation count. Specifically, the comparison methods in our experiments are listed as follows. RPP (Wang and Barab si, 2013; Shen et al., 2014) incorporates three key ingredients: the intrinsic attractiveness, the aging effect, and the reinforcement mechanism using a reinforced Poisson process. CART and SVR perform better in citation count prediction compared to LR
and KNN in the reference(Yan et al., 2011).
Two metrics used for evaluating popularity dynamics in (Shen et al., 2014; Xiao et al., 2016) are also used: Mean Absolute Percentage Error (MAPE) and Accuracy (ACC). Let be the observed citations of paper up to time , and be the predicted one. The MAPE measures the average deviation between the predicted and observed citations over all papers. For a dataset of papers, the MAPE is given by:
ACC measures the fraction of papers correctly predicted under a given error tolerance . Specifically, the accuracy of citation prediction over papers is defined as:
where is an indicator function which return if the statement is true, otherwise return . We find that our method always outperforms regardless the value of . In this paper, we set as (Xiao et al., 2016).
4.3. Results and Discussion
We found in the experiments that the longer the duration of the training set, the better the long-term prediction performance. In this paper, we set the training period as years and then predict the citation counts for each paper from the to after the training period. For example, means that the first observation year after the training period.
As shown in Table. 1, the proposed LT-CCP model exhibits the best performance in terms of ACC in all the situation of , , , and . It means that the LT-CCP consistently achieves the higher accuracy than other models across different observation time. What’s more, the LT-CCP model also exhibits the best performance in terms of MAPE in all the aforementioned situations. That is, the LT-CCP model achieves the higher accuracy and lower error simultaneously. As shown in Fig. 4(a), the superiority of the LT-CCP model, compared to the other methods in terms of ACC, increases with the number of years after the training period. When , the proposed LT-CCP model achieves a significant performance improvement in terms of ACC, about compared to CART, and about compared to RPP. As illustrated in Fig. 4(b), the models used for comparison all achieve acceptable low error rate, except RNN. This problem can be avoid by RNN with prior (Shen et al., 2014)
, which incorporates conjugate prior for the fitness parameter. But the RNN with prior doesn’t improve the ACC performance. That is to say, our proposed LT-CCP model also outperforms than RNN with prior in terms of ACC. Fig.4(c) illustrates the distribution of the predicted citations using LT-CCP model when . It shows that the LT-CCP model matches very well with that of real citations on the studied dataset.
Publication evaluation is always a key point in decision making concerning with recruitment and funding in the scientific community. In this paper, we present a citation count prediction model for individual publications via RNN with LSTM units. Specifically, we integrally formulate the four major phenomena confirmed independently in previous studies of long-term scientific impact quantification, including the intrinsic quality of publications, the aging effect, the Matthew effect and the recency effect. Experiments on a real-large citation dataset demonstrate that our proposed model consistently outperforms the existing prediction models.
More importantly, it provides us great insights in understanding the fundamental mechanism of long-term publication citation counts based on the formally formulation via LSTM. In future, we plan to further integrate the citation patterns into the proposed model, and incorporate it into the Bayesian network to improve the interpretability.
- Bao et al. (2013) Peng Bao, Hua Wei Shen, Junming Huang, and Xue Qi Cheng. 2013. Popularity prediction in microblogging network: a case study on sina weibo. In International Conference on World Wide Web. 177–178.
- Barab si et al. (2012) Albertl szl Barab si, Chaoming Song, and Dashun Wang. 2012. Handful of papers dominates citation. Nature 491, 7422 (2012), 40.
- Crane and Sornette (2008) R Crane and D Sornette. 2008. Robust dynamic classes revealed by measuring the response function of a social system. Proceedings of the National Academy of Sciences of the United States of America 105, 41 (2008), 15649–53.
- Garfield (2001) Eugene Garfield. 2001. Impact factors, and why they won’t go away. Nature 411, 6837 (2001), 522.
- Hirsch (2005) J. E. Hirsch. 2005. An Index to Quantify an Individual’s Scientific Research Output. Proceedings of the National Academy of Sciences of the United States of America 102, 46 (2005), 16569–16572.
- Margolis (1967) J Margolis. 1967. Citation indexing and evaluation of scientific papers. Science 155, 3767 (1967), 1213–1219.
- Matsubara et al. (2012) Yasuko Matsubara, Yasushi Sakurai, B. Aditya Prakash, Lei Li, and Christos Faloutsos. 2012. Rise and fall patterns of information diffusion: model and implications. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 6–14.
- Mcgovern et al. (2003) Amy Mcgovern, Lisa Friedland, Michael Hay, Brian Gallagher, Andrew Fast, Jennifer Neville, and David Jensen. 2003. Exploiting relational structure to understand publication patterns in high-energy physics. Acm Sigkdd Explorations Newsletter 5, 2 (2003), 165–172.
- Pascanu et al. (2012) Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2012. On the difficulty of training Recurrent Neural Networks. 52, 3 (2012), III–1310.
- Peng (2016) Bao Peng. 2016. Modeling and Predicting Popularity Dynamics via an Influence-based Self-Excited Hawkes Process. In ACM International on Conference on Information and Knowledge Management. 1897–1900.
- Pobiedina and Ichise (2016) Nataliia Pobiedina and Ryutaro Ichise. 2016. Citation count prediction as a link prediction problem. Applied Intelligence 44, 2 (2016), 252–268.
et al. (2014)
Hua Wei Shen, Dashun
Wang, Albertl szl Barab si, and
Chaoming Song. 2014.
Modeling and Predicting Popularity Dynamics via
Reinforced Poisson Processes. In
Twenty-Eighth AAAI Conference on Artificial Intelligence. 291–297.
- Szabo and Huberman (2010) Gabor Szabo and Bernardo A. Huberman. 2010. Predicting the popularity of online content. Vol. 53. Communications of the ACM. 80–88 pages.
- Tang et al. (2008) Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. ArnetMiner:extraction and mining of academic social networks. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 990–998.
et al. (2011)
Duy Q Vu, Arthur U
Asuncion, David R Hunter, and Padhraic
Dynamic egocentric models for citation networks.
International Conference on International Conference on Machine Learning. 857–864.
- Wang and Barab si (2013) Dashun Wang and Albert L szl Barab si. 2013. Quantifying long-term scientific impact. Science 342, 6154 (2013), 127–32.
- Xiao et al. (2017) Shuai Xiao, Junchi Yan, Stephen M. Chu, Xiaokang Yang, and Hongyuan Zha. 2017. Modeling The Intensity Function Of Point Process Via Recurrent Neural Networks. (2017).
- Xiao et al. (2016) Shuai Xiao, Junchi Yan, Changsheng Li, and Bo Jin. 2016. On Modeling and Predicting individual Paper Citation Count over Time. In Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16). 2676–2682.
et al. (2011)
Rui Yan, Jie Tang,
Xiaobing Liu, Dongdong Shan, and
Xiaoming Li. 2011.
Citation count prediction: learning to estimate future citations for literature. InACM International Conference on Information and Knowledge Management. 1247–1252.