The increasing availability of digital data on scholarly outcomes offers unprecedented opportunities to explore the science of science (SciSci) (Fortunato et al., 2018). Based on the empirical analysis of big data, SciSci provides a quantitative understanding of scientific discovery, creativity, and practice. It discovers that the previously discovered knowledge mainly inspires new scientific ideas, and citation is a relatively good reflection of this cumulative nature of scientific research. Citation count, which has been used to evaluate the quality and influence of scientific work for a long time, stands out from many quantification measure metrics of scientific impact. With the rapid evolution of scientific research, there is a huge volume of literature published every year, and this situation is expected to remain within the foreseeable future. Fig. 1 shows the statistics on AMiner (Tang et al., 2008), which is a large literature database in Computer Science. Fig. 1(a) visualizes the explosive increase on the volume of publications in the past years from to . It shows that the literature quantity assumes the exponential order to grow.
Useful scientific research requires reviewing the previous researches. It is not wise, nor possible, for researchers to track all existing related work due to the enormous volume of the existing publications. In general, researchers follow or cite merely a small proportion of high-quality publications. SciSci provides several quantification methods for scientific impact measurement in article-level, author-level, and journal-level. Much SciSci work has been done on the evaluation metrics for the quality and influence of scientific work, including citation count, h-index(Hirsch, 2005), and impact factor (Garfield, 2001). One of the most basic quantification measure metrics of scientific impact is citation count. It measures the number of received citations for an article. Many other essential evaluation criteria of authors (e.g., h-index) and journals (e.g., Impact Factor) are calculated based on citation count.
A lot of SciSci researchers have focused on the characterization of scientific impact, such as the universal citation distributions (Radicchi et al., 2008), the characteristics of citation networks (Hunter et al., 2011; Pan et al., 2012; Kuhn et al., 2014), and the growth pattern of scientific impact (Dong et al., 2015). The results reveal the regularity of scientific progress that a few research papers attract the vast majority of citations (Barabási et al., 2012), long-distance interdisciplinarity leads to higher scientific impact (Larivière et al., 2015; Yegros-Yegros et al., 2015). Fig. 1(b) illustrates the citation distribution (the number of papers vs. citation counts) of about two million papers in AMiner. It is natural to find that not all publications attract equal attention in academia. A few research papers accumulate the vast majority of citations, and most of the other papers attract only a few citations (Barabási et al., 2012). The citation distribution follows the power-law distribution. A small number of scholarly outcomes are more likely to attract scientists’ attention than the others accounting for a vast majority. For the ever-growing literature quantity, it is significative to forecast which paper is more likely to attract more attention in academia.
The fact is that the current citation count and the derived metrics can only capture the past accomplishment. They lack the predictive power to quantify future impact (Acuna et al., 2012). Predicting an individual paper’s citation count over time is significant, but (arguably) very difficult. To predict the citation count of individual items within a complex evolving system, current models are falling into two main paradigms. One formulates the citation count over time as time series, and then makes predictions by either exploiting temporal correlations (Szabo and Huberman, 2010), or fitting these time series with certain classes of designed functions (Matsubara et al., 2012; Bao et al., 2013), including the regression models (Yan et al., 2011), the counting process (Vu et al., 2011), the point process, the Poisson process (Wang et al., 2013), Reinforced Poisson Process (RPP) (Shen et al., 2014), self-excited Hawkes Process (Peng, 2016), RPP with self-excited Hawkes Process (Xiao et al., 2016). The designed functions consider various factors.
The other prevalent line utilize Deep Neural Network (DNN) based models to solve the scientific impact prediction problem. Recently, Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN), have received considerable attention from both the academia and the industry. RNN has been proven to perform particularly well on temporal data series(Sutskever et al., 2014)
. Due to the vanishing gradient problem, RNN always fails to handle the temporal contingencies present in the input/output sequences spanning long intervals(Bengio et al., 1994)
. The networks with loops in RNN allow information to persist in a long time. Long short-term memory (LSTM) is proven to be capable of learning long-term dependencies. RNN with LSTM units performs rather well in handling long-term temporal data series(Xiao et al., 2017).
All the existing methods try to tune the citation distribution exactly as the power-law distribution. However, this paper argues that the effectiveness of quantifying long-term scientific impact is fundamentally limited in this line of thinking. This paper proposes to put more attention on some specific items. The authors validate the proposed line of thinking on a real large-scale citation dataset. Extensive experiment results demonstrate that the proposed method possesses remarkable power at predicting the long-term scientific citation. The most important contribution is that this paper is the first to change the line of thinking in quantifying the long-term scientific impact. Instead of simulating the power-law distribution, researchers need to pay more attention to the limited attention to better stand on the shoulders of giants.
2. Problem Formulation
The basic evaluation metric for scientific impact is citation count. The received citation count of an individual paper during time period is characterized by a time-stamped sequence , where represents the number of citation counts received by paper at time , is an integer greater than or equal to zero. In the context of giving the historical citation records, the goal is to model the future citation count and predict it over an arbitrary time.
Definition 2.1 ().
Scientific Impact. Given the literature corpus , , the scientific impact of a literature article at time is defined as its citation counts :
The underlying assumption of the citation count here is the accumulated citations, which make it possible to quantify citations for different items at different times. The long-term scientific impact of individual item can be formalized as the following time series . Without loss of generality, the number of accumulated citation count increase over time. And then, we have .
The scientific impact prediction problem can be formalized as follows.
Input: For each paper , the input is , where , and is expressed as a
-dimensional feature vector, andis the citation counts of paper at time .
Learning: The goal of citation count prediction is to learn a predictive function () to predict the citation counts of an article after a given time period . Formally, we have
where is the predicted citation count and is the actual one.
Prediction: Based on the learned prediction function, we can predict the citation count of a paper for the next years, for example, the citation count of paper at time is given by .
The deep learning attention model.
3. Scientific Impact Prediction
As the most efficient scientific impact prediction method found so far, RNN has already achieved compelling performance in predicting the scientific impact. This paper embeds the RNN with LSTM units as a baseline and then emphasize highly cited papers in the proposed attention mechanism. Although many other fields have used the attention mechanism, the proposed method gives the new insight about long-term quantifying scientific impact. Instead of adapting citation distribution to a power-law distribution, the findings in this paper provide a new line of thinking for the SciSci research.
3.1. Deep Learning Attention Mechanism
Given a time-stamped sequence , a -dimensional feature vector needs to be designed as input. The input space of every item with popularity records reflects the intrinsic quality of the item. Fig. 2(a) gives an overview of the model architecture. There are two key components in the architecture: the RNN with LSTM units and the attention model. As illustrated in Fig. 2(b), it arranges the LSTM units in the form of RNN with layers. In the deep neural network, the parameter depends on the input scale. RNN is famous for its popularity and well-known capability for efficient time series learning (Xiao et al., 2017). The LSTM units capture the long-range dependency in long-term scientific impact quantification.
The RNN with LSTM Units. The LSTM units are arranged in the form of RNN, as illustrated in Fig. 2(b). There are four major components in a standard LSTM unit, including a memory cell, a forget gate , an input gate , and an output gate . The gates are responsible for information processing and storage over arbitrary time intervals. Usually, the outputs of these gates are between and . A new study gives suggestions to push the output values of the gates towards or . By doing so, the gates are mostly open or closed, instead of in a middle state (Li et al., 2018). This paper arranges the LSTM units in the form of RNN. In this way, introducing the memory cell will solve the vanishing gradient problem. Thus, it can store information for either short or long periods in the LSTM unit.
Intuitively, the input gate controls the extent to which a new value flows into the memory cell. A function of the inputs passes through the input gate and is added to the cell state to update it. The following formula for the input gate is used:
where matrix collects the weights of the input and recurrent connections. The symbol
represents the Sigmoid function. The values of the vectorare between and . If one of the values of is (or close to ), it means that this input gate is closed and no new information is allowed into the memory cell at time . If one of the values is , the input gate is open for new coming value at time . Otherwise, the gate is in the state of half-open half-clearance.
The forget gate controls the extent to which a value remains in the memory cell. It provides a way to get rid of the previously stored memory value. Here is the formulation of the forget gate:
where is the weight matrix that governs the behavior of the forget gate. Similar to , is also a vector of values between and . If one of the values of is (or close to ), it means that the memory cell should remove that piece of information in the corresponding component in the cell. If one of the values is , the corresponding information will be kept.
Remembering information for long periods of time is practically the default behavior of LSTM. The long-term accumulative influence is formulated as follows:
where denotes the Hadamard product (the element-wise multiplication of matrices), is calculated as follows:
That is, the information in memory cell consists of two parts: the retained old information (controlled by the forget gate), and the new coming information (controlled by the input gate).
The output gate controls the extent to which the value in the cell is used to compute the output activation of the LSTM unit. The following output function is used:
The weight matrices and bias vector parameters are needed to be learned during training. This paper updates the current working state as the following formula:
The items stored in the current working state have an advantage in reading over those stored in long-term memory. In the time series modeling of scientific impact, the recent items stored in the short-term working state have an advantage over those stored in the long-term memory. The next step introduces the attention mechanism based on .
The Attention Model. The artificial attention mechanism, inspired by the attention behavior in neuroscience, has been applied in deep learning for speech recognition, translation, and visual identification of object.Broadly, attention mechanisms are components of prediction systems that allow the system to focus on different subsets of the input sequentially. It aims to capture the critical points and focuses on the relevant parts more than the remote parts as a human does. More specifically, the content-based attention generates attention distribution. Only part of a subset of the input information is focused. The attention function needs to be differentiable, so that everywhere of the input is focused, just to different extents.
The deep learning attention mechanism used in this paper works as follows: given an input , the aforementioned LSTM units generate to represent the hidden patterns of the input. The output is the summary of the focusing on information linked to the input. In this formulation, attention produces a fixed-length embedding of the input sequence by computing an adaptive weighted average of the state sequence .
The graphical representation of the attention model is shown in Fig. 2(c). The input and the hidden layer of LSTM network (a RNN composed of LSTM units) are the input of the attention model. Then, it computes the following formula:
where is the weight matrix. An important remark here is that each is computed independently without looking at the other for . Then, each
is linked to a Softmax layer, which function is given by:
where , the is the softmax of the projected on a learned direction. The output is a weighted arithmetic mean of the input, and the weights reflects the relevance of and the input. It is calculated as the following formula:
Finally, the popularity of item at time is given by the prediction .
3.2. Key Factor in Quantifying Long-term Impact
As widely acknowledged, the citation distribution follows the power-law distribution. This finding leads the way of research in this domain. Researchers try to simulate the citation distribution as the power-law distribution. This paper changes the line of thinking. Although the number of research papers has exploded, the reading time of scientists has not. The attention shifts toward the top % over time (Barabási et al., 2012). Even though the citation distribution follows the power-law distribution, attention is also vital in quantifying the long-term scientific impact.
In the fact of limited attention, Matthew effect dominates in quantifying the long-term scientific impact. The experiments will confirm it. The citation count captures the inherent differences between papers, accounting for the perceived novelty and importance of a paper. The “rich-get-richer” phenomenon summarizes the Matthew effect of accumulated advantage, i.e., previously accumulated attention triggers more subsequent attentions (Crane and Sornette, 2008). It is in fact that the highly popular items are more visible and more likely to be viewed than others. The proposed model emphasizes highly cited papers under limited attention. The memory cell in the LSTM unit considers the long-term dependencies. As shown in Eq. (5), previously accumulated attention stored in the long-term memory triggers more subsequent attention. What is more, the attention model, which focuses on the most popular part of the time series as Eq. (11) does, also emphasizes the Matthew effect.
This section demonstrates the effectiveness of putting particular emphasis on the vital factor in quantifying the long-term scientific impact.
The authors extract the data from an academic search and mining platform called AMiner and construct a real large-scale scholarly dataset111https://www.aminer.cn/data. The full graph of citation network contained in this dataset has about million vertices (papers) and million edges (citations). In detail, the dataset is composed of digitalized papers spanning from to (for more than years), and citations between them. By convention, the authors eliminate those papers with less than citations during the first years after publication and only retain the remaining papers as the training data. As a result, papers published in to are retained.
4.2. Baseline Models and Evaluation Metrics
To compare the predictive performance of the proposed attention model against other models, we introduce several published models that have been used to predict scientific impact. Specifically, the comparison methods in the experiments are LR, CART, SVR (the three basic machine learning methods used in(Yan et al., 2011)), RPP (Wang et al., 2013; Shen et al., 2014), and RNN (Xiao et al., 2017). The advantage of deep learning is the utilization of various features. For the sake of fairness, the authors only use the citation count records and the same feature used in (Wang et al., 2013; Shen et al., 2014).
This paper uses two basic metrics for scientific impact evaluation: Mean Absolute Percentage Error (MAPE) and Accuracy (ACC). Let be the observed citations of paper up to time , and be the predicted one. The MAPE measures the average deviation between the predicted and observed citations over all papers. For a dataset of papers, the MAPE is given by:
ACC measures the fraction of papers correctly predicted under a given error tolerance . Specifically, the accuracy of citation prediction over papers is defined as:
where is an indicator function which returns if the statement is true, otherwise returns . We find that our method always outperforms regardless the value of . In this paper, we set .
4.3. Model Setting
The experiment results show that the longer the duration of the training set, the better the long-term prediction performance. This paper sets the training period as years and then predict the citation counts for each paper from the to after the training period. For example, means that the first observation year after the training period. In the experiments, the features with positive contributions are the citation history, the h-index of the paper author, and the level of the publication journal. For the convenience of performance comparison, the input feature used here is the citation history for every sub-window of length years. The value of the parameter is
. The loss function used here is MAPE. Adadelta is the gradient descent optimization algorithm. The attention layer is fully connected and uses tanh activation.
As shown in the Table. 1, the proposed model exhibits the best performance in terms of ACC in all the situations of , , , , and . It means that the DLAM consistently achieves the higher accuracy than other models across different observation time. What is more, the proposed model also exhibits the best performance in terms of MAPE in all the situations mentioned above. That is, the proposed model achieves higher accuracy and lowers error rates simultaneously. In the experiments, all the models used for comparison achieve acceptable low error rates, except RPP. RPP can avoid this problem with prior (Shen et al., 2014)
, which incorporates conjugate prior for the fitness parameter. However, the RPP with prior does not improve the ACC performance. Overall, the proposed model also outperforms than RPP with prior.
Compared to the other methods in terms of ACC and MAPE, the proposed model increases with the number of years after the training period. Compare to RNN (the most efficient method certified in recent works), the proposed model achieves a few performance improvements about in terms of MAPE, and about in terms of ACC, when . However, in the situation of , the proposed model achieves significant performance improvement about in terms of MAPE, and about in terms of ACC. In other words, the proposed model shows much superiority than other models in scientific impact prediction, especially in the long-term situation.
5. Further Exploration
Effectiveness of the attention mechanism. The authors remove the attention module of the proposed model to verify the effectiveness of the attention mechanism. The remainder is RNN with LSTM units (labeled as LT-CCP), which is proven to be useful in long-term citation count prediction. In the next step, we add the attention mechanism in two different ways. Firstly, we add the attention module before the RNN module, which is labeled as ATT-B-LT (attention before LT-CCP). In a second way, we add the attention module after the RNN module, which is labeled as ATT-A-LT (attention after LT-CCP). As shown in Fig. 3(b) and Fig. 3(a), the ACC is increased, and the corresponding MAPE is decreased. Both ATT-B-LT and ATT-A-LT perform better than LT-CCP in terms of MAPE and ACC. Introducing the attention module improves the ability of scientific impact prediction. The effectiveness of the attention mechanism is verified.
In addition, we can see that the ATT-A-LT performs better than ATT-B-LT. It indicates that the deep learning model can learn the implicit features underlying the citation records, which provide a further boost in the performance.
Analysis of the citation distribution. We illustrate the actual and the predicted citations distribution of LT-CCP (RNN with LSTM), ATT-B-LT and ATT-A-LT (DLAM) when in Fig. 3(c), Fig. 3(d) and Fig. 3(e) respectively. The LT-CCP (RNN with LSTM) illustrated in Fig. 3(c) shows the best simulation of the power-law distribution. But the ATT-B-LT shown in Fig. 3(d) and the ATT-A-LT (DLAM) shown in Fig. 3(e) present bad simulation of the power-law distribution. The results show that LT-CCP (RNN with LSTM) matches very well with that of real citations, but the ATT-B-LT and the ATT-A-LT (DLAM) don’t. Usually, it is believed that more similar to the power-law distribution, the whole result is more better. At first glance, it seems that LT-CCP (RNN with LSTM) performs the best.
However, the first thought is wrong. As verified in Fig. 3(a) and Fig. 3(b), the LT-CCP (RNN with LSTM) performs the worst. In fact, the LT-CCP only has better fitting effect on the papers with little citation counts. On the contrary, the ATT-B-LT and ATT-A-LT (DLAM) have better fitting effect on the highly cited papers. The methods with attention mechanism achieve better overall performance. It is more accordant with practical prediction requirements that a few papers occupy vast number of citations. It further proves the effectiveness of the attention model. The experimental results indicate that we need to change the fixed pattern of thinking in quantifying long-term scientific impact.
Scientific impact evaluation is always a key point in decision making concerning with recruitment and funding in the scientific community. Based on big data based empirical analysis, science of science provides quantitative understanding of the scientific impact. In this paper, the authors introduce the attention mechanism in long-term scientific impact prediction, and verify its effectiveness. More importantly, this paper provides us great insights in understanding the key factor in quantifying long-term scientific impact. Usually, it is believed that the citation distribution is more similar to the power-law distribution, the whole result is more better. However, the experimental results in the paper discredit this conclusion. In the future research work, we need to change the fixed pattern of thinking in quantifying long-term scientific impact, and make better use of limited attention to better stand on the shoulders of giants.
The work is supported by the National Natural Science Foundation of China (NSFC) under Grant No. and No. , and NSFC for Distinguished Young Scholar under Grant No. .
- Future impact: predicting scientific success. Nature 489 (7415), pp. 201. Cited by: §1.
- Popularity prediction in microblogging network: a case study on sina weibo. In International Conference on World Wide Web, pp. 177–178. Cited by: §1.
- Publishing: handful of papers dominates citation. Nature 491 (7422), pp. 40. Cited by: §1, §3.2.
- Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5 (2), pp. 157–166. Cited by: §1.
- Robust dynamic classes revealed by measuring the response function of a social system. Proceedings of the National Academy of Sciences of the United States of America 105 (41), pp. 15649–53. Cited by: §3.2.
- Will this paper increase your h-index?: scientific impact prediction. In Proceedings of the eighth ACM international conference on web search and data mining, pp. 149–158. Cited by: §1.
- Science of science. Science 359 (6379). Cited by: §1.
- Impact factors, and why they won’t go away. Nature 411 (6837), pp. 522. Cited by: §1.
- An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences of the United States of America 102 (46), pp. 16569–16572. Cited by: §1.
- Dynamic egocentric models for citation networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 857–864. Cited by: §1.
- Inheritance patterns in citation networks reveal scientific memes. Physical Review X 4 (4), pp. 041036. Cited by: §1.
- Long-distance interdisciplinarity leads to higher scientific impact. Plos one 10 (3). Cited by: §1.
- Towards binary-valued gates for robust lstm training. In International Conference on Machine Learning, pp. 3001–3010. Cited by: §3.1.
- Rise and fall patterns of information diffusion: model and implications. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 6–14. Cited by: §1.
- World citation and collaboration networks: uncovering the role of geography in science. Scientific reports 2, pp. 902. Cited by: §1.
- Modeling and predicting popularity dynamics via an influence-based self-excited hawkes process. In ACM International on Conference on Information and Knowledge Management, pp. 1897–1900. Cited by: §1.
- Universality of citation distributions: toward an objective measure of scientific impact. Proceedings of the National Academy of Sciences 105 (45), pp. 17268–17272. Cited by: §1.
Modeling and predicting popularity dynamics via reinforced poisson processes.
Twenty-eighth AAAI conference on artificial intelligence, Cited by: §1, §4.2, §4.4.
- Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pp. 3104–3112. Cited by: §1.
- Predicting the popularity of online content. Vol. 53, Communications of the ACM. Cited by: §1.
- ArnetMiner:extraction and mining of academic social networks. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 990–998. Cited by: §1.
- Dynamic egocentric models for citation networks. In International Conference on International Conference on Machine Learning, pp. 857–864. Cited by: §1.
- Quantifying long-term scientific impact. Science 342 (6154), pp. 127–132. Cited by: §1, §4.2.
- On modeling and predicting individual paper citation count over time. In Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16), pp. 2676–2682. Cited by: §1.
- Modeling the intensity function of point process via recurrent neural networks.. In Thirty-First AAAI Conference on Artificial Intelligence, Vol. 17, pp. 1597–1603. Cited by: §1, §3.1, §4.2.
Citation count prediction: learning to estimate future citations for literature. In ACM International Conference on Information and Knowledge Management, pp. 1247–1252. Cited by: §1, §4.2.
- Does interdisciplinary research lead to higher citation impact? the different effect of proximal and distal interdisciplinarity. PloS one 10 (8). Cited by: §1.