Heavy-tailed heterogeneous distributions have been found in many empirical data since their first observation in the economy Mitzenmacher2004
. Such a distribution is characterized by a long and heavy tail, which is a part of the distribution displaying infrequently occurring events with a large value. Such a highly-skewed distribution seems to be so ubiquitous, from the cyberspace,i.e. World Wide Web Albert1999 , to biological systems, i.e. metabolite system of living things Jeong2000 , being believed as a universal law of nature. Consequently, for almost 20 years, topological studies on large-scale systems have been focused on some interesting features of the scale-free distributions: degree distribution, small-world property, robustness on failure, community structure, core-periphery structure, etc. Power law and related distributions have commonly been considered as a holy grail of the complex systems due to their innate properties Albert2000 ; Barrat2000 . Scholars occasionally assume that such long-tails imply a power law, which is a quantity x is a result from a probability distribution as follows:
where is called an exponent and is a constant parameter that characterizes power-law distribution. Generally, empirical observation follows the exponent in the range , for values larger than a certain minimum . However, a recent study showed that such a simple power law or complete scale-free distributions are rare Broido2018 .
Estimating an appropriate citation distribution is a crucial component of scientometrics to establish unbiased statistical backgrounds for policy and decision makers Vieira2010 ; Ruiz2012 ; Bornmann2017 . Unfortunately, identifying that distribution is challenging, because most citation distributions are characterized as long-tail with rare events, essentially accompanying large fluctuations on observed distributions Clauset2009 . Although observed data behave like a certain model distribution, it is still hard to deny the possibility of the alternative distributions, e.g. log normals and power laws Redner1998 ; Redner2005 . Indeed, scholars proposed several types of model distributions for the citation counts. One possible candidate is power law and its variants Redner1998 ; Price1976 ; Albarran2011 ; Brzezinski2015 . An exponential and a stretched exponential are also reported Wallace2009 . In addition, recent studies indicated citation distribution can be explained through the (discretized) lognormal Thelwall2014 ; Thelwall2016 . Despite the significant contributions of such endeavours, previous studies mainly focused on the accumulated citation count from the published year to a certain citation window and often neglect the yearly earned citation for the academic literature, e.g. journal articles. There were also several studies investigating dynamics and ageing of citation counts for individual articles Glanzel2004 ; Mingers2006 ; Bouabid2011 ; however, these rarely paid attention at the evolution of yearly citation distribution as a whole. For the comprehensive understanding of citation evolution, complementary methodology is necessitated with more in-depth analysis on the increment of citation.
In this study, we perform detailed analysis on the dynamic pattern of citation distributions on the history for 42 million of academic literature published between 1996 and 2016. First, we use the entire citation history to assess the model distribution for yearly acquired citation. Second, we propose journal- and year- normalization method of citation count. Our analysis of articles belonging to distinct disciplines shows that the proper model distribution for the raw counts and normalized counts are different. In this process, we demonstrate that the journal’s prestige gives strong impact on the raw citation count so that proper normalization is required to comprehend dynamics of citation evolution. We show that our normalization method can practically remove such a citation surplus owing to the journal’s prestige, displayed in the long-term correlation of yearly acquired citation count.
2 Assessing empirical citation distributions
2.1 Data set
For our analysis on paper metadata, we use the dump of the entire SCOPUS CUSTOM XML DATA for 22 August 2017. This custom data contains the complete copy of data from the SCOPUS website from the very beginning, i.e., January 1996 to August 2017, and includes title, journal, abstract, author information, and citation records in XML format. Each type of document plays different roles in knowledge formation. For example, conference proceedings are a conventional method for presenting new research in the fields of computer science, whereas journal articles are the primary method for many other disciplines. One should note that several disciplines in social science also acknowledge books and reports as essential archives of knowledge. Therefore, to prevent a possible bias towards specific disciplines, we use the entire metadata regardless of the citation type in SCOPUS.
In this data set, there are a total of records of academic literature. Each metadata denotes journals and timestamps of the document. This metadata also includes All Science Journal Classification (ASJC) system for each journals SCOPUSDATA . This system composite of two-levels hierarchical classifications with 27 subject areas and 334 subject categories, yet some of the subject categories are barely used Wang2016 . To tackle this issue, we use Scimago Journal & Country Rank(SJR) consisting of 309 refined subject categories and 27 subject areas from the ASJC scheme SJR ; Gomez2011 . In this study, we take the SJR classification of 2016 regardless of the articles’ publication year for the consistency. Journals documented in SJR databases are attributed to least one subject category for ASJC; however, SJR excludes some of the journals with some criteria on the journal quality and journal volume size SJR ; Gomez2011 . Therefore, we also exclude journals not belonging to SJR classification system for the analysis performed on each subject category and area separately. In addition, some journals have multiple IDs due to the altered scope, ISBN or ISSN, publisher, journal title etc. Journal classification may also vary during such changes; thus, we merge the classification information of the journal with multiple IDs, if journals with distinct IDs share an identical journal title. The timestamps of the publications are preferentially extracted from the publicationdate element. It is infrequently replaced by the xocs:sort-year element only if the publicationdate element is empty or incomplete in the metadata. If the timestamp of a certain publication is not between 1996 and 2017, we consider the data as invalid and ignore it.
2.2 Best fit model distributions for the annual acquired citation count.
The power law is characterized with long-tail and rare events, essentially accompanying large fluctuations on observed distributions Clauset2009 . It is thus hard to infer a suitable distribution from an empirical observation. In fact, even if a power law fits well to the observed data, there is always a possibility of alternative distributions: exponential, log normal, and so on. Moreover, such model distributions might fit better than our primary assumption of power law. Even though there are frequently referred candidates for the heavy-tailed distributions, we choose six candidate models frequently claimed as follows:
Simple power-law distribution
Power-law distribution with an exponential cut-off
Stretched exponential distribution
Log normal and positive log-normal distributions
We begin investigating the empirical evidence for best fit model distributions by the Maximum Likelihood Ratio methods Clauset2009 . We use the yearly acquired number of citations, , which is defined as the number of citations of article obtained in the year . This value implies the level of attention for single academic literature in a particular year, unlike accumulated citations from the published year that reflect the long-term cumulative impact of the academic literature. For the first step, we fitted the empirical distribution of for all six candidate model distributions and each year between 1996 to 2016. For every published and cited years, we scan the parameters including the minimum value of to find the best fit according to its log-likelihood value, because the citation distribution may not be characterized by a single distribution solely Redner1998 . This log-likelihood value is also displaying how a particular model distribution is suitable relative to the alternative models.
For the first step, we apply Maximum Likelihood Ratio methods to the entire publication regardless of its disciplines. Unexpectedly, we observe the mixture of three distributions, instead of the single dominant model (Fig. 1a). More specifically, we observe the mixture of log normal (LN), power law with an exponential cut-off (PLE), and basic power law (PL) as the best fit of the probability density distribution for the . The other three distributions are only observed for negligible numbers. The estimated power-law exponent of ranges extensively from to (Fig. 1b). This result is also supported by the visual demonstration of probability density showing widespread lines (Fig. 1c). Therefore, it is hard to conclude any universality of encompassing the different publication years and cited years.
One should note that this observation does not completely exclude the possibility of universality. It was reported that one can yield a distribution similar to power law by stacking many log-normal distributions with different means Stringer2008 . In other words, this mixture of power laws (with or without an exponential cut-off) and log normals may imply the convolution of many log-normal distributions. Indeed, the mean citation count per academic literature varies largely according to its disciplines Waltman2011 ; thus academic citation is the exemplar of the stacking of multiple distributions with different means. Solving the puzzle, we perform a similar analysis of likelihood ratio considering the differences in citation behavior between academic fields. Fig. 2
shows the count of disciplines that are classified as corresponding model distributions. As suspected, we observe the log normal dominates across the disciplines for every year and both classification levels (see Fig.2 c and f). The power law with an exponential cutoff is occasionally detected (see Fig 2 b and e), whereas basic power law is rarely distinguished (see Fig 2 a and d). Note that the count of subject categories shows more clear disparity in counts than those of subject areas (compare Fig 2 a–c and d–f). Stacking more distributions makes it hard to determine the distribution precisely.
2.3 Journal- and Time- normalized citation score
To proceed with in-depth analysis of citation distribution, we stress the fact that the mean of also varied largely by the journal (see Fig. 3), implying the existence of inherited citation due to the prestige of the journal Lariviere2010 ; Stegehuis2015 . This background effect itself may make it unfair to directly compare citation counts of articles from different journals and years. Moreover, ageing that consistently reduces the preference of citation was also reported Eom2011 ; Hajra2005 . To compensate for such over-representation and ageing effect, we propose the rescaled measures of citation as follows:
where is the citation count of article in the cited year , and is the set of articles published in the same journal and published year () of the article . The rescaled citation presents the relative excellence of the academic literature among the most similar publications in terms of the age and journal.
Unlike raw citation, we find that the single distribution dominates with our rescaled citation measure for entire publication and citation year (Fig. 4a). Across the entire citation and publication years, most plausible distributions are power law with an exponential cut-off, except four citation and publication year pairs (showing stretched exponential; only 1.6% of the entire pairs). Considering the observation of complex mixture of distributions for the raw citation, such dominance is noteworthy. The estimated power-law exponent of for those distributions converge around (Fig. 4b). This finding is also visually supported by the probability density itself, which shows more gathered lines across the years than raw citation (compare Fig. 4c with Fig. 1c).
Our analysis on the considering the differences in citation behavior between academic fields shows the remarkable regularity as well (Fig. 5). Most disciplines (subject area and subject category) exhibit the best-fit distribution as the power law with an exponential cut-off (see Fig. 2 b and e). The other two distributions are rarely detected (see Fig 5 a, c, d, f). One should note that the dominance displayed in the share of best-fit model, log normal for the raw count and power law with an exponential cut-off for the respectively, is more prominent in the than in . Specifically, of the total distributions for the are observed as power law with an exponential cut-off among the subject categories (6 313 distributions out of 6 831 distributions), whereas only 63.7% of the distributions showing log-normal behavior for the (4 355 distributions out of 6 831 distributions). The experiment across subject categories shows a bit weak, but similar result with the case of subject areas: only 73.0% of the distributions of are found to be log normal (57 040 out of 78 177 distributions), yet 83.4% of the distributions are power law with an exponential cutoff.
Although the raw citation is hardly considered as power law (with an exponential cut-off) according to aforementioned observations, it is still worthwhile to measure the power-law exponent because it can be used as a proxy of the heterogeneity for probability distributions Hu2008 . The above estimation of exponent suggests that raw citation distribution is becoming more heterogeneous as the time passed from the publication (Figs 1c). The exponent of normalized measure changes relatively smaller than those of raw citation distribution (Fig. 4c). A logical step forward is to search for the fluctuation degree regarding the disciplines. In Fig. 6, we discover that the exponent of rescaled citations is not only less fluctuated with the time difference between the year of citation and publication but also more stable with the disciplines than the raw citation measure . Therefore, the rescaled measures are nearly free from the journal and year effect, making it possible to compare a significant amount of scientific literature from different years and journals into the same place for analysis away from the inherited impact of the journals.
2.4 Memory effect of and
Identifying the emerging concept of science and technology is obviously the desired goal, both for the researchers and policymakers. Scanning highly cited papers is a common tool for the sensing the emerging or breakthrough concept Oppenheim1978 ; Aksnes2003 ; Schneider2017 ; however, citation behaviors differ between disciplines Waltman2011 . Thus, the definition and interpretation of a concept highly cited paper is complicated. Moreover, the existence of the ageing effect raises the degree of difficulty for comparing the citations from the different years Glanzel2004 ; Mingers2006 ; Bouabid2011 ; Eom2011 ; Hajra2005 . The proposed rescaled measure successfully reduce the influence of the discipline and ageing (see Fig. 5 and Fig. 6, respectively). In short, we successfully minimize the discussed fluctuations with a simple normalization.
One last remained property of citation behavior is the rich-get-richer phenomena Borner2004 ; Wang2014 , possibly influenced by the fame of the journals belonging to, as papers published in headliner journals easily get the early citation. To probe this, we use Pearson’s correlation between two different cited years, from the lists of the papers published in a certain year. As we expected, raw citation shows the strong correlation between two citation years (Fig. 7a–d). The influence of early citation lasts more than a decade; meanwhile citations of more than decade after publication still have strong correlation with the citation in later periods. (Fig. 7a–d). Considering that the later citation has influenced by the earlier ones, the impact of early citation cannot be neglected. On the contrary, the sequence of rescaled citation count shows insignificant correlations across the cited years (Fig. 7a–d). Additionally, the impact of early citation is almost zero, so that be seen as a no effect of early citation (see the correlations of Cited year 2 in Fig. 7). Gathering up the threads, we find that the influence from the initial citation lasts long, but it may be neutralized by our normalization method.
In this study, we investigated the structure of academic citation through a massive history of citation metadata over the past two decades magnifying the influence of the journals’ prestige. Our finding suggests that citation evolution is not solely affected by the influence of paper itself but by the overall influence of the attributes: discipline, ageing, and early citation due to the journals’ prestige. We also supplement the evidences for the influence of the journals on the citations that would enhance the merit of previous studies Stegehuis2015 ; Didegah2013 ; Tahamtan2016 . We believe that extending our analysis into various agents, e.g. impact of countries, authors, institutes, and disciplines, is necessary to understand the ecosystem of science and technology deeper, yet we leave the tasks for further study.
Our approach also has notable implications for policy-making, especially when collaborated with other elaborate methodologies Shibata2008 . While evaluations for scientific investments are conventionally based on research achievements, e.g., the number of citations, publications, and the reputations from colleagues, citation boosting by the fame of authors, journals, countries, and other agents are easily overlooked. Even though we investigated the influence from the journals only, our results imply that the influence of halo effect, i.e. citation attributed by the environment, lasts for a long time. Such an impact also may result in less accuracy of evaluations; thus, a comprehensive understanding of the factors is demanded to set an unbiased and fair standard. Beyond the impact on the citation that we harnessed in this study, a myriad of online resources responding to science and technology also can be rescaled from the similar spirit because altmetric scores can also be influenced by the fame of early spreaders. Going one step forward, we would like to emphasize that data-driven analysis should be accompanied by proper normalization and aptly integrated with the contents-oriented and qualitative perspectives of approaches that span the entire progress of knowledge accumulation Waltman2016 ; Leydesdorff2016a . If the task is accomplished, the synergy will bring for the application of citation analysis. Finally, we also hope that our approach sheds light on the unbiased understanding of citation dynamics of the science and technology in the future.
4 Author’s contribution
Jinhyuk Yun: Conceived and designed the analysis; Collected the data; Performed the analysis; Wrote the paper. Sejung Ahn: Conceived and designed the analysis; Wrote the paper. June Young Lee: Conceived and designed the analysis; Collected the data; Wrote the paper.
This work was supported by the National Research Foundation of Korea Grant funded by the Korean Government through Grant No. NRF-2017R1E1A1A03070975 (J.Y.; S.A.) and the Korea Institute of Science and Technology Information. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
- (1) M. Mitzenmacher, A brief history of generative models for power law and lognormal distributions, Internet mathematics 1 (2) (2004) 226–251.
- (2) R. Albert, H. Jeong, A.-L. Barabási, Internet: Diameter of the world-wide web, Nature 401 (6749) (1999) 130.
- (3) H. Jeong, B. Tombor, R. Albert, Z. Oltvai, A.-L. Barabási, The large-scale organization of metabolic networks, Nature 407 (6804) (2000) 651.
- (4) R. Albert, H. Jeong, A. Barabasi, Error and attack tolerance of complex networks, Nature 406 (6794) (2000) 378–382.
- (5) A. Barrat, M. Weigt, On the properties of small-world network models, The European Physical Journal B-Condensed Matter and Complex Systems 13 (3) (2000) 547–560.
- (6) A. D. Broido, A. Clauset, Scale-free networks are rare, arXiv preprint arXiv:1801.03400.
- (7) E. S. Vieira, J. A. Gomes, Citations to scientific articles: Its distribution and dependence on the article features, Journal of Informetrics 4 (1) (2010) 1–13.
- (8) J. Ruiz-Castillo, The evaluation of citation distributions, SERIEs 3 (1-2) (2012) 291–310.
- (9) L. Bornmann, L. Leydesdorff, Skewness of citation impact data and covariates of citation distributions: A large-scale empirical analysis based on web of science data, Journal of Informetrics 11 (1) (2017) 164–175.
- (10) A. Clauset, C. R. Shalizi, M. E. Newman, Power-law distributions in empirical data, SIAM review 51 (4) (2009) 661–703.
- (11) S. Redner, How popular is your paper? an empirical study of the citation distribution, The European Physical Journal B-Condensed Matter and Complex Systems 4 (2) (1998) 131–134.
- (12) S. Redner, Citation statistics from 110 years of physical review, Physics today 58 (6) (2005) 49–54.
- (13) D. d. S. Price, A general theory of bibliometric and other cumulative advantage processes, Journal of the Association for Information Science and Technology 27 (5) (1976) 292–306.
- (14) P. Albarrán, J. A. Crespo, I. Ortuño, J. Ruiz-Castillo, The skewness of science in 219 sub-fields and a number of aggregates, Scientometrics 88 (2) (2011) 385–397.
- (15) M. Brzezinski, Power laws in citation distributions: evidence from scopus, Scientometrics 103 (1) (2015) 213–228.
- (16) M. L. Wallace, V. Larivière, Y. Gingras, Modeling a century of citation distributions, Journal of Informetrics 3 (4) (2009) 296–303.
- (17) M. Thelwall, P. Wilson, Distributions for cited articles from individual subjects and years, Journal of Informetrics 8 (4) (2014) 824–839.
- (18) M. Thelwall, Citation count distributions for large monodisciplinary journals, Journal of Informetrics 10 (3) (2016) 863–874.
- (19) W. Glänzel, Towards a model for diachronous and synchronous citation analyses, Scientometrics 60 (3) (2004) 511–522.
- (20) J. Mingers, Q. L. Burrell, Modeling citation behavior in management science journals, Information processing & management 42 (6) (2006) 1451–1464.
- (21) H. Bouabid, Revisiting citation aging: a model for citation distribution and life-cycle prediction, Scientometrics 88 (1) (2011) 199.
- (22) Elsevier. Scopus custom data [online] (Accessed: 2017-04-22).
- (23) Q. Wang, L. Waltman, Large-scale analysis of the accuracy of the journal classification systems of web of science and scopus, Journal of Informetrics 10 (2) (2016) 347–364.
- (24) SCImago. Scimago journal & country rank [online] (Accessed: 2017-08-30).
- (25) A. J. Gómez-Núñez, B. Vargas-Quesada, F. de Moya-Anegón, W. Glänzel, Improving scimago journal & country rank (sjr) subject classification through reference analysis, Scientometrics 89 (3) (2011) 741.
- (26) M. J. Stringer, M. Sales-Pardo, L. A. N. Amaral, Effectiveness of journal ranking schemes as a tool for locating information, PLoS One 3 (2) (2008) e1683.
- (27) L. Waltman, N. J. van Eck, T. N. van Leeuwen, M. S. Visser, A. F. van Raan, Towards a new crown indicator: An empirical analysis, Scientometrics 87 (3) (2011) 467–481.
- (28) V. Larivière, Y. Gingras, The impact factor’s matthew effect: A natural experiment in bibliometrics, Journal of the American Society for Information Science and Technology 61 (2) (2010) 424–427.
- (29) C. Stegehuis, N. Litvak, L. Waltman, Predicting the long-term citation impact of recent publications, Journal of informetrics 9 (3) (2015) 642–657.
- (30) Y.-H. Eom, S. Fortunato, Characterizing and modeling citation dynamics, PloS one 6 (9) (2011) e24926.
- (31) K. B. Hajra, P. Sen, Aging in citation networks, Physica A: Statistical Mechanics and its Applications 346 (1-2) (2005) 44–48.
- (32) H.-B. Hu, X.-F. Wang, Unified index to quantifying heterogeneity of complex networks, Physica A: Statistical Mechanics and its Applications 387 (14) (2008) 3769–3780.
- (33) C. Oppenheim, S. P. Renn, Highly cited old papers and the reasons why they continue to be cited, Journal of the Association for Information Science and Technology 29 (5) (1978) 225–231.
- (34) D. W. Aksnes, Characteristics of highly cited papers, Research evaluation 12 (3) (2003) 159–170.
- (35) J. W. Schneider, R. Costas, Identifying potential “breakthrough” publications using refined citation analyses: Three related explorative approaches, Journal of the Association for Information Science and Technology 68 (3) (2017) 709–723.
- (36) K. Börner, J. T. Maru, R. L. Goldstone, The simultaneous evolution of author and paper networks, Proceedings of the National Academy of Sciences 101 (suppl 1) (2004) 5266–5273.
- (37) J. Wang, Unpacking the matthew effect in citations, Journal of Informetrics 8 (2) (2014) 329–339.
- (38) F. Didegah, M. Thelwall, Which factors help authors produce the highest impact research? collaboration, journal and document properties, Journal of Informetrics 7 (4) (2013) 861–873.
- (39) I. Tahamtan, A. S. Afshar, K. Ahamdzadeh, Factors affecting number of citations: a comprehensive review of the literature, Scientometrics 107 (3) (2016) 1195–1225.
- (40) N. Shibata, Y. Kajikawa, Y. Takeda, K. Matsushima, Detecting emerging research fronts based on topological measures in citation networks of scientific publications, Technovation 28 (11) (2008) 758–775.
- (41) L. Waltman, A review of the literature on citation impact indicators, Journal of Informetrics 10 (2) (2016) 365–391.
- (42) L. Leydesdorff, L. Bornmann, J. A. Comins, S. Milojević, Citations: Indicators of quality? the impact fallacy, Frontiers in Research Metrics and Analytics 1 (2016) 1.