Predicting publication productivity for researchers: a piecewise Poisson model

08/20/2019 ∙ by Zheng Xie, et al. ∙ 0

Predicting the scientific productivity of researchers is a basic task for academic administrators and funding agencies. This study provided a model for the publication dynamics of researchers, inspired by the distribution feature of researchers' publications in quantity. It is a piecewise Poisson model, analyzing and predicting the publication productivity of researchers by regression. The principle of the model is built on the explanation for the distribution feature as a result of an inhomogeneous Poisson process that can be approximated as a piecewise Poisson process. The model's principle was validated by the high quality dblp dataset, and its effectiveness was testified in predicting the publication productivity for majority of researchers and the evolutionary trend of their publication productivity. Tests to confirm or disconfirm the model are also proposed. The model has the advantage of providing results in an unbiased way; thus is useful for funding agencies that evaluate a vast number of applications with a quantitative index on publications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

Predicting the scientific productivity of researchers is a basic task for academic administrators and funding agencies. This study provided a model for the publication dynamics of researchers, inspired by the distribution feature of researchers’ publications in quantity. It is a piecewise Poisson model, analyzing and predicting the publication productivity of researchers by regression. The principle of the model is built on the explanation for the distribution feature as a result of an inhomogeneous Poisson process that can be approximated as a piecewise Poisson process. The model’s principle was validated by the high quality dblp dataset, and its effectiveness was testified in predicting the publication productivity for majority of researchers and the evolutionary trend of their publication productivity. Tests to confirm or disconfirm the model are also proposed. The model has the advantage of providing results in an unbiased way; thus is useful for funding agencies that evaluate a vast number of applications with a quantitative index on publications.

Keywords: Scientific publication, Productivity prediction, Data modelling.

Introduction

Scientific fields such as informetrics, scientometrics, and bibliometrics establish a range of models and methods to evaluate the impacts of scientific publications, and then to predict the scientific success of researchers[1]. Although publication productivity correlates to scientific success, much attention on this topic has concentrated on citation-based indexes, followed by the -index provided by Hirsch[2], a popular measure of scientific success. It is the maximum value of such that a researcher has produced publications that have each been cited at least times. The popularity of the -index is attributable to its simplicity and its addressing both the productivity and the citation impact of publications[3].

The success in the prediction of citation-based indexes and the -index can be thought to result from the cumulative advantage of receiving citations found by Price in the year 1965[4], which has been extended as a general theory for bibliometric and other cumulative advantage process[5, 6, 7]. From the perspective of statistics, the success is due to the predictable components of these indexes that can be extracted via autoregression. In more concrete terms, the current -index, the number of annual citations, and the number of five-year citations are found to be positive predictors to these indexes[8, 9].

The cumulative advantage in producing publication is weaker than that in receiving citations. Empirical studies on data display it as the phenomenon that the tail of the quantitative distribution of the publications produced by a group of researchers is much shorter than that of the citation distribution of those researchers[10]. This study also shows the productivity of researchers in the dblp dataset does not have a predictable component that can be extracted via autoregression. The autocorrelation coefficients with a lag larger than of the time series on an individual researcher’s cumulative productivity are almost smaller than , suggesting a lack of predictability in a researcher’s productivity only by his or her historical publications. Therefore, the critical factor of the success in the prediction of citation-based indexes and the -index does not exist in the prediction task of publication productivity.

Is there any predictability in publication patterns? Given the factors involved in publishing, such as the intrinsic value of research work, timing, and the publishing venue, finding regularities in the publication history of researchers is an elusive task. Age and achievement probably constitute the most comprehensive attempt to empirically determine the changes in researchers’ creativity, reflected by the changes in their publication productivity

[11]. In network science, these factors are called the aging phenomenon and the cumulative advantage, dominating the evolution of coauthorship networks[12, 13]. Hence, the productivity has been theoretically expressed as a curvilinear function of age[14]. This theoretical result is suitable for the fruitful researchers that have a long time engaging in research. However, it cannot fit the productivity evolution of many researchers in the dblp dataset analyzed here.

Empirical datasets from several disciplines show that the number of a researcher’ publications approximately follows a generalized Poisson distribution with a fat tail

[15]. Can this feature be reproducible by dynamical random models? Previous studies show this distribution can be thought as a mixture of Poisson distributions[16]. Samples following the same Poisson distribution means that they would be drawn from the same population. It means researchers can be partitioned into several populations, such that each population has certain homogeneity in publication patterns. Finding such a partition would help to reveal the mystery of publication patterns, which inspires this study.

We partitioned the researchers in the dblp dataset into several subsets, each of which consists of the researchers with the same number of publications produced at a given time interval. This partition eliminates the diversity in publishing experience. For each subset, we found that the number of publications of a member follows a Poisson distribution at each of the following short time intervals. This inspires us to provide a piecewise Poisson model to find significant predictors for publication productivity. The finding is that for each subset, there is a significant relationship between time and publication productivity. The relationship allows us to infer the future productivity for researcher subsets. We provided two methods to validate the prediction results provided our model in the aspects of predicting the evolutionary trend of productivity and the quantitative distribution of publications.

This paper is organized as follows. Literature review and empirical data are described in Sections 2, 3. The model is described in Sections 4-6, where the experiments and comparisons with previous results are also analyzed. The results are discussed and conclusions drawn in Section 7.

Literature review

There are three main aspects to the prediction of scientific success: the -index, citation-based indexes, and publication productivity. Although our study focuses on predicting the publication productivity of researchers, reviewing the methods in first two aspects contributes to finding the possibility and unavailability of applying those methods to the third aspect.

As a popular measure of scientific success, the

-index of researchers attracts considerable attention on predicting it. Acuna et al analyzed the data of 3,085 neuroscientists, 57 Drosophila researchers, and 151 evolutionary scientists by a linear regression with elastic net regularization. They presented a formula to predict the

-index, and indicated that the current -index is the most significantly positive predictor, compared with the number of current papers, the year since publishing first paper, etc[8]

. Dong et al utilized the standard linear regression and logistic regression to analyze more features, such as the average citations of an author’s papers and the number of coauthors

[17]. Mccarty et al analyzed the coauthorship data of 238 authors collected from the Web of Science, and showed that the number of coauthors and their -index also are positive predictors[18].

The number of received citations is a widely-used measure of success for publications and researchers. To predict highly cited publications only based on short-term citation data, Mazloumian applied a multi-level regression model[19], Wang et al derived a mechanistic model[20], Newman defined -scores[9]

, Gao et al utilized a Gaussian mixture model

[21], Pobiedina applied link prediction[22]

, and Abrishami et al utilized deep learning

[23]. Together with the impact factors of journals, Stern and Abramo applied linear regression models to this prediction task respectively[25, 24], and Kosteas introduced the rankings of journals[26]. To improve prediction precision, the information of authors and contents of publications are utilized: Bornmann et al added publications’ length[27]

; Bai et al applied maximum likelihood estimation, and introduced the aging of publications’ impact

[28]; Sarigol et al used a method of random decision forests, and introduced specific characteristics of coauthorship networks (e. g. the centrality)[29]; Yu et al provided a stepwise regression model synthesizing specific features of publications, authors, and journals[30]; Klimek et al utilized the centrality measures of term-document networks[31].

Returning to the prediction of publication productivity, one may find that the studies on this aspect are quite few when compared with those on -index and citation-based indexes. Empirical studies found the cumulative advantage in producing publications and the aging of researcher’ creativity[32, 33]. Laurance et al analyzed the publications of 182 researchers by using the Pearson correlation coefficient, and found that Pre-PhD publication success strongly correlated to long-term success[34]. In the aspect of theoretical research, Lehman concluded that achievement tends to be a curvilinear function of age. From the onset of a researcher’s career, productivity tends to rapidly increase, then reaches at the peak productive age, and thereafter slowly declines with aging[11]. Simonton provided a formula to model this process[14].

The aforementioned methods of citation-based indexes and -index all refer to the positive correlation between the current indexes and their history. Essentially, the mechanism underlying their success is the cumulative advantage on those indexes. However, the effect of cumulative advantage in producing publications is not so strong. The tail of the quantitative distribution of the publications produced by a group of researchers is much shorter than that of the citation distribution of those researchers. Therefore, the prediction methods of publication productivity would be quite different from those of citation-based indexes and the -index.

The data

In order to predict publication productivity for majority of researchers in a given dataset, the provided model needs clear training data containing enough researchers with enough publications. For example, the example shown in Section 5 requires enough researchers with publications for using regression. Therefore, the model needs a large dataset, spanning a long time interval. Meanwhile, name ambiguities exist in bibliographic data, which manifest themselves in two ways: one person is identified as two or more entities (splitting error); two or more persons are identified as one entity (merging error)[35, 36]. Merging errors would generate names with a number of publications far more than ground truth, invalidating the prediction results of the model. Because the effects of merging errors would be significant on the researcher subsets, the members of which have many publications. Therefore, a dataset with limited errors is required.

The dblp computer science bibliography provides a dataset satisfying above requirements, which consists of the open bibliographic information on major journals and proceedings of computer science (https://dblp.org). The dataset has been manually cleaned. Certain methods of name disambiguation have also been applied to this dataset. For example, the ORCID information has been utilized regularly to correct numerous cases of homonymous and synonymous. We extracted parts of the data at certain time intervals as training and testing data (Table 1). These parts totally consist of 220,344 publications in 1,586 journals and proceedings, which are produced by 328,690 researchers at the years from 1951 to 2018. Sets 1 and 2 are used to extract the historical publication quantity of researchers in Sets 3 and 4 respectively. Set 5 is used as training data. Sets 6 and 7 are used to test the prediction results for the researchers in Sets 3 and 4 respectively. Due to the size and the time span of the analyzed dataset, this study would not be treated as a case study. The provided model is at least suitable for the community of computer science. Note that the term “researcher” in this paper refers to an author of the dblp dataset.

Dataset
Set 1 1951–1995 20,781 20,666 346 1.556 1.565
Set 2 1951–2000 38,149 35,643 542 1.571 1.681
Set 3 1995 3,709 2,268 160 1.137 1.859
Set 4 2000 5,741 3,600 257 1.184 1.888
Set 5 1995–2009 87,140 62,636 931 1.538 2.139
Set 6 1996–2013 116,231 80,193 1,150 1.557 2.257
Set 7 2001–2018 301,741 184,701 1,495 1.733 2.831

The index : the time interval of data, : the number of researchers, : the number of publications, : the number of journals, : the average number of researchers’ publications, : the average number of publications’ authors.

Table 1: Certain subsets of the dblp dataset.

This study is a data-driven one, inspired by the features of the analyzed data. The quantitative distributions of publications produced by some groups of researchers in the dblp dataset have a fat tail. The Kolmogorov-Smirnov (KS) test rejects to regard them as Poisson distributions (Fig. 1). The emergence of their fat tail can be explained as a result of the cumulative advantage in producing publications or the diversity of researchers’ ability emerged over time[13]. This feature also appears in other empirical datasets of publications[16], which guarantees the generality of the provided model. In more detail, previous studies show that the distributions are featured by a trichotomy, comprising a generalized Poisson head, a power-law middle part, and an exponential cutoff[37].

Figure 1: The quantitative distribution of researchers’ publications. Given a year from to , the considered researchers are those who have publications at and no more than 10 publications at the years from to , constituting of the total researchers with publications at . In all except four cases, the KS test rejects that the quantitative distribution of the considered researchers’ publications at the year (red circles) is a Poisson distribution (blue lines) at the significance level 5% ().

The trichotomy can be derived from a range of “coin flipping” behaviors, where the probability of observing “head” is dependent on observed events [38]. The event of producing a publication can be regarded as an analogy of observing “head”, where the probability of publishing is also affected by previous events. Researchers would easily produce their second publication, compared with their first one. This is a cumulative advantage, research experiences accumulating in the process of producing publications. It displays as the transition from the generated Poisson head to the power-law part. Aging of researchers’ creativity is against cumulative advantage, which displays as the transition from the power-law part to the exponential cutoff.

The quantitative distributions of researchers’ publications can be fitted by a mixture of Poisson distributions[16]. Therefore, we could expect to partition researchers into specific subsets, such that the quantitative distribution of publications produced by the researchers of each subset is a Poisson. When restricting into a short time interval, the effects of cumulative advantage and aging would be not significant. However, the diversity of researchers in publication history cannot be eliminated only by shrinking the observation window in the time dimension. Therefore, we provided a split scheme to eliminate the diversity as follows.

Consider the researchers with no more than publications, and partition them into subsets, where is a given integer (for example, in the experiment shown in Section 4). For any , the -th subset or subset contains the researchers with publications at a given time interval (which is the time interval from the year to for the testing data in Section 4). The finding is that for each subset , the number of a researcher’s publications at each observed time interval, namely each year from 2001 to 2018, follows a Poisson distribution (Fig. 2). This inspires us to provide a piecewise Poisson model in the next section.

Figure 2: Eliminating the diversity in the historical quantity of publications induces Poisson distributions. For the year from 1995 to 2012 and , the KS test cannot reject the quantitative distribution of the publications at the year produced by subset (the researchers have publications at the years from 1951 to ) is a Poisson distribution. Panels show the the -value of the test (red circles) and the baseline (blue lines).

The piecewise Poisson model

Partition researchers into subsets that are defined in above section, where is the largest number of publications that we can predict. For each subset, we considered its publication productivity at a time interval , where . Partition into intervals with cutpoints . The half-closed interval is referred to as the -th time interval or interval , where .

Denote the publication productivity of researcher subset at the time interval by for any possible and . This productivity is assumed to be

(1)

where is a covariate at the time interval , and is the effect of the covariate on researcher subset . The baseline of the publication productivity for researcher subset is , which is the publication productivity of subset at the first time interval . The index in Eq. (1) can be regarded as a dummy index. The proportionate changes the productivity according to the value of covariate .

Now let us recall the definition of the Poisson model, a generalized linear model of regression analysis

[39]

. It is used to model count data and contingency tables, thus has potential to predict publication productivity. The Poisson model assumes the response variable

follows a Poisson distribution, and assumes the logarithm of its expected value can be expressed by a linear combination of covariates. Let

be a vector of covariates, and

be a vector of the covariates’ effect. The Poisson model takes the form

(2)

where , and is the conditional expected value of given .

For each subset , the provided model is exactly the Poisson model, because the number of publications produced by a researcher in subset at a year follows a Poisson distribution (Fig. 2). Therefore, we named the provided model piecewise Poisson model. Note that Eq. (1) can be generalized to deal several characteristics varying with , namely a vector . As a beginning, this study only considered the simplest case: one characteristic associated with time .

Now let us show how to calculate the publication productivity. Consider the training data consisting of the researchers having publications at the time interval and their publications at the time interval , where in an integer larger than . Giving a parameter , the researchers with no more than publications of the training data are partitioned into subsets according to their number of publications at .

The publication productivity of subset at the -th time interval can be estimated as

(3)

where , , is the number of researchers with publications at the time interval , and is the number of publications produced by those researchers at the -th interval .

Recalling Eq. (1) and taking logs in Eq. (3), we obtained

(4)

where . Note that

is a dummy variable, and

for empirical data. The linear regression then can be utilized to calculate and . Eq. (4) describes the relationship between the publication productivity and the covarite .

For the majority researchers in the training data in the next section, we found a significant relationship between and the covariate defined by (), which allows us to take as an estimation of subset ’s publication productivity at any time interval , where . Fig. 3 shows an illustration of the provided model.

Figure 3: An illustration of the piecewise Poisson model. The training data are used to calculate the number of researchers who have publications at the time interval and their publication quantity at the time interval for and . The predicted publication productivity is calculated by our model. The vector records the publication quantity of a researcher in subset at each time interval .

Based on the found significant relationship, we can infer the number of publications produced by subset at the time interval by the algorithm shown in Table 2. Denote the publication quantity of researcher in subset at the time interval by , then . The algorithm gives the predicted publication quantity of at the time interval . Due to its random nature, this algorithm cannot exactly predict the publication quantity of an individual researcher, but it can be suitable for a group of researchers. Note that the training data would be not enough for using linear regression, when is large. Therefore, the model can only predict the publication quantity not very high, for example, no larger than 40 publications in the example shown in the next section.

Input: Parameter ; publication productivity , where , and .
For subset do:
    calculate and by applying linear regression to Eq. (4);
    for a researcher in subset do:
        initialize ;
        for do:
            sample an integer from the Poisson distribution with mean ;
            let ;
        let .
Output: the predicted productivity to any researcher .

Note that we only predict the publication productivity for the researchers with a historical publication quantity no larger than , where satisfies that for any researcher .

Table 2: Simulating the increasing process of publication quantity.

Experiments

Now the model is applied to the dblp dataset. The training data are the researchers who have publications at the years from to and their publications from to . The testing data are the researchers who have publications at the year and their publications from to . Hence, the model parameters are , , , ,…, ,…, . So , and . The covariate , where .

The publication productivity is calculated based on the training data, where , and . So . For example, is the number of researchers with one publication at the years from to , and is the number of those researchers’ publications at the year . Then, we calculated and for by applying the linear regression to Eq. (4).

We found that for more than researchers in the training data (those with publications no larger than ), there is a significantly linear relationship between the time (unit: year) and the logarithm of publication productivity (which is verified by the -value of the test shown in Fig. 4). Thus, we can utilize to estimate the publication productivity for subset at each year from to . In the experiment here, we can only predict the publication quantity for 98.76% researchers of the testing data (who have no more than publications at the time interval ), because their predicted number of publications is no more than , namely the maximum publication quantity that can be predicted by the model. Two methods are provided to validate the effectiveness of the model as follows.

Figure 4: The relationship between publication productivity and time. Consider the researchers who have publications at the years from to , where . Calculate their publication productivity at the next year . The red squares show the productivity for from to and for each

. The solid dot lines show the predicted results by the Poisson regression, and the dashed lines show confidence intervals. The

-value indicates that the relationship between publication productivity and time is significant for , , and .

Firstly, the prediction results of our model are validated by the correlation between the evolutionary trend of empirical publication quantity and that predicted by our model. Let be the average number of publications produced by the subset (where ) of the testing data at the time interval , and be that predicted by the model. Fig. 5 shows the correlated trends of and on , given each year from to . Note that the index is treated as a dummy index. The correlation of the trends is measured by the Pearson correlation coefficient and the Spearman’s rank correlation coefficient[40]. Fig. 5 also shows that both of the coefficients decrease over time, which indicates that the prediction precision of the model decreases over time. Note that the Pearson coefficient indicates the strength of a linear relationship between two variables and , unless the conditional expected value of given is linear or approximate linear in , and verse vice. The visual examinations shown in Fig. 5 guarantee the effectiveness of the correlation analysis here.

Figure 5: Model-data fittings on the evolutionary trend of publication quantity. The red dots show the average number of publications produced by the subset of Set 4 at the time interval . The solid dot lines express this average number predicted by our model. Index is the Spearman’s rank correlation coefficient of and on , and is the Pearson correlation coefficient.

Secondly, the prediction results are validated by comparing the quantitative distribution of publications of each subset of the testing data (where those publications are produced at the time interval ) with the predicted one for any possible and . Fig. 6 shows that a fat tail emerges in the evolution of the ground-truth distributions, because a small fraction of researchers produced many publications. However, our model cannot predict over-exaggerated productivity, because the training data are not enough for using regression. In the example here, the predicted quantity is no more than . Therefore, the KS test rejects that the compared distributions are the same (see the -value in Fig. 6), although there is a coincidence in their forepart.

Figure 6: Model-data fittings on the quantitative distribution of researchers’ publications. Red circles show this distribution for the publications produced by the researchers of Set 4 at the time interval . Blue squares show the corresponding one predicted by our model. Index is the

-value of KS test with the null hypothesis that samples are drawn from the same distribution.

Comparisons with previous results

Firstly, we discussed the possibility of utilizing the prediction formula provided by Simonton[14]:

(5)

where . Parameter is termed the “ideation rate”, is termed the “elaboration rate”, , and represents the maximum number of publications a researcher can produce in his lifespan. This formula theoretically expresses a researcher’s publication productivity by a function of time . With the parameters in Reference[14], the shape of this curvilinear function is shown in Fig. 7.

Figure 7: The publication productivity predicted by the formula in Eq. (6). Panel show the curve of this formula with the parameters provided by Simonton: , , and .

As aforementioned, the cumulative advantage and the aging of creativity have impacts on researchers’ publication productivity. One can think that a researcher’s publication productivity is proportional to his publication quantity in his early age of research. The more publications he has, the higher his publication productivity. As his age increases, his creativity decreases and will dry up in his later period of research. The formula in Eq. (6) expresses this evolution of publication productivity.

Fig. 8 shows the publication productivity of the subset (where ) of the testing data at each year from to , which cannot be fitted by the formula in Eq. (6). One possible explanation of this inconformity is the variation of the personnel structure on the researchers who produce publications. Note that the formula is provided at the year 1984. In recent thirty years, the number of academic masters and doctors dramatically increases. They contribute a large fraction of publications during their study periods. Many of them will not do research after graduation, and thus will not continue to produce publications. Therefore, the formula in Eq. (6) is unsuitable for describing the evolutionary process of their productivity. Meanwhile, it can be suitable for the fruitful scientists who have a long research career. However, in the dblp dataset, the number of these researchers is quite small, because more than researchers produced no more than publications.

Figure 8: The relationship between publication productivity and time. The red circles show the average number of the publications at each year from to , produced by the subset (where ) of the researchers of Set 4.

Secondly, we discussed the practicability of utilizing a researcher’s historical publication productivity to predict his or her future productivity. Therefore, we should know that is there any predictable components of the productivity that can be extracted via autoregression. Previous studies found that those components are significantly positive predictors to citation-based indexes and -index, constituting the principle terms of the prediction formulae of those indexes.

The mechanism of generating these autoregressive components is the cumulative advantage in receiving citations and in the evolution of -index. Previous empirical studies show that the number of citations received in the future depends on the number of citations already received[5]. However, the effect of cumulative advantage in producing publications is much weaker than that in receiving citations. It is reflected by the short tail of the quantitative distribution of a group of researchers’ publications, compared with that of the citation distribution of the same researchers[10].

In statistics, autoregressive models specify that the response variable depends linearly on its previous values with a stochastic term. The advantage of those model is that not much information is required, only the self-variable series. If the autocorrelation coefficients of the response variable series are smaller than 0.5, autoregressive models are not suitable for prediction task. That is, the coefficients of autoregressive components are not large enough to be significant predictors.

The autocorrelation coefficient of with lag (where ) is defined as

(6)

where is the mean of ’s elements[41]. We constructed a time series to record the quantitative information of publications for a researcher , where is the number of his cumulative publications produced at the time interval for .

We calculated the autocorrelation coefficients of for any researcher in the testing data. And we found that these coefficients with a lag larger than 1 are almost smaller than 0.5. Therefore, for an individual researcher, his or her historical publication productivity is not sufficient to predict his or her future productivity. It indicates that the autoregressive models may not be suitable for prediction publication productivity; thus the schemes of those successful prediction methods about citation-based indexes and the -index may also be unfeasible.

Figure 9: Autocorrelation coefficients of the time series on cumulative publication productivity. The time series for a researcher is defined by a vector , where is the number of his publications produced at the time interval for , …, . Lags=. Panels show the average of the coefficients on each subset of the researchers of Set 4, where . Index is the proportion of each subset.

Discussion and conclusions

We provided a model to predict the publication productivity of researchers. The model needs a large training data, but there is not much information about researchers required, only their publications’ production time. The model’s practicability is validated on the dblp dataset, exhibiting its ability in the prediction of publication productivity in terms of the fine model-data fittings on the evolutionary trend of productivity and the quantitative distribution of publications. The model, with its prediction results unbiased, may be useful for funding agencies to evaluate the possibility of applicants to complete the quantitative index of publications in their applications.

This model offers convincing evidence that the publication patterns of many researchers are characterized by a piecewise Poisson process. Even where our model does not provide an exact productivity prediction for an individual researcher, it may still be of use in its ability to provide a satisfactory prediction for a group of researchers on average. The prediction results of our model offer some comfort by showing that the future of a group of researchers is not so random. The occasional rejection of a paper may feel unjust and indiscriminate, but for a group, such factors seem to average out, rendering the trajectories of researchers’ publication productivity relatively predictable.

The prediction precision of the model can be improved by utilizing more features of researchers, such as the network features of their coauthorship (degree, betweenness, centrality, etc.), because previous studies showed that research collaboration contributes to scientific productivity[44, 45, 46]. In addition, our model has the potential to be extended for assessing the confidence level of prediction results, and thus have clear applicability to empirical research.

Yet little is known about the mechanisms governing the evolution of researchers’ publication productivity. Predicting the productivity of an individual researcher would not be done only by regression as this study did for a group of researchers, due to the randomness of an individual’s research. This study displays the randomness by the relatively small autocorrelation coefficients of the time series on a researcher’s cumulative publication productivity. However, analyzing massive data to track scientific careers would help to advance our understanding of how researchers’ productivity evolves. Therefore, advanced algorithms are needed to synthetically analyze the aforementioned data features and more features, such as journals’ annual issue volume, impact factors, language, and so on.

Acknowledgments

The author thanks Professor Jinying Su in the National University of Defense Technology for her helpful comments and feedback. This work is supported by the National Natural Science Foundation of China (Grant No. 61773020) and National Education Science Foundation of China (Grant No. DIA180383).

References

  •  1. Sinatra R, Wang D, Deville P, Song C, Barabási AL (2016) Quantifying the evolution of individual scientific impact. Science, 354(6312), aaf5239
  •  2. Hirsch JE (2005) An index to quantify an individual’s scientific research output. Proc Natl Acad Sci USA 102, 16569-16572.
  •  3. Schubert A (2007) Successive -indices. Scientometrics, 70, 201-205.
  •  4. Price DJS (1965) Networks of scientific papers. Science 149(3683): 510-515.
  •  5. Price DJS (1976) A general theory of bibliometric and other cumulative advantage process. J Am Soc Inf Sci, 27(5): 292-306.
  •  6. Barabási AL, Albert R (1999) Emergence of scaling in random networks. Science, 286(5439): 509-512.
  •  7. Perc M (2014) The Matthew effect in empirical data. J R Soc Interface, 11: 20140378.
  •  8. Acuna DE, Allesina S, Kording KP (2012) Future impact: Predicting scientific success. Nature, 489(7415), 201.
  •  9. Newman MEJ (2014) Prediction of highly cited papers. Europhys Lett, 105(2), 28002.
  •  10. Xie Z, Xie ZL, Li M, Li JP, Yi DY (2017) Modeling the coevolution between citations and coauthorship of scientific papers. Scientometrics 112: 483-507.
  •  11. Lehman HC (2017) Age and achievement (Vol. 4970). Princeton University Press.
  •  12. Glänzel W (2014) Analysis of co-authorship patterns at the individual level. Transinformacao 26: 229-238.
  •  13. Xie Z, Ouyang ZZ, Li JP, Dong EM, Yi DY (2018) Modelling transition phenomena of scientific coauthorship networks. J Assoc Inf Sci Technol 69(2): 305-317.
  •  14. Simonton DK (1984) Creative productivity and age: A mathematical model based on a two-step cognitive process. Dev Rev, 4(1), 77-111.
  •  15.

    Xie Z, Li M, Li JP, Duan XJ, Ouyang ZZ (2018) Feature analysis of multidisciplinary scientific collaboration patterns based on pnas. EPJ Data Science 7: 5.

  •  16. Xie Z, Ouyang ZZ, Li JP (2016) A geometric graph model for coauthorship networks. J Informetr 10: 299-311.
  •  17. Dong Y, Johnson RA, Chawla NV (2016) Can scientific impact be predicted? IEEE Transactions on Big Data, 2(1), 18-30.
  •  18. Mccarty C, Jawitz JW, Hopkins A, Goldman A (2013) Predicting author h-index using characteristics of the co-author network. Scientometrics, 96(2), 467-483.
  •  19. Mazloumian A (2012) Predicting researchers’ scientific impact. Plos One, 7(11), 1-5.
  •  20. Wang D, Song C, Barabási AL (2013) Quantifying long-term scientific impact. Science, 342(6154), 127-132.
  •  21. Cao X, Chen Y, Liu KR (2016) A data analytic approach to quantifying scientific impact. J Informetr, 10(2), 471-484.
  •  22. Pobiedina N, Ichise R (2016) Citation count prediction as a link prediction problem. Appl Intell, 44(2), 252-268.
  •  23.

    Abrishami A, Aliakbary S (2019) Predicting citation counts based on deep neural network learning techniques. J Informetr, 13(2), 485-499.

  •  24. Abramo G, D’Angelo CA, Felici G (2019) Predicting publication long-term impact through a combination of early citations and journal impact factor. J Informetr, 13(1), 32-49.
  •  25. Stern DI (2014) High-ranked social science journal articles can be identified from early citation information. Plos One, 9(11), e112520.
  •  26. Kosteas VD (2018) Predicting long-run citation counts for articles in top economics journals. Scientometrics, 115(3), 1395-1412.
  •  27. Bornmann L, Leydesdorff L, Wang J (2014) How to improve the prediction based on citation impact percentiles for years shortly after the publication date? J Informetr, 8(1), 175-180.
  •  28. Bai XM, Zhang LI, Lee I (2019) Predicting the citations of researcherly paper, J Informetr, 13, 407-418.
  •  29. Sarigöl E, Pfitzner R, Scholtes I, Garas A, Schweitzer F (2014) Predicting scientific success based on coauthorship networks. EPJ Data Science, 3(1), 9.
  •  30. Yu T, Yu G, Li PY, Wang L (2014) Citation impact prediction for scientific papers using stepwise regression analysis. Scientometrics, 101(2), 1233-1252.
  •  31. Klimek P, Jovanovic AS, Egloff R, Schneider R (2016) Successful fish go with the flow: citation impact prediction based on centrality measures for term-document networks. Scientometrics, 107(3), 1265-1282.
  •  32. Newman M (2001) Clustering and preferential attachment in growing networks. Phys Rev E 64(2): 025102.
  •  33. Tomassini M, Luthi L (2007) Empirical analysis of the evolution of a scientific collaboration network. Physica A 385(2): 750-764.
  •  34. Laurance WF, Useche DC, Laurance SG, Bradshaw CJ (2013) Predicting publication success for biologists. BioScience, 63(10), 817-823.
  •  35. Milojević S(2013) Accuracy of simple, initials-based methods for author name disambiguation, J Informetr 7 767-773.
  •  36. Xie Z (2019) A Bayesian model on the merging errors of coauthorship data. Physica A, 527, 121140.
  •  37. Xie Z (2019) A cooperative game model for the multimodality of coauthorship networks, Scientometrics, https://doi.org/10.1007/s11192-019-03183-z.
  •  38. Consul PC, Jain GC (1973) A generalization of the Poisson distribution. Technometrics 15(4): 791-799.
  •  39. Nelder JA, Wedderburn RW (1972) Generalized linear models. J R Stat Soc Ser A-G, 135(3), 370-384.
  •  40. Hollander M, Wolfe DA (1973) Nonparametric Statistical Methods. Wiley.
  •  41. Box GE, Jenkins GM, Reinsel GC, Ljung GM (2015) Time series analysis: forecasting and control. John Wiley & Sons.
  •  42. Cox DR (1972) Regression models and life-tables. J Roy Stat Soc B Met, 34(2), 187-202.
  •  43. Dennis W (1954) Predicting scientific productivity in later maturity from records of earlier decades. J Gerontol, 9(4), 465-467.
  •  44. Lee S, Bozeman B (2005) The impact of research collaboration on scientific productivity. Soc Stud Sci 35: 673-702.
  •  45. Ductor L (2015) Does co-authorship lead to higher academic productivity? Oxford B Econ Stat, 77(3), 385-407.
  •  46. Qi M, Zeng A, Li M, Fan Y, Di Z (2017) Standing on the shoulders of giants: the effect of outstanding scientists on young collaborators’ careers. Scientometrics, 111(3), 1839-1850.

Appendix A: The piecewise exponential model

The formula of the provided model is similar to that of the piecewise exponential model in survival analysis, which is defined as follows[42]. Assume that the duration

of an event is a continuous random variable with probability density function

. Let

, which is the cumulative distribution function. It is the probability that the event has occurred by duration

. The survival function is defined as , and the hazard function .

Let be a vector of covariates for individual , and be the vector of covariates’ effect. The hazard function at for individual is assumed to be

(7)

where , and is a baseline hazard function that describes the risk for individual with , and is the relative risk.

Subdivide time into reasonably small intervals and assume that the baseline hazard is constant at each interval, leading to a piecewise exponential model

(8)

where is the hazard corresponding to individual at interval , is the baseline hazard at interval . Write as and as to allow for a time-dependent effect of the predictor vector. Then, we would write

(9)

where is the formula of the piecewise exponential model.

Although Eq. (1) and Eq. (9) are similar in formula, they are essentially different. The index in Eq. (1) is the time, and index is about researcher subset. The regression is used to calculate and , which vary with researcher subset and are free of time . But in Eq. (9), the baseline and the effect is free of but depends on time .

Appendix B: An other example

The training data are the same as those in Section 4. The testing data are the researchers who have publications at the year and their publications from to . Hence, the model parameters are , , , ,…, ,…, . So , and . The covariate , where . We only predicted the publications for the researchers with no more than publications, who account for 98.8% of the researchers in the testing data here. Figs. 10 and 11 show the results of applying the validation methods in Section 5 to the predicted productivity by our model.

Figure 10: Model-data fittings on the evolutionary trend of publication quantity. The red dots show the average number of publications produced by the subset of Set 3 at the time interval . The solid dot lines express this average number predicted by the piecewise Poisson model.
Figure 11: Model-data fittings on the quantitative distribution of publications. Red circles show this distribution for the publications produced the researchers of Set 3 at the time interval . Blue squares show the corresponding one predicted by our model.