1 Introduction
Research publications are one of the measures of the output of a research institute. In this paper, our objective is to find the factors which cause the scholarly communities to engage in active research and publications. Nowadays, researchers heavily rely on digital libraries like science direct, research gate, the web of science, etc. In this study, we try to understand how the web activity of scientists causes an impact on their research publication. The problem of discovering causal impact is ubiquitous in biological science, economics, public policy, and many regard of our daily life requiring logical reasoning and decisionmaking, see e.g., [22, 11]. Thus, causal questions are about the mechanism behind the data or about predictions after a novel intervention is applied to the system, see [16].
When it comes to learning causality from data, we should be careful about the differences between statistical associations and causations. In [12], presented the example that when the weather temperatures are high, the owner of an ice cream parlor may see high electric bills along with high sales. It means there would be a strong association between the electricity bill and the revenue. However, the high electricity bill did not cause high sales. In this case, the weather temperature is the common cause of both the high electricity usage and the high sales numbers. We say that temperature is the confounder of the causality of the electricity usage on ice cream sales. The standard statistical analysis focuses on correlation and not necessarily addresses the issues of the causal inference, see [23].
As computing systems are integrating with the daily lives of people, it is important to understand the causal effects of these interventions correctly. The digital world generates a staggering amount of data. While these massive data sets unlocked novel opportunities to understand the scientific issues, still there is greater potential for research and practice, especially for causal inferences. The [18] presented review which explores several key issues that have arisen around big data related to public health. Recently, a lot of progress happened in developing statistical causal inference tools (see, e.g., [22, 14, 30]) which enable scientists to assess causal hypotheses and learn the corresponding effects empirically. Several causal inference methods, such as causal mediation approaches [24], propensity score techniques (see e.g.,[28, 8]), and sensitivity analyses [15], are already part of the regular methodological causal inference toolbox of researchers.
However, existing methods may still be too simple to answer complex questions of causation. In particular, the research questions that go beyond the main effects of an experimental study, like AB testing [20]. There are three issues which are often not adequately addressed by standard statistical tools. For example, (i) the causation due to latent variables, (ii) complex network of causation, (iii) the iterative and dynamic processes of both, the temporal change in individual behavior [27] and change in factors at community level that affect the sustainability of evidence, see e.g., [2, 19, 8, 6, 7, 27] and [10]. Therefore we should pay much more attention to designing proper study for causal inference.
Conventional Statistical Machine Learning (SML) methods are insufficient for causal analysis; because these algorithms are built on pattern recognition, correlation, and focuses on prediction, see
[26]. For instance, in policymaking, one may want to use an SML algorithm to predict the number of papers that will be published by a scientist in the future. However, that may lead to posing a target in front of the scientist. As a result, it might create a negative feedback loop and compromise the quality of the publication. The [21] presented several case studies which showed how SML based predictive models create negative feedback loops and disempower the members of the society under consideration. The need to adapt the practices and policies by the institute, using the available massive amounts of observational data, prompted the requirement for causal discovery. Hence, for a research institute, it is more useful to discover the factors that lead to research publication, rather than predicting the number of papers for scientists. For example, the causal inference might find factors like ‘collaboration’ as a success for research output. This kind of insight will lead the academic society to take more initiative which will help the young researcher to take part in more collaborations. This kind of policy will be more empowering for the members of the scientific community, rather than setting them up for failure by giving them the target of the number of papers to publish.The rest of the paper is structured as follows. In Section 2, we present the related research. Section 3, we describe the data set, data privacy policy and the study design. We present the exploratory data analysis in Section 4. We present the research methodology in Section 5. In Section 6, we discuss the results and analysis of our findings. Finally, we conclude the paper with a few summary remarks in Section 7.
2 Related Research
Several studies reported a positive relationship between collaboration and research publications, see, e.g., [9, 32, 5, 31, 1]. In [32], looked into email communication and research productivity. To the best of our understanding, there is no literature which looks into websearch activity of scientists and their research productivity.
The experimental results in [9], studied the effect of collaboration on the publication productivity of 65 biomedical scientists at a New Zealand university over 14 years. The findings from the paper suggest that international collaboration has a positive effect on a scientist’s research publications than domestic collaborations. Moreover, collaborations are linked to the article’s quality. Since [9], used longitudinal design, the study establishes the causal effect of scientific collaboration on research publication for biomedical scientist beyond a reasonable doubt. However, [9], did not explore the effect of internet usage by the scientist on the research publication, which we highlighted in this paper.
Also, [32] found that the strong association between email communication and research productivity. [32] used email counts as a proxy for internet usage and found a positive association with research productivity. However, [32] did not address the causality through their design study. Therefore it indicates the association and not causation. For example, a scientist might be engaged in organizing conference or admin work, and that has nothing to do with research productivity. [32] uses the negative binomial regression models over the Poisson model, and they made an adhoc choice for the negative binomial model instead of using a statistical model selection criterion, like Akaike Information Criterion. [32] reported that the lack of evidence between research productivity and scientific collaboration, which is counterintuitive. However, [32] found the association between publication productivity, profession network size, and degree of collaboration. Since [32] did not address the design issue of the causal study, therefore the study lacks to find evidence between scientific collaboration and publication productivity. In this paper, we found a strong association between scientific collaboration and publication productivity.
The [5] reported that the impact factor of research publications is associated with factors like different dimensions of collaborations. For example, the factors like the individual, institutional and international collaboration; journal and reference impacts; abstract readability; reference and keyword totals; paper, abstract and title lengths are associated with the impact of research publication. The findings of [5] suggested that the collaboration and journal are significantly associated with the higher citation. The result of [5] suggested that researchers should include relevant references, extended abstracts, and engage in the widest possible working team. Also, [5] reported that international collaboration has a high impact on research publications. In summary, [5] reported that different aspects of collaboration are significantly associate with research productivity like the citation.
The [31] studied and reported how network embeddedness of scientists affects research output and impact of scientists. The results of [31] indicate that the network dynamics of collaboration behind the generation of quality output contrasts dramatically with that of quantity.
In [1] discussed the relationship between the different types of collaboration and research productivity. In particular, [1] showed that only collaboration at the intramural and domestic level has a positive effect on research productivity and all forms of collaboration are positively affected by research productivity.
3 Data Sets and Study Design
3.1 Data Set Description
We implemented the study for scholarly communities of the Indira Gandhi Centre for Atomic Research (IGCAR). IGCAR is one of the premier research institutes in India. We divided the research activities of the community in four branches, namely Physical Sciences, Chemical Sciences, Engineering Sciences, and Other Sciences. We allotted atmospheric, earth, and general sciences into other sciences. There are 262 members in the community who have published at least one paper in the year 2016. We analyze the factors of those members, which leads them to publish articles in the year 2017. The factors may be their demographic data like years of experience in the institute, research branch, the number of collaborations in the year 2016, etc. and their internet activity like the amount of time spent in browsing scientific articles, etc. The database contains data set from four sources: (i) Publication database of 2017, (ii) Publication database of 2016, (iii) Demographic Database and (iv) Weblog data set of 2016.
Variable of Interest: The number of the research publication on 2017, from the Publication database of 2017, is our target variable of interest, and we want to test which are the predictors that affect the number of the research publication on 2017.
Demographic Predictors: We considered demographic variables like, (i) whether the member is Ph.D. or not, (ii) the years of experience of the member, (iii) the branch of research interest.
Publication Database of 2016: It consists of publications of all the 262 members. We derived the number of collaborations made by the scientist in the year 2016 from this database. For example, if a scientist A publishes two papers. The first paper with B and C; and the second paper with C and D; then we consider that A has two unique collaborations. Similarly, we counted the collaborations of the 262 members in the year 2016. In this technique, if three scientists A, B, and C publishes four papers together, then we count them as one collaboration.
Weblog variables of 2016 : The access to ejournals gets dissipated in large amount in the form of logs in the web servers. We considered the variable like number of hits; the time spent on browsing scientific articles; the number of hours spent in the weekends; the number of visits on the top viewed journal group like IEEE, Science Direct, Springer, Elsevier, Taylor and Francis, etc. by the members; the number of maximum hits made in the particular month; the number of hours spent in the morning/evening hours; the month where maximum hits made, and the maximum download size by the members.
3.2 Data Privacy
We committed to the privacy of the member of the society under study. Hence we considered data privacy as an essential part of our work. We employed Data Masking to preserve data privacy of the members of the society under study. Randomly generated identifiers masked all personal information from all sources such as names, mail ids, gender, and departmental groups disguised in all databases.
3.3 Study Design
One must be careful to establish the causal connection between the variable of interest and the estimated effect of the factors. The covariation is a necessary but not sufficient condition for causal inference, see e.g.,
[29]. Correlation is not causation. However, it is a good sign that causation may exist. We often forget that the causal association must be established by design and should not rely upon statistical models whose postulates seldom defended, see e.g., [25]. We presented our study design in Figure 1, where ‘Publication Database of 2016’, ‘Weblog Database of 2016’ and ‘Demographic Database’ considered as factors; which might have a causal impact on the ‘Publication Database of 2017’. We assumed the ‘Demographic Database’ of the scientist of IGCAR as constant over 2016 and 2017. However, we understand the ‘Demographic Database’ is dynamic over a long period, which is out of the scope from the current analysis. The ‘Publication Database of 2016’ surely have an impact on ‘Publication Database of 2017’; because most of the IGCAR scientists are working in longrun collaborative projects. Our objective is to identify if the ‘Weblog database of 2016’ has any effect on the ‘Publication Database of 2017’; after taking care of the impact from ‘Publication Database of 2016’ and ‘Demographic Database.’4 Exploratory Data Analysis
In Figure 2, we present the number of publication in 2017, versus the number of scientists who published at least one paper in 2016. Figure 2
indicates the distribution of the number of publication has a decaying effect. Hence a Poisson probability model or negative binomial model would be an appropriate probability model, for the ‘number of the publication.’ From Figure
2, we see that the mean (2.68) is less than the standard deviation (3.19); which indicates that the Poisson model may not be the appropriate model. However, the negative binomial could be a more appropriate model as the variance of the negative binomial model is larger than the mean.
In Table 1, we present the average and the median number of publication by the scientist with a Ph.D. and without Ph.D. The table indicates that the scientist with Ph.D. tends to publish two more paper than their colleague without Ph.D. Figure 5 presents the years of experience of the scientist and the number of publication in 2017. The trend and variability increase with years of experience. Figure 3 presents a strong positive relationship between the number of a unique collaboration in 2016 and the total number of publications in 2017. This strong positive association is along the same line of the available literature of last decade, see [9, 32, 5, 31, 1]. Finally, in Figure 4, we plot the 3D surface between years of experience, unique collaboration of 2016 and the number of publication of 2017. The plot indicates the existence of possible nonlinear behavior between the three variables.
5 Research Methodology
In this section, we present the two models and the rationale for formulating the null and alternative hypothesis for our analysis under the proposed models. We also present the underlying assumptions of the models and analysis.
5.1 Assumptions Related to Causal Inference
We follow [13] and made the following assumptions throughout our study.

Unconfoundedness: The Assignment is free from dependence on the potential outcome. That is the web activity of a scientist in 2016 will not be dependent on how many publications she/he is going to have in 2017.

Individualistic Assignment: The assignment mechanism is individualistic. The probability of sample unit (i.e., a randomly chosen scientist) is a function of pretreatment variables for the unit only and free of dependence on the values of pretreatment variables for other units. That is a web activity of a randomly chosen scientist is independent of the web activity of any other scientist of the community.

Probabilistic Assignment: The assignment mechanism is probabilistic so that the probability of receiving any level of the treatment is strictly between zero and one. In other words, the probability of the web activity of a scientist is purely random in nature.
5.2 Probability Models
Poisson Regression Model
In this study, we consider the measure of research output ( i.e., the total number of publications 2017) as a count variable. The most popular probability model for count variable is Poisson distribution and probability of the number of publications can be modeled as
where denote the number of publications in 2017 and is the rate of publication in 2017. The rate of publication can be modeled as
where
are the predictor variables, such as the number of a unique collaboration in 2016, scientist’s years of experience, if the scientist has a Ph.D. (or Not), websearch activities, etc.
Negative Binomial Regression Model
We can use the negative Binomial probability model as an alternative model for the total number of publications 2017 in the following way,
where the predictor variables can be modeled as
The regression coefficients of the Poisson and Negative Binomial model can be estimated using standard tools, see e.g., [3, 4].
5.3 Hypothesis
Null Hypothesis
The last decades of research, see e.g., [9, 32, 5, 31, 1], establishes beyond a reasonable doubt that collaboration is the main factor of research productivity. Hence, we assume the number of publications on 2017 (the measure of research productivity), is only the function of the number of collaborations in 2016, and other demographic variables like years of experience, and Ph.D.’s or not. Our exploratory data analysis also indicates the same. Under the null model, the weblog variables like scientist websearch activity of 2016 have no impact on research productivity of 2017. Hence we have the null model as
Alternative Hypothesis
We formulate our alternative hypothesis as follows. In addition to the number of collaborations in 2016 and other demographic variables, the weblog variables like scientist’s websearch activity of 2016 do have an impact on research productivity of 2017. For example: If the number of maximum hits by a scientist in sciindexed journals, has an effect on the research output of 2017. Hence we have the alternative model as
(5.1)  
6 Results and Analysis
In Table 2, we present the Akaike’s Information Criterion (AIC) for both Poisson and Negative Binomial Regression under Null and Alternative model. AIC for Negative Binomial Regression is smaller than Poisson regression, indicates that the Negative Binomial Regression performs better than Poisson regression. Hence now onwards, we present all our analysis only based on the Negative Binomial Regression. Note that [32] reported the negative binomial regression is a better model compare to Poisson regression. Hence our findings are inline with previous findings of [32]. The findings from [32] were based on agricultural scientist of Philippine. However, our findings are based on the research productivity of Indian scientist. The two independent studies indicates that the negative binomial regression perhaps a good model for studying research productivity.
In Table 3, we present the Likelihood Ratio (LR) based ChiSquare Test between Null and Alternate Model. The Pvalue for the test indicates that we reject the null model in favor of the alternate model at 0.01% level of significance. It indicates that the websearch activity of 2016 has a statistically significant effect on the number of research publication for 2017. Note that the LR test in the Table 3 only indicates that websearch activity of 2016 has a significant effect over research publication of 2017. But it does not say anything about the direction. Hence we present further analysis.
Based on analysis of alternate negative binomial regression model in Equation 5.1, presented in Table 4; the demographic variables like Ph.D., Experience and unique Collaborations of 2016 have a statistically significant impact on the number of research publications of 2017. This supports the findings of [5]. In addition to demographic variables, the log variables of 2016 like viewing scientific indexed journals, document size of the downloaded documents, etc. are also statistically significantly affect the research publications of 2017. The positive value of the coefficient estimate for logvariables of 2016 like ‘hits new’, ‘sci indexed journal sites’, ‘download document size’ indicate, these variable have a positive significant impact on research publication of 2017. Note that the coefficient of interactions between the variables is significant, however, the negative value of interaction indicates the system is complex. Hence in order to understand the effect of the proposed complex system, we present three scenarios in ‘Whatif’ analysis.
Whatif Analysis
We considered three scenarios. In Table 5
, we present the first, second and third quartile of web browsing activity. Where we considered the total number of hits and size of the downloaded document as representative features for web browsing activity. If a scientist’s number of hits and size of downloaded document are both at the first quartile level, then we consider that scientist as one who is less active in web browsing. Similarly, if a scientist’s both numbers of hits and download document size are at the third quartile level, then we consider the scientist as one with a high level of webactivity.
SciIndex = 1 indicates that the scientist’s maximum hit is the website of some sciindexed journal and SciIndex = 0 indicates that the scientist’s maximum hit is the website which is not a site of a SciIndex journal. Median for SciIndex is 1, means more than 50% scientist’s maximum websearch activity is Sciindexed journal.
We considered the demographics of two scientists are exactly same. We assumed two scientist having Ph.D., five collaboration on 2016 and 10 years of experience.
In all three scenarios the first scientist were always having high web activity and maximum hit of the sciindexed journals. We presented the result of the whatif analysis in Table 6
. We estimated the standard error and
confidence interval, using bootstrap statistics with 5000 bootstrap samples.Scenario I
In the first scenario, we considered the second scientist has the low web activity and her/his maximum hit is not the site of the sciindexed journal. The difference in the expected number of publication between the two scientists is 1.62 with a standard error of and confidence interval . This indicates that the difference between the two scientists is statistically significant and one with high webactivity publishes atleast one more paper than the scientist with low levels of webactivity.
Scenario II
In the second scenario, we considered the second scientist has a high web activity and her/his maximum hit is not the site of the sciindexed journal. The difference in the expected number of publication between the two scientists is 1.35 with a standard error of and confidence interval . This indicates that the difference between the two scientists is statistically significant. Now as both scientists are at the high level of webactivity. But the second scientist’s maximum number of hit is not the sciindexed journal. It means the main feature that differentiates between the two scientists is if the maximum number of hit is the sciindexed journal.
Scenario III
The second scenario leads us to check the third scenario. In the third scenario, we considered the webactivity of the two scientists, same as that of scenario I. That is the first scientist has the high level of webactivity and the second scientist have the low activity. However, both scientists number of maximum hit is the sciindexed journal. Now the difference reduces to 0.35 with confidence interval includes zero. It indicates that the second scientist who has a low level of webactivity is as effective as the scientist with a high level of webactivity because the scientist’s maximum number of hits is sciindexed journal site. It means a scientist can have less webactivity but if the webactivity is only related to research then that scientists productivity will be as high as another.
7 Discussion
The concept of causation has long been controversial in qualitative research, and many qualitative researchers have rejected causal explanation as incompatible with an interpretive or constructivist approach [17]. Unable to analyze the quality of the publications. In the future, we propose to measure the quality of the publications through citations. Causal inferences eliminate the possibility of the bad feedback loop involved in the case of predictive modeling models. Instead, it discovers useful inferences which can be adapted as policies in the institute for more publications. It helps to assist the members of our society and not directing them with rules which are difficult to comply. Also, we tried to preserve the privacy of the members of our society.
Our analysis indicates that the Negative Binomial Regression performs better than Poisson regression. Perhaps it captures overdispersion exists in the data. The demographic variables like Ph.D., Experience and unique Collaborations of 2016 have a statistically significant impact on the number of research publications of 2017. In addition to demographic variables, the weblog variables of 2016 like viewing scientific indexed journals, document size of the downloaded documents, etc. shows statistically significant effect on the research publications of 2017.
Whatif analysis indicates the web browsing activity leads to more number of the publication. However, interestingly we see a scientist with low web activity can be as productive as others if her/his maximum hits are the sciindexed journal. That is if the scientist uses web browsing only for researchrelated activity, then she/he can be equally productive even if her/his web activity is lower than fellow scientists.
References
 [1] Giovanni Abramoa , Andrea Ciriaco D ’ Angelob and Gianluca Murgia. The relationship among research productivity, research collaboration, and their determinants. Journal of Informetrics, 11:1016–1030, 2017.
 [2] Lanza S. T. Butera N. M. and Coffman D. L. A framework for estimating causal effects in latent class analysis: Is there a causal link between early sex and subsequent profiles of delinquency? Prevention science, 15:397–407, 2014.
 [3] Sourish Das and Dipak K. Dey. On bayesian analysis of generalized linear models using the jacobian technique. The American Statistician, 60(3):264–268, 2006.
 [4] Sourish Das and Dipak K. Dey. On dynamic generalized linear models with applications. Methodology and Computing in Applied Probability, (2):407–421, 2013.
 [5] Fereshteh Didegah and Mike Thelwall. Which factors help authors produce the highest impact research? collaboration, journal and document properties. Journal of Informetrics, 7:861–873, 2013.
 [6] Wolfgang Wiedermann , Nianbo Dong and Alexander von Eye. Advances in statistical methods for causal inference in prevention science: Introduction to the special section. Prevention Science, 20(3):390–393, 2019.
 [7] K.Imai, D.Tingley and T.Yamamoto. Experimental designs for identifying causal mechanisms. Journal of the Royal Statistical Society A, 176:5–51, 2013.
 [8] A.von Eye and W.Wiedermann. On direction of dependence in latent variable contexts. Educational and Psychological Measurement, 74(1):5–30, 2014.

[9]
ZiLin Hea, XueSong Gengb and Colin CampbellHuntc.
Research collaboration and research output: A longitudinal study of 65 biomedical scientists in a new zealand university.
Research Policy, 38:306–317, 2009.  [10] D. A. Chambers, R. E. Glasgow and K. C.Strange. The dynamic sustainability framework: Addressing the paradox of sustainment amid ongoing change. Implementation Science, 8:12–23, 2013.
 [11] Aliferis Constantin. Cooper Greg. Elisseeff Andr´e. Pellet JeanPhilippe Spirtes Z¨urich. Peter. Guyon, Isabelle. and Alexander Statnikov. Design and analysis of the causation and prediction challenge. In JMLR: Workshop and Conference Proceedings, volume 3, pages 1–33. 2008.
 [12] Ruocheng Guo , Lu Cheng , Jundong Li , P. Richard Hahn and Huan Liu. A survey of learning causality with data:problems and methods. ACM Transactions on Web, 9(4):1559–1131, 2010.
 [13] Guido. W. Imbens and Donald. B. Rubin. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, 2015.
 [14] Jonas Peters , Dominik Janzing and Bernhard Schölkopf. Elements of causal inference:Foundations and learning algorithms. Cambridge:MIT Press., 2017.
 [15] W. Liu , S. J. Kuramoto and E. A. Stuart. An introduction to sensitivity analysis for unobserved confounding in non experimental prevention research. Prevention Science, 14(6):570–580, 2013.
 [16] Marloes H. Maathuis and Preetam Nandy. A review of some recent advances in causal inference. https://arxiv.org/abs/1506.07669/ last accessed Sep 2019, 2015.
 [17] Joseph A. Maxwell. The importance of qualitative research for causal explanation in education. Qualitative Inquiry, 18(8):655–661, 2012.
 [18] Stephen J. Mooney and Vikas Pejaver. Big data in public health: Terminology, machine learning, and privacy. The Annual Review of Public Health, 39:95–112, 2018.
 [19] B. Muthén, and T. Asparouhov. Causal effects in mediation modeling: An introduction with applications to latent variables. Structural Equation Modeling, 22:12–23, 2015.

[20]
Cathy O’Neil.
Doing Data Sciences
. Oreily publishers, 2014.  [21] Cathy O’Neil. Weapons of Math Destruction. Crown Books, 2016.
 [22] Judea. Pearl. Causal inference in statistics: An overview. Statistics Surveys, 3:96–146, 2009.
 [23] Judea Pearl. The science and ethics of causal modeling. In A. T. Panter and S. K. Sterba, editors, Handbook of Ethics in Quantitative Methodology, pages 384–409. Taylor & Francis, 2011.
 [24] Judea Pearl. The causal mediation formula: A guide to the assessment of pathways and mechanisms. Prevention Science, 13:426–436, 2012.
 [25] Jasjeet S. Sekhon. The neymanrubin model of causal inference and estimation via matching methods. In Henry E. Brady Janet M. BoxSteffensmeier and David Collier, editors, The Oxford Handbook of Political Methodology. Oxford University Press, 2008.
 [26] Amit Sharma. Tutorial on causal inference and counterfactual reasoning. https://causalinference.gitlab.io/kddtutorial/ Last accessed: September 2019, 2018.
 [27] Adriene M. Belt , Aidan G. C. Wright , Briana N. Sprague and Peter C. M. Molenaar. Bridging the nomothetic and idiographic approaches to the analysis of clinical data. Assessment, 23:447–458, 2016.
 [28] V. S. Harder , E. A. Stuart and J. C. Anthony. Propensity score techniques and the assessment of measured co variate balance to test causal associations in psychological research. Psychological Methods, 15:234–249, 2010.
 [29] Edward Tufle. The Cognitive Style of PowerPoint: Pitching Out Corrupts Within, Second edition. Graphic Press, 2006.
 [30] T. J. VanderWeele. Explanation in causal inference: Methods for mediation and interaction. Oxford University Press., 2015.
 [31] Claudia N. GonzalezBrambila , Francisco M. Veloso and David Krackhardt. The impact of network embeddedness on research output. Research Policy, 42:1555–1567, 2013.
 [32] Marcus Antonius Ynalvez and Wesley M. Shrum. Professional networks, scientific collaboration, and publication productivity in resourceconstrained research institutions in a developing country. Research Policy, 40:204–216, 2011.
Comments
There are no comments yet.