Research publications are one of the measures of the output of a research institute. In this paper, our objective is to find the factors which cause the scholarly communities to engage in active research and publications. Nowadays, researchers heavily rely on digital libraries like science direct, research gate, the web of science, etc. In this study, we try to understand how the web activity of scientists causes an impact on their research publication. The problem of discovering causal impact is ubiquitous in biological science, economics, public policy, and many regard of our daily life requiring logical reasoning and decision-making, see e.g., [22, 11]. Thus, causal questions are about the mechanism behind the data or about predictions after a novel intervention is applied to the system, see .
When it comes to learning causality from data, we should be careful about the differences between statistical associations and causations. In , presented the example that when the weather temperatures are high, the owner of an ice cream parlor may see high electric bills along with high sales. It means there would be a strong association between the electricity bill and the revenue. However, the high electricity bill did not cause high sales. In this case, the weather temperature is the common cause of both the high electricity usage and the high sales numbers. We say that temperature is the confounder of the causality of the electricity usage on ice cream sales. The standard statistical analysis focuses on correlation and not necessarily addresses the issues of the causal inference, see .
As computing systems are integrating with the daily lives of people, it is important to understand the causal effects of these interventions correctly. The digital world generates a staggering amount of data. While these massive data sets unlocked novel opportunities to understand the scientific issues, still there is greater potential for research and practice, especially for causal inferences. The  presented review which explores several key issues that have arisen around big data related to public health. Recently, a lot of progress happened in developing statistical causal inference tools (see, e.g., [22, 14, 30]) which enable scientists to assess causal hypotheses and learn the corresponding effects empirically. Several causal inference methods, such as causal mediation approaches , propensity score techniques (see e.g.,[28, 8]), and sensitivity analyses , are already part of the regular methodological causal inference toolbox of researchers.
However, existing methods may still be too simple to answer complex questions of causation. In particular, the research questions that go beyond the main effects of an experimental study, like A-B testing . There are three issues which are often not adequately addressed by standard statistical tools. For example, (i) the causation due to latent variables, (ii) complex network of causation, (iii) the iterative and dynamic processes of both, the temporal change in individual behavior  and change in factors at community level that affect the sustainability of evidence, see e.g., [2, 19, 8, 6, 7, 27] and . Therefore we should pay much more attention to designing proper study for causal inference.
2 Related Research
Several studies reported a positive relationship between collaboration and research publications, see, e.g., [9, 32, 5, 31, 1]. In , looked into email communication and research productivity. To the best of our understanding, there is no literature which looks into web-search activity of scientists and their research productivity.
The experimental results in , studied the effect of collaboration on the publication productivity of 65 bio-medical scientists at a New Zealand university over 14 years. The findings from the paper suggest that international collaboration has a positive effect on a scientist’s research publications than domestic collaborations. Moreover, collaborations are linked to the article’s quality. Since , used longitudinal design, the study establishes the causal effect of scientific collaboration on research publication for bio-medical scientist beyond a reasonable doubt. However, , did not explore the effect of internet usage by the scientist on the research publication, which we highlighted in this paper.
Also,  found that the strong association between email communication and research productivity.  used email counts as a proxy for internet usage and found a positive association with research productivity. However,  did not address the causality through their design study. Therefore it indicates the association and not causation. For example, a scientist might be engaged in organizing conference or admin work, and that has nothing to do with research productivity.  uses the negative binomial regression models over the Poisson model, and they made an ad-hoc choice for the negative binomial model instead of using a statistical model selection criterion, like Akaike Information Criterion.  reported that the lack of evidence between research productivity and scientific collaboration, which is counter-intuitive. However,  found the association between publication productivity, profession network size, and degree of collaboration. Since  did not address the design issue of the causal study, therefore the study lacks to find evidence between scientific collaboration and publication productivity. In this paper, we found a strong association between scientific collaboration and publication productivity.
The  reported that the impact factor of research publications is associated with factors like different dimensions of collaborations. For example, the factors like the individual, institutional and international collaboration; journal and reference impacts; abstract readability; reference and keyword totals; paper, abstract and title lengths are associated with the impact of research publication. The findings of  suggested that the collaboration and journal are significantly associated with the higher citation. The result of  suggested that researchers should include relevant references, extended abstracts, and engage in the widest possible working team. Also,  reported that international collaboration has a high impact on research publications. In summary,  reported that different aspects of collaboration are significantly associate with research productivity like the citation.
The  studied and reported how network embeddedness of scientists affects research output and impact of scientists. The results of  indicate that the network dynamics of collaboration behind the generation of quality output contrasts dramatically with that of quantity.
In  discussed the relationship between the different types of collaboration and research productivity. In particular,  showed that only collaboration at the intramural and domestic level has a positive effect on research productivity and all forms of collaboration are positively affected by research productivity.
3 Data Sets and Study Design
3.1 Data Set Description
We implemented the study for scholarly communities of the Indira Gandhi Centre for Atomic Research (IGCAR). IGCAR is one of the premier research institutes in India. We divided the research activities of the community in four branches, namely Physical Sciences, Chemical Sciences, Engineering Sciences, and Other Sciences. We allotted atmospheric, earth, and general sciences into other sciences. There are 262 members in the community who have published at least one paper in the year 2016. We analyze the factors of those members, which leads them to publish articles in the year 2017. The factors may be their demographic data like years of experience in the institute, research branch, the number of collaborations in the year 2016, etc. and their internet activity like the amount of time spent in browsing scientific articles, etc. The database contains data set from four sources: (i) Publication database of 2017, (ii) Publication database of 2016, (iii) Demographic Database and (iv) Weblog data set of 2016.
Variable of Interest: The number of the research publication on 2017, from the Publication database of 2017, is our target variable of interest, and we want to test which are the predictors that affect the number of the research publication on 2017.
Demographic Predictors: We considered demographic variables like, (i) whether the member is Ph.D. or not, (ii) the years of experience of the member, (iii) the branch of research interest.
Publication Database of 2016: It consists of publications of all the 262 members. We derived the number of collaborations made by the scientist in the year 2016 from this database. For example, if a scientist A publishes two papers. The first paper with B and C; and the second paper with C and D; then we consider that A has two unique collaborations. Similarly, we counted the collaborations of the 262 members in the year 2016. In this technique, if three scientists A, B, and C publishes four papers together, then we count them as one collaboration.
Weblog variables of 2016 : The access to e-journals gets dissipated in large amount in the form of logs in the web servers. We considered the variable like number of hits; the time spent on browsing scientific articles; the number of hours spent in the weekends; the number of visits on the top viewed journal group like IEEE, Science Direct, Springer, Elsevier, Taylor and Francis, etc. by the members; the number of maximum hits made in the particular month; the number of hours spent in the morning/evening hours; the month where maximum hits made, and the maximum download size by the members.
3.2 Data Privacy
We committed to the privacy of the member of the society under study. Hence we considered data privacy as an essential part of our work. We employed Data Masking to preserve data privacy of the members of the society under study. Randomly generated identifiers masked all personal information from all sources such as names, mail ids, gender, and departmental groups disguised in all databases.
3.3 Study Design
One must be careful to establish the causal connection between the variable of interest and the estimated effect of the factors. The co-variation is a necessary but not sufficient condition for causal inference, see e.g.,. Correlation is not causation. However, it is a good sign that causation may exist. We often forget that the causal association must be established by design and should not rely upon statistical models whose postulates seldom defended, see e.g., . We presented our study design in Figure 1, where ‘Publication Database of 2016’, ‘Weblog Database of 2016’ and ‘Demographic Database’ considered as factors; which might have a causal impact on the ‘Publication Database of 2017’. We assumed the ‘Demographic Database’ of the scientist of IGCAR as constant over 2016 and 2017. However, we understand the ‘Demographic Database’ is dynamic over a long period, which is out of the scope from the current analysis. The ‘Publication Database of 2016’ surely have an impact on ‘Publication Database of 2017’; because most of the IGCAR scientists are working in long-run collaborative projects. Our objective is to identify if the ‘Weblog database of 2016’ has any effect on the ‘Publication Database of 2017’; after taking care of the impact from ‘Publication Database of 2016’ and ‘Demographic Database.’
4 Exploratory Data Analysis
indicates the distribution of the number of publication has a decaying effect. Hence a Poisson probability model or negative binomial model would be an appropriate probability model, for the ‘number of the publication.’ From Figure2
, we see that the mean (2.68) is less than the standard deviation (3.19); which indicates that the Poisson model may not be the appropriate model. However, the negative binomial could be a more appropriate model as the variance of the negative binomial model is larger than the mean.
In Table 1, we present the average and the median number of publication by the scientist with a Ph.D. and without Ph.D. The table indicates that the scientist with Ph.D. tends to publish two more paper than their colleague without Ph.D. Figure 5 presents the years of experience of the scientist and the number of publication in 2017. The trend and variability increase with years of experience. Figure 3 presents a strong positive relationship between the number of a unique collaboration in 2016 and the total number of publications in 2017. This strong positive association is along the same line of the available literature of last decade, see [9, 32, 5, 31, 1]. Finally, in Figure 4, we plot the 3D surface between years of experience, unique collaboration of 2016 and the number of publication of 2017. The plot indicates the existence of possible non-linear behavior between the three variables.
5 Research Methodology
In this section, we present the two models and the rationale for formulating the null and alternative hypothesis for our analysis under the proposed models. We also present the underlying assumptions of the models and analysis.
5.1 Assumptions Related to Causal Inference
We follow  and made the following assumptions throughout our study.
Unconfoundedness: The Assignment is free from dependence on the potential outcome. That is the web activity of a scientist in 2016 will not be dependent on how many publications she/he is going to have in 2017.
Individualistic Assignment: The assignment mechanism is individualistic. The probability of sample unit (i.e., a randomly chosen scientist) is a function of pre-treatment variables for the unit only and free of dependence on the values of pre-treatment variables for other units. That is a web activity of a randomly chosen scientist is independent of the web activity of any other scientist of the community.
Probabilistic Assignment: The assignment mechanism is probabilistic so that the probability of receiving any level of the treatment is strictly between zero and one. In other words, the probability of the web activity of a scientist is purely random in nature.
5.2 Probability Models
Poisson Regression Model
In this study, we consider the measure of research output ( i.e., the total number of publications 2017) as a count variable. The most popular probability model for count variable is Poisson distribution and probability of the number of publications can be modeled as
where denote the number of publications in 2017 and is the rate of publication in 2017. The rate of publication can be modeled as
are the predictor variables, such as the number of a unique collaboration in 2016, scientist’s years of experience, if the scientist has a Ph.D. (or Not), web-search activities, etc.
Negative Binomial Regression Model
We can use the negative Binomial probability model as an alternative model for the total number of publications 2017 in the following way,
where the predictor variables can be modeled as
The last decades of research, see e.g., [9, 32, 5, 31, 1], establishes beyond a reasonable doubt that collaboration is the main factor of research productivity. Hence, we assume the number of publications on 2017 (the measure of research productivity), is only the function of the number of collaborations in 2016, and other demographic variables like years of experience, and Ph.D.’s or not. Our exploratory data analysis also indicates the same. Under the null model, the web-log variables like scientist web-search activity of 2016 have no impact on research productivity of 2017. Hence we have the null model as
We formulate our alternative hypothesis as follows. In addition to the number of collaborations in 2016 and other demographic variables, the web-log variables like scientist’s web-search activity of 2016 do have an impact on research productivity of 2017. For example: If the number of maximum hits by a scientist in sci-indexed journals, has an effect on the research output of 2017. Hence we have the alternative model as
6 Results and Analysis
In Table 2, we present the Akaike’s Information Criterion (AIC) for both Poisson and Negative Binomial Regression under Null and Alternative model. AIC for Negative Binomial Regression is smaller than Poisson regression, indicates that the Negative Binomial Regression performs better than Poisson regression. Hence now onwards, we present all our analysis only based on the Negative Binomial Regression. Note that  reported the negative binomial regression is a better model compare to Poisson regression. Hence our findings are in-line with previous findings of . The findings from  were based on agricultural scientist of Philippine. However, our findings are based on the research productivity of Indian scientist. The two independent studies indicates that the negative binomial regression perhaps a good model for studying research productivity.
In Table 3, we present the Likelihood Ratio (LR) based Chi-Square Test between Null and Alternate Model. The P-value for the test indicates that we reject the null model in favor of the alternate model at 0.01% level of significance. It indicates that the web-search activity of 2016 has a statistically significant effect on the number of research publication for 2017. Note that the LR test in the Table 3 only indicates that web-search activity of 2016 has a significant effect over research publication of 2017. But it does not say anything about the direction. Hence we present further analysis.
Based on analysis of alternate negative binomial regression model in Equation 5.1, presented in Table 4; the demographic variables like Ph.D., Experience and unique Collaborations of 2016 have a statistically significant impact on the number of research publications of 2017. This supports the findings of . In addition to demographic variables, the log variables of 2016 like viewing scientific indexed journals, document size of the downloaded documents, etc. are also statistically significantly affect the research publications of 2017. The positive value of the coefficient estimate for log-variables of 2016 like ‘hits new’, ‘sci indexed journal sites’, ‘download document size’ indicate, these variable have a positive significant impact on research publication of 2017. Note that the coefficient of interactions between the variables is significant, however, the negative value of interaction indicates the system is complex. Hence in order to understand the effect of the proposed complex system, we present three scenarios in ‘What-if’ analysis.
We considered three scenarios. In Table 5
, we present the first, second and third quartile of web browsing activity. Where we considered the total number of hits and size of the downloaded document as representative features for web browsing activity. If a scientist’s number of hits and size of downloaded document are both at the first quartile level, then we consider that scientist as one who is less active in web browsing. Similarly, if a scientist’s both numbers of hits and download document size are at the third quartile level, then we consider the scientist as one with a high level of web-activity.
Sci-Index = 1 indicates that the scientist’s maximum hit is the website of some sci-indexed journal and Sci-Index = 0 indicates that the scientist’s maximum hit is the website which is not a site of a Sci-Index journal. Median for Sci-Index is 1, means more than 50% scientist’s maximum web-search activity is Sci-indexed journal.
We considered the demographics of two scientists are exactly same. We assumed two scientist having Ph.D., five collaboration on 2016 and 10 years of experience.
In all three scenarios the first scientist were always having high web activity and maximum hit of the sci-indexed journals. We presented the result of the what-if analysis in Table 6
. We estimated the standard error andconfidence interval, using bootstrap statistics with 5000 bootstrap samples.
In the first scenario, we considered the second scientist has the low web activity and her/his maximum hit is not the site of the sci-indexed journal. The difference in the expected number of publication between the two scientists is 1.62 with a standard error of and confidence interval . This indicates that the difference between the two scientists is statistically significant and one with high web-activity publishes at-least one more paper than the scientist with low levels of web-activity.
In the second scenario, we considered the second scientist has a high web activity and her/his maximum hit is not the site of the sci-indexed journal. The difference in the expected number of publication between the two scientists is 1.35 with a standard error of and confidence interval . This indicates that the difference between the two scientists is statistically significant. Now as both scientists are at the high level of web-activity. But the second scientist’s maximum number of hit is not the sci-indexed journal. It means the main feature that differentiates between the two scientists is if the maximum number of hit is the sci-indexed journal.
The second scenario leads us to check the third scenario. In the third scenario, we considered the web-activity of the two scientists, same as that of scenario I. That is the first scientist has the high level of web-activity and the second scientist have the low activity. However, both scientists number of maximum hit is the sci-indexed journal. Now the difference reduces to 0.35 with confidence interval includes zero. It indicates that the second scientist who has a low level of web-activity is as effective as the scientist with a high level of web-activity because the scientist’s maximum number of hits is sci-indexed journal site. It means a scientist can have less web-activity but if the web-activity is only related to research then that scientists productivity will be as high as another.
The concept of causation has long been controversial in qualitative research, and many qualitative researchers have rejected causal explanation as incompatible with an interpretive or constructivist approach . Unable to analyze the quality of the publications. In the future, we propose to measure the quality of the publications through citations. Causal inferences eliminate the possibility of the bad feedback loop involved in the case of predictive modeling models. Instead, it discovers useful inferences which can be adapted as policies in the institute for more publications. It helps to assist the members of our society and not directing them with rules which are difficult to comply. Also, we tried to preserve the privacy of the members of our society.
Our analysis indicates that the Negative Binomial Regression performs better than Poisson regression. Perhaps it captures overdispersion exists in the data. The demographic variables like Ph.D., Experience and unique Collaborations of 2016 have a statistically significant impact on the number of research publications of 2017. In addition to demographic variables, the web-log variables of 2016 like viewing scientific indexed journals, document size of the downloaded documents, etc. shows statistically significant effect on the research publications of 2017.
What-if analysis indicates the web browsing activity leads to more number of the publication. However, interestingly we see a scientist with low web activity can be as productive as others if her/his maximum hits are the sci-indexed journal. That is if the scientist uses web browsing only for research-related activity, then she/he can be equally productive even if her/his web activity is lower than fellow scientists.
-  Giovanni Abramoa , Andrea Ciriaco D ’ Angelob and Gianluca Murgia. The relationship among research productivity, research collaboration, and their determinants. Journal of Informetrics, 11:1016–1030, 2017.
-  Lanza S. T. Butera N. M. and Coffman D. L. A framework for estimating causal effects in latent class analysis: Is there a causal link between early sex and subsequent profiles of delinquency? Prevention science, 15:397–407, 2014.
-  Sourish Das and Dipak K. Dey. On bayesian analysis of generalized linear models using the jacobian technique. The American Statistician, 60(3):264–268, 2006.
-  Sourish Das and Dipak K. Dey. On dynamic generalized linear models with applications. Methodology and Computing in Applied Probability, (2):407–421, 2013.
-  Fereshteh Didegah and Mike Thelwall. Which factors help authors produce the highest impact research? collaboration, journal and document properties. Journal of Informetrics, 7:861–873, 2013.
-  Wolfgang Wiedermann , Nianbo Dong and Alexander von Eye. Advances in statistical methods for causal inference in prevention science: Introduction to the special section. Prevention Science, 20(3):390–393, 2019.
-  K.Imai, D.Tingley and T.Yamamoto. Experimental designs for identifying causal mechanisms. Journal of the Royal Statistical Society A, 176:5–51, 2013.
-  A.von Eye and W.Wiedermann. On direction of dependence in latent variable contexts. Educational and Psychological Measurement, 74(1):5–30, 2014.
Zi-Lin Hea, Xue-Song Gengb and Colin Campbell-Huntc.
Research collaboration and research output: A longitudinal study of 65 biomedical scientists in a new zealand university.Research Policy, 38:306–317, 2009.
-  D. A. Chambers, R. E. Glasgow and K. C.Strange. The dynamic sustainability framework: Addressing the paradox of sustainment amid ongoing change. Implementation Science, 8:12–23, 2013.
-  Aliferis Constantin. Cooper Greg. Elisseeff Andr´e. Pellet Jean-Philippe Spirtes Z¨urich. Peter. Guyon, Isabelle. and Alexander Statnikov. Design and analysis of the causation and prediction challenge. In JMLR: Workshop and Conference Proceedings, volume 3, pages 1–33. 2008.
-  Ruocheng Guo , Lu Cheng , Jundong Li , P. Richard Hahn and Huan Liu. A survey of learning causality with data:problems and methods. ACM Transactions on Web, 9(4):1559–1131, 2010.
-  Guido. W. Imbens and Donald. B. Rubin. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, 2015.
-  Jonas Peters , Dominik Janzing and Bernhard Schölkopf. Elements of causal inference:Foundations and learning algorithms. Cambridge:MIT Press., 2017.
-  W. Liu , S. J. Kuramoto and E. A. Stuart. An introduction to sensitivity analysis for unobserved confounding in non experimental prevention research. Prevention Science, 14(6):570–580, 2013.
-  Marloes H. Maathuis and Preetam Nandy. A review of some recent advances in causal inference. https://arxiv.org/abs/1506.07669/ last accessed Sep 2019, 2015.
-  Joseph A. Maxwell. The importance of qualitative research for causal explanation in education. Qualitative Inquiry, 18(8):655–661, 2012.
-  Stephen J. Mooney and Vikas Pejaver. Big data in public health: Terminology, machine learning, and privacy. The Annual Review of Public Health, 39:95–112, 2018.
-  B. Muthén, and T. Asparouhov. Causal effects in mediation modeling: An introduction with applications to latent variables. Structural Equation Modeling, 22:12–23, 2015.
Doing Data Sciences. Oreily publishers, 2014.
-  Cathy O’Neil. Weapons of Math Destruction. Crown Books, 2016.
-  Judea. Pearl. Causal inference in statistics: An overview. Statistics Surveys, 3:96–146, 2009.
-  Judea Pearl. The science and ethics of causal modeling. In A. T. Panter and S. K. Sterba, editors, Handbook of Ethics in Quantitative Methodology, pages 384–409. Taylor & Francis, 2011.
-  Judea Pearl. The causal mediation formula: A guide to the assessment of pathways and mechanisms. Prevention Science, 13:426–436, 2012.
-  Jasjeet S. Sekhon. The neyman-rubin model of causal inference and estimation via matching methods. In Henry E. Brady Janet M. Box-Steffensmeier and David Collier, editors, The Oxford Handbook of Political Methodology. Oxford University Press, 2008.
-  Amit Sharma. Tutorial on causal inference and counterfactual reasoning. https://causalinference.gitlab.io/kdd-tutorial/ Last accessed: September 2019, 2018.
-  Adriene M. Belt , Aidan G. C. Wright , Briana N. Sprague and Peter C. M. Molenaar. Bridging the nomothetic and idiographic approaches to the analysis of clinical data. Assessment, 23:447–458, 2016.
-  V. S. Harder , E. A. Stuart and J. C. Anthony. Propensity score techniques and the assessment of measured co variate balance to test causal associations in psychological research. Psychological Methods, 15:234–249, 2010.
-  Edward Tufle. The Cognitive Style of PowerPoint: Pitching Out Corrupts Within, Second edition. Graphic Press, 2006.
-  T. J. VanderWeele. Explanation in causal inference: Methods for mediation and interaction. Oxford University Press., 2015.
-  Claudia N. Gonzalez-Brambila , Francisco M. Veloso and David Krackhardt. The impact of network embeddedness on research output. Research Policy, 42:1555–1567, 2013.
-  Marcus Antonius Ynalvez and Wesley M. Shrum. Professional networks, scientific collaboration, and publication productivity in resource-constrained research institutions in a developing country. Research Policy, 40:204–216, 2011.