The Rise of GitHub in Scholarly Publications

08/09/2022
by   Emily Escamilla, et al.
0

The definition of scholarly content has expanded to include the data and source code that contribute to a publication. While major archiving efforts to preserve conventional scholarly content, typically in PDFs (e.g., LOCKSS, CLOCKSS, Portico), are underway, no analogous effort has yet emerged to preserve the data and code referenced in those PDFs, particularly the scholarly code hosted online on Git Hosting Platforms (GHPs). Similarly, the Software Heritage Foundation is working to archive public source code, but there is value in archiving the issue threads, pull requests, and wikis that provide important context to the code while maintaining their original URLs. In current implementations, source code and its ephemera are not preserved, which presents a problem for scholarly projects where reproducibility matters. To understand and quantify the scope of this issue, we analyzed the use of GHP URIs in the arXiv and PMC corpora from January 2007 to December 2021. In total, there were 253,590 URIs to GitHub, SourceForge, Bitbucket, and GitLab repositories across the 2.66 million publications in the corpora. We found that GitHub, GitLab, SourceForge, and Bitbucket were collectively linked to 160 times in 2007 and 76,746 times in 2021. In 2021, one out of five publications in the arXiv corpus included a URI to GitHub. The complexity of GHPs like GitHub is not amenable to conventional Web archiving techniques. Therefore, the growing use of GHPs in scholarly publications points to an urgent and growing need for dedicated efforts to archive their holdings in order to preserve research code and its scholarly ephemera.

READ FULL TEXT
research
07/26/2023

It's Not Just GitHub: Identifying Data and Software Sources Included in Publications

Paper publications are no longer the only form of research product. Due ...
research
03/20/2018

Public Git Archive: a Big Code dataset for all

The number of open source software projects has been growing exponential...
research
06/20/2018

A Large-Scale Study on Source Code Reviewer Recommendation

Context: Software code reviews are an important part of the development ...
research
02/03/2017

Linking Mathematical Software in Web Archives

The Web is our primary source of all kinds of information today. This in...
research
08/02/2022

An Exploratory Study of Documentation Strategies for Product Features in Popular GitHub Projects

[Background] In large open-source software projects, development knowled...
research
09/09/2022

Computational reproducibility of Jupyter notebooks from biomedical publications

Jupyter notebooks allow to bundle executable code with its documentation...

Please sign up or login with your details

Forgot password? Click here to reset