Growth and Duplication of Public Source Code over Time: Provenance Tracking at Scale

06/19/2019
by   Guillaume Rousseau, et al.
0

We study the evolution of the largest known corpus of publicly available source code, i.e., the Software Heritage archive (4B unique source code files, 1B commits capturing their development histories across 50M software projects). On such corpus we quantify the growth rate of original, never-seen-before source code files and commits. We find the growth rates to be exponential over a period of more than 40 years.We then estimate the multiplication factor, i.e., how much the same artifacts (e.g., files or commits) appear in different contexts (e.g., commits or source code distribution places). We observe a combinatorial explosion in the multiplication of identical source code files across different commits.We discuss the implication of these findings for the problem of tracking the provenance of source code artifacts (e.g., where and when a given source code file or commit has been observed in the wild) for the entire body of publicly available source code. To that end we benchmark different data models for capturing software provenance information at this scale and growth rate. We identify a viable solution that is deployable on commodity hardware and appears to be maintainable for the foreseeable future.

READ FULL TEXT
research
02/12/2021

The Software Heritage Filesystem (SwhFS): Integrating Source Code Archival with Development

We introduce the Software Heritage filesystem (SwhFS), a user-space file...
research
09/17/2019

Breaking Imphash

There are numerous schemes to generically signature artifacts. We specif...
research
08/22/2023

The Software Heritage License Dataset (2022 Edition)

Context: When software is released publicly, it is common to include wit...
research
01/23/2020

Referencing Source Code Artifacts: a Separate Concern in Software Citation

Among the entities involved in software citation, software source code r...
research
09/24/2019

How to use Software Heritage for archiving and referencing your source code: guidelines and walkthrough

Software source code is an essential research output, and many research ...
research
07/22/2022

Efficient Prior Publication Identification for Open Source Code

Free/Open Source Software (FOSS) enables large-scale reuse of preexistin...
research
02/15/2022

Worldwide Gender Differences in Public Code Contributions

Gender imbalance is a well-known phenomenon observed throughout sciences...

Please sign up or login with your details

Forgot password? Click here to reset