World of Code: Enabling a Research Workflow for Mining and Analyzing the Universe of Open Source VCS data

by   Yuxing Ma, et al.

Open source software (OSS) is essential for modern society and, while substantial research has been done on individual (typically central) projects, only a limited understanding of the periphery of the entire OSS ecosystem exists. For example, how are the tens of millions of projects in the periphery interconnected through. technical dependencies, code sharing, or knowledge flow? To answer such questions we: a) create a very large and frequently updated collection of version control data in the entire FLOSS ecosystems named World of Code (WoC), that can completely cross-reference authors, projects, commits, blobs, dependencies, and history of the FLOSS ecosystems and b) provide capabilities to efficiently correct, augment, query, and analyze that data. Our current WoC implementation is capable of being updated on a monthly basis and contains over 18B Git objects. To evaluate its research potential and to create vignettes for its usage, we employ WoC in conducting several research tasks. In particular, we find that it is capable of supporting trend evaluation, ecosystem measurement, and the determination of package usage. We expect WoC to spur investigation into global properties of OSS development leading to increased resiliency of the entire OSS ecosystem. Our infrastructure facilitates the discovery of key technical dependencies, code flow, and social networks that provide the basis to determine the structure and evolution of the relationships that drive FLOSS activities and innovation.


page 1

page 2

page 3

page 4


Release as a Contract: A Concept of Meta-Maintenance for the Entire FLOSS Ecosystem

We advocate for a paradigm shift in supporting free/libre and open sourc...

The CROSS Incubator: A Case Study for funding and training RSEs

The incubator and research projects sponsored by the Center for Research...

Building the Collaboration Graph of Open-Source Software Ecosystem

The Open-Source Software community has become the center of attention fo...

Open Source Software Sustainability: Combining Institutional Analysis and Socio-Technical Networks

Open Source Software (OSS) forms much of the fabric of our digital socie...

The Maven Dependency Graph: a Temporal Graph-based Representation of Maven Central

The Maven Central Repository provides an extraordinary source of data to...

Constructing Temporal Networks of OSS Programming Language Ecosystems

One of the primary factors that encourage developers to contribute to op...

A Longitudinal View at the Adoption of Multipath TCP

Multipath TCP (MPTCP) extends traditional TCP to enable simultaneous use...

Please sign up or login with your details

Forgot password? Click here to reset