Public Git Archive: a Big Code dataset for all

03/20/2018
by   Vadim Markovtsev, et al.
0

The number of open source software projects has been growing exponentially. The major online software repository host, GitHub, has accumulated tens of millions of publicly available Git version-controlled repositories. Although the research potential enabled by the available open source code is clearly substantial, no significant large-scale open source code datasets exist. In this paper, we present the Public Git Archive -- dataset of 182,014 top-bookmarked Git repositories from GitHub. We describe the novel data retrieval pipeline to reproduce it. We also elaborate on the strategy for performing dataset updates and legal issues. The Public Git Archive occupies 3.0 TB on disk and is an order of magnitude larger than the current source code datasets. The dataset is made available through HTTP and provides the source code of the projects, the related metadata, and development history. The data retrieval pipeline employs an optimized worker queue model and an optimized archive format to efficiently store forked Git repositories, reducing the amount of data to download and persist. Public Git Archive aims to open a myriad of new opportunities for "Big Code" research.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/12/2021

The Software Heritage Filesystem (SwhFS): Integrating Source Code Archival with Development

We introduce the Software Heritage filesystem (SwhFS), a user-space file...
research
05/28/2020

SourceFinder: Finding Malware Source-Code from Publicly Available Repositories

Where can we find malware source code? This question is motivated by a r...
research
06/08/2023

X-COBOL: A Dataset of COBOL Repositories

Despite being proposed as early as 1959, COBOL (Common Business-Oriented...
research
12/31/2018

Open Source Software Opportunities and Risks

Open Source Software (OSS) history is traced to initial efforts in 1971 ...
research
08/09/2022

The Rise of GitHub in Scholarly Publications

The definition of scholarly content has expanded to include the data and...
research
04/08/2019

Smart, Responsible, and Upper Caste Only: Measuring Caste Attitudes through Large-Scale Analysis of Matrimonial Profiles

Discriminatory caste attitudes currently stigmatize millions of Indians,...
research
08/02/2022

An Exploratory Study of Documentation Strategies for Product Features in Popular GitHub Projects

[Background] In large open-source software projects, development knowled...

Please sign up or login with your details

Forgot password? Click here to reset