Sampling Projects in GitHub for MSR Studies

03/08/2021
by   Ozren Dabic, et al.
0

Almost every Mining Software Repositories (MSR) study requires, as first step, the selection of the subject software repositories. These repositories are usually collected from hosting services like GitHub using specific selection criteria dictated by the study goal. For example, a study related to licensing might be interested in selecting projects explicitly declaring a license. Once the selection criteria have been defined, utilities such as the GitHub APIs can be used to "query" the hosting service. However, researchers have to deal with usage limitations imposed by these APIs and a lack of required information. For example, the GitHub search APIs allow 30 requests per minute and, when searching repositories, only provide limited information (e.g., the number of commits in a repository is not included). To support researchers in sampling projects from GitHub, we present GHS (GitHub Search), a dataset containing 25 characteristics (e.g., number of commits, license, etc.) of 735,669 repositories written in 10 programming languages. The set of characteristics has been derived by looking for frequently used project selection criteria in MSR studies and the dataset is continuously updated to (i) always provide fresh data about the existing projects, and (ii) increase the number of indexed projects. The GHS dataset can be queried through a web application we built that allows to set many combinations of selection criteria needed for a study and download the information of matching repositories: https://seart-ghs.si.usi.ch.

READ FULL TEXT

page 1

page 2

page 4

research
03/16/2023

Wasmizer: Curating WebAssembly-driven Projects on GitHub

WebAssembly has attracted great attention as a portable compilation targ...
research
12/07/2020

A Tool to Extract Structured Data from GitHub

GitHub repositories consist of various detailed information about the pr...
research
12/13/2018

A multiple criteria methodology for prioritizing and selecting portfolios of urban projects

This paper presents an integrated methodology supporting decisions in ur...
research
12/26/2022

Studying the Characteristics of AIOps Projects on GitHub

Artificial Intelligence for IT Operations (AIOps) leverages AI approache...
research
01/02/2021

Minimum Viable Model Estimates for Machine Learning Projects

Prioritization of machine learning projects requires estimates of both t...
research
03/17/2023

ESP32: QEMU Emulation within a Docker Container

The ESP32 is a popular microcontroller from Espressif that can be used i...
research
08/29/2022

Confounder Selection: Objectives and Approaches

Confounder selection is perhaps the most important step in the design of...

Please sign up or login with your details

Forgot password? Click here to reset