Sampling Projects in GitHub for MSR Studies

by   Ozren Dabic, et al.

Almost every Mining Software Repositories (MSR) study requires, as first step, the selection of the subject software repositories. These repositories are usually collected from hosting services like GitHub using specific selection criteria dictated by the study goal. For example, a study related to licensing might be interested in selecting projects explicitly declaring a license. Once the selection criteria have been defined, utilities such as the GitHub APIs can be used to "query" the hosting service. However, researchers have to deal with usage limitations imposed by these APIs and a lack of required information. For example, the GitHub search APIs allow 30 requests per minute and, when searching repositories, only provide limited information (e.g., the number of commits in a repository is not included). To support researchers in sampling projects from GitHub, we present GHS (GitHub Search), a dataset containing 25 characteristics (e.g., number of commits, license, etc.) of 735,669 repositories written in 10 programming languages. The set of characteristics has been derived by looking for frequently used project selection criteria in MSR studies and the dataset is continuously updated to (i) always provide fresh data about the existing projects, and (ii) increase the number of indexed projects. The GHS dataset can be queried through a web application we built that allows to set many combinations of selection criteria needed for a study and download the information of matching repositories:



There are no comments yet.


page 1

page 2

page 4


A Tool to Extract Structured Data from GitHub

GitHub repositories consist of various detailed information about the pr...

An Insight into the Pull Requests of GitHub

Given the increasing number of unsuccessful pull requests in GitHub proj...

What's in a GitHub Star? Understanding Repository Starring Practices in a Social Coding Platform

Besides a git-based version control system, GitHub integrates several so...

A Dataset for GitHub Repository Deduplication

GitHub projects can be easily replicated through the site's fork process...

A multiple criteria methodology for prioritizing and selecting portfolios of urban projects

This paper presents an integrated methodology supporting decisions in ur...

Minimum Viable Model Estimates for Machine Learning Projects

Prioritization of machine learning projects requires estimates of both t...

PHANTOM: Curating GitHub for engineered software projects using time-series clustering

Context: Within the field of Mining Software Repositories, there are num...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The amount of data available in software repositories is growing faster than ever. At the time of writing, GitHub [GitHub] hosts over 80 Million public repositories111 accounting for over 1 billion commit activities. Such an unprecedented amount of software data represents the main ingredient of many Mining Software Repositories (MSR) studies.

One of the fist steps in MSR studies consists in selecting the subject projects, i.e., the software repositories to analyze in order to answer the research questions (RQs) of interest. Such a step is crucial to achieve generalizability of the findings and ensure that the selected projects result in useful data points for the goal of the study. For example, a study investigating the types of issues reported in GitHub [Bissyande:issre2013] requires the selection of repositories regularly using the GitHub integrated issue tracker. Instead, a study interested in the pull request (PR) process of OSS projects [Zampetti:saner2019] must ensure that the subject systems actually adopt the PR mechanism (e.g., by verifying that at least PRs have been submitted in a given repository). In addition to RQ-specific selection criteria, several studies adopt specific filters to exclude toy and personal projects. For example, previous works excluded repositories having a low number of stars [wen2020empirical], commits [pecorelli2020developer], or issues [Bissyande:issre2013].

Once the selection criteria have been defined, software repositories satisfying them must be identified. Frequently, the search space is represented by all projects hosted on GitHub that, as previously said, are tens of millions. To query such a collection of repositories, developers can use the official GitHub APIs [GitHubAPI] that, however, come with a number of limitations both in terms of number of requests that can be triggered and information that can be retrieved. For example, the GitHub search API allows for a maximum of 30 requests per minute and each request can return at most 100 results. Only searching for some basic information about the public Java repositories hosted on GitHub would require, at the time of writing, 160k requests (88 hours). If additional information is required for each repository (e.g., its number of commits), additional requests must be triggered, making the process even more time expensive. Moreover, setting an appropriate value for the selection criteria (e.g., a project must have at least 100 commits) without having an overall view of the available data can be tricky. For instance, researchers cannot easily select the top 10% repositories in terms of number of commits without firstly collecting this information for the entire population. Finally, given a selection criteria, the GitHub search API provides at most the first 1,000 results (through 10 requests). This means there is no easy way to retrieve all matching results for a selection criteria if it exceed this upper bound.

To support developers in mining GitHub, several solutions have been proposed. Popular ones are GHTorrent [Gousi13] and GHArchive [GHArchive]. Both projects continuously monitor public events on GitHub and archive them. While the value of these tools is undisputed, as the benefits they brought to the research community, they do not provide a handy solution to support the sampling of projects on GitHub accordingly to the desired selection criteria. For example, computing the number of commits, issues, etc. for a repository in GHTorrent would require MySQL queries aimed at joining multiple tables.

We present GHS (GitHub Search) [GitHubSearch], a dataset and a tool to simplify the sampling of projects to use in MSR studies. GHS continuously mines a set of 25 characteristics of GitHub repositories that have been often used as selection criteria in MSR studies and that, accordingly to our experience in the field, can be useful for sampling projects (e.g., adopted license, number of commits, contributors, issues, and pull requests).

The tool behind GHS can be configured to mine projects written in specific programming languages. As of today, it mined information about over 700k repositories written in 10 different languages (i.e., Python, Java, C++, C, C#, Objective-C, Javascript, Typescript, Swift, and Kotlin).

A stable version of the dataset is hosted on zenodo [ghs-dataset] and it features 735,669 repositories written in the previously mentioned languages. As detailed in the following, GHS has been designed with scalability in mind and to specifically support the sampling of projects for MSR study. While the user can download the dataset and query it with an ad-hoc script, a querying interface with export features is available at

Ii The Dataset

This section describes GHS [ghs-dataset], a dataset containing information about 735,669 GitHub public repositories that can be used by researchers to easily select projects for an empirical study. In particular, 25 characteristics of each project are mined, stored, and continuously updated. Our mining tool exploits the GitHub search API and an ad-hoc crawler we built to collect specific information from the repositories’ homepage. Table I lists the collected characteristics, together with a short description for each of them, the source from which the information is mined and one example of works in the literature that used such a characteristic in the empirical study (or “-” if we did not find a related reference).

Fig. 1 depicts the main steps behind the data collection process put into place to build GHS. The following subsections detail such a process.

Fig. 1: The GHS architecture
Charcteristic Description Mining Source Used in
name Name of the repository in the form user_name/repo_name GitHub Search API [muse2020prevalence]
commits Number of commits on the default branch Repository’s landing page [gonzalez2020did]
last_commits_sha The SHA-1 hash of the latest commit on the default branch Repository’s landing page -
last_commits The date of the latest commit on the default branch Repository’s landing page [gonzalez2020state]
license The license used for the repository (if any) GitHub Search API [vendome2017license]
branches Number of remote branches Repository’s landing page [Han:compsac2019]
default_branch Name of the default branch GitHub Search API -
contributors Number of contributors Repository’s landing page [pecorelli2020developer]
releases Number of releases [releases] Repository’s landing page [Moreno:tse2017]
watchers Number of users watching the repositories Repository’s landing page [sheoran2014understanding]
stars Number of stars the repository received GitHub Search API [Zampetti:saner2019]
forks Number of repositories forked from this repository GitHub Search API [gonzalez2020state]
is_fork_project Whether the projects is a fork GitHub Search API [bryksin2020using]
size The size of project (in kilobytes) GitHub Search API [borrelli2020detecting]
created_at Date when the repository is created GitHub Search API [bryksin2020using]
pushed_at Latest date when a commit is pushed to any of the repository’s branches GitHub Search API [gonzalez2020did]
updated_at Latest date when the repository object is updated, e.g., description changed GitHub Search API -
homepage The repository’s homepage URL (if any) GitHub Search API [Aghajani:icse2019]
main_language The main language that the repository’s source code is written in GitHub Search API [nakamaru2020empirical]
total_issues Total number of issues (both open and closed issues) Repository’s issues page [Bissyande:issre2013]
open_issues Number of open issues Repository’s issues page [Bissyande:issre2013]
total_pull_requests Total number of pull requests (both open and closed issues) Repository’s pull requests page [Zampetti:saner2019]
open_pull_requests Number of open pull requests Repository’s pull requests page [Zampetti:saner2019]
has_wiki Whether the repository has wiki GitHub Search API [Tantisuwankul:jss2019]
archived Whether the repository is marked as archived (i.e., read-only) GitHub Search API [Coelho:esem18]
TABLE I: Characteristics stored in GHS for each GitHub project

Ii-a Data Extraction

As depicted in Fig. 1, the GHS data collection process is carried out through three main components.

1. GitHub API Invoker: This component has two main responsibilities. First, it can retrieve the list of repositories (i) written in a specific language, and (ii) created or updated during a certain time period. The latter feature is needed, as detailed later, to overcome the GitHub API maximum result limit of 1,000 results per request. To retrieve, for example, the list of repositories written in Java and updated in March 2020, the following GitHub API request is triggered:

For each collected repository, the information in Table I having “GitHub Search API” as mining source is retrieved.

Second, this component is in charge of monitoring if the GitHub access token being used for mining has not exceeded its request limit. Indeed, we use authenticated requests to increase the usage limits imposed by the GitHub API.

2. GitHub Website Crawler: This component is used to collect, for a given repository, all information in Table I having a repository’s webpage as mining source. Since the information of interest is scattered in different pages, this component mines the repository’s (i) landing page [landingPage], (ii) issues page [issuesPage], and (iii) pull requests page [prPage].

We parse the HTML of these pages by using the CSS selectors containing the information of interest. For this task we primarily rely on the jsoup library [jsoup]. Unfortunately, due to the use of dynamic content generation in the GitHub pages, not all elements are present when downloading the content of a page, e.g., the number of contributors is dynamically generated, and cannot always be captured using jsoup (it depends on the time required for loading the needed information). When jsoup fails in retrieving a specific information, we rely on the Selenium WebDriver for Chrome [selenium], which provides the possibility to wait for the required information to load. Since Selenium introduces a significant performance drawback, it is only used as backup strategy when jsoup returns an error.

We are aware that mining CSS selectors as a strategy to collect information can require future updates if the GitHub UI substantially changes. We considered such a scenario in our implementation by using, when possible, generic selectors that are unlikely to change over time. Also, this “maintenance cost” is counterbalanced by the high performance in retrieving the required information ensured by the webpages parsing.

3. Repository Miner: This is the core component orchestrating the collection of the GHS dataset. Before describing how it works, it is important to clarify that the set of programming languages of interest (i.e., the ones for which repositories will be mined) is defined by the GHS administrator. In our case, we set the 10 languages composing the current version of the dataset. The Repository Miner implements a mining algorithm that is triggered every six hours for continuously updating the information in GHS. For each programming language of interest, the algorithm checks if any prior mining has been conducted. If no record of prior mining is found, the GitHub API Invoker is triggered to mine all repositories created or updated between January 1st 2008 (GitHub started in February 2008) and the current time minus two hours222We ignore the last two hours since it takes time for the GitHub’s internal database to sync newly created projects.. If, instead, a previous mining process has been performed for the specific language, the GitHub API Invoker collects all repositories created or updated between the last date mined in and the current time minus two hours.

In both cases the GitHub API Invoker collects all repositories (i) written in the selected language, (ii) created/updated during the selected interval, and (iii) having at least 10 stars. The decision of only collecting repositories having at least 10 stars aims at drastically reducing the number of repositories we store and makes the data collection more scalable (e.g., from preliminary analyses we performed on Java, 5% of repositories have at least 10 stars). We acknowledge that, as also shown in previous work [Munaiah2017], the number of stars is not a good proxy for repositories quality or relevance, and there are better ways to automatically identify engineered GitHub projects (e.g., the Reaper tool [Munaiah2017]). However, we believe that the 10 stars threshold provides a reasonable compromise between the quality of data and the time required to mine and continuously update all projects.

If the GitHub API Invoker retrieves more than 1,000 repositories for a time interval, it splits the interval in half, and the two new time intervals are pushed to a priority queue handling the requests to process. Such a mechanism is needed since the GitHub API only provides the first 1,000 results for a request. The algorithm recursively picks and process the oldest interval from the queue until it is empty, meaning the mining for the current language is completed.

Otherwise, if there are less than 1,000 results for an interval, the algorithm iterates over the result list. For each retrieved repository, the algorithm scrapes the missing information from the repository web pages (using the GitHub Website Crawler), and saves the full record to a database. Our algorithm can mine/update 20k repositories everyday.

Ii-B Data Storage

The data collected for all repositories (Table I) is stored in a MySQL database. When updated information about a previously mined repository is collected, the corresponding rows for that repository will be updated with the new information (i.e., no new row is created).

While this ensures that the repository data contained within GHS is kept updated, GHS does not offer an overview of the historic evolution of said characteristics.

A stable version of the dataset, exported on January 28th 2021, is hosted on zenodo [ghs-dataset] and it features 735,669 repositories written in 10 languages. [ bicolor, sidebyside, sidebyside adapt=both, sidebyside gap=5pt, top=0pt,left=0pt,right=0pt,bottom=0pt, boxrule=0pt,rounded corners, interior style=top color=LHScolor,bottom color=LHScolor!60!black, segmentation style=top color=ABlue,bottom color=ABlue!60!black, ][baseline,outer sep=0pt, inner sep=0pt] [LHScolor!40!black] at (0,-0.1ex) DOI; [white] at (0,0) DOI; [baseline,outer sep=0pt, inner sep=0pt] [ABlue!40!black] at (0,-0.1ex) 10.5281/zenodo.4476391; [white] at (0,0) 10.5281/zenodo.4476391;

Fig. 2: GUI to query GHS [GitHubSearch] (left) and results page with export options (right).

Ii-C Querying Ghs

The latest and continuously growing version of our dataset can be downloaded/queried through our online platform [GitHubSearch]. Fig. 2 depicts the GUI we provide to query GHS (left part) and an example of results page obtained by searching for the Apache Java repositories having at least 100 commits (right).

General filters [baseline=(char.base)] [shape=circle,fill,inner sep=0pt] (char) 1; can be applied to select projects containing a specific string in their name (e.g., “apache/” will return all projects run by the Apache Software Foundation), having a specific license, written in a given language or using specific labels for their issues (e.g., “refactoring”). The latter feature is still under development, which is why we do not present issue labels in the stable version of GHS.

Projects can also be filtered based on their history and activity (e.g., number of commits, releases) [baseline=(char.base)] [shape=circle,fill,inner sep=0pt] (char) 2;, even only retrieving repositories that had activities in a specific time frame [baseline=(char.base)] [shape=circle,fill,inner sep=0pt] (char) 3;. Finally, filters labeled with [baseline=(char.base)] [shape=circle,fill,inner sep=0pt] (char) 4; concern popularity indicators, while those with [baseline=(char.base)] [shape=circle,fill,inner sep=0pt] (char) 5; allow to further refine the results list by removing, for example, forks.

By clicking on the “Search” button, the repositories satisfying the search criteria are shown, giving the possibility to the user to inspect the results list and, eventually, download it in different formats [baseline=(char.base)] [shape=circle,fill,inner sep=0pt] (char) 6;.

Iii Related Work

To support researchers in MSR, several solutions have been proposed. GHArchive [GHArchive] records the public GitHub activities on an hourly basis as json archives. This is done by mining the GitHub public event stream (e.g., a user creating a repository, a repository gaining a new watcher) through the use of webhooks [GitHubAPIWebhook]. This means that, for example, to sample all Java repositories created in 2012 we must retrieve all repositories linked to a “create” event from each hour, of each day, of each month of the year. This translates in scanning 8 thousand files for said events. Thus, while GHArchive is a fantastic data source for MSR studies, it is not convenient for sampling repositories.

GHTorrent [Gousi13] continuously collects data from the GitHub API storing it in both relational and non-relational databases. It likely offers the most used dataset in MSR studies, thanks to the huge amount of stored data and no limitations posed on its querying. However, as mentioned in Section I, retrieving specific information such as the number of commits in a repository may require formulating queries on quite a large dataset. GHS, as compared to GHTorrent, (i) stores only basic repository information needed for making projects’ sampling convenient, and (ii) provides a handy GUI to query the dataset.

Software Heritage [dicosmo:hal2017] aims at preserving software in source code form including, e.g., projects deleted from GitHub. It contains, at the date of writing, over 150M repositories featuring almost 10B source files. The focus of such a dataset is different from GHS since Software Heritage is not explicitly meant to simplify projects sampling for empirical studies based on (pre-computed) selection criteria.

Surana et al. [surana2020tool] proposed GitRepository, a tool to extract structured information from GitHub repositories related to contributors, issues, pull requests, releases, and subscribers. The authors do not provide a dataset, but a tool able to create a dataset using the GitHub API. GHS provides a wider variety of information, allowing for a better sampling made easy trough its GUI. In addition to the discussed works, some older projects are no longer active.

Markovtsev and Long introduced Public Git Archive [markovtsev2018public], a dataset of 180k repositories having at least 50 stars. The dataset has been released in 2018 and, to the best of our knowledge, is not kept updated.

Bissyandé et al. [Bissyande:2013] presented Orion, a corpus of software projects collected from GitHub, Google Code [googleCode] and Freecode [freeCode]. To query Orion a custom designed DSL language must be used. The project webpage [orion] is no longer accessible.

Iv Future Work

There are four main directions in which we are improving GHS. First, we will add more and more programming languages over time. Doing this is as easy as changing a configuration file. Second, we will finalize the collection of the issue labels that can be used, for example, when a researcher is interested in repositories explicitly using specific labels such as refactoring or documentation. The GUI already supports such a feature, while the crawling of this information is not yet finalized. Third, the code behind GHS

is open source

[githubRepo] and we plan to collect requests for additional project characteristics to include in GHS from the research community through its issue tracker. Lastly, we will focus on improving performance, especially in terms of data mining.

V Conclusions

We presented GHS (GitHub Search), a dataset to simplify the sampling of projects for MSR studies. A stable version of GHS is available on zenodo [ghs-dataset] and features information about 735,669 GitHub repositories written in 10 languages. The dataset is continuously updated and expanded, with its latest version available at together with a handy querying interface.

Vi Acknowledgments

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 851720).