LabelGit: A Dataset for Software Repositories Classification using Attributed Dependency Graphs

03/16/2021
by   Cezar Sas, et al.
26

Software repository hosting services contain large amounts of open-source software, with GitHub hosting more than 100 million repositories, from new to established ones. Given this vast amount of projects, there is a pressing need for a search based on the software's content and features. However, even though GitHub offers various solutions to aid software discovery, most repositories do not have any labels, reducing the utility of search and topic-based analysis. Moreover, classifying software modules is also getting more importance given the increase in Component-Based Software Development. However, previous work focused on software classification using keyword-based approaches or proxies for the project (e.g., README), which is not always available. In this work, we create a new annotated dataset of GitHub Java projects called LabelGit. Our dataset uses direct information from the source code, like the dependency graph and source code neural representations from the identifiers. Using this dataset, we hope to aid the development of solutions that do not rely on proxies but use the entire source code to perform classification.

READ FULL TEXT
research
11/16/2020

The Software Heritage Graph Dataset: Large-scale Analysis of Public Software Development History

Software Heritage is the largest existing public archive of software sou...
research
05/19/2022

GitRanking: A Ranking of GitHub Topics for Software Classification using Active Sampling

GitHub is the world's largest host of source code, with more than 150M r...
research
08/02/2022

An Exploratory Study of Documentation Strategies for Product Features in Popular GitHub Projects

[Background] In large open-source software projects, development knowled...
research
05/28/2020

Using Source Code Density to Improve the Accuracy of Automatic Commit Classification into Maintenance Activities

Source code is changed for a reason, e.g., to adapt, correct, or adapt i...
research
10/16/2019

HiGitClass: Keyword-Driven Hierarchical Classification of GitHub Repositories

GitHub has become an important platform for code sharing and scientific ...
research
02/06/2020

A Dataset for GitHub Repository Deduplication

GitHub projects can be easily replicated through the site's fork process...
research
06/05/2018

Adapting Neural Text Classification for Improved Software Categorization

Software Categorization is the task of organizing software into groups t...

Please sign up or login with your details

Forgot password? Click here to reset