LabelGit: A Dataset for Software Repositories Classification using Attributed Dependency Graphs

by   Cezar Sas, et al.

Software repository hosting services contain large amounts of open-source software, with GitHub hosting more than 100 million repositories, from new to established ones. Given this vast amount of projects, there is a pressing need for a search based on the software's content and features. However, even though GitHub offers various solutions to aid software discovery, most repositories do not have any labels, reducing the utility of search and topic-based analysis. Moreover, classifying software modules is also getting more importance given the increase in Component-Based Software Development. However, previous work focused on software classification using keyword-based approaches or proxies for the project (e.g., README), which is not always available. In this work, we create a new annotated dataset of GitHub Java projects called LabelGit. Our dataset uses direct information from the source code, like the dependency graph and source code neural representations from the identifiers. Using this dataset, we hope to aid the development of solutions that do not rely on proxies but use the entire source code to perform classification.



page 3


The Software Heritage Graph Dataset: Large-scale Analysis of Public Software Development History

Software Heritage is the largest existing public archive of software sou...

GitRanking: A Ranking of GitHub Topics for Software Classification using Active Sampling

GitHub is the world's largest host of source code, with more than 150M r...

Using Source Code Density to Improve the Accuracy of Automatic Commit Classification into Maintenance Activities

Source code is changed for a reason, e.g., to adapt, correct, or adapt i...

HiGitClass: Keyword-Driven Hierarchical Classification of GitHub Repositories

GitHub has become an important platform for code sharing and scientific ...

A survey of data transfer and storage techniques in prevalent cryptocurrencies and suggested improvements

This thesis focuses on aspects related to the functioning of the gossip ...

A Dataset for GitHub Repository Deduplication

GitHub projects can be easily replicated through the site's fork process...

To Automatically Map Source Code Entities to Architectural Modules with Naive Bayes

Background: The process of mapping a source code entity onto an architec...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.