GitRanking: A Ranking of GitHub Topics for Software Classification using Active Sampling

05/19/2022
by   Cezar Sas, et al.
12

GitHub is the world's largest host of source code, with more than 150M repositories. However, most of these repositories are not labeled or inadequately so, making it harder for users to find relevant projects. There have been various proposals for software application domain classification over the past years. However, these approaches lack a well-defined taxonomy that is hierarchical, grounded in a knowledge base, and free of irrelevant terms. This work proposes GitRanking, a framework for creating a classification ranked into discrete levels based on how general or specific their meaning is. We collected 121K topics from GitHub and considered 60% of the most frequent ones for the ranking. GitRanking 1) uses active sampling to ensure a minimal number of required annotations; and 2) links each topic to Wikidata, reducing ambiguities and improving the reusability of the taxonomy. Our results show that developers, when annotating their projects, avoid using terms with a high degree of specificity. This makes the finding and discovery of their projects more challenging for other users. Furthermore, we show that GitRanking can effectively rank terms according to their general or specific meaning. This ranking would be an essential asset for developers to build upon, allowing them to complement their annotations with more precise topics. Finally, we show that GitRanking is a dynamically extensible method: it can currently accept further terms to be ranked with a minimum number of annotations (∼ 15). This paper is the first collective attempt to build a ground-up taxonomy of software domains.

READ FULL TEXT

page 7

page 8

page 9

research
03/16/2021

LabelGit: A Dataset for Software Repositories Classification using Attributed Dependency Graphs

Software repository hosting services contain large amounts of open-sourc...
research
07/08/2021

GitQ- Towards Using Badges as Visual Cues for GitHub Projects

GitHub hosts millions of software repositories, facilitating developers ...
research
02/10/2021

GitHub Discussions: An Exploratory Study of Early Adoption

Discussions is a new feature of GitHub for asking questions or discussin...
research
10/18/2020

Topic Recommendation for Software Repositories using Multi-label Classification Algorithms

Many platforms exploit collaborative tagging to provide their users with...
research
10/28/2017

Topic-based Integrator Matching for Pull Request

Pull Request (PR) is the main method for code contributions from the ext...
research
10/28/2017

DevRank: Mining Influential Developers In Github

As the social coding is becoming increasingly popular, understanding the...
research
01/29/2023

Producing Usable Taxonomies Cheaply and Rapidly at Pinterest Using Discovered Dynamic μ-Topics

Creating a taxonomy of interests is expensive and human-effort intensive...

Please sign up or login with your details

Forgot password? Click here to reset