HiGitClass: Keyword-Driven Hierarchical Classification of GitHub Repositories

10/16/2019
by   Yu Zhang, et al.
11

GitHub has become an important platform for code sharing and scientific exchange. With the massive number of repositories available, there is a pressing need for topic-based search. Even though the topic label functionality has been introduced, the majority of GitHub repositories do not have any labels, impeding the utility of search and topic-based analysis. This work targets the automatic repository classification problem as keyword-driven hierarchical classification. Specifically, users only need to provide a label hierarchy with keywords to supply as supervision. This setting is flexible, adaptive to the users' needs, accounts for the different granularity of topic labels and requires minimal human effort. We identify three key challenges of this problem, namely (1) the presence of multi-modal signals; (2) supervision scarcity and bias; (3) supervision format mismatch. In recognition of these challenges, we propose the HiGitClass framework, comprising of three modules: heterogeneous information network embedding; keyword enrichment; topic modeling and pseudo document generation. Experimental results on two GitHub repository collections confirm that HiGitClass is superior to existing weakly-supervised and dataless hierarchical classification methods, especially in its ability to integrate both structured and unstructured data for repository classification.

READ FULL TEXT

page 1

page 3

page 8

research
11/05/2022

Hierarchical Multi-Label Classification of Scientific Documents

Automatic topic classification has been studied extensively to assist ma...
research
10/06/2021

Weakly-supervised Text Classification Based on Keyword Graph

Weakly-supervised text classification has received much attention in rec...
research
03/16/2021

LabelGit: A Dataset for Software Repositories Classification using Attributed Dependency Graphs

Software repository hosting services contain large amounts of open-sourc...
research
12/04/2017

Topics and Label Propagation: Best of Both Worlds for Weakly Supervised Text Classification

We propose a Label Propagation based algorithm for weakly supervised tex...
research
10/26/2020

Hierarchical Metadata-Aware Document Categorization under Weak Supervision

Categorizing documents into a given label hierarchy is intuitively appea...
research
10/18/2020

Topic Recommendation for Software Repositories using Multi-label Classification Algorithms

Many platforms exploit collaborative tagging to provide their users with...
research
05/24/2018

An experimental comparison of label selection methods for hierarchical document clusters

The focus of this paper is on the evaluation of sixteen labeling methods...

Please sign up or login with your details

Forgot password? Click here to reset