HiGitClass: Keyword-Driven Hierarchical Classification of GitHub Repositories

10/16/2019 ∙ by Yu Zhang, et al. ∙ 11

GitHub has become an important platform for code sharing and scientific exchange. With the massive number of repositories available, there is a pressing need for topic-based search. Even though the topic label functionality has been introduced, the majority of GitHub repositories do not have any labels, impeding the utility of search and topic-based analysis. This work targets the automatic repository classification problem as keyword-driven hierarchical classification. Specifically, users only need to provide a label hierarchy with keywords to supply as supervision. This setting is flexible, adaptive to the users' needs, accounts for the different granularity of topic labels and requires minimal human effort. We identify three key challenges of this problem, namely (1) the presence of multi-modal signals; (2) supervision scarcity and bias; (3) supervision format mismatch. In recognition of these challenges, we propose the HiGitClass framework, comprising of three modules: heterogeneous information network embedding; keyword enrichment; topic modeling and pseudo document generation. Experimental results on two GitHub repository collections confirm that HiGitClass is superior to existing weakly-supervised and dataless hierarchical classification methods, especially in its ability to integrate both structured and unstructured data for repository classification.



There are no comments yet.


page 1

page 3

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

For the computer science field, code repositories are an indispensable part of the knowledge dissemination process, containing valuable details for reproduction. For software engineers, sharing code also promotes the adoption of best practices and accelerates code development. The needs of the scientific community and that of software developers have facilitated the growth of online code collaboration platforms, the most popular of which is GitHub, with over 96 million repositories and 31 million users as of 2018. With the overwhelming number of repositories hosted on GitHub, there is a natural need to enable search functionality so that users can quickly target repositories of interest. To accommodate this need, GitHub introduced topic labels111https://help.github.com/en/articles/about-topics which allowed users to declare topics for their own repositories. However, topic-based search on GitHub is still far from ideal. For example, when searching for repositories related to “phylogenetics”, a highly relevant repository opentree222https://github.com/OpenTreeOfLife/opentree with many stars and forks does not even show up in the first 10 pages of search results as it does not contain the “phylogenetics” tag. Hence, to improve the search and analysis of GitHub repositories, a critical first step is automatic repository classification.

In the process of examining the automatic repository classification task, we identify three different cases of missing labels: (1) Missing annotation: the majority of repositories (73% in our Machine-Learning dataset and 78% in our Bioinformatics dataset) have no topic labels at all; (2) Incomplete annotation: since topic labels can be arbitrarily general or specific, some repositories may miss coarse-grained labels while others miss fine-grained ones; (3) Evolving label space: related GitHub topics tags may not have existed at the time of creation, so the label is naturally missing. Missing annotation is the major drive behind automatic classification, but this also implies that labeled data is scarce and expensive to obtain. Incomplete annotation reflects the hierarchical relationship between labels: repositories should not only be assigned to labels of one level of granularity, but correspond to a path in the class hierarchy. Finally, the evolving label space requires the classification algorithm to quickly adapt to a new label space, or take the label space as part of the input. Combining these observations, we define our task as keyword-driven hierarchical classification for GitHub repositories. By keyword-driven, we imply that we are performing classification using only a few keywords as supervision.

Compared to the common setting of fully-supervised classification of text documents, keyword-driven hierarchical classification of GitHub repositories poses unique challenges. First of all, GitHub repositories are complex objects with metadata, user interaction and textual description. As a result, multi-modal signals can be utilized for topic classification, including user ownership information, existing tags and README text. To jointly model structured and unstructured data, we propose to construct a heterogeneous information network (HIN) centered upon words. By learning node embeddings in this HIN, we obtain word representations that reflect the co-occurrence of multi-modal signals that are unique to the GitHub repository dataset. We also face the supervision scarcity and bias problem as users only provide one keyword for each class as guidance. This single keyword may reflect user’s partial knowledge of the class and may not achieve good coverage of the class distribution. In face of this challenge, we introduce a keyword enrichment module that expands the single keyword to a keyword set

for each category. The newly selected keywords are required to be close to the target class in the embedding space. Meanwhile, we keep mutual exclusivity among keyword sets so as to create a clear separation boundary. Finally, while users provide a label hierarchy, the classification algorithm ultimately operates on repositories, so there is a mismatch in the form of supervision. Since we already encode the structured information through the HIN embeddings, in our final classification stage we represent each repository as a document. To transform keywords into documents, we first model each class as a topic distribution over words and estimate the distribution parameters. Then based on the topic distributions, we follow a two-step procedure to generate pseudo documents for training. This also allows us to employ powerful classifiers such as CNNs for classification, which would not be possible with the scarce labels.

To summarize, we have the following contributions:

  • We present the task of keyword-driven hierarchical classification of GitHub repositories. While GitHub has been of widespread interest to the research community, no previous efforts have been devoted to the task of automatically assigning topic labels to repositories, which can greatly facilitate repository search and analysis. To deal with the evolving hierarchical label space and circumvent expensive annotation efforts, we only rely on the user-provided label hierarchy and keywords to train the classifier.

  • We design the HiGitClass framework, which consists of three modules: HIN construction and embedding; keyword enrichment; topic modeling and pseudo document generation. The three modules are carefully devised to overcome three identified challenges of our problem: the presence of multi-modal signals; supervision scarcity and bias; supervision format mismatch.

  • We collect two datasets of GitHub repositories from the machine learning and bioinformatics research community. On both datasets we show that our proposed framework HiGitClass outperforms existing supervised and semi-supervised models for hierarchical classification.

The remainder of this paper is organized as follows. In Section II, we formally introduce our problem definition, multi-modal signals in a GitHub repository and heterogeneous information networks. In Section III, we elaborate our framework HiGitClass with its three components. Then in Section IV, we present experimental results and discuss our findings. Section V covers related literature and we conclude in Section VI.

Ii Preliminaries

Ii-a Problem Definition

We study hierarchical classification of GitHub repositories where the categories form a tree structure. Traditional hierarchical classification approaches [7, 19] rely on a large set of labeled training documents. In contrast, to tackle the evolving label space and alleviate annotation efforts, we formulate our task as keyword-driven classification, where users just need to provide the label hierarchy and one keyword for each leaf category. This bears some similarities with the dataless classification proposed in [33] that utilizes an user-defined label hierarchy and class descriptions. There is no requirement of any labeled repository. Formally, our task is defined as follows.

Problem Definition. (Keyword-Driven Hierarchical Classification.) Given a collection of unlabeled GitHub repositories, a tree-structured label hierarchy and one keyword for each leaf class , our task is to assign appropriate category labels to the repositories, where the labels can be either a leaf or an internal node in .

Ii-B GitHub Repositories

Fig. 1: A sample GitHub repository with the user’s name, the repository name, description tags, and README (only the first paragraph is shown).
Fig. 2: The HiGitClass framework. Three key modules (i.e., HIN encoding, keyword enrichment, and pseudo document generation) are used to tackle the aforementioned three challenges, respectively.

Fig. 1 shows a sample GitHub repository333https://github.com/Natsu6767/DCGAN-PyTorch. With the help of GitHub API444https://developer.github.com/v3/, we are able to extract comprehensive information of a repository including metadata, source code and team dynamics. In HiGitClass, we utilize the following information:

User. Users usually have consistent interests and skills. If two repositories share the same user (“Natsu6767” in Fig. 1

), they are more likely to have similar topics (e.g., deep learning or image generation).

Name. If two repositories share the same name, it is likely that one is forked from the other and they should belong to the same topic category. Besides, indicative keywords can be obtained by segmenting the repository name properly (e.g., “DCGAN” and “PyTorch” in Fig. 1).

Description. The description is a concise summary of the repository. It usually contains topic-indicating words (e.g., “DCGAN” and “CelebA” in Fig. 1).

Tags. Although a large proportion of GitHub repositories are not tagged, when available, tags are strong indicators of a repository’s topic (e.g., “dcgan” and “generative-model” in Fig. 1).

README. The README file is the main source of textual information in a repository. In contrast to the description, it elaborates more on the topic but may also diverge to other issues (e.g., installation processes and code usages). The latter introduces noises to the task of topic inference.

We concatenate the description and README fields into a single Document field, which serves as the textual feature of a repository.

Ii-C Heterogeneous Information Networks

In our proposed framework HiGitClass, we model the multi-modal signals of GitHub repositories as a heterogeneous information network (HIN) [35, 34]. HINs are an extension of homogeneous information networks to support multiple node types and edge types. We formally define a heterogeneous information network as below:

Heterogeneous Information Network (HIN). An HIN is defined as a graph with a node type mapping and an edge type mapping . Either the number of node types or the number of relation types is larger than 1.

As we all know, one advantage of networks is the ability to go beyond direct links and model higher-order relationships which can be captured by paths between nodes. We introduce the notion of meta-paths, which account for different edge types in HINs.

Meta-Path. In an HIN, meta-paths [35] are an abstraction of paths proposed to describe multi-hop relationships. For an HIN , a meta-path is a sequence of edge types --…- (). Any path that has the same types as the metapath is an instance of the metapath. When edge types are a function of the node types, we also represent a metapath as --…- ( and - for any ).

Iii Method

We lay out our HiGitClass framework in Fig. 2. HiGitClass consists of three key modules, which are proposed to solve the three challenges mentioned in Introduction, respectively.

To deal with multi-modal signals, we propose an HIN encoding module (Section III-A). Given the label hierarchy and keywords, we first construct an HIN to characterize different kinds of connections between words, documents, users, tags, repository names and labels. Then we adopt ESim [31], a meta-path guided heterogeneous network embedding technique, to obtain good node representations.

To tackle supervision scarcity and bias, we introduce a keyword enrichment module (Section III-B). This module expands the user-specified keyword to a semantically concentrated keyword set for each category. The enriched keywords are required to share high proximity with the user-given one from the view of embeddings. Meanwhile, we keep mutual exclusivity among keyword sets so as to create a clear separation boundary.

To overcome supervision format mismatch, we present a pseudo-document generation technique (Section III-C). We first model each class as a topic distribution over words and estimate the distribution parameters. Then based on the topic distributions, we follow a two-step procedure to generate pseudo documents for training. This step allows us to employ powerful classifiers such as convolutional neural network [16]. Intuitively, the neural classifier is fitting the learned word distributions instead of a small set of keywords, which can effectively prevent it from overfitting.

Iii-a HIN Construction and Embedding

HIN Construction. The first step of our model is to construct an HIN that can capture all the interactions between different types of information regarding GitHub repositories. We include six types of nodes: words (), documents (), users (), tags (), tokens segmented from repository names () and labels (). There is a one-to-one mapping between documents () and repositories (), thus document nodes may also serve as a representation of its corresponding repository in the network. Since the goal of this module is to learn accurate word representations for the subsequent classification step, we adopt a word-centric star schema [36, 37]. The schema is shown in Fig. 3(a). We use a sample ego network of the word “DCGAN” to help illustrate our schema (Fig. 3(b)). The word vocabulary is the union of words present in the documents, tags, segmented repository names and user-provided keywords.

Following the star schema, we then have 5 types of edges in the HIN that represent 5 types of word co-occurrences:

(1) . The word-document edges describe document-level co-occurrences, where the edge weight between word and document indicates the number of times appears in (i.e., term frequency, or ). From the perspective of second-order proximity [38], edges reflect the fact that two words tend to have similar semantics when they appear in the same repository’s document.

(a) Our HIN schema
(b) A sample ego network under the schema
Fig. 3: Our HIN schema and a sample network under the schema. The five edge types characterize different kinds of second-order proximity between words.

(2) . We add an edge between a word and a user node if the user is the owner of a repository that contains the word in its document field. The edge weight between word and user is the sum of the term frequency of the word in each of the user’s repositories:

(3) . The word-tag relations encode tag-level word co-occurrences. The edge weight between word and tag is

(4) . We segment the repository name using “-”, “_” and whitespace as separators. For example, we obtain two tokens “DCGAN” and “PyTorch” by segmenting the repository name “DCGAN-PyTorch” in Fig. 1. The edge weight between word and name token also defined through term frequency:

(5) . The word-label relations describe category-level word co-occurrences. Only user-provided keywords will be linked with label nodes and its parents. For example, if we select “DCGAN” as the keyword of a (leaf) category “$Image-Generation”, “DCGAN” will have links to “$Image-Generation” and all of its parent categories (e.g., “$Computer-Vision”).

HIN Embedding.

Once we have an HIN, we proceed to learn representations for nodes in the network. Then the embedding vectors of word nodes can be applied for repository topic classification.

There are many popular choices for network representation learning on HINs, such as metapath2vec [6] and hin2vec [9]. We adopt ESim [31] as our HIN embedding algorithm as it achieves the best performance in our task. (We will validate this choice in Section IV-C.) Random walk based methods [27, 11] have enjoyed success in learning node embeddings on homogeneous graphs in a scalable manner. ESim adapts the idea for use on HINs, and restricts the random walk under guidance of user-specified meta-paths. In HiGitClass, we choose , , , and as our meta-paths, modeling the five different types of second-order proximity between words.

Following the selected meta-paths, we can sample a large number of meta-path instances in our HIN (e.g., is a valid node sequence, while is not). Given a meta-path and its corresponding node sequence –…–

, we assume that the probability of observing a path given a metapath constraint follows that of a first-order Markov chain:




Here, is the global bias of meta-path . and are -dimensional local bias of . and are -dimensional embedding vectors of nodes and , respectively. , , , and can be learned through maximizing the likelihood.

However, the denominator in Equation (1) requires summing over all nodes, which is very computationally expensive given the large network size. In our actual computation, we estimate this term through negative sampling [24].

where is the set of nodes that serve as negative samples.

Iii-B Keyword Enrichment

Since we only ask users to provide one keyword for each category, in case of scarcity and bias, we devise a keyword enrichment module to automatically expand the single keyword the a keyword set so as to better capture the semantics of the category.

From the HIN embedding step, we have obtained the embedding vector for each word . We perform normalization so that all embedding vectors reside on the unit sphere (i.e., ). Then the inner product of two embedding vectors is adopted to characterize the proximity between two words and . For each class , we add words sharing the highest proximity with into its enriched keyword set. Meanwhile, to create a clear separation boundary between categories, we require to be mutually exclusive. Therefore, the expansion process terminates when any two of the keyword sets tend to intersect. Algorithm 1 describes the process.

1:  ,
2:  ,
3:  while   do
4:     for  to  do
7:     end for
8:  end while
9:  ,    //Remove the last added keyword to keep mutual exclusivity
10:  output
Algorithm 1 KeywordEnrich()

Note that on a unit sphere, the inner product is a reverse measure of the spherical distance between two points. Therefore, we are essentially expanding the keyword set with the nearest neighbors of the given keyword. The termination condition is that two “neighborhoods” have overlaps.

Iii-C Topic Modeling and Pseudo Document Generation

To leverage keywords for classification, we face two problems: (1) a typical classifier needs labeled repositories as input; (2) although the keyword sets have been enriched, a classifier will likely be overfitted if it is trained solely on these keywords. To tackle these issues, we assume we can generate a training document for class given through the following process:

Here is the topic distribution of parameterized by , with which we can “smooth” the small set of keywords to a continuous distribution. For simplicity, we adopt a “bag-of-words” model for the generated documents, so . Then we draw samples of words from to form a pseudo document following the technique proposed in [22].

Spherical Topic Modeling. Given the normalized embeddings, we characterize the word distribution for each category using a mixture of von Mises-Fisher (vMF) distributions [1, 10]. To be specific, the probability to generate keyword from category is defined as

where , as a vMF distribution, is the -th component in the mixture with a weight

. The vMF distribution can be interpreted as a normal distribution confined to a unit sphere. It has two parameters: the mean direction vector

and the concentration parameter . The keyword embeddings concentrate around , and are more concentrated if is large. is a normalization constant.

Following [23], we choose the number of vMF components differently for leaf and internal categories: (1) For a leaf category , the number of components is set to 1 and the mixture model degenerates to a single vMF distribution. (2) For an internal category , we set the number of components to be the number of ’s children in the label hierarchy.

Given the enriched keyword set , we can derive and

using Expectation Maximization (EM)

[1]. Recall that the keywords of an internal category are aggregated from its children categories. In practice, we use the approximation procedure based on Newton’s method [1] to derive .

Pseudo Document Generation. To generate a pseudo document for , we first sample a document vector from . Then we build a local vocabulary that contains top- words similar with in the embedding space. ( in our model.) Given , we repeatedly generate a number of words from a background distribution with probability and from the document-specific distribution with probability . Formally,

where is the background distribution (i.e., word distribution in the entire corpus).

(a) Machine-Learning
(b) Bioinformatics
Fig. 4: Label hierarchy and provided keywords (in blue) on the two datasets.

Here we generate the pseudo document in two steps: first sampling the document vector and then sampling words from a mixture of the document language model and a background language model. Compared to directly sampling words from , the two-step process ensures better coverage of the class distribution. In direct sampling, with high probability we will include words that are close to the centroid of the topic . The classifier may learn to ignore all other words and use only these words to determine the predicted class. By first sampling the document vector, we would like to lead the classifier to learn that all documents that fall within the topic distribution belong to the same class.

The synthesized pseudo documents are then used as training data for a classifier. In HiGitClass, we adopt convolutional neural networks (CNN) [16] for the classification task. One can refer to [16, 22] for more details of the network architecture. The embedding vectors of word nodes obtained by ESim in the previous HIN module are used as pre-trained embeddings.

Recall the process of generating pseudo documents, if we evenly split the fraction of the background distribution into the children categories of , the “true” label distribution (an -dimensional vector) of a pseudo document can be defined as

Since our label is a distribution instead of a one-hot vector, we compute the loss as the KL divergence between the output label distribution and the pseudo label.

Iv Experiments

We aim to answer two questions in our experiments. First, does HiGitClass achieve supreme performances in comparison with various baselines (Section IV-B)? Second, we propose three key modules in HiGitClass. How do they contribute to the overall performances? (The effects of these three modules will be explored one by one in Sections IV-C, IV-D and IV-E).

Iv-a Experimental Setup

Datasets. We collect two datasets of GitHub repositories covering different domains.555Our code and datasets are available at https://github.com/yuzhimanhua/HiGitClass. Their statistics are summarized in Table I.

Dataset #Repos #Classes (Level 1 + Level 2)
Machine-Learning 1,596 3 + 14
Bioinformatics 876 2 + 10
TABLE I: Dataset statistics.
  • Machine-Learning. This dataset is collected by the Paper With Code project666https://paperswithcode.com/media/about/evaluation-tables.json.gz. It contains a list of GitHub repositories implementing state-of-the-art algorithms of various machine learning tasks, where the tasks are organized as a taxonomy.

  • Bioinformatics. This dataset is extracted from research articles published on four venues Bioinformatics, BioNLP, MICCAI and Database from 2014 to 2018. In each article, authors may put a code link, and we extract the links pointing to a GitHub repository. Meanwhile, each article has an issue section, which is viewed as the topic label of the associated repository.

Note that more than 73% (resp., 78%) of the repositories in our Machine-Learning (resp., Bioinformatics) dataset have no tags.

Method Level-1 Micro Level-1 Macro Level-2 Micro Level-2 Macro Overall Micro Overall Macro
HierSVM [7] 54.20 4.53 46.58 3.59 27.40 3.55 31.99 5.34 40.80 1.20 34.57 3.94
HierDataless [33] 76.63 33.44 13.72 8.80 45.18 13.15
WeSTClass [22] 61.78 3.90 48.00 2.04 37.71 2.72 34.34 1.70 49.75 3.24 36.75 1.67
WeSHClass [23] 70.69 2.14 52.73 2.18 38.08 2.07 33.87 2.23 54.39 2.11 37.60 1.67
PCNB [41] 77.79 1.92 62.53 2.55 26.77 3.89 22.85 1.98 52.28 1.92 29.85 1.74
PCEM [41] 77.28 2.00 59.27 2.29 22.34 4.02 19.28 2.25 49.81 2.00 26.34 1.83
HiGitClass w/o HIN 75.28 6.99 60.85 5.51 40.21 2.91 39.77 2.22 57.74 4.07 43.49 2.10
HiGitClass w/o Enrich 86.48 1.41 72.19 2.53 36.71 1.13 43.75 2.85 61.60 0.25 48.77 1.94
HiGitClass w/o Hier 57.31 1.63 59.30 3.14 40.45 2.51 44.41 2.66 48.88 2.07 47.04 1.63
HiGitClass 87.68 2.23 72.06 4.39 43.93 2.93 45.97 2.29 65.81 3.55 50.57 2.98
TABLE II: Performances of Compared Algorithms on the Machine-Learning

Dataset. HierDataless does not have a standard deviation since it is a deterministic algorithm.

Method Level-1 Micro Level-1 Macro Level-2 Micro Level-2 Macro Overall Micro Overall Macro
HierSVM [7] 80.39 2.72 70.16 6.05 13.49 10.2 10.04 9.07 46.94 4.66 20.06 7.12
HierDataless [33] 81.39 78.75 39.27 36.57 60.33 43.60
WeSTClass [22] 61.78 5.75 52.73 4.86 20.39 3.48 17.52 2.59 41.09 4.28 23.91 2.82
WeSHClass [23] 63.17 3.56 59.65 3.51 26.44 1.33 24.94 0.98 44.80 2.26 30.72 1.30
PCNB [41] 77.03 2.89 59.43 3.51 31.77 3.80 22.90 4.84 54.40 2.89 28.99 4.82
PCEM [41] 78.51 3.06 61.99 4.18 32.80 2.88 18.93 5.28 55.66 3.06 26.11 5.41
HiGitClass w/o HIN 66.21 8.33 64.39 7.00 31.87 3.65 30.47 3.15 49.04 5.75 36.13 3.52
HiGitClass w/o Enrich 54.55 10.0 53.57 9.15 22.95 1.46 23.18 1.89 38.74 4.30 28.24 0.87
HiGitClass w/o Hier 78.79 2.57 73.71 2.28 39.42 4.47 41.58 2.63 59.10 3.51 46.94 2.48
HiGitClass 81.71 3.95 77.11 3.89 42.44 8.46 41.67 8.04 62.08 6.11 47.57 7.34
TABLE III: Performances of Compared Algorithms on the Bioinformatics Dataset. HierDataless does not have a standard deviation since it is a deterministic algorithm.

Baselines. We evaluate the performance of HiGitClass against the following hierarchical classification algorithms:

Note that HierSVM, PCNB and PCEM can only take document-level supervision (i.e., labeled repositories). To align the experimental settings, we first label all the repositories using TFIDF scores by treating the keyword set of each class as a query. Then, we select top-ranked repositories per class as the supervision to train HierSVM, PCNB and PCEM. Since the baselines are all text classification approaches, we append the information of user, tags and repository name to the end of the document for each repository so that the baselines can exploit these signals.

Besides the baselines, we also include the following three ablation versions of HiGitClass into comparison.

  • w/o HIN skips the HIN embedding module and relies on word2vec [24] to generate word embeddings for the following steps.

  • w/o Enrich skips the keyword enrichment module and directly uses one single keyword in spherical topic modeling.

  • w/o Hier directly classifies all repositories to the leaf layer and then assigns internal labels to each repository according to its leaf category.

Evaluation Metrics. We use F1 scores to evaluate the performances of all methods. Denote , and, as the instance numbers of true-positive, false-positive and false negative for category . Let (resp., ) be the set of all Level-1 (resp., Level-2/leaf) categories. The Level-1 Micro-F1 is defined as , where and . The Level-1 Macro-F1 is defined as , where and . Accordingly, Level-2 Micro/Macro-F1 and Overall Micro/Macro-F1 can be defined on and .

Iv-B Performance Comparison with Baselines

Tables II and III demonstrate the performances of compared methods on two datasets. We repeat each experiment 5 times (except HierDataless, which is a deterministic algorithm) with the mean and standard deviation reported.

As we can observe from Tables II and III, on both datasets, a significant improvement is achieved by HiGitClass compared to the baselines. On Machine-Learning, HiGitClass notably outperforms the second best approach by 22.1% on average. On Bioinformatics, the only metric in terms of which HiGitClass does not perform the best is Level-1 Macro-F1, and the main opponent of HiGitClass is HierDataless. As mentioned above, HierDataless incorporates Wikipedia articles as its external knowledge. When user-provided keywords can be linked to Wikipedia (e.g., “genomic”, “genetic” and “phylogenetic” in Bioinformatics), HierDataless can exploit the external information well. However, when the keywords cannot be wikified (e.g., names of new deep learning algorithms such as “dcgan”, “rcnn” and “densenet” in Machine-Learning), the help from Wikipedia is limited. In fact, on Machine-Learning, HierDataless performs poorly. Note that on GitHub, it is common that names of recent algorithms or frameworks are provided as keywords.

Besides outperforming baseline approaches, HiGitClass shows a consistent and evident improvement against three ablation versions. The average boost of the HIN module over the six metrics is 15.0% (resp., 28.6%) on the Machine-Learning (resp., Bioinformatics) dataset, indicating the importance of encoding multi-modal signals on GitHub. The Level-1 F1 scores of HiGitClass w/o Enrich is close to the full model on Machine-Learning, but the keyword enrichment module demonstrates its power when we go deeper. This finding is aligned with the fact that the topic distributions of coarse-grained categories are naturally distant from each other on the sphere. Therefore, one keyword per category may be enough to estimate the distributions approximately. However, when we aim at fine-grained classification, the topic distributions become closer, and the inference process may be easily interfered by a biased keyword. HiGitClass w/o Hier performs poorly on Machine-Learning, which highlights the importance of utilizing the label hierarchy during the training process. The same phenomenon occurs in the comparison between WeSTClass and WeSHClass.

Iv-C Effect of HIN Construction and Embedding

(a) Machine-Learning, Micro
(b) Bioinformatics, Micro
(c) Machine-Learning, Macro
(d) Bioinformatics, Macro
Fig. 5: Performances of algorithms with different HIN modules.

We have demonstrated the contribution of our HIN module by comparing HiGitClass and HiGitClass w/o HIN. To explore the effectiveness of HIN construction and embedding in a more detailed way, we perform an ablation study by changing one “factor” in the HIN and fixing all the other modules in HiGitClass. To be specific, our HIN has five types of edges, each of which corresponds to a meta-path. We consider five ablation versions (No W-D-W, No W-U-W, No W-T-W, No W-N-W and No W-L-W). Each version ignores one edge type/meta-path. Moreover, given the complete HIN, we consider to use two popular approaches, metapath2vec [6] and hin2vec [9], as our embedding technique, which generates two variants metapath2vec as embedding and HIN2vec as embedding. Fig. 5 shows the performances of these variants and our Full model.

We have the following findings from Fig. 5. First, our Full model outperforms the five ablation models ignoring different edge types, indicating that each meta-path (as well as each node type incorporated in the HIN construction step) plays a positive role in classification. Second, our Full model outperforms metapath2vec as embedding and hin2vec as embedding in most cases, which validates our choice of using ESim as the embedding technique. The possible reason that ESim is more suitable for our task may be that we have a simple word-centric star schema and a clear goal of embedding word nodes. Therefore, the choices of meta-paths can be explicitly specified (i.e., –?–) and do not need to be inferred from data (as hin2vec does). Third, among the five ablation models ignoring edge types, No W-U-W performs the worst on Machine-Learning, which means - edges (i.e., the user information) contribute the most in repository classification. Meanwhile, on Bioinformatics, - edges have the largest offering. This can be explained by the following statistics: in the Machine-Learning dataset, 348 pairs of repositories share the same user, out of which 217 (62%) have the same leaf label; in the Bioinformatics dataset, there are 356 pairs of repositories having at least two overlapping tags, among which 221 (62%) belong to the same leaf category. Fourth, - edges also contribute a lot to the performances. This observation is aligned with the results in [37], where document-level word co-occurrences play a crucial role in text classification.

Iv-D Effect of Keyword Enrichment

Class $Semantic-Segmentation $Pose-Estimation


Keyword segmentation pose ner squad
Enriched Keywords semantic estimation entity question
papandreou person tagger answering
scene human lample bidaf
pixel mpii - -
segment 3d - -
Class $Gene-Expression $Genetics-and-Population $Structural-Bioinformatics $Phylogenetics
Keyword expression genetic structure phylogenetic
Enriched Keywords gene traits protein trees
genes trait pdb newick
rna markers residues phylogenetics
cell phenotypes pymol phylogenies
isoform associations residue evolution
TABLE IV: Entity enrichment results on the two datasets. Four leaf categories are shown for each dataset.

Quantitatively, the keyword enrichment module has a positive contribution to the whole framework according to previous experiments. We now show its effect qualitatively. Table IV lists top-5 words selected by HiGitClass during keyword enrichment. Besides topic-indicating words (e.g., “tagger”, “trait”, “trees”, etc.), popular algorithm/tool names (e.g., “bidaf” and “pymol”), dataset names (e.g., “mpii”, “pdb”) and author names (e.g., “papandreou” and “lample”) are also included in the expanded keyword set. Note that some provided keywords are more or less ambiguous (e.g., “segmentation” and “structure”), and directly using them for topic modeling may introduce noises. In contrast, the expanded set as a whole can better characterize the semantics of each topic category.

Iv-E Effect of Pseudo Documents

(a) Machine-Learning
(b) Bioinformatics
Fig. 6: Performances of HiGitClass with different numbers of pseudo documents.

In all previous experiments, when we build a classifier for an internal category , we generate 500 pseudo documents for each child of . What if we use less/more synthesized training data? Intuitively, if the amount of generated pseudo documents is too small, signals in previous modules cannot fully propagate to the training process. On the contrary, if we have too many generated data, the training time will be unnecessarily long. To see whether 500 is a good choice, we plot the performances of HiGitClass with 10, 50, 100, 500, 1000 and 2000 pseudo documents in Fig. 6.

On the one side, when the number of pseudo documents is too small (e.g., 10, 50 or 100), information carried in the synthesized training data will be insufficient to train a good classifier. On the other side, when we generate too many pseudo documents (e.g, 1000 or 2000), putting efficiency aside, the performances are not guaranteed to increase. In fact, on both datasets, the F1 scores start to fluctuate when the number of pseudo documents becomes large. In our task, generating 500 to 1000 pseudo documents for each class will strike a good balance.

V Related Work

GitHub Repository Mining. As a popular code collaboration community, GitHub presents many opportunities for researchers to learn how people write code and design tools to support the process. As a result, GitHub data has received attention from both software engineering and social computing researchers. Analytic studies [5, 40, 39, 15, 29, 21] have investigated how user activities (e.g., collaboration, following and watching) affect development practice. Algorithmic studies [44, 32, 28, 8] exploit README files and repository metadata to perform data mining tasks such as similarity search [44] and clustering [32]. In this paper, we focus on the task of automatically classifying repositories whereas previous works [15, 29] have relied on human effort to annotate each repository with its topic.

HIN Embeddings. Many node embeddings techniques have been proposed for HIN, including [31, 6, 9, 42]. From the application point of view, typical applications of learned embeddings include node classification [31, 6, 9, 42], node clustering [6] and link prediction [9, 42]. Several studies apply HIN node embeddings into downstream classification tasks, such as malware detection [13] and medical diagnosis [12]. Different from the fully-supervised settings in [13, 12], our repository classification task relies on a very small set of guidance. Moreover, most information used in [13, 12] is structured. In contrast, we combine structured information such as user-repository ownership relation with unstructured text for classification.

Dataless Text Classification. Although deep neural architectures [16, 43, 14] demonstrate their advantages in fully-supervised text classification, their requirement of massive training data prohibits them from being adopted in some practical scenarios. Under weakly-supervised or dataless settings, there have been solutions following two directions: latent variable models extending topic models (e.g., PLSA and LDA) by incorporating user-provided seed information [20, 4, 17, 18] and embedding-based models deriving vectorized representations for words and documents [3, 37, 22]. There are also some work on semi-supervised text classification [25, 30], but they require a set of labeled documents instead of keywords.

Hierarchical Text Classification. Under fully-supervised settings, [7] and [19] first propose to train SVMs to distinguish the children classes of the same parent node. [2]

further defines hierarchical loss function and applies cost-sensitive learning to generalize SVM learning for hierarchical classification.


proposes a graph-CNN based model to convert text to graph-of-words, on which the graph convolution operations are applied for feature extraction. Under

weakly-supervised or dataless settings, previous approaches include HierDataless [33], WeSHClass [23] and PCNB/PCEM [41], which have been introduced in Section IV-A. All above mentioned studies focus on text data without additional information. In HiGitClass, we are able to go beyond plain text classification and utilize multi-modal signals.

Vi Conclusions and Future Work

In this paper, we have studied the problem of keyword-driven hierarchical classification of GitHub repositories: the end user only needs to provide a label hierarchy and one keyword for each leaf category. Given such scarce supervision, we design HiGitClass with three modules: heterogeneous information network embedding; keyword enrichment; pseudo document generation. Specifically, HIN embeddings take advantage of the multi-modal nature of GitHub repositories; keyword enrichment alleviates the supervision scarcity of each category; pseudo document generation transforms the keywords into documents, enabling the use of powerful neural classifiers. Through experiments on two repository collections, we show that HiGitClass consistently outperforms all baselines by a large margin, particularly on the lower levels of the hierarchy. Further analysis shows that the HIN module contributes the most to the boost in performance and keyword enrichment demonstrates its power deeper down the hierarchy where the differences between classes become more subtle.

For future work, we would like to explore the possibility of integrating different forms of supervision (e.g., the combination of keywords and labeled repositories). The HIN embedding module may also be coupled more tightly with the document classification process by allowing the document classifier’s prediction results to propagate along the network.


We thank Xiaotao Gu for useful discussions. The research was sponsored in part by U.S. Army Research Lab. under Cooperative Agreement No. W911NF-09-2-0053 (NSCTA), DARPA under Agreement No. W911NF-17-C0099, National Science Foundation IIS 16-18481, IIS 17-04532, and IIS-17-41317, DTRA HDTRA11810026, and grant 1U54GM114838 awarded by NIGMS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov). The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies. We thank anonymous reviewers for valuable and insightful feedback.


  • [1] A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra. Clustering on the unit hypersphere using von mises-fisher distributions. Journal of Machine Learning Research, 6(Sep):1345–1382, 2005.
  • [2] L. Cai and T. Hofmann.

    Hierarchical document categorization with support vector machines.

    In CIKM’04, pages 78–87, 2004.
  • [3] M.-W. Chang, L.-A. Ratinov, D. Roth, and V. Srikumar. Importance of semantic representation: Dataless classification. In AAAI’08, pages 830–835, 2008.
  • [4] X. Chen, Y. Xia, P. Jin, and J. Carroll. Dataless text classification with descriptive lda. In AAAI’15, 2015.
  • [5] L. Dabbish, C. Stuart, J. Tsay, and J. Herbsleb. Social coding in github: transparency and collaboration in an open software repository. In CSCW’12, pages 1277–1286, 2012.
  • [6] Y. Dong, N. V. Chawla, and A. Swami. metapath2vec: Scalable representation learning for heterogeneous networks. In KDD’17, pages 135–144, 2017.
  • [7] S. Dumais and H. Chen. Hierarchical classification of web content. In SIGIR’00, pages 256–263, 2000.
  • [8] Y. Fan, Y. Zhang, S. Hou, L. Chen, Y. Ye, C. Shi, L. Zhao, and S. Xu. idev: Enhancing social coding security by cross-platform user identification between github and stack overflow. In IJCAI’19, pages 2272–2278, 2019.
  • [9] T.-y. Fu, W.-C. Lee, and Z. Lei. Hin2vec: Explore meta-paths in heterogeneous information networks for representation learning. In CIKM’17, pages 1797–1806, 2017.
  • [10] S. Gopal and Y. Yang. Von mises-fisher clustering models. In ICML’14, pages 154–162, 2014.
  • [11] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In KDD’16, pages 855–864, 2016.
  • [12] A. Hosseini, T. Chen, W. Wu, Y. Sun, and M. Sarrafzadeh. Heteromed: Heterogeneous information network for medical diagnosis. In CIKM’18, pages 763–772, 2018.
  • [13] S. Hou, Y. Ye, Y. Song, and M. Abdulhayoglu. Hindroid: An intelligent android malware detection system based on structured heterogeneous information network. In KDD’17, pages 1507–1515, 2017.
  • [14] J. Howard and S. Ruder. Universal language model fine-tuning for text classification. In ACL’18, pages 328–339, 2018.
  • [15] E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. M. German, and D. Damian. The promises and perils of mining github. In MSR’14, pages 92–101, 2014.
  • [16] Y. Kim. Convolutional neural networks for sentence classification. In EMNLP’14, pages 1746–1751, 2014.
  • [17] C. Li, J. Xing, A. Sun, and Z. Ma. Effective document labeling with very few seed words: A topic model approach. In CIKM’16, pages 85–94, 2016.
  • [18] X. Li, C. Li, J. Chi, J. Ouyang, and C. Li. Dataless text classification: A topic modeling approach with document manifold. In CIKM’18, pages 973–982, 2018.
  • [19] T.-Y. Liu, Y. Yang, H. Wan, H.-J. Zeng, Z. Chen, and W.-Y. Ma. Support vector machines classification with a very large-scale taxonomy. SIGKDD Explorations, 7(1):36–43, 2005.
  • [20] Y. Lu and C. Zhai. Opinion integration through semi-supervised topic modeling. In WWW’08, pages 121–130, 2008.
  • [21] W. Ma, L. Chen, X. Zhang, Y. Zhou, and B. Xu. How do developers fix cross-project correlated bugs? a case study on the github scientific python ecosystem. In ICSE’17, pages 381–392, 2017.
  • [22] Y. Meng, J. Shen, C. Zhang, and J. Han. Weakly-supervised neural text classification. In CIKM’18, pages 983–992, 2018.
  • [23] Y. Meng, J. Shen, C. Zhang, and J. Han. Weakly-supervised hierarchical text classification. In AAAI’19, pages 6826–6833, 2019.
  • [24] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS’13, pages 3111–3119, 2013.
  • [25] T. Miyato, A. M. Dai, and I. J. Goodfellow. Adversarial training methods for semi-supervised text classification. In ICLR’16, 2016.
  • [26] H. Peng, J. Li, Y. He, Y. Liu, M. Bao, L. Wang, Y. Song, and Q. Yang. Large-scale hierarchical text classification with recursively regularized deep graph-cnn. In WWW’18, pages 1063–1072, 2018.
  • [27] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In KDD’14, pages 701–710, 2014.
  • [28] G. A. A. Prana, C. Treude, F. Thung, T. Atapattu, and D. Lo. Categorizing the content of github readme files. Empirical Software Engineering, pages 1–32, 2018.
  • [29] P. H. Russell, R. L. Johnson, S. Ananthan, B. Harnke, and N. E. Carlson. A large-scale analysis of bioinformatics code on github. PLOS One, 13(10):e0205898, 2018.
  • [30] D. S. Sachan, M. Zaheer, and R. Salakhutdinov. Revisiting lstm networks for semi-supervised text classification via mixed objective function. In AAAI’19, pages 6940–6948, 2019.
  • [31] J. Shang, M. Qu, J. Liu, L. M. Kaplan, J. Han, and J. Peng. Meta-path guided embedding for similarity search in large-scale heterogeneous information networks. arXiv preprint arXiv:1610.09769, 2016.
  • [32] A. Sharma, F. Thung, P. S. Kochhar, A. Sulistya, and D. Lo. Cataloging github repositories. In EASE’17, pages 314–319, 2017.
  • [33] Y. Song and D. Roth. On dataless hierarchical text classification. In AAAI’14, pages 1579–1585, 2014.
  • [34] Y. Sun and J. Han. Mining heterogeneous information networks: principles and methodologies. Synthesis Lectures on Data Mining and Knowledge Discovery, 3(2):1–159, 2012.
  • [35] Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. PVLDB, 4(11):992–1003, 2011.
  • [36] Y. Sun, Y. Yu, and J. Han. Ranking-based clustering of heterogeneous information networks with star network schema. In KDD’09, pages 797–806. ACM, 2009.
  • [37] J. Tang, M. Qu, and Q. Mei. Pte: Predictive text embedding through large-scale heterogeneous text networks. In KDD’15, pages 1165–1174, 2015.
  • [38] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Large-scale information network embedding. In WWW’15, pages 1067–1077, 2015.
  • [39] J. Tsay, L. Dabbish, and J. Herbsleb. Influence of social and technical factors for evaluating contribution in github. In ICSE’14, pages 356–366. ACM, 2014.
  • [40] J. T. Tsay, L. Dabbish, and J. Herbsleb.

    Social media and success in open source projects.

    In CSCW’12, pages 223–226, 2012.
  • [41] H. Xiao, X. Liu, and Y. Song. Efficient path prediction for semi-supervised and weakly supervised hierarchical text classification. In WWW’19, pages 3370–3376, 2019.
  • [42] C. Yang, Y. Feng, P. Li, Y. Shi, and J. Han. Meta-graph based hin spectral embedding: Methods, analyses, and insights. In ICDM’18, pages 657–666, 2018.
  • [43] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy. Hierarchical attention networks for document classification. In NAACL’16, pages 1480–1489, 2016.
  • [44] Y. Zhang, D. Lo, P. S. Kochhar, X. Xia, Q. Li, and J. Sun. Detecting similar repositories on github. In SANER’17, pages 13–23, 2017.