Open-source software (OSS) communities provide a wide range of functional and technical features for software teams and developers to collaborate, share, and explore software repositories. Many of these repositories are similar to each other, i.e., they have similar objectives, employ similar technologies or implement similar functionality. Users explore hosted repositories to search for interesting software components tailored to their needs. However, as the community grows, it becomes harder to effectively organize these repositories so that users can efficiently retrieve and reuse them.
Collaborative tagging has significantly impacted the information retrieval field for the better, and it seems to be a promising solution to the above problem wang2018entagrec++. Tags are a form of metadata used to annotate various entities based on their main concepts. They are often more useful compared to textual descriptions as they capture the salient aspects of an entity in a simple token. In fact, through encapsulating human knowledge, tags help bridge the gap between technical and social aspects of software development treude2009tagging. Thus, tags can be used for organizing and searching for software repositories as well. Software tags describe categories a repository may belong to, its main programming language, the intended audience, the type of user interface, and its other key characteristics. Furthermore, tagging can link topic-related repositories to each other and provide a soft categorization of the content wang2018entagrec++. Software repositories and QA platforms rely on users to generate and assign tags to software entities. Moreover, several studies have exploited tags to build recommender systems for software QA platforms such as Stack Overflow xia2013tag; wang2014tag; wang2018entagrec++; liu2018fasttagrec.
In 2017, GitHub enabled its users to assign topic tags to repositories. We believe topic tags, which we will refer to as “topics” in this paper, are a useful resource for training models to predict high-level specifications of software repositories. However, as of February 2020, only 5% of public repositories in GitHub had at least one topic assigned to them222Information retrieved using Github API.. We discovered over 118K unique user-specified topics in our data. According to our calculations, the majority of tagged repositories only have a limited number of high-quality topics. Unfortunately, as users keep creating and assigning new topics based on their personalized terminology and style, the number of defined topics explodes, and their quality degrades golder2006usage. This is because tagging is a distributed process, with no centralized coordination. Thus, similar entities can be tagged differently xia2013tag
. This results in an increasing number of redundant topics which consequently makes it hard to retrieve similar entities based on differently-written synonym topics. For example, the same topic can be written in full or abbreviated, plural or singular formats, with/without special characters such as ‘-’, or may contain human-language related errors, such as typos. Take repositories working on a deep learning model namedConvolutional Neural Network as an example. We identified 16 differently-written topics or combination of separate topics for this concept including cnn, CNN, convolutional
convolutional-deep-learning, ccn-model, cnn-
architecture, and convolutional + neural + network. The different forms of the same concept are called aliases. This high level of redundancy and customization adversely affects training models. That is the quality of topics (e.g., their conciseness, completeness, and consistency), impacts the efficacy of operations that rely on topics to perform. Fortunately, GitHub has recently provided a set of limited refined topics called featured topics. This has allowed us to use this set as an initial seed to train supervised models to automatically tag software repositories and consequently, create an inventory of them.
We treat the problem of assigning existing topics to new repositories as a multi-label classification problem. We use the set of featured topics as labels for supervising our models. Each software repository can be labeled with multiple topics. Using both traditional machine-learning techniques and advanced deep neural networks, we trained different models to automatically predict these topics. The input to our model is straight-forward: repositories’ textual information and their file names. Recommender systems return ranked lists of suggestions. Thus, our model for a given repository outputs a fixed number of topics with the highest predicted probabilities. We evaluate our model with respect to various evaluation metrics including, , , , and Label Ranking Average Precision (). The results indicate that our approach can achieve high Recall, Success Rate, and LRAP scores (, , and respectively). We improve upon the baseline approach by , , , and regarding , , , and metrics, respectively. Furthermore, we compared the recommendations of our model with those of the baseline approach from users’ perspectives. Participants evaluated the recommendations based on two measure of correctness and completeness. Our model on average recommends correct topics out of topics for sample repositories, while the baseline only suggests correct topics on average. Moreover, developers indicated our model also provides a more complete set of recommendations compared to those of the baselines. Our main contributions are the following:
We perform rigorous text processing techniques on user-specified topics to augment the Github’s initial set of 355 featured topics with about 29K sub-topics; We evaluate the quality of the mapping between user-specified and featured topics. The results indicate we are successfully able to accurately map these topics.
We train several multi-label classification models to automatically recommend topics for repositories. Then, we quantitatively and qualitatively evaluate our proposed approach. The results indicate that we outperform the baseline in both cases by large margins.
We make our models and datasets publicly available for use by others333https://GitHub.com/MalihehIzadi/SoftwareTagRecommender.
Finally, we develop an online tool, Repository
Catalogue, for automatically predicting topics for Github repositories. Our tool is also publicly available at https://www.repologue.com/.
2 Problem Definition
An OSS community such as Github hosts a set of repositories , where denotes a single software repository. Each software repository may contain various types of textual information such as a description, README files, and wiki pages describing the repository’s goal, and features in detail. It also contains an arbitrary number of files including its source code. Figure 1 provides a sample repository from Github which is tagged with six topics such as rust and tui. We preprocess and combine the textual information of these repositories, such as their name, description, README file, and wiki pages with the list of their file names as the input of our approach. Furthermore, we preprocess their set of user-specified topics and use them as the labels for our supervised machine learning techniques. Topics are transformed according to the initial candidate set of topics , where is the number of featured topics. For each repository, is either or , and indicates whether the -th topic is assigned to the target repository. Our goal is to recommend several topics from the candidate set of topics to each repository through learning the relationship between existing repositories’ textual information and their corresponding set of topics.
In this section, we provide preliminary information on the methods we have used in our proposed approach, covering both traditional classifiers and deep models.
: Multinomial NB (MNB) is a variant of naive Bayes frequently used in text classification. MNB is a probabilistic classifier used for multi-nomially distributed data. On the other hand, the second naive Bayes variation, Gaussian NB (GNB), is used when the continuous values associated with each class are distributed according to Gaussian distribution.
Logistic Regression: This classifier uses a logistic function to model the probabilities describing the possible outcomes of a single trial.
FastText Developed by Facebook, FastText is a library for learning word representations and sentence classification especially in the case of rare words by exploiting character level information joulin2017bag. We have used FastText to train a supervised text classifier.
: Transformers are the state-of-the-art models which use attention mechanisms and disregard the recurrent component of Recurrent Neural Networks (RNN)vaswani2017attention. Transformers are showed to generate higher quality results, they are more parallelizable, and require significantly less time to train compared to RNNs. Using the transformer concept, Bidirectional Encoder Representations from Transformers (BERT) was proposed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context devlin2018bert. BERT employs a two tasks of Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) on a large corpus constructed from the Toronto Book Corpus and Wikipedia. DistilBERT developed by HuggingFace sanh2019distilbert, was proposed to pre-train a smaller general-purpose language model compared to BERT. DistilBERT combines language modeling, distillation and cosine-distance losses to leverage the inductive biases learned by pre-trained larger models. The authors have shown DistilBERT can be fine-tuned with good performances on a variety of tasks. They claim compared to BERT, DistilBERT decreases the model size by 40%, while retaining 97% of its language understanding capabilities and being 60% faster.
4 Proposed Method
In this section, we first present the high-level architecture of our approach. Then, we discuss its main components, including the data preparation technique and the multi-label classifiers in more detail.
4.1 Approach Overview
Figure 2 presents the overall workflow of our proposed approach consisting of three main phases; (1) data preparation, (2) training, and (3) prediction.
The first phase is composed of two parts; preparing the set of featured topics and preparing the textual data of repositories as labels and inputs of the multi-label classifiers. For each repository, we extract its available user-specified topics, name, description, README files, wiki pages, and finally a list of source file names (including their extensions). User-specified topics assigned to the repositories go through several text-processing steps and then, are compared to to the set of featured topics. After applying the preprocessing steps, if the cleaned version of a user-specified topic is found in the list of featured topics, it will be included, otherwise it will be discarded. Our classifier treats the list of topics for each repository as its labels. We transform these featured topics’ lists per repository to multi-hot-encoded vectors and use them in the multi-label classifiers. We also process and concatenate textual data from the repositories along with their source file names to form our corpus. We feed the concatenated list of a repository’s textual information (description, README, wiki, project name, and file names) to the transformer-based and FastText classifier as is. On the other hand, for traditional classifiers, we either use TF-IDF or Doc2Vec embeddings to represent the input textual information of repositories.
Next, in the training phase, the resulting representations are fed to the classifiers to capture the semantic regularities in the corpus. The classifiers detect the relationship between the repositories’ textual information and the topics assigned to the repositories and learn to predict the probability of each featured topic being assigned to the repositories.
Finally, in the prediction phase, the trained models predict topics for the repositories in the test dataset. In fact, our model output a vector containing probabilities of assigning each topic to a sample repository. We sort the output probability vector and then retrieve the corresponding topics for the top candidates (highest probabilities) based on the recommendation list’s size.
4.2 Data Preparation
We collected the raw data of repositories with at least one user-specified topic using the GitHub API which resulted in about two million repositories. This data contains repositories’ various document files such as description, README files (crawled in different formats, e.g., README.md, README, readme.txt, readme.rst, etc in both upper and lower case characters), wiki pages, a complete list of their file names, and finally the project’s name. We also retrieved the set of user-specified topics for these repositories.
Initially, we remove repositories with no README and no description. Then, we discard repositories that have less than ten stars kalliamvakou2016depth. This results in about 180K repositories and 118K unique user-specified topics. After performing all the preprocessing steps mentioned in the next sections, we remove repositories that are left with no input data (either textual information or cleaned topics). Therefore, about 152K repositories and 228 featured topics remains in the final data.
In the next section, we review the above-mentioned preprocessing steps in more detail. Considering the differences between our input sources, we treat textual information from these resources differently. We first clean textual information such as descriptions, REAMDEs and wiki pages together. Then we clean project names and file names.
4.2.1 Preprocessing Descriptions, READMEs, and Wiki Pages
We perform the following preprocessing methods on these types of data.
Exclude repositories in which more than half of the README and description consist of non-English characters,
Remove punctuation, digits, non-English and non-ASCII characters,
Replace popular SE- and CS-related abbreviations and acronyms such as lib, app, config, DB, doc, and env with their formal form in the dataset444The complete list of these tokens is available in our repository.,
Remove abstract concepts such as emails, URLs, usernames, markdown symbols, code snippets, dates, and times to normalize the text using regular expressions,
Split tokens based on several naming conventions including SnakeCase, camelCase, and underscores using an identifier splitting tool called Spiral555https://github.com/casics/spiral.,
Convert tokens to lower case,
Omit stop words, then tokenize and lemmatize documents to retain their correct word formats, We do not perform stemming since some of our methods (e.g., DistilBERT) have their own preprocessing techniques,
Remove tokens with a frequency of less than 50 to limit the vocabulary size for traditional classifiers. Less-frequent words are typically special names or typos. According to our experiments, using these tokens has little to no impact on the accuracy.
4.2.2 Preprocessing Project’s and Source File Name
The reason for incorporating this type of information in our approach is that names are usually a good indicator of the main functionality of an entity. Therefore, we crawled a list of all the file names available inside each repository. As this information cannot be obtained using the GitHub API, we cloned every project and then parsed all their directories. Before cleaning file names, our dataset had an average of 488 and a median of 50 files per repository. We perform the following steps on the names:
Split the project name into the owner and the repository name.
Drop special (e.g., ‘-’ and ‘.’) or non-English characters from all names,
Split names according to the naming conventions, including SnakeCase, camelCase, and underscores (using Spiral).
Extracted a list of most frequent and useful name tokens such as lib and api from the list of all names,
Omit stop words, and apply tokenization and lemmatization on the names,
For the source file names, remove the most frequent but not useful name tokens that are common in various types of repositories regardless of their topic and functionality. These include names such as license, readme, body, run, new, gitignore, and frequent file formats such as txt666The complete list of these tokens is available in our repository.. These tokens are frequently used but do not convey much information about the topic. For instance, if a token such as manager or style is repeatedly used in description or README of a repository, it implies that the repository’s functionality is related to these token. However, an arbitrary repository can contain several files named style or manager, while the repository’s main functionality varies from these topics. Since we concatenate all the processed tokens from each repository into a single document and feed this document as the input to our models, we removed these domain-specific tokens from the list of file names to avoid any misinterpretation by the models77footnotemark: 7.
Remove tokens with a frequency of less than 20. This omits uninformative personal tokens such as user names.
4.2.3 Statistics of Input Information
Based on the distribution of input data types, we truncate a fixed number of tokens and concatenate them to make a list of single input documents. To be exact, we extract a maximum of 10, 50, 400, 100, and 100 tokens from project names, descriptions, READMEs, wiki pages, and file names, respectively. In our dataset, most of the data for each repository comes from its README files. Figure 3 presents a histogram of prevalence of the number of input tokens among the repositories in our dataset. Table LABEL:tab:input_stat summarizes some statistics about our input data. The average number of input tokens per repository is 235.
4.2.4 Preprocessing User-specified Topics
After processing repositories’ textual information reviewed in previous sections, we clean the set of assigned user-specified topics. GitHub provides a set of community-curated topics on-line888https://GitHub.com/GitHub/explore/tree/master/topics. Each of these topics may have several aliases as well. On February 2020, GitHub provided a total number of 355 featured topics along with 777 aliases. Among our 180K repositories, about 136K repositories contain at least one featured topic. However, our dataset also contains 118K unique user-specified topics and the number of aliases for these featured topics (777) was very limited.
The magnitude of number of user-specified topics is due to the fact that topics are written in free-text format. For instance, topics could be written as their abbreviation/acronym or in their full form, in plural or singular, with or without numbers (denoting version, date, etc.), and with numbers in digits or letters. Moreover, the same topic can take different forms such as having “ing” or “ed” at its end. Some users include stop words in their topics, some do not. Some have typos. Some include words such as plugin, app, application
and so on in one topic (with or without a dash). Note that topics written in different lexicons can represent the same concepts. Furthermore, a topic that has different parts, if split, can represent completely different concepts compared to what it was originally intended to represent. For example,single-page-application as a whole represents a website design approach. However, if split, part of the topic such as single may lose its meaning or worse, become misguiding.
To address the above issues, we preprocess user-specified topics to map them to their respective featured topics. The goal is to (1) augment Github’s featured set by utilizing the large number of available user-specified topics in the community and (2) provide as many properly-labeled repositories as possible for the models to train. We define a set of heuristics, and perform the following steps:
Before performing any text-processing technique, extract existing featured topic from the list of user-specified topics (if any). Then, perform the following steps on the remaining tokens,
Remove versioning such as v3 in react-router-v3,
Remove arbitrary digits at the end of a topic, (note that we cannot simply remove any digits since topics, such as 3d, and d2v will lose their meaning),
Extract most frequent topics such as api, tool, or package from the rest of user-specified topics such as twitch-api,
Convert plural forms to singular, (one cannot simply remove ‘s’ from the end of a topic because topics such as css, and iOS will become meaningless)
Remove stop words,
Lemmatize topics (to preserve the correct word forms),
Aggregate topics such as combine neural and network to neural-network,
Augment the set of a repository’s featured topics (output of the first step) with our set of the mapped featured topics (recovered from the rest of above steps). Figure 4 depicts the process of augmenting featured topics with our sub-topics.
As the result of these steps, we discovered about 29K unique sub-topics that can be mapped to their corresponding featured topics. Furthermore, we recover 16K more repositories (from our 180K repositories) and increase the total number of featured topics used in the dataset by 20%. In this stage, data contains about 152K repositories with 355 unique featured topics and total of 307K tagged featured topics. In order to have sufficient number of sample repositories both in the training and testing sets, we remove feature topics used less than 100 times in the dataset. There remains a set of 228 featured topics.
It is worth mentioning that while Github provides on average two aliases for each featured topic, we were able to identify on average 94 sub-topics for each featured topic. Moreover, while Github does not provide any alias for 95 featured topics, we were able to recover at least one sub-topic for half of them (48 out of 95). Table LABEL:tab:subtopics_stat summarizes the statistics information about Github’s aliases and our sub-topics per repository. Table LABEL:tab:sample_repo_subtopics presents a sample of Github repositories, their user-specified topics, the directly extracted featured topics, and the additional mapped featured topics using our approach. In section 6.1, we perform a human evaluation on a statistically representative sample of this 29K sub-topics dataset and assess the accuracy of mapped pairs of (sub-topic, featured topic,).
|Token number per repository|
|kubernetes-sigs/ gcp- compute-persistent-disk-csi-driver||k8s-sig-gcp, gcp||-||google-cloud,
|microsoft/ vscode-java-debug||java, java-debugger,
|fandaL/ beso||topology-optimization, calculix-fem-solver,
tdd-utilities, bdd-framework, python2
As displayed in Figure 5, in our dataset, top 20% number of topics cover more than 80% of the topics’ cumulative frequencies over all repositories. In other words, cumulative frequencies of top 45 topics cover 80% of cumulative frequencies of all topics. The distribution of top 45 topics in the final dataset is shown in Figure 5.
After employing all the preprocessing steps described above, we concatenate all the data of each repository into a single document file and generate the representations for feeding to classifiers.
4.3 Multi-label Classification
The classifiers we have reviewed in Section 3 are some of the most efficient and widely used supervised machine learning models for text classification. We train the following set of traditional classifiers with the preprocessed data acquired from the previous phase: MNB, GNB, and LR. The input data in text classification for these classifiers is typically represented as TF-IDF vectors, or Doc2Vec vectors. Usually, MNB variation is applied to classification problems where the multiple occurrences of words are important. We use MNB with TF-IDF vectors and GNB with Doc2Vec vectors. We also use LR with both TF-IDF and Doc2Vec vectors. To be comprehensive, we employ a FastText classifier as well, which can accept multi-label input data. As for the deep learning approaches, we fine-tune a DistilBERT pre-trained model to predict the topics. We discuss our approach in more detail in the following sections.
4.3.1 Multi-hot Encoding
Multi-label classification is a classification problem where multiple target labels can be assigned to each observation instead of only one label in the case of standard classification. That is, each repository can have an arbitrary number of assigned topics. Since we have multiple topics for repositories, we treat our problem as a multi-label classification problem and encode the labels corresponding to each repository in a multi-hot encoded vector. That is for each repository we have a vector of size , with each element corresponding to one of our featured topics. The value of these elements are either or , depending on whether that repository has been assigned the target topic.
4.3.2 Problem Transformation
Problem transformation is an approach for transforming multi-label classification into binary or multi-class classification problems. OneVsRest (OVR) strategy is a form of problem transformation for fitting exactly one classifier per class. For each classifier, the class is fitted against all the other classes. Since each class is represented by only one classifier, OVR is an efficient and interpretable technique and is the most commonly used strategy when using traditional machine learning classifiers for a multi-label classification task. The classifiers take an indicator matrix as an input, in which cell indicates that repository is assigned the topic . Using this approach of problem transformation, We converted our multi-label problem to several simple binary classification problems, one for each topics.
4.3.3 Fine-tuning Transformers
Recently, Transformers and the BERT model have significantly impacted the NLP domain. This is because the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of NLP tasks (in our case multi-label classification), without major task-specific architecture modifications. Therefore, we exploit DistilBERT, a successful variant of BERT in our approach. We add a multi-label classification layer on top of this model and fine-tune it on our dataset. Figure 6 depicts the architecture of our model.
4.3.4 Handling Imbalanced Data
As shown in Section 4.2, the distribution of topics in our dataset is very unbalanced (long-tailed distribution). That is most of the repositories are assigned with a very few of the topics while many other topics are used less frequently (have less support). In such cases, the classifier can be biased toward predicting more frequent topics more often, hence increasing precision and decreasing recall of the least-frequent topics. Therefore, we need to assign more importance to certain topics and define more penalties for their misclassification. To this end, we define a vector containing the weights corresponding to our topics in the fit method of our classifiers. It is a list of weights with the same length as the number of topics. We populate this list with a dictionary of . Weight for topic is equal to the ratio of the total number of repositories denoted as to the frequency of a topic () as shown in Equation 2
. Thus, less-frequent topics will have higher weights while calculating loss functions. Therefore, the model learns to better predict them.
5 Experimental Design
In this section, we present our experimental setting. We aim at answering the following research questions to address different aspects of both our data component and the classifier models. We first perform a human evaluation to assess the quality and accuracy of our mappings between sub-topics and their corresponding featured topics in RQ1. We then evaluate the performance of our models based on various metrics including Recall, Precision, Success Rate and LRAP scores of the recommended lists by answering RQ2. To answer RQ3, we designed a user study and evaluated the correctness and completeness of recommended topics by our proposed approach and the baseline. Finally, with RQ4, we aim at investigating the necessity of different parts of our input data. The list of our research questions is as follows:
RQ1: How well can we map sub-topics to their corresponding featured topics?
RQ2: How accurately can we recommend topics for repositories?
RQ3: How accurate and complete are the set of recommended topics from users’ perspective?
RQ4: Does the combination of input types actually improve the accuracy of the models?
5.1 Dataset and Models
We divided our preprocessed dataset of GitHub repositories (Section 4.2) to three subsets of training, validation, and testing datasets. We first split the data into train and test sets with ratios of 80%, and 20%, respectively. Then we split the train set to two subsets to have validation set as well (with ratios 90% to 10%). We have about 152K repositories, with 228 selected featured topics. Input data consists of projects’ names, descriptions, READMEs, wiki pages, and file names concatenated together. To answer RQ4, we feed the model with different combinations of input types and evaluate the performance on two best models.
To train traditional classifiers, we use the Sci-kit Learn999https://scikit-learn.org library. We exploit its OneVsRestClassifier feature for some of our traditional models such as NB and LR. Furthermore, we use the HuggingFace 101010https://huggingface.co and the SimpleTransformers111111https://gitbub.com/ThilinaRajapakse/simpletransformers libraries for the implementation of our DistilBERT-based classifier. We set the learning rate to
, the number of epochs to, the maximum input length to and the batch size to . We set the maximum number of features for to and for TF-IDF and Doc2Vec embeddings. Higher numbers would result in overfitted models and/or the training time would increase greatly. We also set the minimum frequency count for Doc2Vec to and the ngram range to for TF-IDF. As for the FastText, We first optimize it by setting the Automatic tuning duration to hours. The best parameters retrieved for our data are the learning rate of , the minimum frequency count of , and the ngram size of . We set the remaining parameters to default values. Our experiments are conducted on a server equipped with two GeForce RTX 2080 GPUs, an AMD Ryzen Threadripper 1920X CPU with 12 core processors, and 64G RAM.
5.2 Evaluation Metrics
To evaluate our methods we considered standard evaluation metrics applied in both recommendation systems and multi-label classification scenarios such as Recall, Precision, F1 measure, Success Rate, and Label Ranking Average Precision (LRAP) to address different aspects of our model izadi2014unifying. The evaluation metrics used in our study are as follows.
Recall, Precision, and F1 measure: These are the most commonly used metrics in assessing a recommender system’s performance in the top- suggested topics jalili2018evaluating. Precision is the ratio where is the number of true positives and the number of false positives. Thus, for a repository is the percentage of correctly predicted topics among the top- recommended topics for that repository. Similarly, Recall is the ratio where is the number of false negatives. Thus, for a repository is the percentage of correctly predicted topics among the topics that are actually assigned to that repository.
measure, as expected, is the harmonic mean of the previous two and is calculated as. We report these metrics for recommendation lists. Moreover, we show how much these metrics are affected by changing the size of the recommendation list.
Success Rate: We denote success rate for different top- recommendation lists as and report and . S@1 measures whether the most probable predicted topic for each repository, is correctly predicted. S@5 measures whether there is at least one correct suggestion among the five-first recommendations.
LRAP: This metric is used for multi-label classification problems, where the aim is to assign better ranks to the topics truly associated with each repository schapire2000boostexter. That is for each ground truth topic, LRAP evaluates what fraction of higher-ranked topics were true topics. LRAP is threshold-independent and its score is always between and , with being the best value. LRAP is threshold-independent for Equation 3, calculates LRAP. Given a binary indicator matrix of the ground truth topics and the score associated with each topic, the average precision is defined as
where is , is
, computes the cardinality of the set that is the number of elements in the set, and is the “norm”.
5.3 Human Evaluation for Mapping Sub-topics to Featured Topics
To answer RQ1, we assessed the quality of the sub-topic dataset with the help of software engineering experts. As mentioned in Section 4.2.4, through cleaning 118K user-specified topics, we built a dataset of about 29K unique sub-topics which can be mapped to the set of Github’s 355 featured topics.
Fourteen software engineers participated in our evaluation, five females and nine males. All our participants either have an Msc or a PhD in Software Engineering or Computer Science. Moreover, they have a minimum of , and an average of years of experience in software engineering and programming.
As the number of sub-topics is too large for the set of topics to be manually examined in its entirety, we randomly selected a statistically representative sample of 7215 sub-topics from the dataset and generated their corresponding pairs as (sub-topic, featured topic). This sample size should allow us to generalize the conclusion about the success rate of the mappings to all our pairs with a confidence level of 95% and confidence interval of 1%. We tried to retrieve at least 25 sub-topics corresponding to each featured topic. Note that for some 47 featured topics, we were not able to extract this number of sub-topics since they only had a few number of sub-topics.
We developed a Telegram bot and provided participants with a simple question: “Considering the pair (featured topic , sub-topic ), Does the sub-topic convey all or part of the concept conveyed by the featured topic ?” to which the participants could answer ‘Yes’, ‘No’, ‘I am not sure!’. To better provide context for the participants, we also included the definition of the featured topics and some sample repositories tagged with the sub-topic. This would help them get a good understanding of definition and usage of that particular topic among Github’s repositories. We asked our participants to take their time and carefully consider each pair and answer with options Yes/No. In case that they could not decide, they were instructed to use the ‘I am not sure!’ button. These cases were later analyzed by a third expert and were labeled either as ‘Yes’ or ‘No’ in the final round. For this experiment, we collected a minimum of two answers per pair of (featured topic, sub-topic). We consider pairs with at least one ’No’ label as failure and pairs with unanimous ‘Yes’ labels as success. Figure 7 shows a screenshot of this Telegram bot.
5.4 User Study to Evaluate Recommendation Lists
We also designed a questionnaire to assess the quality of our recommended topics from users’ perspectives. We randomly selected repositories and included recommended topics by our approach (LR with TF-IDF embeddings), by the baseline approach (Di Sipio et al. di2020multinomial) and the set of original featured topics (the ground truths). We present these sets of recommended topics to the participants as outputs of three anonymous methods to prevent biasing them. We asked the participants to rate the three recommendation lists for each repository based on their correctness and completeness. That is for each repository they answer the following questions:
Correctness: how many correct topics are included in each recommendation list,
Completeness: compare and rank the methods for each repository based on the completeness of the correct recommendations.
As this would require a long questionnaire and assessing all samples could jeopardize the accuracy of evaluations, we randomly assigned the sample repositories to the participants and made sure to cover each of the repositories at least by participants. To provide better context, we also include the content of the README file of repositories for the users.
In this section, we present the results of our experiments and discuss them. First we evaluate the quality of our sub-topics dataset using human evaluation. Then we present the results of our multi-label classification models, and finally the data ablation study to address the research questions.
6.1 RQ1: Human Evaluation of the Sub-topics Dataset
Our success rate was 98.6%, i.e., the participants confirmed that for 98.6% of pairs of the sample set, the sub-topic was correctly mapped to its corresponding featured topic. Only 101 pairs were identified as failed matches. Two of the authors discussed all the cases for which at least one participant had stated they believed the sub-topic and the featured topic should not be mapped. After a careful round of analysis, incorrectly mapped topics were identified as related to a limited number of featured topics, namely unity, less, 3d, aurelia, composer, quality, c, electron, V, fish, and code-review. For instance, we had wrongfully mapped data-quality-monitoring to code-quality, lesscode to less or nycopportunity to unity. Moreover, there were also some cases where a common abbreviation such as SLM was used for two different concepts.
After performing this evaluation, we updated our sub-topic dataset accordingly. In other words, we removed all the instances of wrong matches from the dataset. The updated dataset is available in our repository for public use.
To answer RQ1, we conclude that our approach successfully maps sub-topics to their corresponding featured topic. Our participants confirmed that these sub-topics indeed convey a part or all of the concept conveyed by corresponding featured topic in almost all instances of the sample set.
6.2 RQ2: Recommendation Accuracy
To answer RQ2, we present the results of both the baselines and the proposed models based on our evaluation metrics. Baseline models here are the Di Sipio et al’s di2020multinomial approach and variations of the Naive Bayes algorithm, namely MNB and GNB. We choose the latter two because the core algorithm in the baseline di2020multinomial is an MNB. Furthermore, these techniques lack balancing while our proposed models use balancing techniques. Di Sipio et al. di2020multinomial, first extracts a balanced subset of the training dataset, by taking only sample repositories for each of their selected featured topics. It then proceeds to train an MNB on this data. In the prediction phase, the authors use a source code analysis tool called GuessLang to predict the programming language of each repository separately. In the end, they take topics predicted by their classifier and concatenate it with the programming language topic extracted from the GuessLang and generate their recommendation list.
We set , and report the results for , , , and in Table LABEL:tab:results_topk. As shown by the results, we outperform the baselines by large margins regarding all evaluation metrics. In other words, we improve the baseline di2020multinomial by , , , and in terms of , , , and , respectively. Among our proposed models, the LR classifier with TF-IDF embeddings and the DistilBERT-based classifier achieve similar results and both outperform all other models.
|Di Sipio et al. di2020multinomial||0.465||0.750||0.561||0.210||0.289||0.553||20s||93s|
Another aspect of these models’ performance is the time it takes to train them and predict topics. Table LABEL:tab:results_topk presents the training time for each model as and the prediction time of a complete set of topics for a repository as . To predict the prediction time, we calculate the prediction time of sample recommendation lists for each model and report the average time per list. The values are in millisecond, minutes, and hours. Note that prediction time of the baseline di2020multinomial is significantly larger than our models. This unnecessary delay is caused due to using the GuessLang tool for predicting programming language topics for repositories. Although the training time is a one-time expense, prediction time can be a key factor when choosing the best models.
Moreover, we vary the size of recommendation lists and analyze their impact on the results. We set the parameter (size) equal to , , , , and , respectively, and report the outcome in Figure 8. As expected, as the size of recommendation list increases, so does the . However, while goes up, goes down and thus the decreases. Note that both LR and DistilBERT-based classifier perform very closely regarding for all recommendation sizes and metrics.
To investigate whether there is a significant difference between the results of our proposed approach and the baseline, we followed the guideline and the tool provided by Herbold herbold2020autorank. We conducted a statistical analysis for three approaches of Di Sipio et al. di2020multinomial
, LR and DistilBERT-based classifiers and used 30280 paired samples. We reject the null hypothesis that the population is normal for the three populations generated by these approaches. Because we have more than two populations and due the fact that they are not normal, we use the non-parametricFriedman test to investigate the differences between the median values of the populations friedman1940comparison. We employed the post-hoc Nemenyi test to determine which aforementioned differences are statistically significant nemenyi1962distribution. The Nemenyi test uses critical distance (CD) to evaluate which one is significant. If the difference is greater than CD, then the two approaches are statistically significantly different. We reject the null hypothesis of the Friedman test that there is no difference in the central tendency of the populations. Therefore, we assume that there is a statistically significant difference between the median values of the populations. Based on the post-hoc Nemenyi test, we assume that there are no significant differences within the following groups: LR and DistilBERT-based classifier. All other differences are significant.
Figure 9 depicts the results of hypothesis testing for F1@5 measure. The Friedman test rejects the null hypothesis that there is no difference between median values of the approaches. Consequently, we accept the alternative hypothesis that there is a difference between the approaches. Based on the Figure 11 and the post-hoc Nemenyi test, we cannot say that there are significant differences within the following approaches: (LR and DistilBERT). All of the other differences are statistically significant.
In Table LABEL:tab:results_bestClasses, we presents the results based on different topics. About 100 topics have Recall and Precision scores higher than 80% and 50%, respectively. Furthermore, only six topics out of 228 topics have Recall scores lower than 50%. Thus, in the following we will investigate cases for which the model reports low Precision. We divide these topics into two groups: (1) topics assigned to a low number of repositories (weekly supported topics), and (2) topics assigned to a high number of repositories (strongly supported topics). In the first row, we report 36 topics of the first group, such as phpunit, code-review, dependency-management, less, package-manager, storybook, code-quality that are assigned to repositories less than 80 times in our data. Note that we have used balancing techniques in our models, which helped recommend less-frequent and specific topics correctly as much as possible. However, some of these topics seem to convey concepts used in general cases such as operating-system, privacy, npm, mobile, and frontend. Therefore, we believe augmenting the dataset with more sample repositories tagged by these topics can boost the performance of our classifiers.
In the second row, we have 12 popular topics,
framework, nodejs, server,
and weekly supported
|operating-system, p2p, privacy, neovim, eslint, yaml, hacktoberfest, aurelia, csv, web-components, gulp, maven, styled-components, homebrew, mongoose, nuget, firefox-extension, threejs, localization, wpf, scikit-learn, pip, webextension, virtual-reality, github-api, ajax, archlinux, nosql, vanilla-js, package-manager, less, storybook, code-quality, dependency-management, code-review, phpunit|
but strongly supported
Finally, Table LABEL:tab:results_sample_list presents our model’s recommended topics for a few sample repositories. As confirmed by the user study, our proposed approach is not only capable of recommending correct topics but also it can recommend missing topics121212Green topics in the Table..
|Repositories||Featured topics||Recommended topics (LR)|
|grpc/grpc||-||Csharp, framework, library, java, ruby|
|vysheng/tg||-||Telegram, linux, shell, lua, bash|
|google/gson||-||Java, json, xml, library, android|
|git/git||c, shell||git, documentation, security, c, shell|
6.3 RQ3: Results of the User Study
Figure 10 shows two groups of BoxPlots comparing the correctness and completeness of the recommended topics by our three methods included in the user study. With regard to the correctness of the suggestions, the median and average correct topics of our model are and out of 5 recommended topics. While the median and average of the baseline approach, Di Sipio et al. di2020multinomial, are and correct topics out of 5 recommended topics. Regarding the completeness of the suggestions, the median and average rank assigned by the participants to our approach are and , respectively. This means almost in all cases, our approach recommends the most complete set of correct topics. Although there are a couple of outlier cases in which our proposed approach is ranked second or third (Figure 10-b). The median and average of the assigned rank for the baseline method are and correct topics. That is in most cases, participants ranked the baseline as the last approach in terms of completeness.
Interestingly, our approach could recommend missing topics. In fact, users indicated that our recommended topics often were more complete than featured topics of the repositories. This is probably because repository owners sometimes forget to tag their repositories with a complete set of topics. Thus, some correct topics will be missing from the repository (missing topics). However, our ML-based model has learned from the dataset and is able to predict more correct topics. This also can be the reason for the low Precision score of the ML-based models because the ground truth is lacking some useful and correct topics. As will be shown in the Data Ablation Study next section, by mapping user-specified topics to featured topics we are able to extract more valuable information from the data and indeed increase scores of Precision and F1 measure.
Therefore, to answer RQ3, we conclude that our approach can successfully recommend accurate topics for repositories. Moreover, it is able to recommend more complete sets comparing to both the baseline’s and the featured sets of topics.
6.4 RQ4: Data Ablation Study
To answer RQ4, we train our proposed models using different types of repository information (i.e. description, README, wiki pages, and file names) as the input. According to the results (Table LABEL:tab:results_diff_inputs), as a single input, wiki pages have the least valuable information. This is probably because only a small number of repositories (about 10%) contained wiki pages and it appears these pages are often missing from repositories. On the other hand, among single source inputs, READMEs provide better results. This is probably because READMEs are the main source for providing information about a repositories’ goals and characteristics. Thus, they have an advantage compared to other sources regarding both the quality and quantity of tokens. Consequently, READMEs are enabled to contribute more to training. While READMEs are essential for training models, Therefore, To answer RQ4, adding more sources of information such as descriptions and file names indeed helps boost the models’ performance. Furthermore, these information complement each other in case a repository does not have a description, README or adequate number of files at the same time.
|All but file names||86.7%||33.6%||45.9%||78.1%|
|All but file names||86.6%||33.5%||45.8%||77.8%|
6.4.1 Different Number of Topics
We also investigate whether there is a relationship between the performance of different models and number of topics they are trained on. We train several models on the most frequent 60, 120, 180, and 228 featured topics, respectively. Figure 11 depicts the results of this experiment. The interesting insight here is that both our proposed models (LR and DistilBERT-based classifier) start from the same score for each metric and are almost always overlapping for all number of topics. This is shown in our qualitative analysis of the results as well (negligible difference between these two models). On the other hand, the MNB classifier (baseline) both starts from much lower scores and decreases faster as well.
6.4.2 Training with Separate Inputs
Here we report the results of training the models with separate input data. Repository’s description, README and wiki pages are consisted of sentences, thus they are inherently sequential. On the other hand, file names do not have any order. Therefore, we separate (1) descriptions, README files and wiki pages from (2) project names and source file names and feed them separately to the models. For TF-IDF embeddings, we set the maximum number of features to 18K and 2K for textual data and file names, respectively. This is because most of the input of our repositories consists of textual information (descriptions, README files and wiki pages). In the same manner, we set the maximum number of features to 800 and 200 for Doc2Vec vectors. Then we concatenated these vectors and fed them to the models. Table LABEL:tab:results_two_inputs shows the results of this experiment. Interestingly, models behave differently. Some are improved (such as MNB), some under-perform the previous case (such as GNB), and some model’s performance is not affected significantly (such as LR, both with TF-IDF and D2V). Therefore, one should take into account these differences while choosing the models and their settings.
|MNB, TF-IDF||Separate inputs||71.0%||27.2%||37.2%||62.0%|
|GNB, D2V||Separate inputs||58.0%||22.0%||30.2%||41.9%|
|LR, TF-IDF||Separate inputs||88.0%||34.1%||46.6%||79.4%|
|LR D2V||Separate inputs||79.7%||30.3%||41.6%||67.0%|
6.4.3 Training before and after Topic Augmentation
Table LABEL:tab:results_without_subtopics compares several models trained on featured topics versus augmented topics (subtopic mapped to their original featured topics). Our results indicate that adding more featured topics through mapping sub-topics in all cases, improves the results in terms of Precision and F1 measure. It is expected that there would be a slight decrease in the Recall score due to the increase in the number of true topics in the data.
|MNB, TF-IDF||Only Featured topics||66.2%||21.7%||31.1%|
|Augmented with sub-topics||65.9%||25.3%||34.6%|
|LR, TF-IDF||Only Featured topics||90.9%||30.2%||43.1%|
|Augmented with sub-topics||89.0%||34.6%||47.0%|
|DistilBERT||Only Featured topics||89.8%||29.7%||42.5%|
|Augmented with sub-topics||88.4%||34.3%||46.9%|
6.5 Practical Implications and Future Work
One of the major challenges in management of software repositories is to provide an efficient organization of software projects such that users will be able to easily navigate through the projects and search for their target repositories. Our research can be the grounding step towards a solution for this problem. The direct value of topic recommenders is to assign fine-grained topics to repositories and maintain the size and quality of the topics set. in this work, we have tried to tackle this problem. Figure 12 presents a screenshot of our online tool, Repologue131313https://www.repologue.com/. Our tool recommends most related featured topics for any given public repository on Github. Users first enter the name of the target repository and ask for recommendations. Repologue will first retrieve both textual information and file names of the queried repository. Then using our train LR model, it will recommend the top topics sorted based on their corresponding probabilities to the user.
In the next step, the set of tagged repositories can also be the input to a more coarse-grained classification technique for software repositories. Such a classifier can facilitate the navigation task for users. In other words, the next steps to our research could be to analyze these topics, find the relationship between them, and build a taxonomy of topics. Then, using this taxonomy, one can identify the major classes existing in software repositories and build a classification model for categorizing repositories in their respective domain. Such categorization can help organize these systems and users will be able to efficiently search and navigate through software repositories. Another approach could be to utilize topics as a complementary input in a search engine. Current search engines mainly operate based on the similarity of textual data in the repositories. Feeding these topics as a weighted input to the search engines can improve the search results.
7 Related Work
In this section, we review previous approaches to this research problem. We organize related work in the following subgroups, including approaches on (i) predicting the topic of a software repository, and (ii) recommending topics for other software entities.
7.1 Topic Recommendation for GitHub Repositories
In 2015, Vargas-Baldrich et al. vargas2015automated, presented Sally, a tool to generate tags for Maven-based software projects through analyzing their bytecode and the dependency relations among them with. This tool is based on an unsupervised multi-label approach. Unlike this approach, we have employed supervised machine-learning-based methods. Furthermore, our approach does not require inspecting the bytecode of programs, and hence, can be used for all types of repositories.
Cai et al. cai2016greta proposed a graph-based cross-community approach, GRETA, for assigning topics to repositories. The authors built a tagging system for GitHub by constructing an Entity-Tag Graph and taking a random walk on the graph to assign tags to repositories. Note that this work was conducted in 2016, prior to the time that GitHub enabled users to assign topics to repositories, thus the authors focused on building the tagging system from scratch and use cross-community domain knowledge, i.e. question tags from Stack Overflow QA website. Contrary to this work, for training our model we used topics assigned by GitHub developers who actually own these repositories and are well aware of their salient characteristics and core functionality. Furthermore, the final set of topics, i.e. the featured topics, are carefully selected by SE community and the GitHub official team. Therefore, apart from applying different methods, the domain knowledge, quality of topics, and their relevance to the repositories in our work are much accurate and relevant.
Although both works have concentrated on building a tagging system for exploring and finding similar software projects, they differ in the approach and the type of input information.
Just recently, Di Sipio et al. di2020multinomial proposed using an MNB algorithm for classification of about 134 topics from GitHub. In each top- recommendation list for a repository, authors would predict topics using the MNB (text analysis) and one programming language topic using a tool called GuessLang (source code analysis).
Similar to our work, they have used featured topics for training multi-label classifiers. However, we perform rigorous preprocessing techniques on both user-specified topics and the input textual information. We provide and evaluate a dataset of 29K sub-topics for mapping to 228 featured topics. Our human evaluation of this dataset has shown that we successfully map these topics and thus, we are able to extract more valuable information out of the repositories’ documentation. Not only do we consider README files, but also we process and use other sources of available textual information such as descriptions, projects and repository names, wiki pages, and finally file names in the repositories. The Data Ablation Study confirms that each type of the information we introduce to the model improves its performance. Furthermore, we apply more suitable supervised models and balancing techniques. As a result of our design choices, we outperform their model by a large margin (from 29% to 65% improvement in terms of various metrics). Moreover, we perform a user study and assess the quality of our recommendation from users’ perspectives. Our approach outperforms the baseline here as well. Finally, we have also developed an online tool that predicts topics for given repositories.
Note that we believe since GitHub already provides the programming-language of each repository using a thorough code analysis approach on all its source code files, there is not much need for predicting only the programming-language topics using code analysis. However, we believe code analysis can be used for more useful goals such as finding the relations between topics through analyzing API calls, etc. which we plan to do in the future.
7.2 Tag Recommendation in Software Information Sites
There are several pieces of research on tag recommendation in software information websites such as Stack Overflow, Ask Ubuntu, Ask Different, and Super User wang2018entagrec++; wang2014tag; zhou2017scalable; xia2013tag; liu2018fasttagrec; maity2019deeptagrec. Question tags have been shown to help users get answers for their questions faster wang2018entagrec++. They have helped in detecting and removing duplicate questions. Also, it has been shown that more complete tags support developers learning (through easier browsing and navigation) held2012learning. The discussion around these tags and their usability in the SE community have been so fortified that the Stack Overflow platform has also developed a tag recommendation system of their own.
These approaches mostly employ word similarity-based and semantic sim-
ilarity-based techniques. The first approach xia2013tag focuses on calculating the similarity based on the textual description. Xia et al. xia2013tag proposed, TagCombine
, to predict tags for questions using a multi-label ranking method based on OneVsRest Naive Bayes classifiers. It also uses a similarity-based ranking component, and a tag-term based ranking component. However, the performance of this approach is limited by the semantic gap between questions. Semantic similarity-based techniqueswang2018entagrec++; wang2014tag; liu2018fasttagrec consider text semantic information and perform significantly better than the former approach. Wang et al. wang2018entagrec++; wang2014tag, proposed ENTAGREC and ENTAGREC++
. These two use a mixture model based on LLDA which considers all tags together. They contains six processing components: Preprocessing Component (PC), Bayesian Inference Component (BIC), Frequentist Inference Component (FIC), User Information Component (UIC), Additional Tag Component (ATC), and Composer Component (CC). They link historical software objects posted by the same user together. Liu et al.liu2018fasttagrec, proposed FastTagRec
, for tag recommendation using a neural-network-based classification algorithm and bags of n-grams (bag-of-words with word order).
8 Threats to the Validity
In this section, we review threats to the validity of our research findings based on three groups of internal, external, and construct validity feldt2010validity.
Internal validity relates to the variables used in the approach and their effect on the outcomes. The set of topics used in our study can affect the outcome of our approach. As mentioned before a user can generate topics in free-format text, thus we need an upper bound on the number of topics used for training our models. To mitigate this problem, we first carefully preprocessed all the topics available in the dataset. Then we used the community-curated set of featured topics provided by the GitHub team. We mapped our processed sub-topics to their corresponding featured topics, and finally extracted a set of a polished, widely used set of 228 topics. To assess the accuracy of these mappings, we performed a human evaluation on a randomly selected subset of the dataset. According to the results, the Success Rate of our mapping was 98.6%. We then analyzed the failed cases and update our dataset accordingly to avoid misleading the models while extracting more information from the repositories’ documentation. Another factor can be errors in our code or in the libraries that we have used. To reduce this threat, we have double-checked the source code. But there still could be experimental errors in the set up that that we did not notice. Therefore, we have released our code and dataset publicly, to enable other researchers in the community to replicate it141414https://GitHub.com/MalihehIzadi/SoftwareTagRecommender.
Compatibility We have evaluated the final recommended topics both quantitatively and qualitatively. As shown in previous sections, their outcomes are compatible.
External validity refers to the generalizability of the results. To make our results as generalizable as possible, we have collected a large number of repositories in our dataset. Hence, we tried to make the approach extendable for automatic topic recommendation in other software platforms as well. Also for training the models, datasets were randomly split to avoid any biases being introduced to the model.
Construct validity relates to theoretical concepts and use of appropriate evaluation metrics. We have used standard theoretical concepts that are already evaluated and proved in academic society. Furthermore, we have carefully evaluated our results based on various evaluation metrics both for assessing multi-label classification methods and recommender systems. Our results indicate that the employed approach has been successful in recommending topics for software entities.
Recommending topics for software repositories helps developers and software engineers access, document, browse, and navigate through repositories more efficiently. By giving users the ability to tag repositories, GitHub made it possible for repository owners to define the main features of their repositories with few simple textual topics. In this study, we proposed several multi-label classifiers to automatically recommend topics for repositories based on their textual information including their name, description, README files, wiki pages, and their file names. We first employed rigorous text-processing steps on both topics and the input textual information. We augmented the initial featured topics provided by the GitHub team by adding sub-topics and mapping them to their corresponding featured topics. Then we trained several multi-label classifiers including LR and DistilBERT-based models for predicting featured topics of GitHub repositories. We evaluated our models both quantitatively and qualitatively. Our experimental results indicate that our models can suggest topics with high and scores of and , respectively. According to users’ assessment, our approach can recommend on average correct topics out of topics and it outperforms the baseline. In the future, we plan to take into account the correlation between the topics more properly. We also can exploit code analysis approaches to augment our models.
Furthermore, using the output of our models, we will design new approaches for finding similar repositories or categorizing them using this set of featured topics.