Software maintenance involves tasks for mitigating potential defects in the code, as well as for evolving it according to the users’ emerging needs smr.2316. Thus, it is crucial for the success of software projects. Issue tracking systems are tools to support these tasks by providing facilities to efficiently signal, manage, and address tickets or potential problems arising in software systems. In this context, software developers are required to timely react to issues reported in issue trackers and solve such issues by investing the lowest possible effort, to keep the costs related to software maintenance low floris2010. However, especially in popular projects, tens or hundreds of issues are reported daily. This complicates the issues management activities, resulting in heavier workloads for developers bissyande2013got; PanichellaBPCA14.
In projects hosted on GitHub, issue submitters report new issues by simply providing a title and an optional description of the issue. As issues of different types (e.g., asking questions, proposing features, signaling bugs) and quality could be submitted, GitHub also offers a customizable labeling system that can be used by developers to tag issue reports (e.g., by specifying the issue category or the related development tasks). Such labeling has positive effects on issues processing exploring:2018, making it easier for their management and prioritization DBLP:conf/wcre/IzquierdoCRBC15
. More specifically, labels assigned to issues help to classify and filter the reports, allowing more efficient issue handling processes. However, the manual labeling of issues may be labor-intensive, error-prone and time-consuming for project managersDBLP:conf/esem/FanYYWW17 and, for this reason, labels are barely used on GitHub exploring:2015; bissyande2013got.
To help maintainers dealing with issue processing, we developed Ticket Tagger DBLP:conf/icsm/KallisSCP19, a tool able to automatically label issue reports. Differently from previous approaches aimed at automatically identifying issue types DBLP:conf/cascon/AntoniolAPKG08; DBLP:journals/smr/ZhouTGG16, since GitHub (according to its lightweight structure) does not provide any structured information about such issues, our tool exclusively relies on the textual features contained in the titles and descriptions of the reports to enable the automated labeling of them, immediately after they are submitted. This is beneficial for developers interested to handle new issues DBLP:conf/wcre/IzquierdoCRBC15.
In this paper we briefly illustrate Ticket Tagger, a GitHub app that can easily work on any software repository hosted on GitHub and automatically marks new issues submitted to target repositories with a relevant label. Besides, we assess the classification performance achieved by using different machine learning strategies and investigate the extent to which confounding factors of different types can degrade classification results.
2 Approach and Tool’s Overview
To classify an issue report, Ticket Tagger processes the report’s title and body to represent the textual information (extracted from the issue) in a vectorial space. By inspecting the resulting components, the tool can assign a relevant label to the mentioned report.
The Machine Learning Model. Different Machine Learning (ML) algorithms can be adopted to efficiently classify textual information joulin2016bag; SorboPVPCG16; PanichellaSGVCG15. However, complex ML strategies may require a long time for training and consume a lot of memory. Since we wanted to deploy the model on low-end server hardware111AWS EC2 t2.nano (1 vCPU, 512 MB RAM, 20 GB SSD), we opted for fastText
, a tool using linear models with a rank constraint and fast loss approximation, able to achieve comparable classification results to several deep learning-based approachesjoulin2016bag.
Issue Reports pre-processing and Vectorial Representation
. For allowing the fastText linear classifier to make issue type predictions, the title and body of the reports are concatenated into a single textual paragraph. The resulting text is then tokenized and the tokenized text represents the source for obtaining the bag of words representation of the issue. This bag of words representation, in which each word is represented by a vector of
character n-grams, is the input of the fastText based classifier.
Issues Classification. The fastText model classifies issues by minimizing the following objective function over possible labels:
where is a bag of features, represents the weight dictionary of the average text embeddings, is the weight dictionary that converts the embedding to pre-softmax values for each class, and is the hierarchical softmax function used to minimize computational complexity DBLP:conf/icsm/KallisSCP19.
We set fastText by using the default values for most of the parameters222For further details, see https://fasttext.cc/docs/en/options.html and applied the following customization:
word n-gram features are not captured, i.e., wordNgrams parameter;
we only consider words that occur at least 14 times in the dataset ,i.e., minCount parameter.
Both settings have been applied according to the disk constraints of our server hardware. Indeed, these decisions allowed us to obtain a trained model requiring less than 5 MB of disk space whilst only imposing a <10% performance penalty.
Ticket Tagger is currently able to classify issues according to three categories reflecting the intent SorboPVPCG16; DBLP:conf/sigsoft/SorboPASVCG16 of the writer: bug report, enhancement, and question. These labels are included by default in every GitHub repository and they are the three labels most used on GitHub exploring:2015. Obviously, our model is designed to be easily re-trained to adapt Ticket Tagger to specific projects’ needs, enabling the prediction of additional issue types.
Tool’s Overview. When a new issue report is submitted to a GitHub repository on which Ticket Tagger is installed, the tool automatically assigns a relevant label to the new report. In particular, Ticket Tagger is a Node.js-based GitHub app, that automatically (i) gathers issue reports information from a GitHub repository, and (ii) labels the newly reported issues, by leveraging the pre-trained fastText model previously discussed. The app is freely accessible and can be easily installed onto any existing GitHub repositories. By navigating to the Ticket Tagger app webpage333https://github.com/apps/ticket-tagger
, to install Ticket Tagger on a target repository, the repository administrator has to click on the “Install” button, specify the repository, and that’s it. From this moment on, as depicted in Figure1, when a user opens a new issue ticket on the repository, GitHub calls the hook endpoint exposed by Ticket Tagger and references the information related to the newly created issue. Such information is used by the app to classify the ticket. In order to automatically label the issue report, GitHub provides a temporary access token to Ticket Tagger, which is consumed by assigning the predicted label to the issue. The automated issue labeling performed by Ticket Tagger allows the developers to (i) timely react to urgent issues, (ii) postpone less impelling tasks (such as enhancement requests), or (iii) assign the questions to specific users.
3 Performance Evaluation
In this section, we describe the datasets and baseline approach used to assess the classification performance of the fastText model integrated into Ticket Tagger (described in Section 2).
Datasets Construction. For assessing Ticket Tagger’s effectiveness in classifying GitHub issues we collected two datasets. The first dataset, , contains 30,000 issues444 https://tinyurl.com/y23kgdro. This dataset was obtained by first collecting issues from 12,112 heterogeneous projects, this by querying the GitHub Archive555https://gharchive.org using Google BigQuery666 https://cloud.google.com/bigquery. After this initial step, we randomly sampled issues from the set of all GitHub issues closed during 2018, thus selected all issues having label matching the following strings: bug, enhancement or question. With this random selection process, we selected, on average, issues for each project (median). One third of the 30,000 issues had the bug label assigned; one third issues had the enhancement label777This label refers to improvements and new features. assigned; while the remaining 10,000 issues had the question label assigned. To construct the second dataset, , we ran a query888https://tickettagger.blob.core.windows.net/scripts/github-labels-top3-34k.sql over the GitHub Archive using Google BigQuery. We queried for issues containing any of the three labels, i.e., bug, enhancement and question, between the 1st and 9th of March 2018 in the GitHub Archive, obtaining approximately 34,000 issues999https://tickettagger.blob.core.windows.net/datasets/github-labels-top3-34k.csv. The resulting distribution of issue types in is as follows: 16,355 (48%) tickets labeled as bug, 14,228 (41.8%) tickets marked as enhancement, and 3,458 (10.2%) question issues. While the first dataset, , contains an identical number of tickets from each category, the second one, , presents an unbalanced distribution of labels and is more representative of reality.
Evaluation Methodology. The goal of our experiments is twofold. On the one hand we compare Ticker Tagger against a baseline approach, to observe whether more simple ML-based approaches are able to achieve comparable or better results than Ticker Tagger. On the other hand, we evaluate the extent to which Ticket Tagger is able to automatically identify the correct labels to assign to issue reports in a realistic scenario. More specifically, we compare Ticket Tagger with the J48 machine learning (ML) algorithm that was successfully used in previous work concerning the assessment of ML strategies for textual classification problems PanichellaSGVCG15; DBLP:conf/sigsoft/SorboPASVCG16. To perform such a comparison, a 10-fold cross validation strategy 10489157 on is used for evaluating the classification performance achieved by both Ticket Tagger and the baseline J48 ML algorithm.
For training the J48 model, we leverage all the terms contained in both titles and descriptions of issues in our dataset to build a document-term matrix , where each row represents an issue of our dataset, and each column represents a term. Every entry of the aforementioned matrix represents the weight or importance of the j-th term in the i-th issue, computed according to the tf-idf weighting scheme BaezaYates:1999 that has been successfully used in recent work concerning the classification of GitHub issues DBLP:conf/esem/FanYYWW17 and vulnerabilities DBLP:journals/jss/RussoSVC19. the evaluation is performed without the custom settings used for reducing fastText’s disk space (described in Section 2).
With the aim of assessing the Ticket Tagger’s capability of recognizing issue types in a realistic setting, i.e., unbalanced distribution of issue types, we carry out a further experiment in which Ticket Tagger is trained on the whole balanced dataset, , and the unbalanced dataset, , is used for evaluating the classification performance. This particular setting , i.e., balanced training set and unbalanced test set, is motivated by the need to avoid that the resulting model is biased towards the majority class(es). Well-known information retrieval metrics, namely precision, recall, and F-measure BaezaYates:1999, are adopted to evaluate the classification performance in our experiments.
|Precision||0.82 (+0.24)||0.89 (+0.29)||0.78 (+0.13)|
|Ticker Tagger||Recall||0.84 (+0.25)||0.76 (+0.13)||0.87 (+0.26)|
|F-measure||0.83 (+0.24)||0.82 (+0.20)||0.83 (+0.20)|
Results. Table 1 reports the classification performance achieved by both Ticket Tagger and the baseline approach (J48) using 10-fold cross validation on . In particular, Table 1 shows how Ticker Tagger obtained F-measure values above 0.80 for each considered label, confirming the practical usefulness of the proposed approach for improving the issue management practices on GitHub. In addition, we can observe how for , Ticket Tagger always outperforms the baseline approach (J48) for all labels and in all precision, recall, and F-measure metrics.
Table Predicting Issue Types on GitHub shows the performance of Ticket Tagger in identifying bug, enhancement and question issues, when trained on and tested on . The results of this second experiment highlight that our tool automatically identifies issues of the bug and enhancement types with reasonably high effectiveness, i.e., F-measure of about 0.75, while lower classification performance is obtained for the question category. On the one hand, these findings confirm the practical usefulness of our tool, as it achieves reasonably high performance in automatically recognizing issues reporting bugs or requesting features. These are the most important feedback for developers interested in performing software maintenance and evolution activities DBLP:conf/sigsoft/SorboPASVCG16. On the other hand, we believe that further efforts and tunings are required to improve the tool’s capability of recognizing issues of the question type.
In recent work, Herbold et al. abs-2003-05357 considered Ticker Tagger in a quantitative comparison, showing that fastText outperforms the competition concerning the issue labeling problem, this without particular tuning. Herbold et al.’s approach achieves slightly higher precision results than our model because it leverages the auto-tuning feature, a feature that we did not use in Ticket Tagger. Thus, such small improvements in prediction performance are due to structural information about the issues used.
Discussion of confounding factors. There are several factors that can potentially influence Ticker Tagger’s performance, as discussed below.
Impact of function words
: For issues belonging to the bug and enhancement classes both precision and recall are above 0.70, while Ticket Tagger produces higher numbers of false positives and false negatives for the question category, i.e., a lower precision and a lower recall are achieved for this class. We believe that the strong use of function words, e.g., “how” or “what” that typically introduce questions, in the issue title or description could lead the classifier to erroneously assign the question label to issues that actually belong to different classes and, consequently, this degrades the precision achieved for the question category. In addition, the lower recall obtained for this class could be connected with the fact that developers (and users) ask questions about a wide range of topicsDBLP:journals/jcst/YangLXWS16, making it hard to learn all the patterns that could lead to the assignment of this label.
, a tool using heuristics based on character sets and trigrams for automatically detecting the language of the text. Results in TablePredicting Issue Types on GitHub suggest that language consistency in issue tickets has a positive effect on the classification performance.
Presence of Code Snippets in Issue Tickets: we observe whether the presence of code snippets in tickets affects the performance of our model. Thus, we generated two datasets, one characterized by 6,000 tickets containing code snippets and one baseline dataset of 6,000 tickets sampled at random using the previously mentioned method. In particular, the presence of code snippets is recognized by detecting pieces of text enclosed in triple backticks, which is the special syntax recommended by the GitHub Flavored Markdown language111111https://docs.github.com/en/github/writing-on-github/basic-writing-and-formatting-syntax to highlight code snippets. Results in Table Predicting Issue Types on GitHub show that the presence of snippets
4 Threats to Validity
Threats to construct validity. We compared Ticket Tagger with a baseline approach (J48) on a dataset comprising equal numbers of bugs, enhancements and questions. This could represent a threat to construct validity as in real scenarios the distributions of the different types of issues may be unbalanced. To counteract this issue, we also assessed Ticket Tagger on a second unbalanced dataset where the proportion between the different classes is close to reality.
Threats to internal validity. Our results could be misleading if a significant percentage of collected issues would be subject to re-labeling. To mitigate this concern and reduce the likelihood of re-labeling for the considered samples, we collected GitHub issues having the closed status assigned.
Threats to external validity. The main threat to external validity is related to the potential specificity of our datasets. The collected issues could not be adequately representative of all the issues present on GitHub. However, to increase the heterogeneity of data, we selected issues from projects (i) having different natures, (ii) implemented through different programming languages, and (iii) developed by different developers’ communities. To further confirm the low specificity of our datasets and the quality of our results, in recent work Ticker Tagger was considered in a quantitative comparison abs-2003-05357, which demonstrated that fastText outperforms state-of-the-art approaches addressing the issue labeling problem.
In this work, we presented Ticket Tagger, an app that we released on the GitHub marketplace, that automatically assigns suitable labels to issues opened on GitHub projects. The core of Ticket Tagger is represented by a machine learning model that analyzes the title and the textual description of issues in order to determine whether such an issue can be labeled as a bug report, a feature request or a question.
With the aim of assessing the classification performance achieved by our tool, we conducted four main evaluation experiments. The results of such evaluation showed that Ticket Tagger allows to automatically assign labels with reasonably high levels of precision and recall, outperforming results of a baseline approach. Our findings have also shown that the use of a consistent language can improve Ticket Tagger classification performance, while the presence of code snippets does not affect the results significantly.
Future work will be aimed (i) at comparing Ticket Tagger’s accuracy and functionality with other existing solutions, as well as (ii) at investigating its usefulness through the analysis of direct feedback from end-users.
The authors express their gratitude and appreciation towards the anonymous reviewers who dedicated their considerable time and expertise to the paper. Sebastiano Panichella gratefully acknowledges the Horizon 2020 (EU Commission) support for the project COSMOS (DevOps for Complex Cyber-physical Systems), Project No. 957254-COSMOS).
Current executable software version
|Nr.||Software metadata description|
|S1||Current software version||2.1.4|
|S2||Permanent link to executables of this version||https://github.com/rafaelkallis/ticket-tagger/releases/tag/v2.1.4|
|S3||Legal Software License||GNU General Public License (GPL)|
|S4||Computing platform / Operating System||macOS, Linux|
|S5||Installation requirements & dependencies||nodejs 12|
|S6||If available Link to user manual - if formally published include a reference to the publication in the reference list||https://github.com/rafaelkallis/ticket-tagger/blob/master/README.md|
|S6||Support email for firstname.lastname@example.org|
Current code version
|Nr.||Code metadata description|
|C1||Current code version||2.1.4|
|C2||Permanent link to code/repository used of this code version||https://github.com/rafaelkallis/ticket-tagger|
|C3||Legal Code License||GNU General Public License (GPL)|
|C4||Code versioning system used||git|
|C6||Compilation requirements, operating environments & dependencies||nodejs 12|
|C7||If available Link to developer documentation/manual||https://github.com/rafaelkallis/ticket-tagger|
|C8||Support email for email@example.com|