A Bug or a Suggestion? An Automatic Way to Label Issues

09/03/2019 ∙ by Yuxiang Zhu, et al. ∙ Nanjing University 0

More and more users and developers are using Issue Tracking Systems (ITSs) to report issues, including bugs, feature requests, enhancement suggestions, etc. Different information, however, is gathered from users when issues are reported on different ITSs, which presents considerable challenges for issue classification tools to work effectively across the ITSs. Besides, bugs often take higher priority when it comes to classifying the issues, while existing approaches to issue classification seldom focus on distinguishing bugs and the other non-bug issues, leading to suboptimal accuracy in bug identification. In this paper, we propose a deep learning-based approach to automatically identify bug-reporting issues across various ITSs. The approach implements the k-NN algorithm to detect and correct misclassifications in data extracted from the ITSs, and trains an attention-based bi-directional long short-term memory (ABLSTM) network using a dataset of over 1.2 million labelled issues to identify bug reports. Experimental evaluation shows that our approach achieved an F-measure of 85.6% in distinguishing bugs and other issues, significantly outperforming the other benchmark and state-of-the-art approaches examined in the experiment.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

User feedback is crucial in requirements engineering and software process management[18], and how to automatically collect and analyze user feedback is a major task in these field. With the development of open source software (OSS) movement, issue reports are playing an essential role in the two-way communication among end users, contributing developers and core developers[19, 7, 5].

Issue Tracking Systems (ITSs), commonly used in large projects, provide platforms to help developers collect, maintain, manage, and track issues like bugs, feature requests, improvement suggestions, and other user feedback [13]. Jira is one of the most famous and widely used ITSs [21]. Many open source projects and organizations, such as Hibernate111https://hibernate.atlassian.net/, Jboss222https://jira.jboss.org/, Spring333https://jira.spring.io/, and the Apache Software Foundation444https://issues.apache.org/jira/, use Jira to manage issues. An example issue in the Jira ITS for project ‘Spring Boot’ is shown in Figure 1. The issue has a title, a type, and other properties like priority, status, and resolution. Another popular ITS is the one used by the world’s leading software development platform GitHub. Figure 2 shows the overview of three issues collected from the GitHub project ‘google/guava’555https://github.com/google/guava/issues. As we can see from the figure, for each issue, the package it affects, its status, and its type are listed. For easy presentation, we refer to this ITS simply as the GitHub ITS in the rest of this paper. Other popular Issue Tracking Systems include, e.g., Bugzilla, Redmine, and Mantis.

Fig. 1: An issue in the Jira ITS for project ‘Spring XD’.
Fig. 2: Three issues in the GitHub ITS for project ‘google/guava’.

In the open source community, a relatively higher priority is often placed on bug reports when it comes to triaging issues, since bug fixing is often more urgent than other tasks[8]. Therefore, most ITSs allow reporters to manually label, or classify, an issue as reporting a bug or something else. Such classification, however, may be incorrect, due to limitations of the reporters’ knowledge and experience. Such misclassifications may cause issues to be assigned to the wrong handler or delay the resolution of issues [6, 27, 15]. Besides, misclassifications constitute a major problem threatening the validity of models learned from ITSs [11], since they introduce noises to the learning process. For example, type is used to indicate whether an issue is a bug or not in Figure 1 and 2; The issue in Figure 1 is currently labeled as a bug, while it is more appropriate to be classified as a dependency-upgrade or enhancement request. A tool to accurately distinguish bugs from other types of issues can help developers better prioritize their tasks in processing the issues.

In this paper, we propose a novel approach to effective issue classification across various ITSs, where an attention-based bi-directional long short-term memory (ABLSTM) network is learned and used to distinguish bugs and non-bug issues. Particularly, the k-NN algorithm is employed in the approach To identify and correct misclassifications of issues extracted from existing ITSs, so that the model learned from these issues have greater discrimination power of bugs. Compared to existing state-of-the-art approaches and those based on traditional methods [3, 9, 23], our approach significantly improves the F-measure in identifying bugs.

Overall, we make the following contributions in this paper:

  • We propose a novel approach to automatically and effectively labelling bug in various Issue Tracking Systems;

  • We implement the approach into a prototype tool;

  • We carry out an empirical study and show the superiority of our approach by comparing its effectiveness with existing state-of-the-art approaches and other benchmark approaches.

The remainder of this paper is structured as follows: Section II presents the main steps involved in applying the proposed approach; Section III reports on the experiments we conducted to evaluate the approach; Section IV gives the experimental results; Section V lists several major threats to our work’s validity; Section VI reviews related work; Section VII concludes the paper.

Ii Methodology

In this section, we present our proposed approach, which can automatically label issues in ITSs. An overview of our approach is shown in Figure 3. The rest of this section describes the approach in detail.

Fig. 3: The overall architecture of our approach

Ii-a Data Collection

Our main data set is collected from three separate Jira ITSs, which respectively contains all issues in Apache, JBoss and Spring projects. The dataset contains over 1.2 million issues, all labelled by their reporters.

Moreover, we also collected issue data from GitHub. We noticed that only a few projects, even among high-stared projects, are comprehensively and carefully labelled. So we employed inclusion criteria to determine whether a project is suitable for data collection: 1) the number of issues exceeds 500; 2) more than half of the issues are labelled, suggesting the developers made serious attempt to manage issue labels; 3) labels are used to denote issues’ types (some projects only use labels to denote the solving status, etc.).

After inspecting hundreds of top-stared Java projects in GitHub, 23 open source projects, where developers managed, labelled and solved issues cautiously, were discreetly selected. We collected 178390 issues from these projects in GitHub and 92.3% of which are labelled. Among these labelled issues, 85833 issues are labelled with bug/non-bug information, while the other are only labelled with irrelevant information such as status and affected modules and will not be considered in the following processing. More details about our dataset are presented in Section III-A.

Ii-B Dataset Preprocessing

1) Reporter Type Extraction: In different projects, reporters use different phraseology to denote the types of issues. For instance, labels like ‘bug’, ‘bug report’, ‘crash’, and ‘defect’ have been used to denote bugs, while labels like ‘suggestion’, ‘enhancement’, ‘improvement’, and ‘enhancement request’ have been used to denote non-bug issues. Besides, it is conventional in some projects to use prefixes like “platform=”, “type=”, and “status=” to denote the labels’ function, as demonstrated in Figure 2.

To figure out the types of issues as assigned by their reporters, or the issues’ reporter types, we therefore extract the labels from all issues in our dataset, identify and remove the prefix in each label, and then decide manually whether the remaining of each label indicates a bug or a non-bug issue type. The reporter type of an issue is ‘bug’ if at least one of its labels suggests so, ‘non-bug’ if the issue is assigned with a type but other than ‘bug’, or null if no type is assigned to the issue by its reporter.

Some labels provide other information relevant to issues, rather than their reporter types. For example, label ‘mac’, and ‘pending’, can be used to give the operating system where an issue occurred, and to indicate the processing status of an issue, respectively. These labels have no bearing on the reporter types of their related issues.

Only issues with reporter type ‘bug’ and ‘non-bug’ are retained and used for further process (i.e., model training, as we describe in Section II-D), all other issues are discarded.

2) Issue Title Preprocessing: First, we stem all words from issue titles using NLTK WordNet Stemmer[17]

. Since we are going to compare similarities between sentences using the k-NN algorithm, it will be helpful to unify words in different tense and voices. Then, stop-words are filtered out as often done in natural language processing.

We notice that issue titles often contain names of program entities like Java classes, methods, and file names. Some of the names, e.g., ‘NullPointerException’ and ‘pom.xml’, are obviously useful for bug classification, but most of them, e.g., ‘JsonDecoder’ and ‘derby.jar’, are too specific and not indicative of issue types. So we only maintain a dictionary of 20,000 most frequently occurred words and convert other words to token ‘<UKN>’. In this way, uncommon names are filtered out. Identifiers in camel case are not split into words in this step.

Finally, we tokenize the issue titles using white space characters as delimiters and remove all punctuations.

Ii-C Misclassification Correction

Next, we calculate the similarity between issues based on their titles and implement the k-NN algorithm to decide the actual types of retained issues.

1) Doc2Vec Training and Ball Tree Building: We use Gensim[24]

doc2vec tools in our work to train document vector. Gensim.doc2vec is based on Distributed Memory (DM) by Quoc Le and Tomas Mikolov

[16]. We traine a 128-dimension vector for every single issue in our dataset.

In k-NN algorithm, for every single issue, we have to search the entire data space to find its nearest neighbors, which is very time-consuming. Therefore, ball tree is introduced as a data structure to organize points in a multi-dimensional space[20]. It can dramatically accelerate the search for nearest neighbors. We use the scikit-learn (Sklearn) toolkit[22]

, a free machine learning library for Python, to build the ball tree model.

2) Misclassification Correction: In the traditional k-NN algorithm[2], an object is classified by a majority vote of its neighbors, which means even if the numbers of neighbors from two different classes are close, the algorithm will still give a result based on the narrow margin. Such narrow victory may result in uncertainty, contingency, and thus a decrease in the overall precision.

Therefore, we enlarge this margin between different classification judgments. We predefine a judgment threshold (set to 0.8 by default), and only when the majority’s quantitative proportion equals or exceeds , the object is classified to the majority. In our misclassification identification process, for every single issue in our dataset, we first identify the nearest neighbors ( is set to 20 by default). Then if at least of the neighbors are of a different type than the issue, we mark this issue as misclassified. Types of the misclassified issues are corrected at the end of the procedure. Correcting the type of an issue means changing the type from ‘bug’ to ‘non-bug’ or vice versa.

Ii-D Classifying Issues Using Neural Networks

In this paper, we apply an attention-based bi-directional LSTM (ABLSTM) network. Our model contains five components: input layer, embedding layer, LSTM layer, attention layer, and output layer. Figure 4 illustrates our network’s architecture.

Fig. 4: The architecture of our neural network

1) Embedding Layer: In the embedding layer, every single token is mapped into a real-valued vector representation. Suppose is the one hot vector for single word in sentence S, where has value 1 at index and 0 in any other positions. The embedding matrix M

is a parameter to be learned, and hyperparameters include the size of vocabulary

V and the dimension of word embedding d. Then we can translate the word into its word embedding vector by calculating product of the one hot vector and the embedding matrix:

(1)

Then, for sentence , we map each word to embedding vector correspondingly and then feed them into the next layer.

2) Bi-LSTM Layer:

In order to solve gradient vanishing or exploding problem, LSTM introduces gate and memory mechanism to keep track of long time memory[12]. LSTM consists of a memory cell and three gates: input gate , forget gate , and output gate . The final output will be calculated based on the states of the memory cell.

Consider time step t, let and be previous hidden and cell state of LSTM layer, then we can compute current hidden state , cell state can be computed by the following equations:

(2)
(3)
(4)
(5)
(6)
(7)

where

is sigmoid activation function,

, , , , , , , , , , , are learning parameters for LSTM, and is the output of LSTM cell. Note that N is hidden layer size and d is the dimension of the input vector.

However, the standard LSTM network processes input in unidirectional order, which can make use of past context, but it ignores future information. To solve this problem, Bidirectional LSTM (BLSTM) introduces another hidden layer of opposite direction so that the output layer can exploit data from past and future concurrently[25, 10].

As shown in Figure 4, there are two sub-LSTMs pass forward and backward simultaneously. For time step t, denotes the output of the forward LSTM while denotes the output of the backward LSTM by reversing the order of word sequence. The output of the time step t is:

(8)

where is element-wise sum.

3) Attention Layer:

Recently, attention mechanism has been proved effective in many NLP fields such as neural machine translation

[4, 14] and document classification[28]. Specifically, attention mechanism is a model which can select more important parts in predicting the output label. For instance, the word ‘crash’ may be more discriminative in deciding whether an issue is a bug.

We introduced the attention mechanism into our architecture. Let H be the matrix of outputs from Bi-LSTM. Namely, , where m is the sentence length. Then the output of attention layer is:

(9)
(10)
(11)

where is a parameter vector to be learned, is the attention weight and . For every sentence, we calculate the final representation used for classification:

(12)

4) Classification: For a sentence S, we predict label by:

(13)
(14)

where and are learning parameters.

Iii Experimental Design

We conducted experiments on our approach to evaluate its effectiveness and efficiency. In this section, we describe the design of the experiments.

We aim to address the following research questions:

  • RQ1: Will our model predict better when misclassification corrector is enabled?

  • RQ2: Compared to the performance of the baseline methods and other similar approaches in this field, can we achieve a better performance?

Iii-a Dataset Detail

We downloaded all of the issues in Apache ITS, JBoss ITS and Spring ITS (as of March 2019), all of which are based on Jira ITS. Note that a Jira ITS is typically configured to support multiple projects managed by an organization. Specifically, the Apache Jira ITS tracks the issues of 620 projects, including Zookeeper, Groovy, Hadoop, Maven, Struts and many other well-known Apache projects; 418 projects, including those for many JBoss components like JBoss Web, JBoss Reliance, and Netty, are hosted on the JBoss Jira ITS; Spring Jira tracks issues of 95 projects such as Spring Framework and Spring IDE.

We also crawled 23 famous projects on GitHub, as described in Section II-A, to collect all their issues. We found it is harder to collect issues from GitHub than from Jira because only a few GitHub projects are well-labelled. We did not use an automatic approach to crawl all top-rank projects in GitHub because poorly labeled issues in many projects may have a negative impact on the discrimination power of our model. The GitHub projects we crawled are: AxonFramework, TypeScript, visualfsharp, vscode, OpenRefine, PowerShell, pulsar, deeplearning4j, che, elasticsearch, guava, google-cloud-java, hazelcast, javaparser, junit5, lettuce-core, micronaut-core, pinpoint, realm-java, spring-boot, spring-framework, spring-security, vavr. Table I reports some basic statistics of the issues we collected from various ITSs.

Data Source Labeled Issue Number Project Number Percent of Bugs
Jira Apache 815338 620 54.2%
JBoss 329552 418 49.3%
Spring 65446 95 39.7%
GitHub 85833 23 44.7%
Total 1296169 1156 51.6%
TABLE I: Statistics of the issues from various ITSs.

Iii-B Model Training and Testing Detail

We based our model on Tensorflow

[1], an open source software which implements the underlying framework of neural network training. Tensorflow is user-friendly, robust and has been proven reliable in deep learning software development.

Since there is no other RNN-based issue classification work, we borrowed and combined the training settings from the best-practice of RNN-based text classification work in other areas and fine-tuned the hyperparameters. The size of the word embeddings is 256; the size of the hidden layers is 256; the size of batches is 1024. To prevent over-fitting, we used dropout[26]

and set the dropout rate to 0.5. The model is validated every epoch in prediction accuracy. During the training, the model is saved every epoch. The maximum number of epochs is 20.

In testing phase, we loaded three models with highest validation accuracy and then obtained test accuracy respectively. After the training, we selected the model with the highest test accuracy for evaluation. What’s more, we also collected prediction details for every single issue in the test dataset and then calculated precision, recall, and F-measure. Ten-fold validation was also employed in the testing phase.

Iii-C Evaluation Metrics: Precision, Recall, and F-measure

Precision is defined as the number of true positive results divided by the total number of positive results predicted by a classifier, while recall is the number of true positive results divided by the number of all positive results that should be assigned to positive. That is, recall measures ‘how complete the results are’ and precision measures ‘how error-free the results are’. F-measure

(also known as F-score) combines both the precision measure and the recall measure, and is calculated as the harmonic mean of precision and recall:

(15)

We also calculate the weighted average value of F-measure as the following for both classes in order to evaluate the overall performance, as was done in [9]:

(16)

where and denote the F-measure for bugs and non-bug issues, respectively, while and denote the numbers of bugs and non-bug issues, respectively.

Iii-D Evaluation for Misclassification Corrector

In order to evaluate the effectiveness of our misclassification corrector, we randomly selected 3,000 labelled issues from our dataset and manually classified them to two categories: misclassified or correctly-classified. We recruited three postgraduate students for the task. All the students are majored in software engineering and experienced in open source software development. They are asked to first read and classify each selected issue independently, and then discuss the issues to which they assigned different types. An issue is only marked as being misclassified if all the students reach a consensus that the issue has a type different from its reporter type.

For the revised k-NN algorithm used in our misclassification corrector, two parameters need tuning: the number of nearest neighbors k and judgment threshold . In a pilot study, we found that because our data set is large enough, the model has similar performance when k varies in interval . When k is less than 15 or more than 30, the performance decreases significantly. Therefore, we chose a medium value k=20 for a good balance between cost and effectiveness. For judgment threshold , we conducted our experiments with = 1.0, 0.95, …, 0.5, which means that the number of nearest neighbors of the different type should not be smaller than 20, 19, …, 10 for an issue to be marked as misclassified. We use to denote the set of issues that are regarded as misclassified by our corrector with judgement threshold .

Let be the set of 3000 issues we manually examined and

the set of misclassified issues discovered in our manual analysis. We can then estimate the precision and recall of our misclassification corrector, w.r.t. a threshold

, using the following formulas:

(17)
(18)

To measure to what extent misclassification corrector helps to improve the overall performance of bug classification, we first train a neural network model (described in Section II-D) with misclassification correction disabled, and we use the resultant model as the control group. Then, for each judgment threshold (), we train another neural network model with misclassifications correction enabled. We use the same training set, validation set, and test set (these three sets are all applied with misclassification corrector) in all cases. At last, we calculated the precision, recall and F-score of the final classification.

Iii-E Baseline Selection and Implementation

The bug classifier we propose aims at working effectively across multiple ITSs like Jira and GitHub. No existing approach, however, was designed with cross-platform applicability as a key distinguishing feature, and therefore can be used as baseline to evaluate our approach: Approaches proposed by Antoniol et al. [3] and Pingclasai et al. [23] utilize information like priority and severity for classifying issues, while issues on the GitHub ITS do not contain such information; The approach developed by Fan et al. [9] uses detailed developer information which is only available for issues in the GitHub ITS.

Therefore, we implement our own baseline bug classifiers using traditional machine learning algorithms like Logistic Regression (LR), Support Vector Machine (SVM), and k-Nearest Neighbor (k-NN). We intentionally included k-NN because it is also used in our misclassification corrector. Note however that the two implementations of k-NN are driven by different issue features and serve for different purposes. All the baseline approaches share the same data collection process as described in Section 

II.

We use not only the whole dataset, but also partial datasets specific to the GitHub, Apache, JBoss, and Spring ITSs, to evaluate the performance of our approach and the baseline methods. We adopt the same experimental settings as used in [9].

Iv Results

Iv-a RQ1: Effect of Misclassification Corrector

In this subsection, we first report the performance of misclassification corrector independently, and then we discuss the impact of the corrector on the whole model.

1) Performance of Misclassification Corrector with Variable Parameters:

Table II reports the precision, recall, and F-measure of our misclassification corrector. In general, we reasonably assume most issue types assigned by reporters are correct. Therefore, we try not to change those types unless there is strong evidence that they are wrong. We also estimated the correction rate in each case, which is the size of corrected data divided by the dataset size. The experiments indicate that by setting a judgment threshold larger than 0.5, the corrector successfully improves the precision of detection, without affecting large proportion of origin data. As we can see from the table, when increases, meaning our corrector selects and corrects items more ’strictly’, the precision increases with , recall decreases with , and correction rate decreases with .

p 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5
Precision 0.88 0.873 0.85 0.837 0.789 0.739 0.672 0.59 0.516 0.453 0.4
Recall 0.018 0.06 0.127 0.218 0.317 0.433 0.546 0.643 0.732 0.816 0.889
F-measure 0.035 0.112 0.221 0.346 0.452 0.546 0.603 0.615 0.605 0.583 0.552
C-rate 0.003 0.01 0.021 0.036 0.056 0.082 0.114 0.152 0.199 0.252 0.312
TABLE II: Prediction Result of Misclassification Corrector under Different Judgment Threshold (C-Rate: Corrected-Rate)
GitHub Apache JBoss Spring All
Perc. Rec. F-M Perc. Rec. F-M Perc. Rec. F-M Perc. Rec. F-M Perc. Rec. F-M
LR Bug 0.693 0.607 0.647 0.721 0.765 0.743 0.743 0.73 0.736 0.732 0.594 0.656 0.721 0.74 0.73
Nonbug 0.708 0.78 0.742 0.7 0.649 0.674 0.747 0.758 0.752 0.711 0.863 0.815 0.719 0.699 0.709
Average 0.701 0.702 0.7 0.712 0.712 0.711 0.745 0.745 0.745 0.756 0.759 0.753 0.72 0.72 0.72
SVM Bug 0.714 0.604 0.655 0.762 0.75 0.756 0.793 0.729 0.759 0.749 0.592 0.662 0.771 0.723 0.747
Nonbug 0.714 0.804 0.756 0.71 0.722 0.716 0.756 0.815 0.784 0.711 0.874 0.819 0.72 0.769 0.744
Average 0.714 0.714 0.711 0.738 0.737 0.738 0.774 0.772 0.772 0.762 0.764 0.758 0.747 0.745 0.745
kNN Bug 0.64 0.45 0.528 0.726 0.614 0.665 0.761 0.55 0.639 0.686 0.446 0.54 0.722 0.564 0.633
Nonbug 0.644 0.797 0.712 0.611 0.723 0.662 0.657 0.833 0.735 0.705 0.867 0.777 0.625 0.77 0.69
Average 0.642 0.643 0.63 0.673 0.664 0.664 0.708 0.694 0.687 0.697 0.7 0.684 0.675 0.664 0.661
Our work Bug 0.779 0.826 0.802 0.855 0.864 0.859 0.859 0.877 0.868 0.808 0.823 0.816 0.842 0.875 0.858
Nonbug 0.844 0.801 0.822 0.850 0.840 0.845 0.887 0.870 0.879 0.883 0.872 0.878 0.871 0.838 0.854
Average 0.814 0.812 0.813 0.853 0.853 0.853 0.874 0.874 0.874 0.853 0.853 0.853 0.857 0.856 0.856
TABLE III: EXPERIMENT DATA DETAIL FOR RQ2

2) Impact of Misclassification Corrector: Figure 5 shows the trend of average F-measure changing with judgment threshold . The orange line shows the results of the control group . The average F-measure in is 0.843. In other words, if there is not a misclassification corrector, we can achieve an F-measure of 0.843. The blue line represents the performance curve of our model with the assistance of the misclassification corrector. The figure shows that our approach with misclassification corrector outperforms the control group when equals or more than 0.65. When is less than 0.65, performance decreases quickly. Also, =0.8 is the optimal threshold. When equals 0.8, we can achieve an F-measure of 0.856. Considering the possible error of sampling, we think [0.75,0.85] is the reasonable interval of .

Fig. 5: F-score curve regarding judgment threshold

Iv-B RQ2: Comparison with Baseline

1) Comparison with our baseline methods: Figure 6 compares our result with baseline methods under different sub-dataset and the entire dataset. Table III provides details in our experiment. From the table we can see that our approach significantly outperforms all of baseline works under all datasets in terms of the weighted average F-measure, precision and recall.

Fig. 6: Comparison with baseline works

Figure 6 shows that all of our classifiers are much better than traditional methods, and SVM performs better than k-NN and LR slightly, which echoes Fan et al.’s investigation[9]. Specifically, compared with SVM, the best baseline method, our work has improved the classification weighted average F-measure from 71.1% to 81.3% for GitHub, from 73.8% to 85.3% for Apache, 77.2% to 87.4% for JBoss and from 75.8% to 85.3% for Spring. For the whole dataset, we increase the performance from 74.5% to 85.6%. In addition, whether in bug detection or in non-bug categorization, our model surpasses any of the baseline methods in both precision and recall measurement by a large gap.

Besides, we calculated the F-measure of our approach without corrector and drew the result together in Figure 6. Under all circumstances, our misclassification corrector can improve the model’s performance, which reinforces our answer to RQ1.

We noticed that all the baseline methods and our approach get an obviously worse performance for GitHub data than for Jira data. The reason may be that the project number of our GitHub dataset is too small and thus the dictionary relies largely on the terminology of certain projects. The unbalanced proportion inside GitHub data distribution may also contribute to the low quality of GitHub data.

2) Comparison with the state-of-the-art methods: The most recent comparable works in this field were done by Antoniol et al.[3], Fan et al.[9] and Pingclasai et al.[23]. We did not repeat their work because our dataset is cross-ITS, which means we only used data fields shared by all ITSs: label and textual description. Unfortunately, most of previous works are ITS-specific (Some discussions will be presented in Section VI) and thus cannot be repeated onto our dataset. In other words, these ITS-specific works needs extra data fields which are absent in our current dataset.

Also, most previous researches were based on relatively small dataset, but our approach was designed to trained on large amount of issues. For instance, Zhou et al.[29] trained on a dataset of only about 3k bug reports, involving 5 open source projects in Bugzilla and Mantis. Another example is the dataset provided by Herzig et al.[11]666https://www.st.cs.uni-saarland.de/softevo/bugclassify/ in 2013, which includes only 7401 issues from 5 JAVA projects.

Antonial et al.[3] distinguishes bugs from other kinds of issues, building their classifiers with between 77% and 82% of correct decisions, while the precision of our classifiers’ bug prediction can reach 84.2% and the precision of our classifiers non-bug prediction can reach 87.1%. Moreover, we can achieve a higher recall. Fan et al.[9] classified issues in GitHub. As they reported, their approach improved traditional SVM from 75% to 78% in F-measure, while our approach trained on GitHub dataset can improve F-measure from 71.1% (for SVM) to 81.4%.

Pingclasai et al.[23] used topic modeling technique to distinguish bug report, involving about 5k issues on 3 open source projects. Instead of weighted average F-measure, they used micro F-measure to measure their model’s performance. In Micro-average method, you need to sum up the individual true positives, false positives, and false negatives for both bug and non-bug issues and the apply them to get the statistics. They yields the micro F-measure between 0.65 and 0.82 for different projects. We also implemented micro F-measure and our approach reaches the micro F-measure of 85.6% in overall dataset.

Therefore, although we did not use the same dataset and data preprocessing procedure as our precedents did, the results of our experiments strongly implied that the performance of our model is much better than those of other recent works in this field.

V Threats to Validity

We have identified several major threats to validity. The first threat is about the human annotation on misclassification. Different reviewers may have different standards or opinions about whether an issue is misclassified. Moreover, there are other different factors influencing misclassification annotation. For instance, Herzig et al.[11], in 2013, found 33.8% of all bug reports to be misclassified. In the same year, Wright et al.[27] estimated that there are between 499 and 587 misclassified bugs in MySQL bug database, which includes 59447 bug reports in total. This shows the huge gap between different researches in identifying misclassification. From our manual labelling process, we roughly estimated about 10%-15% are misclassified in our database.

The second threat is the dataset we used. Because of the poor average issue label quality, we selected 23 projects in GitHub and collected over 170k issues. However, it may be too small compared to the size of the Jira data collected. In addition, there are other traditional ITSs which have not been included in our dataset, such as Bugzilla and Redmine. Although our approach has been qualified to handle issues regardless of which platform they belong to, further adjustment and evaluation are needed regarding other ITSs.

Vi Related Work

For space reasons, this section reviews researches that are the most relevant to this work in issue classification.

Antoniol et al. [3]

were among the first to research on the issue classification problem. In their approach, features were extracted from issue titles, descriptions, and related discussions, and traditional machine learning algorithms, such as Naïve Bayes classifier and decision tree, were employed to classify issues. Given an issue, a considerable amount of discussions, however, may take place several days after the issue is reported, which may have a negative impact on the effectiveness of the approach when prompt classification of issues is expected. Pingclasai et al.

[23]

apply topic modeling to the corpora of bug reports with traditional machine learning techniques including naive Bayes classifier, logistic regression and decision tree. Similar to

[3], they extract three contents of textual information: title, description and discussion. Their performance in classification, measured in F-measure, varies between 0.66-0.76, 0.65-0.77 and 0.71-0.82 for HTTPClient, Jackrabbit and Lucene project respectively.

Fan et al. [9] proposed an approach to classifying issues in GitHub. In their approach, features are extracted from both textual information of issues (including, e.g., issue title and description) and personal information of issue reporters, assuming that the background of the reporters may influence classification. For example, they thinks skilled developers are likely to report a bug-prone issue and provide more useful bug reports. The median of weighted average F-score for the approach was around 0.78, while the median F-score from using SVM is about 0.75, suggesting that ITS-specific data can be utilized to achieve better classification results. In comparison, our approach uses data that are easier to collect and it can achieve better F-measure.

Compared to other issue classification works, the work done by Zhou et al. [29] is special because they did not try to predict type out of a raw issue, but aimed at answering the question of whether a given bug-labelled issue is a corrective bug description or only documenting developers’ other concern. They utilized structural information, including priority and severity, of issues that are available in most ITSs. But in lightweight ITSs like the one used in GitHub, issues do not necessarily have such information.

Vii Conclusion

In this paper, we proposed a novel approach to automatically distinguish bug and non-bug issues in Issue Tracking Systems. Our strategy was, in a nutshell, to 1) collect a large scale of data from different ITSs, 2) preprocess and correct misclassification issues that may harm the model’s performance, and 3) train an attention-based bi-directional LSTM network to label issues with ‘bug’ or ‘non-bug’ tags. We carried out an empirical study, which shows that our approach outperforms the state-of-the-art approaches and achieves better results on text classification evaluation metric. Our approach is easier to apply across different ITSs, since it only requires issue titles as input.

References

  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467. Cited by: §III-B.
  • [2] N. S. Altman (1992) An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician 46 (3), pp. 175–185. Cited by: §II-C.
  • [3] G. Antoniol, K. Ayari, M. Di Penta, F. Khomh, and Y. Guéhéneuc (2008) Is it a bug or an enhancement?: a text-based approach to classify change requests.. In CASCON, Vol. 8, pp. 304–318. Cited by: §I, §III-E, §IV-B, §IV-B, §VI.
  • [4] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §II-D.
  • [5] D. Bertram, A. Voida, S. Greenberg, and R. Walker (2010) Communication, collaboration, and bugs: the social nature of issue tracking in small, collocated teams. In Proceedings of the 2010 ACM conference on Computer supported cooperative work, pp. 291–300. Cited by: §I.
  • [6] N. Bettenburg, S. Just, A. Schröter, C. Weiss, R. Premraj, and T. Zimmermann (2008) What makes a good bug report?. In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, pp. 308–318. Cited by: §I.
  • [7] T. F. Bissyandé, D. Lo, L. Jiang, L. Réveillere, J. Klein, and Y. Le Traon (2013) Got issues? who cares about it? a large scale investigation of issue trackers from github. In 2013 IEEE 24th international symposium on software reliability engineering (ISSRE), pp. 188–197. Cited by: §I.
  • [8] S. Breu, R. Premraj, J. Sillito, and T. Zimmermann (2010) Information needs in bug reports: improving cooperation between developers and users. In Proceedings of the 2010 ACM conference on Computer supported cooperative work, pp. 301–310. Cited by: §I.
  • [9] Q. Fan, Y. Yu, G. Yin, T. Wang, and H. Wang (2017) Where is the road for issue reports classification based on text mining?. In 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pp. 121–130. Cited by: §I, §III-C, §III-E, §III-E, §IV-B, §IV-B, §IV-B, §VI.
  • [10] A. Graves, S. Fernández, and J. Schmidhuber (2005) Bidirectional lstm networks for improved phoneme classification and recognition. In International Conference on Artificial Neural Networks, pp. 799–804. Cited by: §II-D.
  • [11] K. Herzig, S. Just, and A. Zeller (2013) It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In Proceedings of the 2013 international conference on software engineering, pp. 392–401. Cited by: §I, §IV-B, §V.
  • [12] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §II-D.
  • [13] J. Janák (2009) Issue tracking systems. Ph.D. Thesis, Masarykova univerzita, Fakulta informatiky. Cited by: §I.
  • [14] S. Jiang, A. Armaly, and C. McMillan (2017) Automatically generating commit messages from diffs using neural machine translation. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, pp. 135–146. Cited by: §II-D.
  • [15] P. S. Kochhar, T. B. Le, and D. Lo (2014) It’s not a bug, it’s a feature: does misclassification affect bug localization?. In Proceedings of the 11th Working Conference on Mining Software Repositories, pp. 296–299. Cited by: §I.
  • [16] Q. Le and T. Mikolov (2014) Distributed representations of sentences and documents. In International conference on machine learning, pp. 1188–1196. Cited by: §II-C.
  • [17] E. Loper and S. Bird (2002) NLTK: the natural language toolkit. arXiv preprint cs/0205028. Cited by: §II-B.
  • [18] M. Maguire and N. Bevan (2002) User requirements analysis. In IFIP World Computer Congress, TC 13, pp. 133–148. Cited by: §I.
  • [19] T. Merten, B. Mager, P. Hübner, T. Quirchmayr, B. Paech, and S. Bürsner (2015) Requirements communication in issue tracking systems in four open-source projects.. In REFSQ Workshops, pp. 114–125. Cited by: §I.
  • [20] S. M. Omohundro (1989) Five balltree construction algorithms. International Computer Science Institute Berkeley. Cited by: §II-C.
  • [21] M. Ortu, G. Destefanis, M. Kassab, and M. Marchesi (2015) Measuring and understanding the effectiveness of jira developers communities. In Proceedings of the Sixth International Workshop on Emerging Trends in Software Metrics, pp. 3–10. Cited by: §I.
  • [22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011) Scikit-learn: machine learning in python. Journal of machine learning research 12 (Oct), pp. 2825–2830. Cited by: §II-C.
  • [23] N. Pingclasai, H. Hata, and K. Matsumoto (2013) Classifying bug reports to bugs and other requests using topic modeling. In 2013 20th Asia-Pacific Software Engineering Conference (APSEC), Vol. 2, pp. 13–18. Cited by: §I, §III-E, §IV-B, §IV-B, §VI.
  • [24] R. Rehurek and P. Sojka (2010) Software framework for topic modelling with large corpora. In In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Cited by: §II-C.
  • [25] M. Schuster and K. K. Paliwal (1997) Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45 (11), pp. 2673–2681. Cited by: §II-D.
  • [26] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §III-B.
  • [27] J. L. Wright, J. W. Larsen, and M. McQueen (2013) Estimating software vulnerabilities: a case study based on the misclassification of bugs in mysql server. In 2013 International Conference on Availability, Reliability and Security, pp. 72–81. Cited by: §I, §V.
  • [28] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy (2016) Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp. 1480–1489. Cited by: §II-D.
  • [29] Y. Zhou, Y. Tong, R. Gu, and H. Gall (2016) Combining text mining and data mining for bug report classification. Journal of Software: Evolution and Process 28 (3), pp. 150–176. Cited by: §IV-B, §VI.