Identifying non-natural language artifacts in bug reports

by   Thomas Hirsch, et al.
TU Graz

Bug reports are a popular target for natural language processing (NLP). However, bug reports often contain artifacts such as code snippets, log outputs and stack traces. These artifacts not only inflate the bug reports with noise, but often constitute a real problem for the NLP approach at hand and have to be removed. In this paper, we present a machine learning based approach to classify content into natural language and artifacts at line level implemented in Python. We show how data from GitHub issue trackers can be used for automated training set generation, and present a custom preprocessing approach for bug reports. Our model scores at 0.95 ROC-AUC and 0.93 F1 against our manually annotated validation set, and classifies 10k lines in 0.72 seconds. We cross evaluated our model against a foreign dataset and a foreign R model for the same task. The Python implementation of our model and our datasets are made publicly available under an open source license.



There are no comments yet.


page 1

page 2

page 3

page 4


Root cause prediction based on bug reports

This paper proposes a supervised machine learning approach for predictin...

The Forgotten Role of Search Queries in IR-based Bug Localization: An Empirical Study

Being light-weight and cost-effective, IR-based approaches for bug local...

Learning to Describe Solutions for Bug Reports Based on Developer Discussions

When a software bug is reported, developers engage in a discussion to co...

Conclusion Stability for Natural Language Based Mining of Design Discussions

Developer discussions range from in-person hallway chats to comment chai...

Bug or not bug? That is the question

Nowadays, development teams often rely on tools such as Jira or Bugzilla...

Geometry matters: Exploring language examples at the decision boundary

A growing body of recent evidence has highlighted the limitations of nat...

What to Prioritize? Natural Language Processing for the Development of a Modern Bug Tracking Solution in Hardware Development

Managing large numbers of incoming bug reports and finding the most crit...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Natural language processing (NLP) approaches analyzing documents originating from software development processes are an increasingly popular and promising research field. In particular, bug reports are very prominent targets for NLP approaches [1]. NLP applications on the basis of textual bug reports are used, for example, to categorize the impact and root causes of bug reports [2], to classify bugs automatically according to the ODC classification scheme [3], to assign programmers to bug reports [4, 5], to locate the source code that needs to be changed to fix a bug [6], to label the severity of a bug [7, 8], to prioritize bugs [9], to detect duplicates [10], to distinguish bug reports from other issues [11], and to find security related bug reports [12].

However, in contrast to classic NLP problems, bug reports are often cluttered with non-natural language artifacts such as code snippets, stack traces, log outputs, config files, and file listings. While some approaches detect and leverage specific artifacts—most notably information retrieval (IR) approaches [13]—other NLP tasks consider such artifacts as mere noise that may even have negative effects on the task’s performance, e.g., language detection and automated root cause classification approaches. The size of such artifacts also can pose a problem for some approaches and technologies, as especially un-shortened stack traces, core dumps, and thread dumps lead to often unwanted growth of vocabulary, and excessive runtimes. For example, we found bug reports as big as 200 kB of uncompressed text111See e.g. when we created our dataset.

Issue trackers usually provide formatting mechanisms, such as Markdown, to allow users to format their issue comments. One of the simplest forms of artifact detection would be to parse issue descriptions for fragments formatted as code blocks. Unfortunately, not all users/reporters use them properly.222See e.g. Therefore, relying on such formatting alone is not a viable option for artifact detection.

Since certain artifacts add valuable information to some approaches (e.g., stack traces for automated fault localization), researchers developed numerous techniques for identifying and parsing specific artifact types. Several researchers [14, 15, 16] manually built sets of regular expressions after they had investigated the underlying dataset. Although labor intensive, this approach works reasonably well for datasets originating from a small number of software projects.

However, regular expressions have to be adapted when using them on other software projects, and manually annotating data is time-consuming, as pointed out by Mäntylä et al. [17]. Manually created rule sets do not scale to bigger datasets due to the size and number of required regular expressions necessary to account for different logging frameworks, code style guidelines, built systems, configuration file formats, underlying OSs, and IDEs. These scalability and portability issues led researchers to the application of machine learning (ML) techniques [17, 18]. While ML approaches circumvent the manual creation of rules, they introduce the need for manually annotated training sets.

In this work, we propose and discuss a pragmatic approach that minimizes manual labor required when creating an ML classifier. Our approach does not require extensive knowledge about the artifacts that are supposed to be removed, while providing good classification performance at a low computational cost once trained. To achieve this, we tackle aforementioned problems from two sides: (1) by automating the training set creation process, and (2) by using a custom preprocessing step. The automated training set creation process leverages project documentation Markdown files, and GitHub Markdown annotated bug reports. In our custom preprocessing step, white spaces and special characters alongside other features are tokenized and therefore become parts of the feature vectors used by the ML-based classifier algorithm.

We use a standard Python ML library as basis of our implementation, and we provide an easy to use pretrained model that can be used in preprocessing for various NLP tasks concerning textual bug reports. We train and validate our approach on 101 Java software projects hosted on GitHub. Our model’s performance on a manually created validation set is 0.95 ROC-AUC and 0.93 F1. The model can be trained in less than two minutes, and classifies 10k lines in 0.72 seconds. We test our approach on a foreign dataset and discuss its portability and limitations.

This work is structured as follows: We discuss related work in Section II and define the problem in Section III. Section IV contains a detailed description of our approach. In Section V, we present our results, discussion of these results, limitations, and threats to validity. We conclude this paper in Section VI.

Ii Related work

InfoZilla [19]

detects stack traces, source code, patches, and enumerations in bug reports using regular expressions, island parsing and heuristics. Bacchelli

et al. [20] used island parsing to extract structured data from natural language documents such as emails. Later on, they proposed to use supervised ML to classify the content of emails line-by-line into natural language, junk, code, patch and stack trace. To train and test the classifier, they manually classified the content of nearly 1500 emails from four software systems [18].

Rigby and Robillard [21] developed an automated code element resolution tool (ASE) that is based on island parsing. ASE automatically extracts code elements from natural language documents such as StackOverflow posts and determines which elements are important for the document. Ye et al. [22] use a semi-supervised ML approach to detect API mentions in text written on social platforms.

Ponzanelli et al. [23] provide a parsed dataset from Stack Overflow that contains heterogeneous abstract syntax trees for Java code, stack traces, XML/HTML elements and JSON fragments. They used island parsing to identify these fragments.

Calefato et al. [24] reported that they tried to use regular expressions to remove code snippets from email text, but found this approach to not scale well enough—in particular when several programming languages are used. This highlights the need for more generic approaches for artifact detections such as machine learning.

The work that is closest to ours is the Natural Language or Not (NLoN) Package [17]. This R package classifies text lines into text or artifact by using eleven language features and character tri-grams.

Iii Problem definition

While intuition tells us that the line between natural language or non-natural language should be a clear cut, closer investigation reveals the complexity of this problem and a certain amount of overlap of the two domains. Examples of such border cases are code comments and bug report templates: Comments contained in code snippets are natural language texts. However, they may not have been authored by the bug reporter. Bug report templates include headers, questions, and other text. While these are natural language, they are again not written by the bug reporter and can be considered automatically generated text. Migration from other bug tracking systems often introduces generated text portions. These are also natural language, but their origin is artificial. As far as we are aware, there exists no formal definition, established guideline, or agreement within the research community working with textual bug reports on what is to be considered natural language when dealing with bug reports.

For this work, we define artifacts and natural language portions of bug reports as follows: We consider text that was typed by the bug reporter as natural language, and content that was copy-pasted from an IDE, terminal, or other tool to be an artifact. Automatically generated natural language text of the bug tracker, template, or migration processes are considered natural language. Comments in pasted code snippets, elaborate natural language logging messages and error messages are considered artifacts. Further, we consider standalone URLs and Markdown links as artifacts.

Occurrences of non-natural language portions in a natural language sentence are limited mostly to variable names, class names, and short formulas or mathematical equations. Removing such occurrences may render a natural language sentence syntactically and semantically incorrect and unreadable for a human. We therefore consider a line of natural language text interweaved with non-natural language portions as natural language. Log outputs or code snippets always start on a new line. Thus, we detect artifacts on a line by line level and describe the task as a binary classification problem similar to Mäntylä et al. [17].

Iv Approach

First, we explain in Section IV-A what features are used in our machine learning approach. In Section IV-B, we discuss the creation of the training, test and validation sets, before we discuss the setup of the ML approach in Section IV-C.

Iv-a Feature Selection

For humans, separating artifacts from natural language is often possible without the need of actually reading the text. Formatting and structure plays a significant role in a human’s capability to classify a given text segment very fast. For example, indentation of code snippets provides a very good indicator. Therefore, we will include representations of whitespaces in the feature vectors used by the ML classifier.

A closer look at artifacts further reveals that frequency and position of special characters also carry a significant amount of information for our task. While the most common special characters in English text are ‘,’ and ‘.’, the characters , , and ‘/’

are probably the most common in XML. For this reason, we tokenize special characters to include them in the feature vectors. The full replacement table can be found in the online appendix; an excerpt of this table is shown in Table 


Character / Regex Token
␣ ␣ (two whitespaces) Jdoublespace
\t Jtabulator
( Jroundbracketopen
{ Jcurlybracketopen
; Jsemicolon
([A-Z]?[a-z0-9]+)([A-Z][a-z0-9]*)+ Jcamelcased
[0-9]+ Jnumber
TABLE I: Excerpt of introduced tokens

Regarding the position of special characters, lines of English text will often end with ‘.’, ‘?’, and ‘!’, while lines of Java code will often end with ‘{’, ‘}’ or ‘;’. A bag of words (unigram) approach is not suitable to encapsulate such position information. Thus, we add tokens that represent the beginning and end of a line, and employ tri-gram vectorization.

Iv-B Training, Test and Validation Sets

For our experiments, we mined 101 open source Java projects hosted on GitHub. All of the projects utilize GitHub’s built-in issue tracker. We collected 53 288 issue tickets that were labeled with ‘bug’, ‘defect’, or ‘regression’. Furthermore, we used the projects’ documentation as additional source of training data.

Figure 1 illustrates the training, test and validation set creation process. We divided the set of issue tickets into training and test portions (see Section IV-B1 for details). The larger set with 41 771 issue tickets is used together with 5 262 project documentation files as basis for the training set. Details on the processing of the issue tickets and project documentation files are provided in Sections IV-B2 and IV-B3. The smaller set with 11 517 issue tickets is used to source the test set. From these issue tickets, 100 tickets are manually inspected by the authors of this paper to provide validation sets (see Section IV-B4). Furthermore, we reused the NLoN dataset as additional validation set (see Section IV-B5).

Fig. 1: Training, test and validation set creation

Iv-B1 Test and training split

Since parts of our dataset are used in other research, we separate the training and test set along this line: The test set exclusively consists of issue tickets that have commits linked to them as they are used in downstream research that will utilize this pretrained classifier model as preprocessing step.

The training set contains all remaining issue tickets, i.e., 910 478 lines of non-natural language text, and 360 047 lines of natural language text. The test set contains 193 727, and 36 579 lines accordingly. We balance both sets to contain an equal number of lines of artifacts and natural language text. The resulting balanced training set contains 720 094 lines, the resulting test set contains 73 158 lines in total.

Iv-B2 Issue tickets

GitHub’s built-in issue tracker offers Markdown333 for reporters to format their issue reports. For this work, we focus mainly on the following Markdown features: Triple ticks that start and end a code highlighting block, indentation by four spaces signaling a code block, lines that are full quotes from start to end, Markdown style links, tables, URLs, and embedded images. If all issue reporters would make use of these Markdown features to properly wrap non-natural language artifacts, the task of artifact removal would be trivial, but this is unfortunately not the case.

About 45 % of the issue tickets contain Markdown formatted code blocks. From these issue tickets, we manually examined 300 issue tickets and found that 93.3 % used Markdown formatting consistently. The other issue tickets contained code snippets, stack traces and/or log output that were not properly wrapped Markdown blocks. Analogue to the above, we also sampled 300 issue tickets from the 55 % that do not contain Markdown code blocks. 18 % of those tickets contain artifacts that are not properly Markdown formatted.

Since Markdown is not consistently used, the trivial approach of using Markdown features to identify artifacts is insufficient. However, we can leverage the issue tickets that contain Markdown highlighted code highlighting features to automate the creation of a data set for an ML classifier. Figure 2 illustrates this process.

Fig. 2: Automatic separation of human-written text and artifacts

In detail, we search for all issue tickets that contain such Markdown code block tags, and split these texts accordingly into natural language and non-natural language portions. To capture these various Markdown code blocks, we use a small set of six regular expressions. This process is based on the assumption that if a reporter utilizes Markdown for one portion of his/her report, that he/she will also do this for other portions of the same report.

However, this assumption does not always hold, and therefore produces supposedly natural language text portions that in fact are artifacts of some kind. We examine the produced datasets, and augment the above described approach by employing a set of regular expressions to filter common artifact types from the natural language side of the dataset. The first part of these regular expressions can be easily reused in any context: Two regular expressions remove Unix and Windows style prompts, and three regular expressions remove json and xml like content. The second part of regular expressions depend on the used programming language: Five regular expressions specifically aim at Java code, and four regular expressions target logging formats. We finally use two regular expressions to remove lines whose formatting does not allow to distinguish them via regular expressions (e.g. Markdown block quotes using that are used for reply or followup in conversations as well as for code highlighting).

To measure the noise in the final dataset, Researcher 1 manually inspected 600 artifact lines and 600 natural language lines randomly sampled from the test set

. While the artifact portion of the generated dataset did not contain any natural language lines, the natural language text portion contains 35 lines that are artifacts. Most of these mislabeled lines constitute corner cases, e.g., information on operating systems including detailed version numbers. We estimate that about 6 % of the lines labeled as natural language and therefore 3 % of all lines contained in our

training set and test set are mislabeled.

Iv-B3 Project documentation files

The documentation files of a software project are a great source of training data for the purpose of artifact detection. Given the documentation in the form of Markdown files, we employ the same approach as for issue tickets discussed above (see Figure 2). We processed 5 262 .md files, resulting in 284 267 artifact lines and 219 598 lines of natural language. The resulting dataset obtained from documentation files is significantly cleaner since documentation files Markdown features are usually used properly and uniformly by the projects’ maintainers.

Iv-B4 Validation set

To estimate the effect of aforementioned noise on the classifier model and to enable reproducible and objective performance measurements, we created a human annotated validation set to serve as ground truth. We randomly sampled 100 issue tickets from the set of issue tickets that were used to create the test set. Both authors individually classified these bug reports, annotating each line in the bug report to be either artifact or natural language. They were provided with the full textual bug reports for this work, in order to provide context and therefore a qualitatively better ground truth. The Cohen’s Kappa [25] inter-rater agreement of Researcher 1 and Researcher 2 is 0.96 and can be interpreted as ‘almost perfect’ agreement [26]. We noticed that Researcher 1 tended to classify lines in the gray area as artifacts, while Researcher 2 classified them as natural text. The resulting data sets, excluding empty lines, contain 1 816 lines. 76 % of the lines are artifacts, the remainder are natural language. Thus the data sets are imbalanced.

Since Researcher 1 implemented the preprocessing pipeline, we consider Researcher 2 as the ground truth for performance measure as to avoid any unwanted bias. From hereon we refer to these datasets as validation set 1 and validation set 2, corresponding to their human classifiers Researcher 1 and Researcher 2.

Iv-B5 NLoN data set

The dataset of Mäntylä et al. [17] is publicly available444 It was built from three different sources: comments on Mozilla’s issue tracker of multiple different projects (C++), chats from Kubernetes’ public slack channel (Go), and messages on Apache Lucene’s mailing list archives (Java). For each source, 2000 data samples were manually labeled as natural text or artifact. The full dataset comprises 6 000 lines, of which 29 % are artifacts.

Iv-C Machine Learning

Here we present our model’s setup in detail, from preprocessing, through utilized ML algorithms, to our model evaluation strategy.

Iv-C1 Preprocessing

In the first step, we use regular expressions and basic string operations to perform the replacements and tokenizations discussed in Section IV-A. This step is implemented as a scikit-learn transformer. Doing so enables us to utilize standard tokenization and vectorization steps of the Python library, without any adaptations. We do not replace stop words, as these are important features for our task to differentiate natural language from other artifacts. Further, we do not perform case folding, as this also carries some information for the task at hand (e.g. all caps words are more common in artifacts). To encapsulate positional information of the tokens in the feature vectors (as discussed in Section IV-A), we vectorize into uni-, bi-, and tri-grams that are combined into a single feature vector using a simple count vectorizer.

Iv-C2 ML models

We use classic ML models as Support Vector Machines (SVM), Random Forrest Classifier (RFC), Logistic Regression Classifier (LRC), and Multinomial Naive Bayes (MNB), due to their ease of use and little requirements in terms of computational resources for training and prediction. We do not perform hyperparameter tuning, and keep the default values of the classifiers in the used library (MNB:

, SVM: , RFC: , LRC: ). In a preliminary experiment, the classification performance and capabilities of all classifiers were very similar, but the prediction and training times varied. Given the similarity in classification performances, we chose SVM for the following experiments.

Iv-C3 Performance evaluation

We apply the test and training sets presented in Section IV-B. We measure the classification performance on the test set and the validation set and the prediction runtime on the test set. To enable comparison and a discussion of the external validity, we measure the performance of our model trained on our dataset on the NLoN [17] dataset, as well as the performance of our model when trained on the NLoN dataset.

V Evaluation

First, we present the results in Section V-A before we discuss them in Section V-B. Then, we deal with the limitations and threats to validity in Section V-C.

V-a Results

Fig. 3:

Mean ROC-AUC learning curve and standard deviation

Figure 3 shows the mean performance and standard deviation of ROC-AUC performance for ten randomized runs in relation to different training set sizes. Our model’s performance on both validation sets continually increases with the training set size. However, as the classification performance increases, so does the time required for training the model, and more importantly, the memory and storage space requirements of the pretrained model.

We therefore randomly sample 40 % of the training set to be used to train our final model, to control these properties. This resulting training set contains 288 038 lines, and the results reported in this section stem from a model trained on this set. This final model can be trained in less than two minutes, its size is under 60MB, and it takes on average 0.72 seconds to classify 10k lines of input (as single-threaded Python 3.7 process on Debian 10, on Intel i5-4460, 3.20GHz). The first two rows of Table II contain the average classification performance from ten runs of this model as macro average F1 and ROC-AUC.

Model Training Evaluation F1 ROC-AUC
Base validation
Ours 40% Training set Test set 0.96 0.96
Ours 40% Training set Validation set 2 0.93 0.95
Cross validation
Ours 40% Training set NLoN dataset 0.86 0.85
NLoN NLoN dataset Validation set 2 0.81 0.83
Ours NLoN dataset NLoN dataset 0.93 0.93
NLoN 40% Training set Validation set 2 0.90 0.92
TABLE II: Evaluation results of our model and NLoN model

We cross evaluated our pretrained model against a pretrained NLoN model. Each model was used to classify the data from the other model, NLon model on validation set 2, and our model on NLoN-Researcher 2 targets. The results of this evaluation are given in the third and fourth row of Table II.

We further cross evaluated our classifier approach against the NLoN approach by training and validating the models on the opposing datasets. An NLoN classifier was trained on our training set and scored against our validation set. Analog to this, we trained and scored our classifier on the NLoN dataset. Since our employed SVM algorithm is sensitive to imbalanced training sets, we balanced the NLoN dataset by downsampling the majority class (natural language). Given the small size of the dataset we used the Bootstrap algorithm with and

to calculate the 95 % confidence intervals with 0.8/0.2 training/test splits.

The bottom two rows of Table II show the resulting performance scores for both models. Both, the F1 and ROC-AUC confidence intervals of our classifier on the NLoN dataset are 0.91 to 0.94 with a mean of 0.93 over all 100 iterations. The NLoN classifier performed at F1 0.90 and ROC-AUC 0.92 on our validation set. Due to excessive training times for this NLoN classifier (multiple hours) we only performed a single training and evaluation run.

An important question is how well the automatic separation of text and artifacts with regular expressions as described in Section IV-B2 performs on validation set 2. Line-wise classification performs poorly with an F1 score of 0.65 and a ROC-AUC score of 0.77. Performing this task on complete and continuous bug reports aids this approach by enabling detection of Markdown triple quote code blocks and performs at 0.85 F1 and 0.91 ROC-AUC.

V-B Discussion

Despite the noise in our automatically generated training and test sets, our model scores well. The classification performance on our manually created validation set is close to the performance on our automatically generated test set with 0.96 vs. 0.93 F1 scores and 0.96 vs. 0.95 ROC-AUC scores. We investigated misclassifications of our automated dataset generation process in comparison to our manual classification efforts and found that there is an overlap in the disagreement: The same type of artifact misidentified by our automated dataset generation is also often mismatched between the two reviewers. This together with the high classification performance of the model supports our assumption that our automated training and test set creation approach is valid.

The cross evaluation of our model and the NLoN model highlights the limitations in portability of such pretrained models. The origins of the data (bug reports, comments on bug reports, mailing lists, or chats) seem to be an important factor. Further, the different programming languages in the projects used to create these datasets heavily influence the type and form of artifacts found in these texts.

To investigate the reason for this low portability, we retrained and evaluated our classification model on the NLoN dataset. The resulting classifier performed at an average of 0.93 F1 and ROC-AUC. Given an inter-rater F1 score of 0.94 of the NLoN dataset, these performance scores support the validity of our preprocessing and modeling approach. These scores demonstrate that the low cross evaluation scores of the pretrained model arise from dissimilarities in the datasets, while the classifier pipeline including our custom preprocessing is well portable.

However, these scores are lower than the ones reported for our dataset. The reasons for this are twofold: First, and most importantly, the training set size plays a significant role to our approach as already shown in Figure 3, and the NLoN dataset is significantly smaller than the training set used in our experiments. Second, the inter-rater agreement within the NLoN dataset with a Cohen’s Kappa of 0.88 is lower than for our manual validation set (0.96). This lower inter-rater agreement in the NLoN dataset can be explained by their process of randomly sampling lines for manual classification, while our manual classification process provided full bug tickets as context to the human classifier.

The performance comparison of our ML approach with simple regular expression parsing on the validation set (F1 score 0.93 for ML vs. 0.65 for regular expressions on a per line basis / 0.85 for regular expressions and Markdown blocks) shows the advantage of ML based classifier models over simple regular expressions.

V-C Limitations and Threats to Validity

We used only open source projects in our experiments. Therefore, we cannot generalize our results to closed source software projects. Since a dataset containing an ample amount of complete bug reports from commercial software is hard to come by, we cannot validate our approach in this domain.

A threat to external validity is that our internal dataset was mined solely from GitHub. Other bug trackers may encourage or discourage certain behaviors of reporters, may enforce more or less rules for more structured bug reports, and may vary in length, format, and tone of natural language. Further, our internal dataset was constructed exclusively from Java projects. Other programming languages may differ in frequency and usage of special characters and may employ different formatting rules, both being important features that we leverage in our approach.

We addressed this threat by validating our approach on the NLoN dataset, that was sourced from mailing lists, chat messages, and comments on bug reports. Two of the three projects used in the NLoN dataset are not written in Java (Go, C++). Using this dataset, we showed that our preprocessing and ML classifier approach does not overfit on any features predominate in GitHub bug reports or language features, and is portable to other data sources.

Another threat to external validity is that the natural language portions used in our experiment are in English language. While logographic writing systems or other languages using non-Latin alphabets probably ease the task of distinguishing natural language from artifacts, our main concern are languages using a Latin alphabet. Since our approach leverages the projects’ documentation to mine natural language items, this poses an issue if the documentation and the bug reports are written in different languages.

A threat to internal validity is our manual classification effort in order to create a validation set. It is subject to human error, and given the task at hand, also subject to human preference regarding what is actually considered an artifact or human language. To counter this threat, two researchers independently classified the bug reports for the validation set. We analyzed the inter-rater agreement on this dataset, and used the classifications of Researcher 2 who was not involved in the implementation of the system to prevent any bias based on implementation details of the approach.

Vi Conclusion

In this work, we present our Python framework for removing non-natural language text portions from bug reports. This includes a process for automated training set generation, preprocessing steps aimed at feature extraction for this task, and an LSVM classifier trained on our data set. We demonstrate the viability of this approach on a manually annotated dataset, by showing that despite the inherent noise in automatically created datasets we can achieve a F1 score of 0.93 and a ROC-AUC score of 0.95.

The proposed preprocessing and machine learning portion leverages differences of natural language and artifacts as stack traces, code snippets, and similar that are normally lost in common NLP preprocessing approaches. That being formatting and structure, frequency and position of special characters, and specific word constructs, e.g., camelcased names and names utilizing underscores.

Mäntylä et al. [17] manually investigated the underlying datasets to create a list of specific features (e.g. line ends with ‘{’) scoring exceptionally well in their evaluation. Our proposed approach aims at enabling the ML algorithm to learn such features from the training set instead of explicitly providing them, by including aforementioned formatting and structure as well as special characters in tokenization and vectorization. In doing so, we skip the manual effort required for feature identification and implementation.

We cross evaluated NLoN and our model by applying the pretrained models on the opposing validation/test sets. Both pretrained models are limited in their portability, with ROC-AUC scores between 0.83 and 0.85. However, we show that without any parameter tuning the model can be trained on a foreign dataset with reasonable performance. We trained and scored our model on the NLoN dataset to enable comparison. This yielded mean scores of 0.91 F1 and 0.93 ROC-AUC despite different programming languages used in the dataset and its origins of mailing lists, comments on bug reports, and chat messages, in contrast to our dataset of bug reports. We found that our approach benefits from bigger datasets. Such large datasets can be automatically created with our approach.

Our automated dataset creation process is based on parts of the input data being properly Markdown formatted. We demonstrate our process on the basis of bug reports for Java projects mined from GitHub issue trackers. We are confident that our approach can be easily ported to other programming languages, given that the underlying bug tickets are Markdown annotated. Depending on the noisiness of the new data, this may require some effort to add domain specific regular expressions. We hope that our proposed approach is useful for researchers dealing with textual bug reports.

Vii Data availability

The created datasets from 101 open source Java projects, the manual validation sets, and our model’s Python source code are made publicly available on Zenodo555 and GitHub666


The work described in this paper has been funded by the Austrian Science Fund (FWF): P 32653-N (Automated Debugging in Use).


  • [1] J. Zhang, X. Y. Wang, D. Hao, B. Xie, L. Zhang, and H. Mei, “A survey on bug-report analysis,” Science China Information Sciences, vol. 58, pp. 1–24, feb 2015.
  • [2] C. Zhou, B. Li, X. Sun, and L. Bo, “Why and what happened? Aiding bug comprehension with automated category and causal link identification,” Empirical Software Engineering, vol. 26, no. 6, pp. 1–36, aug 2021. [Online]. Available:
  • [3] F. Thung, D. Lo, and L. Jiang, “Automatic defect categorization,” in Working Conference on Reverse Engineering (WCRE), 2012, pp. 205–214.
  • [4]

    S. Mani, A. Sankaran, and R. Aralikatte, “DeepTriage: Exploring the effectiveness of deep learning for bug triaging,” in

    ACM India Joint International Conference on Data Science and Management of Data (CoDS-COMAD ’19)

    , jan 2019, pp. 171–179. [Online]. Available:
  • [5] D. Devaiya, J. Anvik, M. Bheree, and F. Yeasmin Omee, “Evaluating a Tool for Creating Bug Report Assignment Recommenders,” in

    33rd International Conference on Software Engineering & Knowledge Engineering

    , 2021. [Online]. Available:
  • [6] J. Zhou, H. Zhang, and D. Lo, “Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports,” in International Conference on Software Engineering (ICSE 2012), 2012, pp. 14–24.
  • [7] L. Kumar, T. G. Dastidar, L. B. Murthy Neti, S. M. Satapathy, S. Misra, V. Kocher, and S. Padmanabhuni, “Deep-Learning Approach with DeepXplore for Software Defect Severity Level Prediction,” in International Conference on Computational Science and Its Applications (ICCSA 2021).   Springer, Cham, sep 2021, pp. 398–410. [Online]. Available:{_}28
  • [8]

    A. Kukkar, R. Mohana, A. Nayyar, J. Kim, B.-G. Kang, and N. Chilamkurti, “A Novel Deep-Learning-Based Bug Severity Classification Technique Using Convolutional Neural Networks and Random Forest with Boosting,”

    Sensors, vol. 19, no. 13, pp. 2964:1–22, jul 2019. [Online]. Available:
  • [9] M. Ortu, G. Destefanis, S. Swift, and M. Marchesi, “Measuring high and low priority defects on traditional and mobile open source software,” in 7th International Workshop on Emerging Trends in Software Metrics (WETSoM 2016), may 2016, pp. 1–7.
  • [10] A. Kukkar, R. Mohana, Y. Kumar, A. Nayyar, M. Bilal, and K. S. Kwak, “Duplicate Bug Report Detection and Classification System Based on Deep Learning Technique,” IEEE Access, vol. 8, pp. 200 749–200 763, 2020.
  • [11] I. Chawla and S. K. Singh, “An automated approach for bug categorization using fuzzy logic,” in 8th India Software Engineering Conference (ISEC 2015).   ACM, feb 2015, pp. 90–99. [Online]. Available:
  • [12] K. Goseva-Popstojanova and J. Tyo, “Identification of Security related Bug Reports via Text Mining using Supervised and Unsupervised Classification,” in IEEE International Conference on Software Quality, Reliability and Security Identification (QRS’18), 2018, pp. 344–355.
  • [13] R. K. Saha, M. Lease, S. Khurshid, and D. E. Perry, “Improving bug localization using structured information retrieval,” in 28th IEEE/ACM International Conference on Automated Software Engineering (ASE 2013), 2013, pp. 345–355.
  • [14] L. Tan, C. Liu, Z. Li, X. Wang, Y. Zhou, and C. Zhai, “Bug characteristics in open source software,” Empirical Software Engineering, vol. 19, no. 6, pp. 1665–1705, oct 2014.
  • [15] B. Ray, D. Posnett, V. Filkov, and P. Devanbu, “A large scale study of programming languages and code quality in GitHub,” in ACM SIGSOFT Symposium on the Foundations of Software Engineering (FSE’14), nov 2014, pp. 155–165.
  • [16] M. Soltani, F. Hermans, and T. Bäck, “The significance of bug report elements,” Empirical Software Engineering, vol. 25, no. 6, pp. 5255–5294, sep 2020.
  • [17] M. Mäntylä, F. Calefato, and M. Claes, “Natural Language or Not (NLoN) - A Package for Software Engineering Text Analysis Pipeline,” in IEEE/ACM 15th International Conference on Mining Software Repositories (MSR), 2018, pp. 387–391.
  • [18] A. Bacchelli, T. D. Sasso, M. D’Ambros, and M. Lanza, “Content classification of development emails,” in 34th International Conference on Software Engineering (ICSE 2012), 2012, pp. 375–385.
  • [19] N. Bettenburg, T. Zimmermann, R. Premraj, and S. Kim, “Extracting structural information from bug reports,” in International Conference on Software Engineering (ICSE 2008), 2008, pp. 27–30.
  • [20] A. Bacchelli, A. Cleve, M. Lanza, and A. Mocci, “Extracting structured data from natural language documents with island parsing,” in 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011), 2011, pp. 476–479. [Online]. Available:
  • [21] P. C. Rigby and M. P. Robillard, “Discovering essential code elements in informal documentation,” in 35th International Conference on Software Engineering (ICSE 2013), 2013, pp. 832–841.
  • [22] D. Ye, Z. Xing, C. Y. Foo, J. Li, and N. Kapre, “Learning to extract API mentions from informal natural language discussions,” in IEEE International Conference on Software Maintenance and Evolution (ICSME 2016), jan 2017, pp. 389–399.
  • [23] L. Ponzanelli, A. Mocci, and M. Lanza, “StORMeD: Stack overflow ready made data,” in IEEE International Working Conference on Mining Software Repositories, vol. 2015-Augus, aug 2015, pp. 474–477.
  • [24] F. Calefato, F. Lanubile, and B. Vasilescu, “A large-scale, in-depth analysis of developers’ personalities in the Apache ecosystem,” Information and Software Technology, vol. 114, pp. 1–20, oct 2019.
  • [25] J. Cohen, “A Coefficient of Agreement for Nominal Scales,” Educational and Psychological Measurement, vol. 20, no. 1, pp. 37–46, apr 1960. [Online]. Available:
  • [26] J. R. Landis and G. G. Koch, “The Measurement of Observer Agreement for Categorical Data,” Biometrics, vol. 33, no. 1, p. 159, mar 1977.