Defect prediction is an active direction of software engineering research with hundreds of publications. The systematic literature review by Hall et al.  already found 208 studies on defect prediction published between 2000 and 2010, many more have been published since then. Many of these studies were enabled by the sharing of data, highlighted by the early efforts from the PROMISE repository , which is nowadays known as Seacraft . Only few publications on defect prediction collect new data. Instead, most researchers rely on well-known data sets, e.g., the NASA data , the SOFTLAB data , or the data about Java projects from Jureczko and Madeyski  often referred to as PROMISE. A recent literature review on cross-project defect prediction highlights that these and other data sets have become the de facto standard for defect prediction research . While sharing and re-using data is a good thing in general, heavy re-use may also lead to problems with the external validity of results . Unfortunately, there is evidence that this is the case for defect prediction due to two issues: 1) problems with the defect labels; and 2) limitations regarding the features used by researchers.
The first problem are the defect labels, that were determined by different publications that consider different aspects of the defect labeling process. The focus of these publications is mostly on the SZZ algorithm , which was applied by most of the currently used data sets (see Section 2). Research revealed several problems with SZZ, e.g., due to ignoring the affected version field of issue reports  or the identification of irrelevant changes . The use of a six-month time frame for the assignment of defects to releases has also recently been identified as an issue . Moreover, SZZ relies on the correct labeling of issues as bug by the developers in the issue tracking system. However, research shows that about 33% of bug reports are mislabeled and are actually improvements or other issues, like outdated documentation . Additionally, SZZ was designed for version control systems that used a mostly linear development process on a main development branch. Due to the success of Git, this is often not the case anymore and there are many new challenges that need to be considered . For example, prior research found that data, which takes branches into account, leads to better results .
The second problem with the re-use of the existing data is the limited feature space that researchers use to create defect prediction models. If a data set does not contain certain features, it is unlikely that they are added by other researchers, even if research indicates that these features may be useful. For example, multiple publications indicate that features based on code changes potentially outperform static metrics as features [16, 17]. Regardless, researchers mostly rely on data sets that only consist of static features . Thus, many publications are using a potentially inferior set of features, which could alter their results.
Thus, we know from the related work about many separate issues with defect prediction data, especially with respect to the way software artifacts are labeled as defective, but also due to a potential lack of relevant features. However, each of the prior publications on this topic focuses on single issues with defect prediction data. What is missing is a view on the impact of the problems if they are not considered in isolation, but together. Within this article, we close this gap. We provide insights into the quality issues that existing data may have, with a focus on the defect labeling. To this aim, we performed an in-depth analysis of the weaknesses of existing defect labeling strategies with a focus on SZZ. We analyze all aspects of the defect labeling process, i.e., the links between commits and issues, the impact of mislabeled issues, the identification of affected files and the inducing changes, as well as the assignment of defects to releases. Additionally, we use the large sample of data we generate to assess the impact of the lack of features on the performance of defect prediction results.
The key findings of our assessment are the following.
About one quarter of the links to defects detected by SZZ is wrong, both due to missed links as well as false positive links.
We confirm the results by Herzig et al.  and found that for every issue that is correctly labeled as a bug, there are 0.74 mislabeled bug issues.
Due to the combination of wrong links and mislabeled issues, only about half of the commits SZZ identifies are actually bug fixing and SZZ misses about one fifth of all bug fixing commits.
The assignment of defects to releases based on a six months time frame, as well as based on the affected versions field of issue tracking systems is unreliable. With SZZ and a six months time frame, we found that for every file, that is correctly labeled as defective, there are roughly two files that are incorrectly labeled as defective, and two files that are incorrectly labeled as non-defective. Moreover, the quality of the data in the affected version field is questionable for mining purposes without prior manual validation.
The difference of using many features of different types over using only static features of the source code is negligible, as more features only allow that slightly more bugs are predicted, without increasing the precision of the predictions.
As part of our analysis, we devise new or modified techniques for defect labeling, that mitigate the issues we found through our inspection of SZZ. This article contributes the following improvements to defect labeling:
A semi-automated algorithm for the identification of links between Jira issues and commits.
A new algorithm to assign defects to releases based on bug-inducing changes that neither requires a time-window, nor the affected version field in the issue tracking system.
All algorithms were designed to work correctly with Git and take all branches into account.
Finally, we combined the results from the manual analysis of data with the improved algorithms for defect labeling, and a large scale collection of features, in order to create a new defect prediction data set, that has 4198 features, including change metrics [16, 18, 17] and different aggregation strategies . The data set contains defect data for 398 releases of 38 projects from the Apache ecosystem.
The remainder of this paper is structured as follows. We discuss the state of practice for the collection of defect prediction data with respect to existing data sets in Section 2, followed by an analysis of issues with the established data collection methods reported in the state of the art in Section 3. Afterwards, we discuss our suggested improvements to the state of practice in Section 4. Section 5 presents the results of our empirical study on defect labeling and the impact of feature sets. We discuss our results in Section 6 and address the threats to the validity of our work in Section 7. Finally, we conclude the article in Section 8.
2 Existing Data Sets
This article is about the state of practice regarding the collection of defect prediction data. Therefore, the related work are articles that collected defect prediction data. Articles that only discuss specific aspects of this data collection are instead discussed together with the issues with the currently established ways for the collection of defect prediction data in Section 3. We discuss the prior defect prediction data sets with respect to the following criteria:
the number of distinct projects, as well as the number of releases;
the level of abstraction of the data set, e.g., modules, files, or classes;
the features that are provided; and
the defect labeling strategy and the labels that are provided.
|Dataset||#Projects / #Releases||Type||Language||Granularity||Features||Defect Linking||Label Type||Year|
|NASA ||13 / 13||PROP||C/C++/Java||Module||SIZE, COM||NA||Binary||2003|
|ECLIPSE ||1 / 3||OSS||Java||File, Package||SIZE, COM, CHURN||SZZ||Counts||2007|
|SOFTLAB ||5 / 5||PROP||C/C++||Module||SIZE, COM||NA||Binary||2009|
|PROMISE ||37 / 92||Mixed||Java||Class||SIZE, COM, OO||REGEX||Counts||2010|
|RELINK ||3 / 3||OSS||Java||Class||SIZE, COM||GOLDEN||Binary||2011|
|AEEEM ||5 / 5||OSS||Java||Class||SIZE, COM, CHURN||SZZ||Counts||2012|
|NETGENE ||4 / 4||OSS||Java||File||SIZE, COM, GEN||SZZ||Counts||2013|
|MJ12A ||18 / 70||Mixed||Java||Class||SIZE, COM, OO, CHURN||NA||NA||2015|
|SHIPPEY ||23 / 69||OSS||Java||Class, Method||SIZE, COM||SZZ||Counts||2016|
|GITHUB ||15 / 15||OSS||Java||Class, File||SIZE, COM, DOC, CLONE||SZZ||Binary||2016|
|UNIFIED ||37 / 71||OSS||Java||Class, File||SIZE, COM, OO, DOC||NA||NA||2018|
|RNALYTICA ||9 / 32||OSS||Java||File||SIZE, COM, OO, CHURN||Affected Version||Counts||2019|
|JIT ||6 / -||OSS||Java, Perl, C++, Ruby||Commit||CHURN, DEV||SZZ||Binary||2013|
|AUDI ||3 / -||Prop.||C||Commit||SIZE, COM, CHURN||SZZ||Binary||2015|
gives an overview about the data sets used in the literature. Overall, we are aware of fifteen publicly available data sets for defect prediction as of August 2019. The data contains mostly open source projects (OSS), but also proprietary projects (PROP). The PROMISE data is actually a mix of 15 open source projects with 48 releases, 6 proprietary projects with 27 releases, and 17 student projects with 17 releases.
These data sets can be distinguished between data for release-level defect prediction and just-in-time defect prediction. There are two major differences between the release-level and just-in-time data sets. 1) Release-level data contains features for all software artifacts for a certain revision (usually a release) of a project, while just-in-time data contains data for every commit of a project, possibly restricted to the main development branch. 2) The release-level data consists mostly of features that measure the source code directly, e.g., its size, structure, or coupling. Just-in-time data consists mostly of features that measure the changes, e.g., the number of lines that are changed. The second difference has a major consequence regarding the programming languages that are considered: the collection of data about the source code structure requires language-specific tooling for the (static) analysis of source code, while the collection of data about code changes and ownership can be done directly using the version control system. This is reflected directly in the languages of the projects: while most release-level data sets are only for one specific programming language, two out of three just-in-time data sets are for a diverse set of languages.
We note that there is a strong focus on Java in the release-level data sets. Although we have no scientific evidence for this, we believe that the reason for this is likely the good tool support for the static analysis of Java. Moreover, we note that the features for the release level data sets are mostly static product metrics of the types that measure the size (SIZE), code complexity (COM), or aspects of object orientation (OO), e.g., using Chidamber and Kemerer’s metrics . The GITHUB and UNIFIED data sets also contain other features based on static product metrics, i.e., regarding the code documentation (DOC) and code clones (CLONE). The notable exceptions are the AEEEM, MJ12A, and RNALYTICA data, which also contain features based on code changes, e.g., added and deleted lines (CHURN). The AEEEM data even contains features that measure the entropy of code changes as proposed by Hassan  and D’Ambros et al. . The ECLIPSE, AEEEM, and MJ12A data also contain metrics regarding prior bug fixes. The MJ12A data is an extension of a subset of the PROMISE data with CHURN metrics. The UNIFIED data is a special case: this data set is actually a combination of the PROMISE (only OSS projects), AEEEM, ECLIPSE, and GITHUB data. All data is conserved as is in the data set and augmented with additional metric data to create the UNIFIED data set . Most data sets for release-level defect prediction are using the file, java, respectively module level. These are all very similar, as they encompass the complete contents of a single file in most cases, with the exception of anonymous and inner classes, that can lead to differences. The notable exception is SHIPPEY, which also contains method-level data.
For the just-in-time defect prediction data, we note that JIT and FJIT are independent of the programming language, which is a big advantage for the generalizing the defect prediction approach. Both data sets are using metrics that can directly be inferred from the version control system. The difference between the data sets is that the JIT data only contains the information which commits induced a defect. In comparison, FJIT contains information which changes to files (hereafter referred to as file actions) introduced a new defect. The AUDI data set is not comparable to the other two data sets: the data is about source code that was not written by developers, but instead generated from Simulink models developed by engineers. Moreover, the data contains not only information that is collected from the version control system, but also static product metrics about SIZE and COM.
The identification of defective artifacts is a major aspect of defect prediction data that can greatly affect the quality of the data. For example, Yatish et al.  recently found that labeling based on the affected releases leads to large differences in comparison to labeling based on keywords and a six-month time window as proposed by Fischer et al. . The de facto standard approach in research is the SZZ algorithm  as it was used by Zimmermann et al. . The SZZ algorithm first identifies bug fixing commits and then the corresponding inducing changes. However, the identification of inducing changes is, to the best of our knowledge, so far ignored by all release-level data sets. Instead, only bug fixing commits are identified using SZZ and then the six-month time window as proposed by Fisher et al.  is used by Zimmermann et al. . This approach was used for the collections of the ECLIPSE, AEEEM, NETGENE, SHIPPEY and GITHUB data sets. The just-in-time defect prediction data sets JIT, AUDI, and FJIT data use SZZ including the identification of inducing commits. The RELINK data was created as a case study for an issue linking approach. The authors created manually validated issue links and used these links for the identification of bug fixing commits. All files that were implicated in any bug fix are considered as defective, without using a time window. The PROMISE and MJ12A data identify bug fixing commits using regular expressions applied to commit messages and a six-month time window. For the NASA and SOFTLAB data, no information on how the defects are linked to source code is given. Since the UNIFIED data is an aggregation of other data sets that reuses defect labels, there is no defect identification approach. The RNALYTICA data uses an approach for the identification of defects based on the affected version field of the bug tracking system. The authors establish links between commits and issues based on references from issues to the commits in which they were addressed. They then use the hunks changed in these commits to identify which files were changed. The authors then use the affected version field of the issue tracker to assign the defect for the file to releases.
We note that there may be additional data sets, that we did not discuss above. For example, we excluded the data used by Zhang et al. [32, 19] based on the census data by Mockus . The reason for this exclusion is that this data is, to the best of our knowledge, not publicly available anymore, because the links in both papers do not work anymore. Regardless, these data sets would not add anything regarding the methodology for collecting defect prediction data, as the approach is almost exactly the same as for PROMISE and MJ12A: the defect identification is based on commit messages and a six month time window, the metrics are SIZE and COM, and in case of  also CHURN.
3 Issues with Existing Data Sets
The sharing of public data sets in defect prediction is a success story that enabled defect prediction researchers to conduct many experiments with the data. Moreover, the use of the same data by different authors enables comparisons between approaches through meta studies, as was, e.g., done by Hall et al.  who exploited that many papers are based on the NASA data. However, there are a number of potential problems that researchers found regarding the data collection procedures used for the creation of the defect prediction data sets. Within this section, we summarize issues regarding algorithms for defect labeling and the features available in current data sets.
3.1 Defect Labeling
Defect labels are the key component of any defect prediction data set. These labels mark artifacts as defective, e.g., files in a release or in a commit. These labels are the dependent variable that defect prediction models try to predict based on the independent variables, i.e., the features. In the existing defect prediction data, labels are either binary or defect counts. Noisy defect labels may negatively affect the training of defect prediction models or make the evaluation of results unreliable. Especially the impact of the noise on the evaluation of results is problematic. For example, if a defect labeling approach marks too many instances as defective, i.e., produces false positives, the commonly used measures recall, precision and F-measure are not trustworthy anymore, because values may change if the distribution between defective and non-defective instances changes. Consider an example with 100 software artifacts, 25 artifacts are actually defective, but the defect labeling algorithm introduces noise and labels 50 artifacts as defective. A trivial model that predicts everything as defective will overestimate the precision as 0.5 instead of the actual 0.25, which would also affect the F-measure which would be 0.7 instead of 0.4.
As we discussed above, the de facto standard for labeling defective instances is the use of the SZZ algorithm  in the variant used by Zimmermann et al. . The SZZ algorithm works in two steps. First, bug fixing commits are identified. The SZZ algorithm tries to find a matching issue, based on the numbers found in the commit messages. In case any number is found, the algorithm tries to find an issue for the project that has the same number. If an issue is found, semantic checks for the following properties are performed :
The issue was resolved as FIXED at least once.
The author of the commit is assigned to the issue.
The title or description of the issue is contained in the commit message.
One or more files that are changed by the commit are attached to the issue.
A commit is identified as bug fixing if there is at least one linked issue, that passes at least two of the above semantic checks. In case only one semantic check is passed, the commit is labeled as bug fixing, if the commit message contains a term like ”bug” or ”fix”, or it is clear that a number in the commit is a link to an issue, e.g., because the number starts with ”Bug #111”, or the commit contains only a list of numbers.
Once a commit is identified as bug fixing, the second part of the SZZ algorithm is the identification of the inducing changes. SZZ identifies the last changes to each line that was touched as part of a bug fixing commit as candidate for an inducing change. All candidates, that took place before the reporting date of the issue are immediately considered as bug inducing changes. Changes that took place after the reporting date of the issue are suspect, because they were performed after the bug was already in the software. However, because of the chance of bad fixes or partial bug fixes, the changes are not automatically discarded. If the suspects are part of a bug fixing commit (partial fix) or the commit contains changes that are inducing for a different bug (weak suspect), they are considered as inducing. The remaining suspects are considered to be hard suspects and not inducing for the bug fix. The identification of the bug inducing commits is only used for the just-in-time data. For the release level data sets, defects are assigned to releases based on the reporting date: all bugs that were reported in the first six months after the release are assigned to a release .
However, in recent years quality issues with the labels produced by SZZ came into focus. The impact of using all changes in a bug fixing commit as foundation for the defect labeling was investigated in detail by Mills et al. 
. The authors manually validated which files that were modified in a bug fixing commit were actually part of the bug fix and found that about 64% of file changes made in bug fixing commits are not part of the bug fixes, but other changes. They found that mistakenly identified files are due to code that is only added and not modified or deleted (46.58% of all cases), changes are performed on test code (30.90%), refactorings (8.73%), and changes of comments (8.49%). SZZ already ignores pure additions for the identification of inducing changes, because there is no prior commit, where the code was last changed. To the best of our knowledge, none of the SZZ implementations used to create the defect prediction data sets ignores test code, refactorings, or comments. This means that based on the estimation of Millset al. , about 34% of files in bug fixing commits are false positives, i.e., incorrectly identified as defective.
Another potential source of false positives of SZZ are commits that are mistakenly identified as as bug fixes. With SZZ, there are two main sources for this problem: the first is due to the strategy for the identification of links between commits and issues, that works based on numbers. If a core developer of a project fixes a bug with an issue number like one, 256 or other frequently occurring numbers, every commit by this developer that contains this number will be identified as a bug fixing commit. While Bird et al.  found that this problem can be mitigated through additional filters, e.g., based on the date of the commit and the issue resolution, this is not part of the standard SZZ algorithm. There are also approaches that try to recover links between commits and issues, that are not explicit, e.g., ReLink . Such links also cannot be captured by SZZ. The second source of false positive for bug fixing commits are the issues themselves. According to Herzig et al. , about 33% of issues that are reported as a bug in the issue tracking system are actually requests for new features, bad documentation, or simply result in refactorings. They found that due to this 39% of the identified defective files were actually not defective. To the best of our knowledge, only the NETGENE data is based on manually validated issues. That both happen in practice can, e.g., be seen with the issue NUTCH-1111https://issues.apache.org/jira/browse/NUTCH-1. This test issue created by a core developer was not for any real bug in the software. However, all commits by this developer for the Apache Nutch project, where the message contains the number one will be mistakenly identified as bug fixing.
The assignment of the identification of the inducing file actions has also come under scrutiny. Da Costa et al.  investigated how well SZZ identifies inducing commits and found that SZZ implementations perform better, if they ignore changes that only modify whitespaces. This is in line with the findings by Mills et al. . Moreover, they suggest that using the affected versions field of issue tracking systems can improve the validity of SZZ results. Developers can use this field to mark versions of a software that are affected by a specific defect. Regardless, all data sets we discussed in Section 2 use a basic SZZ variant that does not ignore whitespace changes or use the affected version field. However, da Costa et al.  do not suggest how the affected version should be integrated into the SZZ algorithm. Yatish et al.  directly used the affected version field to assign defects to releases. Theoretically, this could lead to a perfect assignment of defects to releases. However, this depends on the maintenance of this field in the issue tracker by the developer. In practice, the value of this field is usually set by the reporter of an issue as the version of the software that the reporter currently uses. An analysis if this defect was already in the software in earlier versions is often not performed and, consequently, the field is not updated. An example for this is the issue CAY-1657222https://issues.apache.org/jira/browse/CAY-1657. The affected version of that issue is 3.1M3. However, as part of the description, the author writes ”I am sure this affects ALL versions of Cayenne, but my testing is done on 3.1 M3/M4”. Thus, this approach is likely to lead to many false negatives, i.e., mistakenly not assigned defects to releases that are affected, because the field is not maintained properly. To the best of our knowledge, there is no empirical evidence regarding the quality of the data in the affected versions field.
Regarding the six month time frame that was proposed by Zimmermann et al.  for the release assignment with SZZ, we could not find an empirical basis for this in the literature. Yatish et al.  already broke with this rule and demonstrated that there are both defects within the six-month time that were introduced after the release, leading to false positives, as well as defects that were fixed more than six months after the release, leading to false negatives. Thus, the work by Yatish et al.  provides a strong indication that the complete history of a project after the releases should be considered for assigning defects to releases. We are not aware of other studies that evaluate the impact of the six month time frame.
3.2 Lack of Features
The features (or independent variables) are at the core of every learning algorithm: they are the information that is available to make a decision or they can be correlated with the outcome. If good features are missing in defect prediction data, this can lead to a loss in performance. This loss in performance can vary between algorithms used for training prediction models. This leads to a troubling question: if researchers find a difference between defect prediction approaches on data that does not contain all important features, would the difference still be there if all relevant features are available? As a consequence, conclusions regarding the performance of defect prediction algorithms based on data that does not contain features that were demonstrated to be valuable have a severe problem with the external validity of the findings. The defect prediction literature suggest that there are at least two kinds of such features: CHURN related features, as well as different variants of aggregated features.
CHURN related features are based on the findings that defects are more likely in often changed parts of a software, especially in case a prior change already removed defects . Moreover, such features may also include information about past defects (BUG), e.g., the number of prior defects that were already corrected in a file. Publications that include CHURN features consistently find that CHURN features are among the most important predictors for defects. This already started with the pioneering work by Ostrand et al.  at Bell labs and was later confirmed, e.g., by Moser et al.  and D’Ambros et al. . Hassan  proposed to use the concepts of entropy and linear decay to further improve the impact of CHURN features, which was also confirmed by D’Ambros et al. . Another strong indicator for the importance of CHURN related features is that they are the main features in the just-in-time defect prediction data sets in combination with code ownership. Regardless, the only release-level data sets that contain CHURN features are ECLIPSE, AEEEM, MJ12A, and RNALYTICA.
However, there is another issue that is known related to CHURN features, i.e., those that take the history into account. The current data sets collect this data only based on the main development branch of a repository. However, due to the advent of Git as version control system, features are often developed on branches. These feature branches are merged through a single merge commit into the main development branch. If only the main branch is considered, information about the history of the development is ignored. To the best of our knowledge, all current defect prediction data sets only use the main development branch. Kovalenko et al. 
evaluated the impact of using feature branches as part of the data collection on results of various software mining approaches, including defect prediction. They found that the performance of defect prediction may improve slightly, if data from branches is included in the mining process. In particular, they found that the results are never worse. Thus, while this issue probably does not have the same impact as the general lack of CHURN features, this could still lead to underestimating the performance of defect prediction approaches.
While CHURN features are part of some data sets, there are other types of features, which are ignored by current defect prediction data altogether. Plosch et al.  analyzed the correlation between warnings produced by the static analysis tools FindBugs333http://findbugs.sourceforge.net/ and PMD444https://pmd.github.io/. They found that the warnings have a stronger correlation with bugs than OO and SIZE metrics. In contrast, Rahman et al.  found that features from static analysis tools do not improve the performance of defect prediction models. Due to the contradictory results, we believe that more data is required, e.g., through additional studies that evaluate the impact of such features.
The impact of aggregated features is a recent result of the defect prediction literature. Zhang et al.  found that how measurements from lower level software artifacts are aggregated into metrics for higher level artifacts impacts the performance of defect prediction models. A popular example of such a metric is the Weighted Method per Class (WMC) metric from the Chidamber and Kemerer’s metrics for object-oriented software . This metric measures the complexity of a class by summation of the complexity of the methods. However, Zhang et al. 
found that aggregation through a single marker, like summation, can actually lead to inferior defect prediction models. Instead, they recommend to use different aggregation strategies to provide multiple aggregations such as summation, median, and the standard deviation and later use feature reduction techniques to remove redundancies. We note that while Zhanget al.  found that using all aggregation schemes leads to the best results overall, they also observed that using only summation is a close second, i.e., the advantage of using multiple summation schemes may be negligible. Regardless, to the best of our knowledge, none of the current publicly available data sets support this kind of analysis.
4 Improving Defect Labeling
In principle, we believe that the SZZ algorithm  provides a very good foundation for the labeling of defects. Thus, we do not propose a radically new algorithm, but rather modify the SZZ algorithm to work well together with the Jira issue tracking system, as well as take the issues from the state of the art into account.
As we described in Section 3.1, the SZZ algorithm may suffer from misidentified defect links due to recurring numbers like 1. We note that SZZ was designed with the Bugzilla issue tracker in mind. Here, issues identifiers are just a single number, i.e., there is no good resolution for this. This is different in Jira, where the issue identifiers have the structure PROJECTID-NUMBER. Thus, we modified the identification of linked issues to take this structure into account to define a new linking approach we call JL555Short for Jira Links.
JL exploits the semantics of the string descriptor of Jira, i.e., we search for the complete identifier in commit messages, and not just any number. The drawback of this is that spelling problems in the project identifier would mean that we miss issue links. To account for this, we manually check all strings that are a combination of a string followed by an integer and supply a list with all wrong spellings, such that they can be corrected by the linking algorithm. While this requires manual effort, this can be done in a matter of minutes. The problem with JL is that it also captures links to commits, where an issue is only mentioned, but not actually addressed. Moreover, if the numbers are alone, i.e., not part of a Jira identifier, JL may miss links. SZZ can detect such links, because the algorithm works only on numbers. Thus, to account for links that SZZ detects but JL misses, we semi-automatically analyze all messages of commits that contain a link determined by JL or SZZ. The goal of this additional step is to create a validated set of links from commits to issues. For many commits, this is not a problem. In case we determined only one link from a commit to an issue, and this link is established because the exact name of the issue occurs at the start of the commit message, we assume that this commit addresses the mentioned issue. An example for such a commit message from the ant-ivy project is ”IVY-1391 - IvyPublish fails when using extend tags with no explicit location attribute”. An expert must inspect all remaining commit messages for which a link was detected regarding two criteria: 1) are the links correct, i.e., are the issues actually mentioned by the commit message and 2) which issues were actually addressed by the commit and which issues were only mentioned. Only correct links that were actually addressed in the commit are then validated by the expert. We refer to the combination of JL with the validated data as JLM666Short for Jira Links Manual.
For both JL and JLM we use rules similar to SZZ  to determine if a commit is bug fixing:
a bug fixing commit must have a validated link to an issue that is validated as BUG; and
the linked issues must have been in a closed or resolved state at any point in its lifetime.
The major assumption behind JLM is that the labeling of issues as bugs in the issue tracker is correct. Since we know from the results by Herzig et al.  that this is often not the case, we propose that the type of issues should be manually validated. Taking pattern from Herzig et al. , we used the following five categories.
BUG for null pointer exceptions, runtime or memory issues caused by defects, or semantic changes to the code to perform corrective maintenance task. This is the same as the BUG category from Herzig et al.
IMPROVEMENT for feature requests or the non-corrective improvement of existing features. This bundles the categories RFE (Feature Request), IMPR (Improvement Request), REFAC (Refactoring Request) from Herzig et al.
TEST for issues that only require changes to the software tests. This category was not used by Herzig et al.
DOC for requested changes to the documentation of the software. This is the same as the DOC category from Herzig et al.
OTHER for all other issues, e.g., questions or brainstormings. This is the same as the OTHER category from Herzig et al.
Our reasons for the differences between our work and Herzig et al. are mainly due to the efficiency. The different types of IMPROVEMENT are often very hard to distinguish based on the description of an issue, while they all lead to improvements of the software. We kept the DOC and OTHER and added TEST because these are clearly distinguishable from the other issue types. For maximal efficiency, one could also use a simple binary classification, i.e., BUG, and not BUG. For our research, we decided against this to facilitate research using this data regarding the automated correction of issue types.
We propose that the manual validation should be done in two steps: First, all linked issues of type BUG should be independently labeled by two experts. The experts have access to the description and comments of the issue, as well as the source code that was changed as part of the commits that were linked to this issue. If both experts agree, we assume their assessment is correct. In case of disagreement, the issues should be presented to a panel of at least two experts, one of which did not participate in the initial labeling. The experts then decide the issue type based on the blinded labels determined by the two experts, the issue description and comments, and the source code changes. This validation should be based on the principle ”innocent unless proven guilty”, i.e., in case there is doubt whether the issue is a BUG or not, the experts should not modify the label, i.e., always label such issues as BUG.
While other issues of other type than BUG could also be manually analyzed, the work by Herzig et al.  showed that bugs are almost always correctly labeled as BUG. Thus, we suggest to restrict the manual labeling to issues of type BUG, due to the time intensive nature of the manual labeling step. In the following, we use JLMIV777Jira Links Manual Issues Validated to refer to bug fix labeling that accounts for the manual validation of issue types.
The results by Da Costa et al.  and Mills et al.  indicate problems with the way SZZ determines bug inducing changes. Both studies highlight that changes that only affect the whitespaces or modify comments should be ignored during the identification of bug inducing changes. To address this concern, we use a regular expression approach to identify changes that only modify comments or whitespaces and ignore them. Mills et al.  also found that changes to tests are also inadvertently covered during the search for bug inducing changes. Since bugs can, by definition, only be in production code, these changes are all false positives. We extend this notion to changes to examples or tutorials, which may contain code files. These are also not production code, but documentation of the project. Based on the results by Mills et al. , these modifications should be able to account for about 85% of the false positive bug inducing changes. We refer to this improvement of the detection of inducing changes as JLMIV+.
Da Costa et al.  also noted that SZZ should take the affected version field of issue tracking systems into account, because this gives further information about the time when the software was defective. In their study, Da Costa et al.  mark all changes after the release of the earliest affected version as incorrect. Yatish et al.  assign bug fixes directly to releases based on the affected version field and do not use SZZ at all. In comparison to Yatish et al.  and Da Costa et al. , we do not consider the affected version label as ground truth, due to the reasons we discussed in Section 3.1. Because we assume that the affected version field is likely missing completely or at least missing older releases that were also affected by a bug, we believe that the approaches by Da Costa  and Yatish et al.  are too strict for real world data and would lead to false negatives, i.e., not assigning bugs to affected releases because the affected versions field is incomplete. A less strict variant that takes the affected version field into account would be to integrate the affected version in the strategy to determine inducing changes of the SZZ algorithm, based on how the bug reporting date in the issue tracker is used. SZZ assumes that changes after the reporting date of a bug are suspect, but may still be inducing for the bug fix, e.g., because they are bad fixes or partial fixes. The same logic can be applied to changes that happen after the release of an affected version. Thus, our proposal to utilize the affected version field to enhance the SZZ algorithm is to extend the notion of suspect changes and use the minimal date of the release of all affected versions and the reporting date of the bug as the boundary for suspects. We refer to this approach as JLMIV+AV.
For release level-data, there is another issue to consider, i.e., how we decide which releases were affected by which bugs and label files within releases accordingly. We already discussed that the six month timeframe has no empirical foundation and leads to mislabels, as demonstrated by Yatish et al. . However, since we believe that the affected version field is unreliable, we propose a different approach than Yatish et al. . We propose an approach that is directly based on the bug inducing changes. If the bug inducing changes are determined correctly, this means that the bug was in the software, when the last non-suspect bug inducing change took place. Suspect changes have to be excluded here, because there is confirmation in the issue tracking system that the bug already affected the system, when this change took place. Similarly, we can determine when the bug was fixed as the last bug fixing commit for the bug. Consequently, a bug affects a release, if all non-suspect inducing changes took place before the release, i.e., there is a path in the commit graph from all inducing changes to the release, and at least one bug fixing commit took place after the release. We refer to this approach as IND-JLMIV.
5 Empirical Study
Within this section, we describe the results of an empirical study we conducted. The goal of this study was two-fold. On the one hand, we want to determine the impact of the issues we discussed in Section 3. On the other hand, we want to determine if our proposed improvements can effectively resolve these issues. For all of these analyses we only presented aggregated results for all projects we analyzed. The detailed tables with all data, including all code required for an exact replication of our work as well as the collected release-level and just-in-time defect prediction data sets, are part of the supplemental material888https://hdl.handle.net/21.11101/0000-0007-D827-A
We will create a long-term archive on Zenodo or a similar archive and use a DOI to cite the material in case of acceptance.
We conducted our study on a convenience sample of projects from the Apache Software Foundation999http://www.apache.org. Apache projects must have reached a certain level of maturity in order to be considered as a top-level project. Especially the use of Jira as issue tracking system is highly recommended and followed by most Apache projects. Additionally, Bissyandé et al.  found that ”Apache developers are meticulous in their efforts to insert bug references in the change logs of the commits”. This is an important property for the projects under consideration because it removes the need for link recovery and we can safely assume that links from commits to issues are available in the data. Yatish et al.  also used a convenience sample of Apache projects for the same reason. To further ensure the maturity of projects, we used the criteria listed in Table II. In addition, we had a soft criterion that we focused on projects with less than 10.000 commits on the main development branch. The reason for this is the very high demand on the resources for the collection of the metric data for every commit in the Git repository.101010While this criterion is irrelevant for the evaluation of the defect labeling, we selected the projects also with the goal to provide a new defect prediction data set. Please note that the total number of commits can still be larger than 10.000 commits, because of we collect the data for all branches.
|Uses Git||Most projects either already use a Git repository, or provide a Git mirror of a SVN repository.|
|Java as main language||Our static analysis only works for Java code.|
|Uses Jira||The Jira of the Apache Software Foundation is the main resource for tracking issues of most Apache projects.|
|At least two years old||Project as a sufficient development history.|
|Not in incubator stage||Project has been fully accepted by the Apache Software Foundation.|
|100 Issues in Jira||Project is mature and actively uses Jira|
|Commits||Project has a sufficient development history.|
|100 Files||Project should have a reasonable size.|
|Activity since 2018-01-01||Project is still active in both Jira and Git.|
Table III lists the 38 projects for which we collected data, including the versions of the 398 releases for which we collected release-level data. The releases were determined using the project homepages. For each release, we looked up the commit of the release in the Git repository. For most releases, there was a related tag in the Git repository. If this was not the case, we manually analyzed the commit history to determine the release commit, using the information we found on the project homepage, as well as related tags and branches.
|ant-ivy||3,189||535||1.4.1, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0|
|archiva||10,262||542||1.0, 1.1, 1.2, 1.3, 2.0.0, 2.1.0, 2.2.0|
|calcite||2,926||842||1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.9.0, 1.10.0, 1.11.0, 1.12.0, 1.13.0, 1.14.0, 1.15.0|
|commons-bcel||1,429||53||5.0, 5.1, 5.2, 6.0, 6.1, 6.2|
|commons-beanutils||1,341||76||1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7.0, 1.8.0, 1.9.0|
|commons-codec||1,838||64||1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 1.10, 1.11|
|commons-collections||3,380||115||1.0, 2.0, 2.1, 3.0, 3.1, 3.2, 3.3, 4.0, 4.1|
|commons-compress||2,755||172||1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 1.10, 1.11, 1.12, 1.13, 1.14, 1.15, 1.16|
|commons-configuration||3,717||188||1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 1.10, 2.0, 2.1, 2.2|
|commons-dbcp||2,205||127||1.0, 1.1, 1.2, 1.3, 1.4, 2.0, 2.1, 2.2.0, 2.3.0, 2.4.0, 2.5.0|
|commons-digester||2,525||26||1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 2.0, 2.1, 3.0, 3.1, 3.2|
|commons-io||2,262||131||1.0, 1.1, 1.2, 1.3, 1.4, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5|
|commons-jcs||1,622||80||1.0, 1.1, 1.3, 2.0, 2.1, 2.2|
|commons-jexl||3,276||84||1.0, 1.1, 2.0, 2.1, 3.0, 3.1|
|commons-lang||5,792||318||1.0, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7|
|commons-math||7,222||415||1.0, 1.1, 1.2, 2.0, 2.1, 2.2, 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6|
|commons-net||2,270||176||1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0, 2.0, 2.1, 2.2, 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6|
|commons-scxml||1,216||70||0.5, 0.6, 0.7, 0.8, 0.9|
|commons-validator||3,416||73||1.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0, 1.5.0, 1.6.0|
|commons-vfs||2,212||156||1.0, 2.0, 2.1, 2.2|
|deltaspike||2,311||302||0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0|
|eagle||1,119||225||0.3.0, 0.4.0, 0.5.0|
|giraph||1,121||337||0.1.0, 1.0.0, 1.1.0|
|gora||1,329||113||0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8|
|jspwiki||8,809||274||1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 2.0.36, 2.2.19, 2.4.56, 2.6.0, 2.8.0, 2.9.0, 2.10.0|
|knox||2,069||568||0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0, 0.14.0, 1.0.0|
|kylin||12,975||732||0.6.1, 0.7.1, 1.0, 1.1, 1.2, 1.3, 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0|
|mahout||4,167||513||0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.10.0, 0.11.0, 0.12.0, 0.13.0|
|manifoldcf||5,936||633||0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 1.10, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 2.10|
|nutch||3,532||643||0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 1.10, 1.11, 1.12, 1.13, 1.14, 2.0, 2.1, 2.2, 2.3|
|opennlp||2,685||219||1.6.0, 1.7.0, 1.8.0|
|parquet-mr||2,249||1413||1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.9.0|
|santuario-java||3,376||83||1.0.0, 1.2, 1.4.5, 1.5.9, 2.0.0, 2.1.0|
|systemml||6,196||452||0.9, 0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 1.0.0, 1.1.0, 1.2.0|
|tika||4,933||605||0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 1.10, 1.11, 1.12, 1.13, 1.14, 1.15, 1.16, 1.17|
|wss4j||3,734||241||1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0|
We collected all data using the SmartSHARK  platform. The advantage of this approach is that we aggregate all collected data in a single MongoDB database, including the results of our manual validations.
We used the tools vcsSHARK111111All *SHARK tools and mynbou are available at https://github.com/smartshark, mecoSHARK, coastSHARK, changeSHARK, and refSHARK to collect data from the version control system. The vcsSHARK collects meta data about commits, e.g., the messages, the committer, as well as the actual changes, i.e., the file actions and hunks. The mecoSHARK is a wrapper around SourceMeter121212https://www.sourcemeter.com/, a tool that calculates static software metrics, clone metrics, as well as the warnings by the static analysis tool PMD. The coastSHARK collects AST node counts and the import statements of Java classes, i.e., low level data about the use of language constructs and the dependencies of classes. The changeSHARK is a wrapper around the ChangeDistiller  that determines the types of changes performed in commits using the classification from Zhao et al. . The refSHARK is a wrapper around the RefDiff tool for refactoring detection . These tools are executed for every commit in the repositories to collect data about the source code evolution. Additionally, we use the tool memeSHARK to remove redundancies from the collected data, e.g., because the data did not change between commits, for a more efficient storage. We use the tool issueSHARK to collect data from the issue tracking system of the projects, e.g., the identifiers, comments, status, and other meta data about the issues.
The tools linkSHARK, labelSHARK, and inducingSHARK implement the approaches for issue linking, labeling of bug fixing commits, and the inference of inducing commits that we evaluate in this empirical study. The manual validation was supported by the visualSHARK, a web application that presented the data that requires manual validation to the experts and stores the results of the validation in the MongoDB. The information that the web interface provides is similar to the LINKSTER tool .
We use the tool mynbou to create CSV files with release-level data with files as level of abstraction. For each file, the data contains the software metrics and PMD warnings collected by the mecoSHARK and the coastSHARK, the number of the the different kinds of changes and refactorings collected by the changeSHARK and the refSHARK from the last six months, and churn metrics proposed by Moser et al. , Hassan , and D’Ambros et al. . Additionally, mynbou provides all thirteen aggregations that were proposed by Zhang et al.  for the software metrics the mecoSHARK collected that are not on the file level, i.e., class, method, interface, enum, attribute and annotation metrics. The data set contains a total of 4198 features. The defect labels are assigned by mynbou using the JLMIV+IND approach. For each file, mynbou stores the number of bugs that were fixed in this file. Moreover, the data contains a matrix that has as columns the issues and as rows the files. This matrix contains the value one, if the issue affected a file. This allows a fine-grained analysis which issues affected which file in a release and also which issues affected multiple files. The column names of this matrix contain the identifier of the issue, the severity of the issue, and the date of the last bug fixing commit for the issue. This meta data allows for later filtering, e.g., to exclude trivial bugs or to ensure that there is no data leakage, e.g., to exclude bugs that were not yet fixed at the time of a release for which a prediction model is trained.
5.2 Evaluation Criteria
For the evaluation of the different aspects of defect labeling, we determine baselines and then determine how well other approaches perform with respect to the baseline. We evaluate two aspects: how many artifacts determined by the baseline are correctly identified and how many additional artifacts are identified. Artifacts are, for example, links from commits to issues, commits, or files. This approach is similar to the concept of true positives and false positives. For example, if a baseline determines artifacts as defective, and an approach for comparison A identifies of these artifacts as defective as well, but also additional artifacts, we say that A identifies artifacts correctly as true positives and additional false positive artifacts. For these comparisons, we report the median and the Median Absolute Difference (MAD) which is defined as for a sample 
. The median and the MAD have the advantage over the mean value and the standard deviation that they are robust in case of non-normal distributions, which is the case for most data in our empirical study. The value 1.4826 is a scaling factor for MAD that makes the values of MAD similar to the standard deviation of normally distributed data. We also evaluate defect prediction models as part of our empirical study, to evaluate the impact of feature sets. We use four performance metrics that are frequently used in defect prediction research [45, 7]. The first three metrics are
where , respectively, are the number of true positive, resp., true negative predictions, , resp., are the number of false positive, resp., false negative predictions. The fourth metric is the AUC, which is the area under the curve of the recall plotted versus the false positive rate which is defined as .
We use Wilcoxons signed-rank test for paired samples  with significance level of 0.01251313130.05 with Bonferoni correction for four tests that we conduct in Section 5.9 to evaluate if the differences in these four metrics are significant. In case differences are significant, we use Cliff’s  to estimate the effect sizes. According to Romano et al. , the effect is negligible if , small if , medium if and large if .
5.3 Identification of Issue Links
The first step of defect labeling is the identification of links between commits and issues. We restricted the analysis only to links to issues of type BUG that were closed and fixed at least once in their lifetime. Thus, we restrict this analysis to the links to issues that are relevant for defect labels.
Our JLM approach that is based on a semi-automated validation of the links found by SZZ and JL identified 18,721 correct links from commits to issues. 5,311 of these links were manually validated by the first author of this article, the remaining 13,410 links were validated by our heuristic, i.e., had a link to a single issue directly at the beginning. We sampled 1000 of the links that were validated by our heuristic to evaluate the correctness of the heuristic. The heuristic was correct in all cases. Moreover, we randomly sampled 1000 commits from the all commits for which neither SZZ nor JL found a link to a Jira issue. We found no links to Jira that we failed to identify. However, we found 38 links to the Bugzilla141414https://bz.apache.org of the Apache Software Foundation. Since the data is not available anymore, we could not validate, if these issues are bugs or improvements. Therefore, about 3.8% of the total commits may also be bug fixing, but cannot be determined as such anymore because the issue data is missing. Otherwise, we found no errors in our data. Thus, while we believe that there may still be missed or invalid links in the data, the amount of data that is affected would be very small. Therefore, we can consider JLM as ground truth for the links between commits and Jira issues. We note that these findings are in line the empirical study by Bissyandé et al.  on issue links in Apache projects.
We evaluate the performance of SZZ and JL with respect to the ground truth data determined with JLM. Figure 1 summarizes the results. JL finds almost all correct links with a median of 99.7% (MAD=0.4%). In the worst case, JL still identifies 96.2% of all links. However, JL finds a median of 1.7% (MAD=1.9%) additional links that are wrong. In the worst case, JL identifies 59.4% additional links. This happened for the commons-jcs project and was due to the very frequent usage of version numbers within commit messages. Further investigation revealed that the usage of version numbers was the main reason for the false positive links by JL for all projects.
SZZ finds a median of 85.4% (MAD=19.6%) of the correct links to issues. We note that the results of SZZ strongly vary, in the worst case SZZ only finds 18.8% of the correct links. This happened for the commons-collections project. When we evaluated this, we found that for commons-collections, many issues were never assigned to a developer in the Jira. This breaks the semantic check of SZZ for the equality of the assignee in the issue tracker and the author of the commit. Further investigation revealed that this semantic check did not hold for all missed links by SZZ. Moreover, SZZ identifies a median of 12.3% (MAD=15.5%) additional links that are wrong. We note that while the median is relatively low, there is a long tail of projects with many additional links. There are even two outliers which are not shown in Figure1. The outliers are for parquet-mr (430.4% additional links) and cayenne (124.3% additional links). In both cases, the broken links are due to links to pull requests on Github, which have the pattern #NUMBER. Since SZZ cannot distinguish between different issue tracking systems, all these numbers are checked against the Jira of the projects and lead to additional links. Further investigation revealed that links to pull requests were the most frequent reason for additional links in general. Commonly used numbers were also problematic, but not as frequent.
5.4 Issue Type Validation
The second part of the validation of the quality of the issue data is a partial conceptual replication  of the results by Herzig et al. . Based on the data for five projects by Herzig et al., we expect that between 27.4% and 42.9% of BUG issues are mislabeled with 99.5% confidence. The first and third author of this paper manually labeled the types for all linked issues of type BUG, regardless of whether the link was established by SZZ, JL, or JLM. All three authors determined the correct label as committee in case there was a conflict between the two independent labels. This way, we manually validated the issue type for all 11,295 issues that were linked by commits. Figure 2 summarizes the results of the evaluation. Overall, we found that a median of 42.3% (MAD=13.3%) of linked BUG issues are mislabeled: 29.0% (MAD=6.4%) are actually IMPROVEMENT, 5.2% (MAD=3.6%) are DOC, 3.0% (MAD=2.0%) are TEST, and 3.8% (MAD=3.1%) are OTHER151515The sum of the median values for for IMPROVEMENT, DOC, TEST, and OTHER is 41.0%, i.e., less than the median of not being a BUG, which is 42.5%. This is possible because the median is not linear.. Thus, our results replicate the findings by Herzig et al. 
, even though we are close to the upper bound of the confidence interval. Another way to read these numbers is that for every correctly labeled BUG issue, there are a median ofincorrectly labeled issues. Figure 2 demonstrates that all projects are affected by this kind of noise in the data, i.e., even in the best case about one fifth of the bugs are mislabeled.
5.5 Bug Fixing Commits
Neither the broken links, nor the wrong issue types have direct impact on defect prediction. The impact on defect prediction research only manifests, if there are false positives or false negatives for the labeling of bug fixing commits based on this data. To validate how the bug fixing commits change, we compare SZZ, JL, JLM, and JLMIV with each other, using JLMIV as baseline. JLMIV constitutes our ground truth for this evaluation, because it is based on the validated links and the validated issues. JLMIV labels a median of 5.6% (MAD=3.3%) of the total commits of a project as bug fixing. Figure 3 summarizes the actual bug fixes that are detected correctly (true positives) and the wrongly detected bug fixing commits (false positives). The results for the true positives mirror the results of the detection between the commits and issues. SZZ identifies a median of 86.9% (MAD=18.2%) of all bug fixing commits, JL is almost perfect with a median of 100% (MAD=0.4%). JLM identifies all bug fixes identified by JLMIV, because JLMIV uses the same links as JLM and only reduces the bug fixing commits, because fewer issues are considered as BUG.
Regarding the false positives, our results are mostly influenced by the issue type validation. SZZ finds a median of 81.1% (MAD=40.0%) false positive bug fixing commits, JL finds a median of 86.3% (MAD=40.4%), and JLM finds a median of 78.9% (MAD=39.3%). While these numbers are very high, they are expected given that there are a median of 0.73 wrong BUG issues for every correct bug issue. There are also additional false positives for SZZ and JL due to additional issue links that are wrong. We note that SZZ has a lower median than JL. This is counter intuitive, because SZZ should have more false positives, because SZZ has more additional issue links than JL. However, this is offset by the correct links that SZZ missed. These not only cause SZZ to miss true positives, but also to miss false positives due to wrong issue labels.
5.6 Ground Truth for Inducing Changes and Affected Releases
We were able to establish ground truth data for all our empirical data so far. Unfortunately, this was not possible for us for the bug inducing changes and the assignment of bugs to releases, including the quality of the data in the affected versions field of Jira. When we considered if we could achieve this through manual validation, we came to the conclusion that this is impossible on this scale for a group of researchers (Section 6.3). Even on a smaller scale, we could often not be sure if our assessment is correct, because we lack understanding of the details of the source code within the projects. We could also not re-use an established approach from the literature. To the best of our knowledge, only Mills et al.  determined similar ground truth data, but only for the file actions addressed in bug fixing commits, not their inducing counterparts. The projects that are part of our empirical study that were also investigated by Mills et al. are mahout and tika. Mills et al. sampled 34 issues for mahout and 22 issues for tika, 23 of these issues are validated bug fixes in our data. We cross-checked our data with these 23 issues to determine if all files that were found to be part of the bug fix would be detected as such by our approach. This is the case, if there is an inducing change for the file. We found that we correctly detect which file actions were actually part of the fix for 21 of the issues. For the issue TIKA-1110, we fail to identify one file, that Mills et al. say is part of the fix. However, this is the deletion of a file, because there are no further references to it. Since the deletion does not cause any change to the logic of the code, this is not a problem of our data. The cases where we miss changes are for TIKA-1070 and TIKA-961. In both cases, there is a pure addition of source code. Since pure additions do not have inducing changes, we cannot identify an inducing commit and would, therefore, miss the related file. Overall, our approach was correct for 21 of 23 issues, i.e., 91% of the file actions.
While this small sample gives us some confidence regarding the file actions in the bug fixing commits, we have no data yet regarding the inducing changes. Since the manual labeling of inducing changes is even more difficult than the labeling of fixing changes, we wanted to rely on the developers for this task. Thus, we looked for information in our data, which we could use to accurately determine inducing changes.
We were able to extract this knowledge from our data by exploiting the false positive links of our JL approach. One reason for false positives is that an issue is referenced in a commit message as the cause of a bug. Thus, we scanned all bug fixing commits identified by JLMIV for false positive issue links. We manually checked the commit messages and found twenty commits which clearly state that one issue was the cause of another issue. It follows that the bug was in the software, when the work on the inducing issue was finished. Thus, we looked up the commits in our data, which were also linked to the inducing issue. In case there were multiple commits on the inducing issue, we manually validated which of the commits on the inducing issue was the latest change that was related to the work on the fixed issue and marked this commit as the inducing commit. In all but one case, this was the latest commit on the inducing issue. The exception is the work on TIKA-2483161616https://github.com/apache/tika/commit/06486c, where the inducing change was not in the latest change171717https://github.com/apache/tika/commit/6930ff, but one of the prior changes181818https://github.com/apache/tika/commit/3aab15. Table IV lists the data we retrieved. We first note that ten of the issues we found only existed for less than five days in the projects. This is expected, because this data is not an unbiased sample from all bug fixing commits, but rather a sample of commits were developers identified a concrete issue as the reason. In these cases, a developer immediately noticed and fixed the problem and referenced the prior work. The other ten issues lived longer, one issue even existed for more than one year. We use this data to evaluate three aspects: 1) the accuracy of the detection of the latest inducing change with JLMIV+; 2) the quality of the data in the affected versions field of the issue tracking system; and 3) the correctness of the assignment of the bugs to releases based on a six month timeframe (6M), the affected versions field (AV), and the inducing changes determined by JLMIV+ (IND).
JLMIV+ finds the correct latest commit regarding the inducing file action for sixteen of the twenty issues. In four cases, JLMIV+ finds a commit that is newer than the actual inducing change. This is an expected weakness of JLMIV+ and of strategies that use the blame mechanism to find the most recent change in a version control system in general. This happens if there is a change on the defective source code between the inducing change and the bug fixing change.
Regarding the affected versions field, we note that there are nine cases, in which the field was not used in the issue tracker. In seven of those cases, this is correct, as the bugs never affected a release, i.e., they were introduced and fixed between releases. In the other two cases, we validated that the bugs affected multiple releases of the software. Of the eleven cases, where the affected version field is defined, only one entry is completely correct (VALIDATOR-376). For two issues, the affected version fields contain a partially correct entry (TIKA-599 and TIKA-2483). In both of these cases, only the latest release is mentioned, the older releases which are also affected are ignored. The remaining eight entries of the affected version field are wrong, i.e., they list releases which are not affected by the bugs. In six cases, the bugs were fixed prior to the release (IVY-882, NUTCH-683, PARQUET-214, LENS-538, GIRAPH-88, GIRAPH-34), in the other two cases the bugs were only induced after the release (KYLIN-3223, EAGLE-573). In the first six cases, the developers assigned the version number of the release that is currently a work in progress, in the last two cases they assigned the version number of the latest release.
The problems with the affected version field are more severe than we expected. Overall, the bugs in this sample affect 10 releases, the affected version field only mentions three releases correctly. We expected that we would find this kind of error in the data of the affected version field, even though we expected fewer differences. Our analysis revealed that the affected versions field may also contain affected versions that are wrong, which we did not expect. The case were the version of the work in progress release is used, is relatively harmless for defect prediction: since the bug fixing commit is before the release, the commit will not be considered during the labeling of the release and the wrong value of the field will, consequently, be ignored. Similarly, our proposed improvement for the detection of inducing changes JLMIV+AV would just use the date of the reporting of the bug and, therefore, also not be affected. The second case, where the latest release is entered, even though the bug was never in that release, is problematic. This leads to an additional assignment to a release and also breaks JLMIV+AV as the actually inducing changes are after the date of this wrongly mentioned release and would, therefore, be flagged as suspect.
The last aspect is the assignment to releases. Assignment based on bug fixes six months after the release is correct for twelve issues, assignment based on affected versions is correct for eleven issues, and assignment based on the inducing changes for all twenty issues. The correctness of the six month time frame depends on two factors: the time to fix and the activity of the project. In case the time to fix was more than six months, the correct release was missed. In case the project was very active, e.g., with multiple releases within the last six months, the bug would be assigned to a release in which it was not yet introduced into the software. With the exception of VALIDATOR-273, the release assignment based on the affected version field is correct if no release was affected and the affected versions either contained a not yet released version or was empty. The assignment based on inducing changes is correct, even in the four cases where a wrong change is identified as inducing. In one of these cases, there is only a small deviation of less than one week between the actual time to fix and the determined inducing change. In case of PARQUET-214 and VALIDATOR-376 the inducing change is relatively far of and it is pure chance due to the project activity that the assignment is correct.
|Fixed Issue||Inducing Issue||Fix Date||Inducing Date||Time to Fix||Deviation JLMIV+||Affected Versions Field||Actually Affected Releases||6M||AV||IND|
|IVY-882||IVY-857||2008-08-22||2008-07-08||45 days||0 days||2.0-RC1||-||✓||(✓)||✓|
|CALCITE-2253||CALCITE-2206||2018-04-16||2018-04-11||5 days||0 days||-||-||✓||✓|
|CALCITE-1215||CALCITE-1212||2016-04-26||2016-04-26||0 days||0 days||-||-||✓||✓|
|CALCITE-822||CALCITE-783||2015-08-08||2015-07-21||17 days||0 days||-||-||✓||✓|
|KYLIN-3223||KYLIN-3239||2018-02-10||2018-02-09||1 days||0 days||2.2.0||-||✓|
|NUTCH-683||NUTCH-676||2009-02-11||2009-01-21||20 days||0 days||1.0.0||-||✓||(✓)||✓|
|PARQUET-214||PARQUET-139||2015-03-31||2015-02-05||54 days||-29 days||1.6.0||-||✓||(✓)||✓|
|TIKA-599||TIKA-528||2011-03-09||2010-10-12||147 days||0 days||0.9||0.8, 0.9||✓||✓|
|TIKA-2483||TIKA-2311||2017-11-14||2017-05-01||196 days||0 days||1.16||1.15, 1.16||✓||✓|
|SYSTEMML-1126||SYSTEMML-584||2017-02-25||2016-04-07||323 days||-7 days||-||0.10, 0.11, 0.12, 0.13||✓|
|SYSTEMML-2162||SYSTEMML-1919||2018-02-28||2017-09-17||163 days||0 days||-||1.0.0||✓|
|SYSTEMML-2275||SYSTEMML-2217||2018-04-22||2018-03-30||23 days||0 days||-||-||✓||✓|
|LENS-538||LENS-486||2015-05-06||2015-05-05||0 days||0 days||2.2||-||✓||(✓)||✓|
|KNOX-1134||KNOX-1119||2017-12-01||2017-11-29||2 days||0 days||-||-||✓|
|VALIDATOR-376||VALIDATOR-273||2015-10-25||2014-07-07||474 days||-174 days||1.4.1||(1.4.1)||✓||✓||✓|
|GIRAPH-918||GIRAPH-908||2014-06-10||2014-06-08||2 days||0 days||-||-||✓||✓|
|GIRAPH-832||GIRAPH-792||2014-01-30||2014-01-28||1 days||0 days||-||-||✓||✓|
|GIRAPH-88||GIRAPH-11||2011-11-15||2011-11-15||0 days||0 days||0.1.0||-||✓||(✓)||✓|
|GIRAPH-34||GIRAPH-27||2011-09-16||2011-09-12||4 days||-1 days||0.1.0, 1.0.0||-||✓||(✓)||✓|
|EAGLE-573||EAGLE-569||2016-09-28||2016-09-27||0 days||0 days||0.5.0||-||✓||✓|
5.7 Bug Inducing Changes
For just-in-time data, the identification of bug fixing commits is only the precursor for finding the inducing changes, which are then the target of the prediction. Moreover, we described how the inducing can be used for assigning defects to releases in Section 4. To this aim, we compare four approaches for the identification of bug inducing changes: 1) the standard SZZ algorithm,; 2) JIMIV, i.e., our improved linking with the issue validation, but standard SZZ to determine inducing changes; 3) JLMIV+, i.e., the improvement of standard SZZ to ignore changes to non-java files, whitespace only changes, and documentation changes; and 4) JLMIV+AV that further extends JLMIV by using the affected versions field.
We do not have ground truth data for this analysis, as discussed in Section 5.6. Instead, we use JLMIV+ as a proxy for the ground truth and analyze the differences between SZZ, JLMIV, and JLMIV+AV with respect to JLMIV+. The reason for this is two-fold. First, the inducing changes of JLMIV+ are a subset of JLMIV that only reduces the inducing changes, e.g., due to whitespace changes. Thus, in case of deviations, the change identified by JLMIV is always a false positive. Second, JLMIV+ is based on our ground truth for bug fixing commits. Since SZZ uses the same inducing strategy as JLMIV, but is based on the inferior SZZ labels for bug fixing commits, all deviations of SZZ from JLMIV+ are also mislabels. Regarding JLMIV+AV, we cannot state whether deviations from JLMIV+ are correct or not: this depends on the affected version field. In case the affected version field contains valid data, JLMIV+AV is likely to be correct, because the identification of suspect changes is improved. In case of invalid data, JLMIV+ is likely to be correct, because the inducing changes would be wrongly flagged as suspect by JLMIV+AV.
Figure 4 summarizes the results for the inducing strategies. JLMIV+ finds that a median of 5.7% (MAD=3.7%) of the commits are bug inducing. If we only consider the 78.296 commits in which at least one Java production file191919Java files excluding tests and documentation. was changed, the percentage of bug inducing commits has a median of 8.6% (MAD=5.4%). When we consider this on the level of changes to files, as is done by Pascarella et al. , we find that a median of 2.7% (MAD=1.3%) of all changes to Java production files are inducing for a bug.
SZZ identifies a median of 92.8% (MAD=10.9%) of the correct bug inducing commits and a median of 86.3% (MAD=53.5%) false positive bug inducing commits. SZZ identifies a median of 91.6% (MAD=12.1%) of the correct bug inducing file actions and a median of 99.6% (MAD=77.1%) of false positive inducing file actions of Java production files. These values are similar to the results for the bug fixing commits, i.e., mislabels due to the inducing strategy are hidden due to the large number of mislabeled bug fixes. The evaluation of JLMIV gives a better insight into the inducing strategy, because there is no noise due to mislabeled bug fixing commits. JLMIV identifies a median of 6.8% (MAD=5.6%) false positive bug inducing commits and a median of 7.7% (MAD=5.7%) false positive inducing file actions for java production files. This reduction is in line with the expectations due to the results from Mills et al. , which found that 8.7% of false positives for the bug fixing actions are due to changes to comments and whitespace only changes. JLMIV+ also excludes file changes to non production code. The comparison above is already restricted to production code, to not give JLMIV+ an unfair advantage, as such file actions could also be excluded by downstream analysis. If we compare the file actions between JLMIV and JLMIV+ on all Java file changes, JLMIV finds a median of 56.1% (MAD=33.3%) additional file changes. This demonstrates that inclusion of documentation code or test code can drastically alter the resulting data.
With respect to JLMIV+AV, we find a median of 4.4% (MAD=4.8%) commits and a median of 4.7% (MAD=4.4%) of file actions are detected less than with JLMIV+. Based on our limited data regarding the correctness of the affected versions field, we would expect that roughly half of these file actions are actually false positives (Section 5.6), i.e., are incorrectly detected by JLMIV+ and constitute noise. Thus, the potential impact of the affected versions field is relatively small with an expected reduction of false positive inducing changes by about 2.4% of the total amount of inducing file actions. Regardless, we cannot recommend to use JLMIV+AV without first validating the data in the affected version field, because the potential benefit due to fewer false positives are offset by an equally large loss due to false negatives.
5.8 Assignment to Releases
The literature suggests either to assign all bugs that are fixed within six months after a release to the release (6M) or to use the affected versions field (AV). In this article, we propose to use the inducing changes instead (IND). We evaluate the release assignment from two perspectives: the assigned issues and the files that are labeled as defective, due to the assigned issues. Same as for the inducing changes, we do not have ground truth. Regardless, the results from Section 5.6 indicate that the assignment based on IND is the most reliable strategy, even though likely not flawless. Therefore, we evaluate the deviations of 6M and AV from IND. We use the 6M strategy both with the bug fixing commits determined by SZZ (6M-SZZ) as well as those determined by JLMIV (6M-JLMIV). For the affected versions, we also use the bug fixing commits determined by JLMIV (AV-JLMIV). For the the assignment based on incuding changes, we use JLMIV+ (IND-JLMIV+).
Figure 5 summarizes the results for the release assignments. Overall, a median of 14 (MAD=14.8%) bugs affect a median of 5.2% (MAD=4.8%) files of a release. For 32 releases, we did not find any bug fixes. We marked these releases in italic in Table III
. 17 of these release are the first releases for the projects, ten releases are the last in our data. The other five releases are for stable versions of Apache Commons projects. Without these 33 releases, the median of bugs that affect a release is 16 (MAD=14.8%) and 5.7% (MAD=4.6%) files are defective. We report the results of 6M-SZZ, 6M-JLMIV and AV-JLMIV without the 33 releases that do not have any bug fix as they may skew the results.
6M-SZZ determines a median of 16.7% (MAD=24.7%) of the bugs and a median of 32.7% (MAD=39.2%) of the files correctly, and determines 32.1% (MAD=47.6%) additional bugs and 36.1% (MAD=53.5%) false positive defective files. 6M-JLMIV determines a median of 23.3% (MAD=26.3%) of the bugs and a median of 32.7% (MAD=36.7%) of the files determined correctly and a median of 11.0% (MAD=16.3%) additional bugs and 11.9% (MAD=17.6%) false positive defective files. AV-JLMIV determines a median of 13.5% (MAD=20.0%) of the bugs and a median of 19.1% (MAD=28.4%) of the files correctly and determines 5.3% (MAD=7.9%) additional bugs and 5.5% (MAD=8.1%) false positive detective files. Thus, AV-JLMIV labels the fewest files as defective of all variants. We compared these results with the ground truth data from Section 5.6. While the deviations are not equal, they show similar trends to the sample depicted in Table IV, both regarding the 6M strategy, as well as the AV strategy.
5.9 Impact of Feature Sets
The focus of our empirical analysis is on the quality of defect labels. Regardless, we also wish to provide an indication whether the lack of features in many data sets is really a problem. A commonly used approach is to perform a correlation analysis between features and the labels of the data. However, this correlation does not reveal if there are actual advantages in using more features for predictions. Thus, we rather perform a release-level defect prediction experiment based on all releases that have more than ten defective files. Overall, these are 239 releases in our data. For these releases, we perform an out-of-sample bootstrap experiment with 100 bootstrap samples, as suggested for performance estimations by Tantithamthavorn et al. . For each release, we randomly draw a bootstrap sample from the data for a release as training data and use all samples that are not part of the bootstrap sample as test data. We repeat this 100 times for each release and use the mean as estimate of the prediction performance. We evaluate the difference in prediction performance between these two feature sets using the F-measure, AUC, recall, and precision51]
. Benchmarks show that xgBoost has competitive prediction power to random forests. The advantage of using xgBoost is that the algorithm is provides a deterministic estimation of the importance of each feature, which we can exploit to get some insights into the usage of the available features.
We train the xgBoost classifiers using two different feature sets: 1) all features in our data (ALL) and 2) only features based on static product metric for classes202020We use summation for the aggregation to the file level. and files (SM), i.e., the similar features to the PROMISE data, one of the most popular defect prediction data sets . Figure 6 shows on the y-axes the performance we achieved with ALL features plotted versus the performance achieved with the SM features on the x-axes. With ALL features, we measured a median F-measure of 21.7% (MAD=22.5%), AUC of 57.4% (MAD=6.3%), recall of 42.6% (MAD=28.5%), and precision of 16.5% (MAD=18.0%). With the SM features, we measured a median F-measure of 18.4% (MAD=18.4%), AUC of 55.7% (MAD=6.6%), recall of 32.8% (MAD=22.7%), and precision of 14.3% (MAD=15.6%). The differences for F-measure (p=0.0008, =0.0407) and recall (p10, =0.1465) are statistically significant with negligible effect sizes. The differences for AUC (p=0.0026) and precision (p=0.2909) are not statistically significant. Thus, using more features may improve predictions performances mildly due to a better recall, which is responsible for the significant increase in F-measure. We note that the AUC is only barely non significant and the effect for the improvement is extremely close to the boundary of a small effect size.
We also used the feature importances to assess which features were actually used. For this, we considered the aggregated feature importances over all bootstrap iterations for all releases together, i.e., for a total of classifiers. We considered two aspects: 1) if a feature was used at all, i.e., if the feature importance was greater than zero for any classifier; and 2) the Top-10% of the used features given the mean feature importances over all classifiers. We explicitly avoid the reporting of concrete ranks or importance scores here. The reason is that a consideration of the mean feature importances over all classifiers does not account for correlations between features and is, therefore, biased. For example, the different aggregations of metrics are strongly correlated to each other.
We found that 2022 features were never used. These features are mostly static metrics, especially for annotations (720 of 780 never used), enums (580 of 780) and interfaces (529 of 780). There are also many PMD warning types (71 of 193) that are not used. Prior issues are also not all relevant, especially if the severity is not one of blocker, critical, major, minor, or trivial. We attribute the latter to the sparsity of linked issues of other types. The Top-10% of features are mostly static features of the source code, either (aggregations of) static metrics measured with SourceMeter (159) or AST node counts (33), but also ten metrics proposed by Moser et al. , five features regarding PMD warnings, four metrics by Hassan , four change types determined by ChangeDistiller , as well as the number of major bugs fixed in the last six months. Thus, while prediction models mostly use static features, they also exploit the large number of features that are available. Regardless, the high number of static features in the Top-10% of features supports our prior findings, i.e., that the small advantage in prediction performance by including other types of features.
Within this section, we discuss the implications of the results of our empirical study. We consider three different aspects: the defect labels, the impact of features on predictions, as well as open issues.
6.1 Defect Labels
Through our empirical study, we showed that the concerns that were raised in the recent years regarding the validity of defect labeling are valid. The two biggest factors that influence the quality of defect labels determined by SZZ are mislabeled issues in the Jira and mistakenly assigned bugs to releases. We proposed a solution for the latter problem by assigning bugs to releases based on inducing commits. The problem of mislabeled issues is more severe: to the best of our knowledge, there is no automated heuristic that can identify mislabeled issues in issue trackers. Our approach was to employ manual analysis by experts, same as Herzig et al. . However, this is extremely time consuming and does not scale well. In our case, we validated 11295 issues overall for the 38 projects. While we did not measure the exact time, two authors of this article spent at least one person month each on the independent labeling, i.e., at least 176 hours. Additionally, all three authors spent at least 20 hours on the resolution of disagreements. Thus, we spent at least 412 working hours on the issue type validation, the actual number is more likely around 600 working hours. Consequently, we required between two and three minutes per issue. This estimate is similar to the four minutes that Herzig et al.  reportedly required.
We note that the mislabeled issues are mislabeled from our perspective as researchers that want to analyze defects in software repositories. Thus, issues like incompatibilities due to new Java versions, failing tests, or missing documentation, are mislabels, because these do not constitute bugs in the sense of wrong run-time behavior of the software at the time of the release. From the perspective of the developers, these may not be mislabels, because they may use a more practical definition of bug in the issue tracker: something that is undesirable. Thus, we believe that these mislabels will remain a systematic issue for the analysis of issue tracking data.
Regarding the impact of the problems with defect labeling, our study revealed that all problems are replicable and that the impact increases if the problems are considered together. The problems with issue linking and the issue types in the repository combined mean that SZZ misses about one fifth of the bug fixing commits, and only about half of the commits SZZ identifies as bug fixing are actually bug fixing. This is because SZZ only identifies about 80% of the actual bug fixing commits, but also mislabels roughly the same amount of bug fixing commits because of the mislabeled issues. If this is combined with a six month time frame for assigning defects to releases, the problem becomes even more severe. For every file, that is correctly labeled as defective, there is roughly one file that is incorrectly labeled as defective, and two files that are incorrectly labeled as non-defective. Yatish et al.  proposed using the affected version field of issue trackers as a solution for the release assignment problem. While this is in principle a perfect solution, the reality of the data in the issue tracking system shows that the information contained in the affected version field is unreliable. Overall, using inducing changes currently seems to be the best heuristic.
Another important aspect to consider here is, that there is still noise in our data due to false positive defect labels. We did not validate the file actions and instead only applied a heuristic that ignored changes to whitespaces and comments. Given the results from Mills et al. , we may have as much noise as one false positive defect label for every four correct defect labels, which would affect both release-level and just-in-time defect prediction. Neto et al.  recently found similar results to Mills et al.  with respect to the percentage of refactorings in the data using a new variant of SZZ that filters changes using the refactoring detection tool RefDiff . Thus, while we may have measured the largest part of the noise in the defect labels, there is still an uncertain region, which is also not accounted for in our data. Refactoring detection can further improve the identification of inducing changes and, therefore, improve the results of the release assignment and for the labeling of files. We note that these additional improvements would only lead to more deviations of SZZ from the actual data, because SZZ would identify even more false positives.
All data sets we discussed in Section 2 are affected by the problems regarding the defect labels. This is a severe threat to all publications that use this data and, therefore, basically to the complete state of the art of defect prediction. If and how results are impacted by this is unclear. It may be that empirical results are not affected, because the signal of the defective data was still strong enough to be picked up by analysis. It may also be that the outcome of experiments changes, because different software artifacts are labeled as defective. We expect that the impact of these findings on release-level defect prediction is stronger than on just-in-time defect prediction. The reasons for this is that the 6M release assignment strategy as a source of noise is not present for just-in-time data.
Our initial motivation that started this research was, that we actually wanted to have a defect prediction data set with a broad selection of features, because we believed that this was the key ingredient that was missing for highly performing defect prediction models. The indications from the literature suggested that this was true (Section 3.2), especially due to churn related features. Because we considered how we should collect the defect data, we discovered the need for our analysis of the defect labels and the focus of our research evolved: due to our findings regarding the defect labels, the analysis of the features is now only in the background of this article. Regardless, we believed that the small experiment we described in Section 5.9 would show that the large feature set is beneficial and that we could point to future research to find a suitable subset from the large amount of features we use, e.g., to minimize the effort to collect the features without a loss in prediction performance.
Our results do not allow for such a definitive conclusion that more features are really required. The significant increases in F-measure is negligible and the increase is explained with an also significant but negligible increase in recall. Thus, more features predict more bugs, although the difference is likely negligible and the precision
is not affected. The lack of a larger difference may be due to the simplicity of the setup: no explicit feature selection, no treatment of the class level imbalance, and no hyper parameter optimization . Please note that we intentionally avoided class level imbalance treatment because we believe that the precision is crucial for defect prediction and Tantithamthavorn et al.  found that while there is a potential positive effect on AUC and recall, precision may be negatively affected by treating the class level imbalance. Nevertheless, the gradient boosting trees we used implicitly perform a feature selection and are relatively robust to class level imbalances. The lack of optimization should affect both feature sets. Thus, while our results may be sub optimal, the difference should still be more pronounced given the indications from the literature.
A closer look at the literature reveals a potential reason for this lack of a stronger effect. The studies that clearly find that more features, especially churn-related features, lead to better prediction results, were all conducted on relatively small data sets, i.e., on 26 releases , three releases , five releases . On the other hand, Zhang et al.  used 255 and only found a small effect of using aggregations. Rahman et al.  even found no statistically significant difference, when they added features based on PMD or FindBugs warnings to static metrics. Thus, the expectations that larger feature sets that go beyond static source code metrics improve predictions may be inflated.
6.3 Open Issues
While we empirically explore many issues regarding data for defect prediction, there are still open issues left, as well as new issues we discovered due to our results.
We have only used manual validation for the bug fixing commits, but not for the file actions in those commits or for the inducing changes that were detected. While we used smaller samples to get insights into the quality, this only helps us estimate the remaining noise in the data. Thus, the first open issue is to extend this data with validated file actions for bug fixes, inducing changes, and release assignments. For the data in our empirical study we would need to manually validate 46.422 file actions for 10.515 bug fixing commits, as well as for the release assignment of 6.530 bugs. These are 34.5 times more file actions than in the study by Mills et al.  with the additional effort for validation of the inducing changes. Thus, this kind of problem cannot be solved by single research groups, but must be tackled by the complete community. If and how this could be done is, to the best of our knowledge, an open problem.
Moreover, our results raise several interesting, and partially concerning questions for further research. We can only speculate how our results regarding the defect labels affect the state of the art. We may find that the same prediction models as before are the best, simply because they are good models, independent of the data. The results may also change, because with the different data, other algorithms may perform better. We are especially looking forward to how our data affects findings that trivial baselines may outperform machine learning, e.g., by Zhouet al. . Recent work already considered similar issues with other variants of SZZ, but without accounting for the biggest source of noise, i.e., the mislabeled issues . The results indicate that these changes will have an effect.
Through a small experiment, we have already (inadvertently) shown that some results regarding the importance of features may need to be revisited. While we found improvements, they were negligible. However, because we only performed a relatively simple study, this only means that the impact is not as obvious as we expected. Future work may uncover subsets of our features which lead to bigger improvements or demonstrate that we only found negligible differences due to our use of gradient boosting trees. Even if future research finds that there really is not much of an improvement if other features than static metrics are used, the question of which features are best is still interesting. For example, just-in-time defect prediction relies mostly on features that are independent of the programming language. Whether the same would be possible for release-level defect prediction without loosing predictive performance is unknown. Such results could help to broaden the scope of future research, because current research only considers a relatively small set of programming languages. Vice versa, just-in-time research avoids using static analysis tools and hence, there is a lack of research on the use of features like the complexity of code changes . Future research could explore if the language independent features are really sufficient.
7 Threats to Validity
Due to the scope of our empirical study, there are many threats to the validity of our findings. We discuss the construct validity, internal validity, external validity, and reliability as separately, as suggested by Runeson and Höst .
7.1 Construct Validity
There are several threats to the construct validity of our experiments. We wrote a large amount of software for these experiments, which may contain defects. However, all software was tested and the results were manually checked. Especially the large amounts of manual analysis we conducted revealed many corner cases, which we could then handle correctly, mitigating the threats due to bugs in our software. Additionally, we may have selected unsuitable baselines for the comparison of results. To mitigate this threat, we created ground truth data as baseline where possible. In case this was not possible, we evaluated a sample from our baseline manually to establish whether our proxy for the baseline was suitable. Moreover, we cross-checked all our results with findings from related studies to evaluate the plausibility of our results. Finally, we may have used inadequate metrics for the measurement of differences. We mitigate this threat by only reporting deviations from the ground truth. To further mitigate this threat, we looked at the raw data and validated that the deviations are accurate reflections of the raw results.
7.2 Internal Validity
The results of the analysis of the defect labeling directly follow from the properties of the defect labeling algorithms, e.g., missing links to issues are the only source for false negative bug fixing commits with SZZ in our data. Therefore, there are no threats to the internal validity of this part of our empirical study. The conclusion that the difference with a larger set of features is negligible may be wrong or misleading. Other factors, especially properties that we cannot easily capture with performance metrics may yield different results, e.g., the acceptance of prediction models by developers could be higher because the recall is improved. Moreover, we only consider a pure classification scenario and no ranking of files by their likelihood of defect or a regression scenario for the prediction of the number of defects in a file. The additional features may lead to bigger differences in performance under these considerations.
7.3 External Validity
The main threat regarding the external validity of the results is due to our focus on Java projects that are developed under the umbrella of the Apache Software Foundation. Thus, it is unclear if and how our findings generalize to projects using other programming languages or software development outside of the Apache Software Foundation. We note that our analysis regarding the defect labeling issues is mostly independent of the programming language, with the exception of the identification of whitespace and comment-only changes. For example, whitespace changes may actually be changes to the logic of a program written with Python. Moreover, the Apache Software Foundation attracts a large amount of developers both from the industry as well as from the open source community. This increases the likelihood that aspects like the labeling of issues as bug or the use of the affected versions field are similar in other contexts.
To avoid bias in the manual validation of data, we involved multiple people. The issues were validated by two authors independently, conflicts were solved by three authors. While the initial validation of the issue links was conducted only one author, a different author performed the manual analysis of a sample of 1000 issues for mislabels. Thus, we minimized the impact of individuals on the results to mitigate this the threat to the reliability of the research.
Within this article we performed a critical assessment of the state of practice of the collection of defect prediction data. We summarized existing data sets and found that the SZZ algorithm is the standard approach for defect labeling and that most data sets only offer a limited set of features. This is in contrast to the state of the art that found issues with defect labeling using SZZ, as well as diverse features that should be valuable for defect prediction. To assess the impact of this difference, we performed an empirical study with the focus on the issues of defect labeling and found that SZZ identifies one incorrect bug fixing commit for each correct bug fixing commit, while still missing about one fifth of the bug fixing commits. The main reason for the mislabeled commits are mislabeled issues, a problem initially found by Herzig et al. . For release-level defect prediction data, this problem is even worse, because most data sets use a six month timeframe to assign defects to releases. The combination of these issues mean that for every correctly labeled defective file, there is one incorrectly labeled defective file and two missed defective files. Thus, there is a large amount of noise in the defect prediction data that is currently used and we can only speculate how this affects the state of the art.
Regarding the features, we found that the difference of the prediction performance with more features is negligible, even though there is an improvement in the recall without a negative effect on the precision. This is in contrast to prior findings that highlighted the importance of, e.g., churn features. However, since our analysis of feature importances and the impact of larger feature sets was only rudimentary, additional research is required to establish what a suitable set of features for defect prediction looks like.
Another contribution of this article is a new defect prediction data set, both for just-in-time, as well as release-level defect prediction. Our data set is larger than any currently used data set, i.e., contains more releases and projects, as well as more features. We hope that the data we produced as part of our work will help the research community to resolve the problems we found. On the one hand, we are looking forward to studies of defect prediction models using our data, both replications of existing work with the de-noised data, as well the the assessment of new approaches and techniques. On the other hand, our data may also be used to improve automated defect labeling, e.g., by trying to automatically correct bug issue labels in issue trackers. Moreover, we hope that our data will be used as the foundation for the manual validation of file actions, to provide a ground truth assessment of the assignment of defects to files and releases.
This work is partially funded by DFG Grant HE 7854/4-1. We also want to thank the GWDG for the support in using their high performance computing infrastructure, that enabled the collection of the large amounts of software metric data.
-  T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell, “A systematic literature review on fault prediction performance in software engineering,” IEEE Transactions on Software Engineering, vol. 38, no. 6, pp. 1276–1304, Nov 2012.
-  T. Menzies, R. Krishna, and D. Pryor, “The promise repository of empirical software engineering data,” 2015.
-  ——, “The seacraft repository of empirical software engineering data,” 2017.
-  “Nasa iv & v facility metrics data program,” 2004. [Online]. Available: http://web.archive.org/web/20110421024209/http://mdp.ivv.nasa.gov/repository.html
-  B. Turhan, T. Menzies, A. B. Bener, and J. Di Stefano, “On the relative value of cross-company and within-company data for defect prediction,” Empirical Software Engineering, vol. 14, no. 5, pp. 540–578, Oct 2009. [Online]. Available: https://doi.org/10.1007/s10664-008-9103-7
-  M. Jureczko and L. Madeyski, “Towards identifying software project clusters with regard to defect prediction,” in Proceedings of the 6th International Conference on Predictive Models in Software Engineering, ser. PROMISE ’10. New York, NY, USA: ACM, 2010, pp. 9:1–9:10. [Online]. Available: http://doi.acm.org/10.1145/1868328.1868342
-  S. Hosseini, B. Turhan, and D. Gunarathna, “A systematic literature review and meta-analysis on cross project defect prediction,” IEEE Transactions on Software Engineering, vol. PP, no. 99, pp. 1–1, 2017.
-  F. Trautsch, S. Herbold, P. Makedonski, and J. Grabowski, “Addressing problems with replicability and validity of repository mining studies through a smart data platform,” Empirical Software Engineering, Aug. 2017.
-  J. Śliwerski, T. Zimmermann, and A. Zeller, “When do changes induce fixes?” in Proceedings of the 2005 International Workshop on Mining Software Repositories, ser. MSR ’05. New York, NY, USA: ACM, 2005, pp. 1–5. [Online]. Available: http://doi.acm.org/10.1145/1082983.1083147
-  D. A. da Costa, S. McIntosh, W. Shang, U. Kulesza, R. Coelho, and A. E. Hassan, “A framework for evaluating the results of the szz approach for identifying bug-introducing changes,” IEEE Transactions on Software Engineering, vol. 43, no. 7, pp. 641–657, July 2017.
-  C. Mills, J. Pantiuchina, E. Parra, G. Bavota, and S. Haiduc, “Are bug reports enough for text retrieval-based bug localization?” in 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME), Sep. 2018, pp. 381–392.
-  S. Yatish, J. Jiarpakdee, P. Thongtanunam, and C. Tantithamthavorn, “Mining software defects: Should we consider affected releases?” in Proceedings of the 41st International Conference on Software Engineering, ser. ICSE ’19. Piscataway, NJ, USA: IEEE Press, 2019, pp. 654–665. [Online]. Available: https://doi.org/10.1109/ICSE.2019.00075
-  K. Herzig, S. Just, and A. Zeller, “It’s not a bug, it’s a feature: How misclassification impacts bug prediction,” in Proceedings of the 2013 International Conference on Software Engineering, ser. ICSE ’13. Piscataway, NJ, USA: IEEE Press, 2013, pp. 392–401. [Online]. Available: http://dl.acm.org/citation.cfm?id=2486788.2486840
-  C. Bird, P. C. Rigby, E. T. Barr, D. J. Hamilton, D. M. German, and P. Devanbu, “The promises and perils of mining git,” in 2009 6th IEEE International Working Conference on Mining Software Repositories, May 2009, pp. 1–10.
-  V. Kovalenko, F. Palomba, and A. Bacchelli, “Mining file histories: Should we consider branches?” in Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ser. ASE 2018. New York, NY, USA: ACM, 2018, pp. 202–213. [Online]. Available: http://doi.acm.org/10.1145/3238147.3238169
-  R. Moser, W. Pedrycz, and G. Succi, “A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction,” in Proceedings of the 30th International Conference on Software Engineering, ser. ICSE ’08. New York, NY, USA: ACM, 2008, pp. 181–190. [Online]. Available: http://doi.acm.org/10.1145/1368088.1368114
-  M. D’Ambros, M. Lanza, and R. Robbes, “Evaluating defect prediction approaches: A benchmark and an extensive comparison,” Empirical Softw. Engg., vol. 17, no. 4-5, pp. 531–577, Aug. 2012. [Online]. Available: http://dx.doi.org/10.1007/s10664-011-9173-9
-  A. E. Hassan, “Predicting faults using the complexity of code changes,” in Proceedings of the 31st International Conference on Software Engineering, ser. ICSE ’09. Washington, DC, USA: IEEE Computer Society, 2009, pp. 78–88. [Online]. Available: http://dx.doi.org/10.1109/ICSE.2009.5070510
-  F. Zhang, A. E. Hassan, S. McIntosh, and Y. Zou, “The use of summation to aggregate software metrics hinders the performance of defect prediction models,” IEEE Transactions on Software Engineering, vol. 43, no. 5, pp. 476–491, May 2017.
-  T. Zimmermann, R. Premraj, and A. Zeller, “Predicting defects for eclipse,” in Proceedings of the Third International Workshop on Predictor Models in Software Engineering, ser. PROMISE ’07. Washington, DC, USA: IEEE Computer Society, 2007, pp. 9–. [Online]. Available: http://dx.doi.org/10.1109/PROMISE.2007.10
-  R. Wu, H. Zhang, S. Kim, and S.-C. Cheung, “Relink: Recovering links between bugs and changes,” in Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, ser. ESEC/FSE ’11. New York, NY, USA: ACM, 2011, pp. 15–25. [Online]. Available: http://doi.acm.org/10.1145/2025113.2025120
-  K. Herzig, S. Just, A. Rau, and A. Zeller, “Predicting defects using change genealogies,” in 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE), Nov 2013, pp. 118–127.
-  L. Madeyski and M. Jureczko, “Which process metrics can significantly improve defect prediction models? an empirical study,” Software Quality Journal, vol. 23, no. 3, pp. 393–422, Sep 2015. [Online]. Available: https://doi.org/10.1007/s11219-014-9241-7
-  T. Shippey, T. Hall, S. Counsell, and D. Bowes, “So you need more method level datasets for your software defect prediction?: Voilà!” in Proceedings of the 10th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ser. ESEM ’16. New York, NY, USA: ACM, 2016, pp. 12:1–12:6. [Online]. Available: http://doi.acm.org/10.1145/2961111.2962620
-  Z. Tóth, P. Gyimesi, and R. Ferenc, “A public bug database of github projects and its application in bug prediction,” in Computational Science and Its Applications – ICCSA 2016, O. Gervasi, B. Murgante, S. Misra, A. M. A. Rocha, C. M. Torre, D. Taniar, B. O. Apduhan, E. Stankova, and S. Wang, Eds. Cham: Springer International Publishing, 2016, pp. 625–638.
-  R. Ferenc, Z. Tóth, G. Ladányi, I. Siket, and T. Gyimóthy, “A public unified bug dataset for java,” in Proceedings of the 14th International Conference on Predictive Models and Data Analytics in Software Engineering, ser. PROMISE’18. New York, NY, USA: ACM, 2018, pp. 12–21. [Online]. Available: http://doi.acm.org/10.1145/3273934.3273936
-  Y. Kamei, E. Shihab, B. Adams, A. E. Hassan, A. Mockus, A. Sinha, and N. Ubayashi, “A large-scale empirical study of just-in-time quality assurance,” IEEE Transactions on Software Engineering, vol. 39, no. 6, pp. 757–773, June 2013.
-  H. Altinger, S. Siegl, Y. Dajsuren, and F. Wotawa, “A novel industry grade dataset for fault prediction based on model-driven developed automotive embedded software,” in Proceedings of the 12th Working Conference on Mining Software Repositories, ser. MSR ’15. Piscataway, NJ, USA: IEEE Press, 2015, pp. 494–497. [Online]. Available: http://dl.acm.org/citation.cfm?id=2820518.2820596
-  L. Pascarella, F. Palomba, and A. Bacchelli, “Fine-grained just-in-time defect prediction,” Journal of Systems and Software, vol. 150, pp. 22 – 36, 2019. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0164121218302656
-  S. R. Chidamber and C. F. Kemerer, “A metrics suite for object oriented design,” IEEE Trans. Softw. Eng., vol. 20, no. 6, pp. 476–493, Jun. 1994.
-  M. Fischer, M. Pinzger, and H. Gall, “Populating a release history database from version control and bug tracking systems,” in International Conference on Software Maintenance, 2003. ICSM 2003. Proceedings., Sep. 2003, pp. 23–32.
-  F. Zhang, A. Mockus, I. Keivanloo, and Y. Zou, “Towards building a universal defect prediction model,” in Proceedings of the 11th Working Conference on Mining Software Repositories, ser. MSR 2014. New York, NY, USA: ACM, 2014, pp. 182–191. [Online]. Available: http://doi.acm.org/10.1145/2597073.2597078
-  A. Mockus, “Amassing and indexing a large sample of version control systems: Towards the census of public source code history,” in 2009 6th IEEE International Working Conference on Mining Software Repositories, May 2009, pp. 11–20.
-  C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V. Filkov, and P. Devanbu, “Fair and balanced?: Bias in bug-fix datasets,” in Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering, ser. ESEC/FSE ’09. New York, NY, USA: ACM, 2009, pp. 121–130. [Online]. Available: http://doi.acm.org/10.1145/1595696.1595716
-  F. Rahman, D. Posnett, A. Hindle, E. Barr, and P. Devanbu, “Bugcache for inspections: Hit or miss?” in Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering. ACM, 2011. [Online]. Available: http://doi.acm.org/10.1145/2025113.2025157
-  T. Ostrand, E. Weyuker, and R. Bell, “Predicting the location and number of faults in large software systems,” IEEE Trans. Softw. Eng., vol. 31, no. 4, pp. 340–355, 2005.
-  R. Plosch, H. Gruber, A. Hentschel, G. Pomberger, and S. Schiffer, “On the relation between external software quality and static code analysis,” in 2008 32nd Annual IEEE Software Engineering Workshop, Oct 2008, pp. 169–174.
-  F. Rahman, S. Khatri, E. T. Barr, and P. Devanbu, “Comparing static bug finders and statistical prediction,” in Proceedings of the 36th International Conference on Software Engineering, ser. ICSE 2014. New York, NY, USA: ACM, 2014, pp. 424–434. [Online]. Available: http://doi.acm.org/10.1145/2568225.2568269
-  T. F. Bissyandé, F. Thung, S. Wang, D. Lo, L. Jiang, and L. Réveillère, “Empirical evaluation of bug linking,” in 2013 17th European Conference on Software Maintenance and Reengineering, March 2013, pp. 89–98.
-  B. Fluri, M. Würsch, M. Pinzger, and H. Gall, “Change distilling: Tree differencing for fine-grained source code change extraction,” IEEE Transactions on Software Engineering, vol. 33, no. 11, pp. 725–743, 2007.
-  Y. Zhao, H. Leung, Y. Yang, Y. Zhou, and B. Xu, “Towards an understanding of change types in bug fixing code,” Information and Software Technology, vol. 86, pp. 37 – 53, 2017. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0950584917301313
-  D. Silva and M. T. Valente, “Refdiff: Detecting refactorings in version histories,” in 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), May 2017, pp. 269–279.
-  C. Bird, A. Bachmann, F. Rahman, and A. Bernstein, “Linkster: Enabling efficient manual inspection and annotation of mined data,” in Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering, ser. FSE ’10. New York, NY, USA: ACM, 2010, pp. 369–370. [Online]. Available: http://doi.acm.org/10.1145/1882291.1882352
-  P. J. Rousseeuw and C. Croux, “Alternatives to the median absolute deviation,” Journal of the American Statistical Association, vol. 88, no. 424, pp. 1273–1283, 1993. [Online]. Available: https://www.tandfonline.com/doi/abs/10.1080/01621459.1993.10476408
-  S. Herbold, A. Trautsch, and J. Grabowski, “A comparative study to benchmark cross-project defect prediction approaches,” IEEE Transactions on Software Engineering, vol. PP, no. 99, pp. 1–1, 2017.
-  F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945. [Online]. Available: http://www.jstor.org/stable/3001968
-  N. Cliff, “Dominance statistics: Ordinal analyses to answer ordinal questions.” Psychological Bulletin, vol. 114, no. 3, p. 494, 1993.
J. Romano, J. Kromrey, J. Coraggio, and J. Skowronek, “Appropriate statistics for ordinal level data: Should we really be using t-test and Cohen’sd for evaluating group differences on the NSSE and other surveys?” inAnnual Meeting of the Florida Association of Institutional Research, 2006, pp. 1–3.
-  F. Shull, J. Carver, S. Vegas, and N. Juristo, “The role of replications in empirical software engineering,” Empirical Software Engineering, vol. 13, no. 2, pp. 211–218, 2008.
-  C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Matsumoto, “An empirical comparison of model validation techniques for defect prediction models,” IEEE Transactions on Software Engineering, vol. 43, no. 1, pp. 1–18, Jan 2017.
-  T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’16. New York, NY, USA: ACM, 2016, pp. 785–794. [Online]. Available: http://doi.acm.org/10.1145/2939672.2939785
-  S. Pafka, “benchm-ml,” https://github.com/szilard/benchm-ml, 2019. [Online]. Available: https://github.com/szilard/benchm-ml
-  E. C. Neto, D. A. da Costa, and U. Kulesza, “The impact of refactoring changes on the szz algorithm: An empirical study,” in 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), March 2018, pp. 380–390.
-  C. Tantithamthavorn, A. E. Hassan, and K. Matsumoto, “The impact of class rebalancing techniques on the performance and interpretation of defect prediction models,” IEEE Transactions on Software Engineering, pp. 1–1, 2018.
-  C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Matsumoto, “Automated parameter optimization of classification techniques for defect prediction models,” in Proceedings of the 38th International Conference on Software Engineering. ACM, 2016.
-  Y. Zhou, Y. Yang, H. Lu, L. Chen, Y. Li, Y. Zhao, J. Qian, and B. Xu, “How far we have progressed in the journey? an examination of cross-project defect prediction,” ACM Trans. Softw. Eng. Methodol., vol. 27, no. 1, pp. 1:1–1:51, Apr. 2018. [Online]. Available: http://doi.acm.org/10.1145/3183339
-  Y. Fan, X. Xia, D. Alencar da Costa, D. Lo, A. E. Hassan, and S. Li, “The impact of changes mislabeled by szz on just-in-time defect prediction,” IEEE Transactions on Software Engineering, pp. 1–1, 2019.
-  P. Runeson and M. Höst, “Guidelines for conducting and reporting case study research in software engineering,” Empirical Software Engineering, vol. 14, no. 2, p. 131, Dec 2008. [Online]. Available: https://doi.org/10.1007/s10664-008-9102-8