The idea of trying to predict which software components e.g., classes, files or even methods are likely to be defect-prone has gained a great deal of traction in software engineering research over the past three decades. To be able to reliably distinguish between defect-prone and clean components is clearly desirable since QA resources can then be allocated more effectively. A considerable number of systematic reviews Cata09 ; Hall12 ; Malh15 ; Hoss17 ; Ozak18 ; Son19 ; Li20 have identified and summarised many hundreds of such studies.
These defect prediction studies have generally approached the problem empirically in the form of computational experiments, where different prediction systems are compared over data with known outcomes (i.e., labelled defect-prone or not defect-prone). Comparisons are made using various classification performance metrics, typically F1111More correctly speaking, F1 is a specific instantiation () of the F-measure which is defined as:
and Area under the Curve (AUC) as the response variables. Unfortunately despite its widespread use, statisticians and machine learning specialists, have drawn attention to various difficulties with F1 particularly when it is used for two-class problemsSoko09 ; Powe11 ; Luqu19 . Section 2.2 reviews these difficulties in some detail.
So we pose the question: do the problems with F1 actually matter, or are the results from the many studies utilising F1 good enough approximations to the ‘truth’? This is an important question because the use of F1 remains widespread (more than a third of papers over the period 2015-20). To answer it we locate studies that report results both with F1 and also the unbiased Matthews correlation coefficient (MCC) Bald00 which enables us to make comparisons and check whether the conclusion changes depending upon the choice of classification performance metric.
This paper extends our previous analysis Yao20 in five ways.
We provide a comprehensive review and critique of the widely-used, classification performance metric F1 drawing from both the machine learning and statistical literature, complemented with an analysis of all permutations of the N=40 confusion matrix.
We have extended the searches for relevant, defect prediction studies which increases the number of primary studies from 8 to 38 meaning that the number of individual results in the meta-analysis is now 12,471 (a more than threefold increase).
We investigate how widely the F1 metric is used in software defect prediction experiments;
We undertake an in-depth investigation of the circumstances when the metric F1 is most likely to be misleading, in particular imbalance.
The updated raw data and code are available from http://doi.org/10.5281/zenodo.4608552.
The remainder of this paper is structured as follows. Section 2 first reviews different approaches to evaluating defect prediction performance, followed by a detailed critique of the F1 metric and an assessment of its role within software defect prediction research. Section 3 describes the systematic review we undertook to find as many F1 and MCC results as possible, for our meta-analysis. Then, we present our bibliometric findings in Section 4. This is followed by our meta- analysis, results and discussion contained in Section 5. The article concludes (Section 6) with a summary of our findings, threats to validity and a set of recommendations for researchers and for readers of defect prediction studies.
In this section we review the most widely utilised classification performance metrics (for more exhaustive reviews see Powers Powe11 and Luque et al. Luqu19 ). Next, we focus on the F1 measure in detail, highlight some problems and contrast it with an alternative metric, namely the Matthews correlation coefficient. Finally, we briefly examine the question of how widely F1 is used in software defect prediction experiments.
2.1 Classification performance metrics in software defect prediction
In this discussion we focus on two-class classification problems as per all of our included studies. This approach restricts the analysis to defect-prone (positive) and not defect-prone (negative) software components. These classes are mutually exclusive. Thus the structure of the classifier performance can be represented as a confusion matrix (see Table1) which, since we have two classes, is a contingency table of predicted versus actual class. From this matrix we are able to derive most classification performance metrics.
|Actual Positive||Actual Negative|
The four cells of the confusion matrix comprise counts of true positives (TP), true negatives (TN), false positives (FP) and false negatives (TN). For our problem domain these correspond to defective components correctly classified as defective, defect-free components correctly classified as defect-free, defective components incorrectly classified as defect-free and defect-free components incorrectly classified as defective.
For the discussion regarding confusion matrices we use the following terminology and concepts.
is the proportion of the actual positive cases (i.e., defect density) hence the prevalence of negative cases is simply . Alternatively where is the cardinality. This is an important concept since it gives rise to problems of imbalanced data sets which in turn cause difficulties for training classifiers. For the problem domain of software defect prediction, it is common, but not invariably so, that is close to zero and the datasets are imbalanced Wang13 . Unfortunately this also leads to difficulties with biased classification performance metrics as we will shortly demonstrate.
is defined as which is the proportion of correctly identified defect-prone units from all cases classified as defect-prone.
also referred to as sensitivity or the true positive rate (TPR). It is the proportion of positive cases that are correctly predicted positive (defect-prone) out of all positive cases.
or the true negative rate (TNR) is defined as the proportion of negative cases that are correctly considered as negative from all negative cases. Specificity and Recall are inversely proportional to each other. When we increase Specificity, Recall decreases and vice versa.
- False positive rate (FPR):
is defined as the proportion of negative cases that are mistakenly considered as positive out of all negative cases. It is also sometimes referred to as Fallout and is a way of characterising the contamination of positive predictions by negative examples.
is defined as the proportion of cases correctly classified to all cases. However, it is not chance-corrected and therefore often a misleading guide. Consider the situation where 95% of cases are positive (so ) then trivially a classifier could achieve 95% accuracy simply by predicting all cases belong to the modal (in this case positive) class.
regulates Precision and Recall such that they have equal weight. Although Precision and Recall can each be trivially optimised independently (either predicting no positive cases or by predicting all positive cases) the idea of combining both measures is intended to take a more balanced view. For this reason F1 has been widely deployed as a means of assessing classifier performance. NB A harmonic mean, as opposed to an arithmetic mean, will penalise more extreme differences between Precision and Recall.
is also known as Youden’s J Youd50 or Informedness for multi-class classification. It is defined as: . It yields the proportion of time we are making an informed decision as opposed to guessing Powe03 . A classifier that has a Bookmaker score of zero is doing no better than chance and a negative score implies worse than chance. This is important information when evaluating classifiers. Its chief value is as a simple benchmark of the extent to which a classifier adds value.
- Matthews correlation coefficient (MCC):
is the Pearson correlation for a contingency table and is known as or by statisticians. It is defined using TP, TN, FP and FN, and so includes all parts of the confusion matrix. As with any correlation coefficient, it ranges from -1 to +1 so more extreme values represent better performance. Thus +1 indicates perfect classification, -1 indicates perfectly perverse classification, and zero indicates random predictions i.e., no classification value. It is related to the chi-square statistic for a contingency table such that where is the number of cases.
- Receiver operating characteristic (ROC) curve:
is a two-dimensional chart where TPR (Recall) is plotted on the Y axis and FPR (Fallout) is plotted on the X axis. A ROC curve describes the relative trade-offs between TPs and FPs for different thresholds of accepting a case as being positive ranging from all (when TPR=1) to none (when FPR=0). As a two-dimensional assessment index, it can be problematic when comparing the performance of different classifiers. In order to compare classifiers using a single scalar value, the usual method is to calculate the area under the ROC curve (AUC) Fawc06 . Since AUC is a proportion of the total area of the unit square, its value must fall between 0 and 1, where AUC=1 means the classifier can perfectly distinguish between positive and negative classes, whilst AUC=0 means the classifier is perfectly perverse, i.e., it predicts all positive cases as negative ones and vice versa. When AUC=0.5, the classifier has no discriminative value and is equivalent to random guessing (equivalently Youden’s J or MCC equal zero). Therefore, any value greater than 0.5 represents a better than chance classification for the two-class case. Note, however, this metric refers to a family of possible classifiers rather than any specific classifier. Thus, unless the ROC curve for classifier A strictly dominates classifier B we cannot make any remarks about our preference for A over B since, in practice, we can only deploy a single classifier. Morasca and Lavazza Mora20 suggest this difficulty might be reduced by only examining ”relevant areas”, that is regions of interest. But we also note that AUC has come under considerable criticism (for example that it uses different misclassification cost distributions for different classifiers Hand09 ; Flac15 ). Moreover, its purpose is somewhat different from our primary interest which is to address performance metrics for particular classifiers. For this reason we do not explore AUC further in this paper.
Table 2 gives formal definitions of the most commonly deployed classification performance metrics in terms of the confusion matrix. It also denotes those metrics that are chance-corrected, in other words making a comparison with a guessing strategy, where an example would be a negative MCC score. By contrast, F1 is not chance-corrected because the F1 value of guessing all predictions are the modal class will depend upon its prevalence.
|True Positive Rate (TPR)|
|False Positive Rate (FPR)||[0,1]||Low||No|
|Bookmaker’s odds or||[-1,1]||High||Yes|
|coefficient (MCC) or|
|Area Under Curve (AUC)||FPR versus TPR||[0,1]||High||Yes|
2.2 A critique of F1
F1 is a widely used performance metric in the field of software defect prediction. It is the harmonic mean of recall and precision (see Table 2) and is a specific instantiation () of the F-measure that is defined as:
It originates from the information retrieval community and was first proposed by van Rijsbergen vanR79 in the 1970s. This metric is only sensitive to the positive class as the definition does not include TNs. Precision is the proportion of true positives in the cases predicted positive and recall is the proportion of positive cases that are predicted. Hence their values are entirely independent of the number of negative cases. Ignoring negative cases, except inasmuch as they contaminate predictions, makes perfect sense when the problem domain is essentially a single-class problem e.g., retrieval of relevant pages from the web when the number of irrelevant pages correctly not retrieved (TNs) is vast, cannot be determined and is not of interest.
So we have a classification metric suitable for single-class information retrieval problems being redeployed for two-class problems. Moreover, we speculate that some researchers have not fully considered the ramifications of doing so. For software defects (and many other problem domains) we most definitely have two classes. Knowing that a software unit has been correctly classified as not defect-prone is important in terms of project resources and software quality. In addition, the dichotomous view of prediction is an over-simplification since most classifiers predict class membership with a given confidence or probability. The threshold for positive class assignment is therefore both flexible and arbitrary in that changing the acceptance threshold for positive cases can move a software unit from being predicted positive (defect-prone) to not defect-prone. The problem of two classes is compounded by much variation in the prevalence of defect-prone cases in training data sets. Typically these positive cases are very much in the minority, hence most data sets are highly imbalancedSun09 .
A second problem is that F1 is difficult to interpret other than zero222Strictly speaking, even zero can be problematic because in the event TN=0 then F1 is undefined, although it is customary to record this situation as F1=0. is the worst case and unity the best case. Specifically the chance component of the metric is unknown, unlike a correlation coefficient or AUC (see Table 2). So, for example, it is hard to know what means. Is it better than chance? Is the classifier actually predicting or would we be better off just guessing? In contrast, a correlation coefficient equal to 0.25 means there is a small positive effect and that the classifier is indeed doing better than chance.
Third, F1 is not chance-corrected. An alternative way to think about the difficulties of F1 is in terms of its relationship to the so-called Bookmaker’s odds (otherwise known as Youden’s J Youd50 or informedness for multi-class classification problems Powe11 ). It gives the probability that the classifier is doing better than chance (see Table 2) and is independent of the relative proportions of positive and negative (defect-prone and not defect-prone) instances. Youden suggested that the metric ranges [0,1] since he appears not to have considered the possibility of a perverse classifier. Unfortunately such circumstances do arise in machine learning experiments, see for example the meta-analysis in Shep14 . Thus, the Bookmaker’s odds provides a simple benchmark333 Whilst we utilise the Bookmaker’s odds to be a simple benchmark to evaluate the chance component of any classification performance, we consider correlation coefficients such as MCC to be more useful for overall performance evaluation. In the case of MCC it has the added utility of being related to a chi-squared distribution.
Whilst we utilise the Bookmaker’s odds to be a simple benchmark to evaluate the chance component of any classification performance, we consider correlation coefficients such as MCC to be more useful for overall performance evaluation. In the case of MCC it has the added utility of being related to a chi-squared distribution.to assess F1.
In Figure 1 we plot, for all 12341 possible permutations of an confusion matrix, the F1 score, and the associated Bookmaker’s odds and the degree of imbalance, defined as: where is the prevalence (of positive cases). Note that for 940 permutations one or both metrics has no defined value e.g., because TP=0. We can observe there is some general tendency for the Bookmaker odds to increase as the F1 score increases as indicated by the green smoothed line. However, there are many deviations and if we examine the lower righthand quadrant (F1 0.5 and Bookmaker 0) we see there is extensive potential for F1 to provide very misleading scores for a classifier that is in actual fact worse than random. As an extreme example, the confusion matrix: yields F1 = 0.93 but Bookmaker’s Odds of -0.10. In other words a near perfect F1 score corresponds to a classifier that is, in reality, slightly worse than guessing. In other words it is perverse!
Whilst these are hypothetical examples, it reveals a potentially, highly misleading performance metric. We also note that these more extreme values tend to correspond to high imbalance scores (i.e., where the number of positive cases have very low or high prevalence).
Fourthly, researchers have commented on the way that F1 combines two distinct quantities, Precision and Recall and that this accomplished via the harmonic mean which will distort the impact of extreme values Hand18 . By contrast, an arithmetic mean is more intuitively interpretable.
So we conclude that F1 is — at least in theory — an unreliable indicator of software defect prediction performance. As an alternative, we propose the Matthews correlation coefficient (MCC) Bald00 , since it is chance corrected (see Table 2), is based upon both classes (it utilises the complete confusion matrix) and has a straightforward interpretation. It also follows a chi-square distribution.
2.3 How widely is F1 used in software defect prediction experiments?
Two previous systematic reviews have explicitly tried to quantify the extent to which F1 is used as the response variable for software defect prediction studies. Malhotra et al. Malh15 reported that 17/64 ( 27%) of included studies between 1991 and 2013, used F1 directly and an additional 37% and 66% used Precision and Recall, which are the two constituent components of F1 (see Table 1). Another systematic review by Hosseini et al. Hoss17 reports that out of 30 studies (2006-2016) 11 ( 37% ) use F1and 21 ( 70%) use precision and recall.
To obtain a more up to date, though somewhat approximate, view of current utilisation rates we applied the following two searches using Google Scholar (29th May, 2020). We excluded patents and citations.
General search ”software defect prediction” AND (”experiment” OR ”empirical”)
F1 subset search ”software defect prediction” AND (”experiment” OR ”empirical”) AND (”F1” OR ”F-measure” OR ”F-score”)
These searches retrieved 2250 (General search) and 978 (F1 subset search) results respectively, which suggests that in the past five years of the order of 43% (978/2250) articles that discussed software defect prediction experiments also mentioned F1. We then randomly sampled444The sampling was conducted by randomly sampling 3 papers from each page (of 10 papers) returned by Google Scholar, 97 pages in total. Overall we examined papers. 30% of these 978 papers and read them carefully to determine if they actually employed F1 in their analysis. We found that 82% (239/291) of papers actually used the F1 metric in their methods. From this we argue that in the last five years there are of the order of software defect prediction studies that make use of the F1 classification performance metric. Given our concerns regarding this metric we find this somewhat worrying.
3 The systematic review
Next we seek to find software defect prediction studies that publish results using F1 and another more reliable metric, MCC, so that we can make comparisons.
This review was carried out in January 2021. The goal was to locate primary studies that undertook experiments to assess software defect prediction methods on historical data sets. Specifically, we needed papers that reported results with both the widely used, but problematic, F1 metric and the Matthews correlation coefficient (MCC). This would enable us to determine whether differences between these metrics are merely a theoretical concern or have real-world impact.
We conducted a basic search in our earlier conference paper Yao20 and located 8 studies. However, for this work we decided to conduct a more in-depth search which resulted in a further 30 studies making a total of 38 papers. The details are given below (and summarised in Table 3).
|Objective||To find experimental results where both the F1 and MCC classifier performance metrics are reported|
|Target domain||Software defect prediction experiments|
|Target audience||(i) Ourselves (for a meta-analysis) and (ii) other researchers|
|Databases searched||Google Scholar|
|Additional searches||Forward and backward chaining, results from previous search Yao20|
|Inclusion validation||JY and MS independently checked papers for potential inclusion and disagreements were discussed|
|Grey literature||Not included|
|Study quality||(i) Refereed and (ii) for predictive studies uses some cross-validation procedure|
|Data collected||(i) Bibliographic data including: authors, title, year and publication venue and types of classification performance metric collected, (ii) result count, (iii) individual F1 and MCC results for , where is dataset and is classifier, (iv) inference procedure details including: use of NHST and correction procedures e.g., Bonferroni|
|Data published||zenodo 10.5281/zenodo.3949897|
We decided to restrict our search to the domain of software defect prediction experiments. This was because there are aspects to the researcher’s choice of classifier performance metric that are domain-specific, namely (i) whether true negatives can be enumerated, i.e., is this count knowable and (ii) the prevalence of true cases, i.e., defect-prone software components. In the case of software defects, not only is the number of true negatives knowable, it’s important since these are the software components that are correctly identified as not being defect-prone. Such components can then be allocated reduced testing resources. The other aspect is that in general, the majority of data sets have few positive (defect-prone) cases, i.e., they are imbalanced. These two conditions render F1 an unsafe choice of performance metric for software defect prediction experiments (see Section 2.2).
We wanted to ensure that we collected high quality experimental results for our meta-analysis. For this reason we only used peer reviewed studies, meaning that the so-called grey literature has been excluded. Although some researchers are strongly advocating ‘multi-vocal’ literature reviews e.g., Garo19 our purpose is slightly different. First, we are not motivated by “closing the gap between academic research and professional practice” (Elmore Elmo91 quoted in Garousi et al. Garo19 ) because our target is researchers. Second, we are aware of the ease with which it is possible to make errors in computational experiments and therefore are motivated to maximise independent scrutiny. Of course, the peer reviewed literature is still replete with mistakes Dono09 ; Alli16 and more specifically in defect prediction Shep19 ; Li20 .
The complete list of exclusion criteria is given in Table 4. The long list of 194 articles was constructed by scanning the title and venue. Where there was any doubt, the article remained in the long list. If we were unable to obtain content through the usual channels we attempted to email the authors. Each criterion was successively applied in the order listed in Table 4, consequently the counts only refer to the residual articles (e.g., if an article is unavailable we make no judgement as to whether it is written in English). Note ‘new data’ refers to the situation where the same experimental results are presented in more than one paper. Finally, ‘suitable data’ is something of a catch all where we are unable to use the data for a range of particular reasons such as no meaningful differences between the classifiers being compared or no comparisons (i.e., results being presented without benchmarks).
|Long list of candidate papers||-||194|
|Not written in English||6||179|
|Different problem domain to software defect prediction||45||135|
|Provides F1 and MCC data||69||43|
|No cross-validation (where appropriate)||0||38|
4 Bibliometric summary data
As indicated, we located 38 research papers that described computational experiments that compare the performance of different classifiers, e.g., logistic regression of random forest algorithms, across various software defect data sets, e.g., Nasa MDP and Eclipse. The publication venues are quite widely distributed, though we note there are three papers from Promise conferences, EMSE and JSS, two from IST, TSE and MECS (the Intl. J. of Modern Education and Computer Science). In total there are 12 conference and 26 journal papers. A complete list is given in AppendixA.
The papers range from 2012 to 2020. On the whole, however, providing MCC results seems to be a relatively recent phenomenon with 32/38 papers published since 2018. We also observed that just over half (20/38) of the papers also reported AUC, however, we did not analyse these data as being beyond the scope of our study. 555Area under the curve (AUC) would also appear to be growing in popularity amongst software defect prediction researchers. However, it is a metric based on the frontier between the true and false positive rates and thus is a characteristic of a family of classifiers rather than any specific classifier Fawc06 ; Hand09 ; Powe11 . The line plot in Fig. 2 gives an indication of the clear, steep upwards trend using a loess smoother.
In total, the 38 papers contain 12,471 usable results, that is where we have complete cases, and pairs of values for each of F1 and MCC. The number of results per paper varied hugely from 6 to 1890 with a median of 140.5. We noted that the papers containing a large number of results tended to be major benchmarking exercises, whilst those with few results tended to promote new algorithms or approaches and just compared these with some baseline approach.
These papers between them utilise 97 distinct datasets, however many are used multiple times across multiple studies. Unfortunately, it would seem that different studies refer to the same data set by different names, e.g., Promise and NASA MDP. Also, it’s not always clear which version of a data set is being used plus different studies deploy differing data-cleaning strategies. Nevertheless this indicates the breadth of research activity.
In terms of analysis, we observe from Table 5 that 24/38 papers use null hypothesis significance testing (NHST) to determine whether a result is “statistically significant” or ‘meaningful’. Fig. 2 shows a clear upward trend such that in the past couple of years this is the dominant means of reasoning about comparative classification performance. Of these, half (12/24) use some procedure (e.g., a post hoc Nemenyi test) to adjust the acceptance threshold when making multiple, inferential tests. A further 10 papers merely compare values, so a higher score of the performance metric (F1 or MCC) is to be preferred to a lower score, irrespective of the magnitude. The remaining 4 papers either use alternative methods, are unclear or have some alternative purpose than comparing competing defect predictors.
5.1 Classification performance metrics
First we, examine the spread of values for F1 and MCC, recalling that but so the summary statistics in Table 6 are not directly comparable. Although the statistics are derived from all reported results from our systematic review, it is noteworthy that many classifiers appear to perform poorly. Some are even have negative correlation coefficient values (592/25,467 observations).
As, discussed in Section 2.2, the F1 metric is not easy to interpret with respect to chance odds, but we did find 235 instances of the boundary case of which comprise of all observations. We assume the authors are reporting a divide-by-zero error as zero. This could be caused by either no positive cases (TP+ FN) or no true positives (TP). The former situation might arise from the vagaries of cross-validation (if there are few positive cases relative to the size of each fold). The latter might arise from a very poorly performing classifier.
We can also visualise the individual distributions as violin plots (see Fig. 3
). Note that, unlike MCC, the density plot for F1 is truncated at zero since this is the minimum possible value. The kernel density estimators suggest quite complex, but positively skewed distributions, i.e., the means are greater than the medians. In other words, most classifiers do not predict well. However, we do stress that all results are included encompassing possibly quite naïve or simplistic baselines that the researchers only intended as comparators.
Although the two performance metrics are measured on different scales, we can use Spearman’s correlation coefficient to evaluate the strength of a monotonic association and find . Some researchers would refer to this as ‘moderate’ strength Scho18 . Probably more informative is to examine the relationship graphically in the scatter plot given by Fig. 4. Although the relationship is positive, i.e., as one metric value increases so does the other, we observe a good deal of scatter and some extreme outliers. This breakdown in the relationship is most noticeable for the higher values of F1. In other words, the relationship between the two classification performance metrics is far from straightforward and, paradoxically, the nearer F1 is to unity the less trustworthy is the result. A near perfect score for F1 can in practice mean anything from a similarly near perfect MCC score to a negative correlation and everything in between! Recall, these are real, reported results in the refereed scientific literature.
5.2 Difference between F1 and MCC
In order to extract information from what is a diverse set of primary studies, we must adopt a standardised view. Essentially, all our studies are conducting computational experiments where the:
treatments are the different predictive algorithms or classifiers that are being investigated and manipulated by the researcher;
response variables are the different classification performance metrics, in our case F1 and MCC;
experimental units are software project data for which predictions are made;
block an aggregation of similar experimental units such as a data set which is typically the level of reporting.
Almost without exception, a repeated measures type of design is deployed by our set of primary studies, that is all treatments are applied to all experimental units. This is possible — unlike for many experiments involving humans — since there is no potential for carryover or other ordering effects. Results are presented as tables of response variables, for example, see Table 7.
When researchers assess their results they do so by making comparisons between treatments. Which one is to be preferred? These are generally decomposed to pairwise comparisons.666Although omnibus tests (e.g., a Friedman test) are sometimes utilised to make comparisons of multiple results such tests can always be decomposed to their primitive components. In Table 7 we see some hypothetical results. In this simple example we have two comparisons, one for each block or dataset, of Logistic Regression (LR) compared to Naïve Bayes (NB). For example, using F1 we have Naïve Bayes out-performing Logistic Regression for both data sets since ( and ), however, the results using MCC are not fully concordant since ( but ). In the latter case, which treatment or type of defect predictor we prefer depends upon the choice of the measurement function . In such a situation we refer to the results as being discordant. One can imagine the hypothetical example being extended with a third classifier, say Random Forest, which would then mean the table of results could be decomposed to six comparisons (LR-NB, LR-RF, NB-RF twice over, once per dataset). To generalise, a table reporting treatments over data sets can be decomposed to comparisons.
More formally, the comparisons are made using different metrics or measurement functions . For our analysis, we have and . These functions are applied to different dataset-treatment combinations, yielding for example, (Dataset1, LR) which gives the F1 metric from applying Logistic Regression to Dataset1. By taking pairs of measures, we can establish preference relations, e.g., (Dataset1, LR) (Dataset1, NB), in other words, NB is to be preferred to LR for Dataset1. The question arises whether other measurement functions, MCC in our analysis, yield concordant or discordant relations. If the choice of metric governs the outcome this is concerning, the more so because we know that F1 is a flawed metric in the context of two-class classification in software defect prediction.
Comparisons between pairs of treatment-blocks can differ in either magnitude or direction. Discordance, our focus, addresses the latter. In other words using F1 leads one to prefer X to Y, yet using MCC would lead the investigator to conclude the opposite. Researchers adopt a range of approaches when considering how evaluate differences in magnitude, particularly small differences. A common, though controversial, approach is the use of null hypothesis significance testing (NHST) Cohe94 ; Gelm06 ; Colq14 . Here the idea is to distinguish between small differences in magnitude that might be merely due to noise, and differences of greater import. In such circumstances one can deploy a “not worse than” () preference relation. Scott-Knott is another approach with a similar goal Mitt15 .
We are agnostic about the direction of the difference, since whether a classifier is interpreted as treatment 1 or 2 is entirely arbitrary, thus we look at absolute differences. Figure 5
shows the distribution of differences together with a kernel density estimator. Both metrics show a similar highly positively skewed distribution (partly as a consequence of taking the absolute difference), nevertheless it is noteworthy that for both metrics the median differences between pairs of observations are small (F1 = 0.060 and MCC = 0.068). This suggests that for the majority of comparisons between pairs of defect predictors, the disimilarities are small.
The question arises, if a difference in direction occurs, could this be due to trivial change in an accuracy metric (e.g., )? In other words what if many differences in pairs of observations are very small? However, given that the majority of studies (24/38) use NHST as a decision procedure, even small differences are likely to be interpreted as meaningful due to the generally quite large data sets. 777Using a simulation based on the NASA MDP data sets, which are the most widely used, a median dataset size of , and a median standard deviation of sd = 0.03465 reported by Tran et al.
, and a median standard deviation of sd = 0.03465 reported by Tran et al.Tran19 we determined that a Welsh test on this data could detect (i.e., find statistically significant) a difference in treatments of in F1 metric values (95% CI 0.0065, 0.0127 with p-value = 2.204e-09). This means that even very small differences in pairs of classification performance would be viewed as ‘significant’ when viewed through the lens of NHST. For our purposes this means even small differences between F1 and MCC might lead researchers to very different conclusions. Furthermore, another 10 papers simply make direct comparisons. Being conservative we could argue a difference in F1 of 0.01 or more could be identified as ‘meaningful’. If smaller differences were discarded this would eliminate 426 results or 426/2737 which is 15% of the conclusion changes, in other words, only a small proportion of direction changes that we identify, might be considered to be too small to be as conclusion changes by the researchers. We return to this point in the threats to validity (see Section 6.1).
Next, we turn to the question of what proportion of reported results exhibit discordance. This addresses the problem of how much does it practically matter that many research papers have used a biased classification metric as the response variable for their experiments. Since the two metrics use different scales, we initially focus on direction rather than magnitude. Overall from the 12,471 results there are 2737 (or 21.95%) instances of a conclusion (or direction) change. Using the Agresti-Coull Brow01
method to estimate the confidence interval of the binomial proportion, we have the following 95% confidence interval:
Note that given the large number of observations the confidence interval (CI) is tightly defined. Therefore these results indicate that more than a fifth of the reported results contain conclusions that will change if the biased F1 metric is replaced by a less problematic metric such as MCC.
We can also examine the relationship between the magnitude of differences between pairs of treatments as captured by F1 and by the unbiased MCC. Fig. 6 shows the comparisons for F1 and MCC plotted against each other. When the metrics are concordant the data points fall in the lower left and upper right quadrants (coloured red) whilst a change in direction is signified when the points lie in the upper left and lower right quadrants (coloured blue). The number of instances is given in each quadrant. The plot shows that whilst there is a broad trend — that as differences as captured by one metric increase so they do for the other — there are, however, many extreme outliers
5.3 Can we predict the likelihood of a conclusion change?
Here we explore whether magnitude of the effect is an explanatory factor for the probability of a conclusion change when we replace the biased F1 classification performance metric with MCC.
One seemingly reasonable hypothesis is that the larger the observed difference between the two treatments (i.e., the effect magnitude) the less likely it matters which performance metric we employ. In other words, a large effect size can be trusted however measured? To investigate we use a simple procedure popularised by Gelman and Park Gelm09 based on a tertile split and comparing the bottom and top tertiles, in our case based upon absolute F1 difference (since we only care about magnitude, not direction). The reasoning to ignore the middle tertile is to avoid comparison of adjacent items which is the well known disadvantage of a median split.
Table 8 reveals a considerable difference between tertile T1 and T3 which is reflected in the odds ratio of OR = 4.11 and the 95% CI = . Again we use the Agresti-Coull method Brow01 . From this we can see that a small effect or magnitude (in our case, an absolute difference between F1 values of , is more likely than not () to be in the wrong direction. NB The effect is the absolute difference in F1 scores, as opposed to a standardised effect size such as the d-family of statistics like the widely used Cohen’s . In other words, irrespective of “statistical significance” or other arguments, a result showing a small difference in competing classifiers, as captured by F1, is as likely as not, to be in the wrong direction.
This paper has posed the question: to what extent can we rely on research results from software defect prediction studies that are based on the problematic F1 performance metric? Unfortunately, although we, and many others before us, have shown that F1 is not a good choice of metric in the context of defect prediction (see Section 2.2, F1 has been very widely used. So, should we be concerned or is this just a minor academic quibble? This question is the theme of our paper.
To summarise our investigation, we have searched the literature for software defect prediction studies that report performance both as F1 and MCC. By this means we retrieved 38 refereed papers that contain a combined 12,471 pairs of results. We then analysed these results by assessing whether the F1 and MCC results are concordant. So, for example, we say a result is concordant if we prefer predictor A to B irrespective of whether we use F1 or MCC to make that comparison. Contrariwise, we say that the results are discordant if the direction of the preference depends upon which classification performance metric is used. That is the conclusion would change.
Our main findings are:
Although not a new finding we show that F1 is problematic when used in a two-class problem domain such as software defect prediction. By enumerating and then plotting all N=40 confusion matrix permutations, we show how misleading F1 is because it is not chance-adjusted. We demonstrate this with respect to simple Bookmaker’s odds.
The F1 metric is still widely used by researchers investigating classifiers for software defect prediction. Our analysis of the literature suggests of the order of 800 software defect prediction papers have used this metric in the past five years alone.
We find that more than a fifth (21.95%) of all results change not only in magnitude but most importantly, in conclusion (or direction) when the unbiased MCC is used, instead of the F1 metric.
In passing, we also note that some classifiers do not perform well, i.e., less well than chance. This is not apparent if researchers rely on F1, although clear from a negatively valued correlation coefficient.
Unsurprisingly the smaller the effect (difference in performance between pairs of classifier) the more likely the conclusion will change. The odds-ratio between the lowest and top tertile is 4.11 (95% CI = ).
This tendency to use F1 in software defect studies has wider ramifications than just single studies, since it then propagates through into meta-analyses which are often based on this metric Hoss17 ; Malh15 . Other meta-analyses have been obliged to discard significant amounts of data when researchers only reported results in terms of F1 e.g., Shep14 .
Finally, we wish to be clear that we are not making a criticism of the authors of the 38 primary studies included in our meta-analysis. To their considerable credit they have provided the necessary data to make our investigation possible. Nor are we claiming that they have relied upon the F1 metrics. They have, however, provided the means whereby we can answer the question: how much does using F1 for software defect prediction studies matter? Sadly, the answer seems to be: a good deal.
6.1 Threats to validity
Internal: threats relate to extent to which the design of our investigation enables us to argue there is evidence to support our claim (i.e., that using F1 as a response variable for defect prediction studies causes misleading results and therefore conclusions).
Can we be sure that discordance is an appropriate way to reveal problems with study conclusions due to using F1? We believe using the idea of sign or direction change is the most fundamental way of considering pairs of results. In other words in terms of preference relations. Changing a preference for Classifier A to Classifier B inverts the meaning of the results. The alternative of setting some minimum effect magnitude threshold by comparison seems arbitrary.
Perhaps researchers only focus on very large differences between F1 measures? So a change in direction or discordant results might not matter if both sets of results are very close to a zero effect size. Researchers have indeed used a range of decision procedures to determine whether the magnitude of the effect matters (see Table 5) but the majority (24/38) use NHST. One of the many criticisms of NHST is that very small effects can be ‘significant’ when is large which is typically the case for software defect studies.
Measurement error, for instance with regard to the boundary conditions e.g., F1=0. Elsewhere we have found that reporting and/or measurement errors can be depressingly prevalent Shep19 . However, it is not obvious why F1 would be more impacted than MCC.
The data sets used by researchers generally assume simple relationships and traceability between defects and repairs. Herbold Herb19 has argued, with some justification, that we should expect m:n relationships between defects and software units. Whilst this may well weaken the practical relevance of software defect prediction research, our focus is on how we assess classification performance and whether it matters if F1 is employed. Hence we believe that this, valid threat, is somewhat peripheral.
In line with all the research included in this study, we ignore costs by assuming the costs of false positives and false negatives are equal. Penalising one class of error more than the other is a potentially important area of software defect prediction research Khos98 ; Herb19 , but one outside the scope of this study.
External: threats concern the generalisability of our findings.
Is our sample large enough? We have more than 12,000 pairs of results plus we have sought to locate all studies that provide the data we need for our investigation.
Suppose we’d looked at other studies? Requiring the study to publish both F1 and MCC results might skew the findings? This is possible and it does seem that using MCC is a relatively recent practice. We excluded the grey literature (i.e., unrefereed studies). Given that (hopefully) research methods and practice improve over time and our focusing on demonstrably refereed studies, this would suggest that if anything we are biased to higher quality studies. Consequently, the overall picture could conceivably be worse than our meta-analysis reveals.
These results have implications both for researchers, but also for consumers of their research (both other researchers and practitioners).
The first, and most obvious, implication is that researchers should stop using the F1 metric to analyse and compare software defect classifiers. We should not reason that because people have previously used F1 we should continue to do so. Otherwise our research will be perpetually mired in the past! Minimally, we suggest researchers should provide other unbiased metrics such as the Matthews correlation coefficient Bald00 .
We need full reporting of data, results and code / scripts Muna17 . Preferably, papers should provide all the confusion matrices so that a wide range of metrics can potentially be computed as secondary analysis.
When undertaking meta-analyses these should not be based upon F1 results. Instead, it may be possible to either use results based upon other metrics or derive them from other information reported Bowe14 . Otherwise, the risk that 22% of the results used in a meta-analysis are completely misleading must be viewed as rendering the results critically contaminated.
When reading past studies based upon F1, consider the absolute size of the effect plus the confidence interval but ignore statistical significance. Unless the absolute F1 difference is non-trivial (our analysis would suggest ) we recommend little credence should be given to such a result. Even then, this still implies a chance that not only is the magnitude of the effect wrong but it’s actually in the opposite direction. So instead of classifier A being considerably better than B, it turns out that B is better than A. How can we expect practitioners to deploy substantial resources in the real-world e.g., guiding their testing effort when such advice could be completely misguided.
It seems a great deal of research effort has been deployed on the clearly important problem of how to predict where testing effort should be focused in large software systems. As important as that question might be, we can hardly expect our research to have much practical impact unless we, as a community, take reasonable steps to ensure our computational experiments have meaning.
Appendix A Details of the primary studies included in the systematic review
|Abae18||Abaei et al.||2018||J||A fuzzy logic expert system to predict module fault proneness using unlabeled data||14|
|AlDa18||Al Dallal||2018||J||Predicting fault-proneness of reused object-oriented classes in software post-releases||27|
|Ali20||Ali et al.||2020||J||
Software Defect Prediction Using Variant based Ensemble Learning and Feature Selection Techniques
|Amas18||Amasaki||2018||C||Cross-version defect prediction using cross-project defect prediction approaches: does it work?||1404|
|Amas20||Amasaki||2020||J||Cross-version defect prediction: use historical data, cross-project data, or the both?||171|
|Ayon19||Ayon||2019||C||Neural network based software defect prediction using genetic algorithm and particle swarm optimization||30|
|Bangash20||Bangash et al.||2020||J||On the time-based conclusion stability of cross-project defect prediction models||190|
|Bowe12||Bowes et al.||2012||C||Comparing the performance of fault prediction models which report multiple performance measures: recomputing the confusion matrix||6|
|Bowe15||Bowes et al.||2015||C||Different classifiers find different defects although with different level of consistency||12|
|Bowe18||Bowes et al.||2018||J||Software defect prediction: do different classifiers find the same defects?||18|
|Chen20||Chen et al.||2020||J||An empirical study on heterogeneous defect prediction approaches||336|
|Felix20||Felix et al.||2020||J||Predicting the number of defects in a new software version||30|
|Ge18||Ge et al.||2018||C||
Comparative study on defect prediction algorithms of supervised learning software based on imbalanced classification data sets
|Gong19||Gong et al.||2019||J||An improved transfer adaptive boosting approach for mixed?project defect prediction||1512|
|Herb18||Herbold et al.||2018||J||A comparative study to benchmark cross-project defect prediction approaches||1890|
|Iqba20||Iqbal||2020||J||A classification framework for software defect prediction using multi-filter feature selection technique and MLP||588|
|Lena20||Lenarduzzi et al.||2020||C||Are SonarQube rules inducing bugs?||36|
|Matl19||Matloob et al.||2019||J||A framework for software defect prediction using feature selection and ensemble learning techniques||85|
|Naseem20||Naseem et al.||2020||J||Investigating Tree Family Machine Learning Techniques for a Predictive System to Unveil Software Defects||450|
|Nezh20||Nezhad et al.||2020||J||
Software defect prediction using over-sampling and feature extraction based on mahalanobis distance
|Niu20||Niu et al.||2020||J||Cost-sensitive Dictionary Learning for Software Defect Prediction||1595|
|Pan19||Pan et al.||2019||J||An improved cnn model for within-project software defect prediction||336|
|Pand20||Pandey et al.||2020||J||Bpdet: an effective software bug prediction model using deep representation and ensemble learning techniques||429|
|Pecorelli20||Pecorelli et al.||2020||J||A large empirical assessment of the role of data balancing in machine-learning-based code smell detection||11|
|Rizw17||Rizwan et al.||2017||C||Empirical study on software bug prediction||78|
|Rodr14||Rodrigues et al.||2014||C||Preliminary comparison of techniques for dealing with imbalance in software defect prediction||726|
|Ship18||Shippey et al.||2018||C||Code cleaning for software defect prediction: a cautionary tale||18|
|Tian20||Tian et al.||2020||C||How Well Just-In-Time Defect Prediction Techniques Enhance Software Reliability?||110|
|Tong18||Tong et al.||2018||J||
Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning
|Tong20||Tong et al.||2020||J||Credibility Based Imbalance Boosting Method for Software Defect Proneness Prediction||231|
|Tran19||Tran et al.||2019||C||Combining feature selection, feature learning and ensemble learning for software fault prediction||36|
|Xuan15||Xuan et al.||2015||C||Evaluating defect prediction approaches using a massive set of metrics: an empirical study||15|
|Xu19||Xu et al.||2019||J||Software defect prediction based on kernel pca and weighted extreme learning machine||237|
|Xu20||Xu et al.||2020||J||Imbalanced metric learning for crashing fault residence prediction||237|
|Zhan16||Zhang et al.||2016||J||Towards building a universal defect prediction model with rank transformed predictors||50|
|Zhang20||Zhang et al.||2020||J||
Automated defect identification via path analysis-based features with transfer learning
|Zhao19||Zhao et al.||2019||J||Siamese dense neural network for software defect prediction with small data||900|
The authors wish to thank the authors of the 38 primary studies included for providing sufficient information to make this analysis possible. We also wish to stress that our criticism of F1 does not mean we are criticising their papers. On the contrary, their foresight that alternative metrics to F1 are needed, has been invaluable. Jingxiu Yao wishes to acknowledge the support of the China Scholarship Council.
Conflict of interest
The authors declare that they have no conflict of interest.
- (1) C. Catal, B. Diri, A systematic review of software fault prediction studies, Expert Systems with Applications 36 (4) (2009) 7346–7354.
- (2) T. Hall, S. Beecham, D. Bowes, D. Gray, S. Counsell, A systematic literature review on fault prediction performance in software engineering, IEEE Transactions on Software Engineering 38 (6) (2012) 1276–1304.
- (3) R. Malhotra, A systematic review of machine learning techniques for software fault prediction, Applied Soft Computing 27 (2015) 504–518. doi:http://dx.doi.org/10.1016/j.asoc.2014.11.023.
- (4) S. Hosseini, B. Turhan, D. Gunarathna, A systematic literature review and meta-analysis on cross project defect prediction, IEEE Transactions on Software Engineering 45 (2) (2017) 111–147.
- (5) R. Özakıncı, A. Tarhan, Early software defect prediction: A systematic map and review, Journal of Systems and Software 144 (2018) 216–239.
- (6) L. Son, N. Pritam, M. Khari, R. Kumar, P. Phuong, P. Thong, et al., Empirical study of software defect prediction: a systematic mapping, Symmetry 11 (2).
N. Li, M. Shepperd, Y. Guo, A systematic review of unsupervised learning techniques for software defect prediction, Information and Software Technology online (2020) 106287.
- (8) M. Sokolova, G. Lapalme, A systematic analysis of performance measures for classification tasks, Information Processing and Management, 45 (4) (2009) 427–437.
- (9) D. Powers, Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation, Journal of Machine Learning Technologies 2 (1) (2011) 37–63.
A. Luque, A. Carrasco, A. Martín, A. de las Heras, The impact of class imbalance in classification performance metrics based on the binary confusion matrix, Pattern Recognition 91 (2019) 216–231.
- (11) P. Baldi, S. Brunak, Y. Chauvin, C. Andersen, H. Nielsen, Assessing the accuracy of prediction algorithms for classification: an overview, Bioinformatics 16 (5) (2000) 412–424.
- (12) J. Yao, M. Shepperd, Assessing software defection prediction performance: Why using the Matthews correlation coefficient matters, in: Proceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering, 2020.
- (13) S. Wang, X. Yao, Using class imbalance learning for software defect prediction, IEEE Transactions on Reliability 62 (2) (2013) 434–443.
- (14) W. Youden, Index for rating diagnostic tests, Cancer 3 (1) (1950) 32–35.
- (15) D. Powers, Recall & precision versus the bookmaker, International Conference on Cognitive Science, 2003.
- (16) T. Fawcett, An introduction to ROC analysis, Pattern Recognition Letters 27 (8) (2006) 861–874.
- (17) S. Morasca, L. Lavazza, On the assessment of software defect prediction models via ROC curves, Empirical Software Engineering 25 (5) (2020) 3977–4019.
- (18) D. Hand, Measuring classifier performance: a coherent alternative to the area under the ROC curve, Machine Learning 77 (2009) 103–123. doi:10.1007/s10994-009-5119-5.
- (19) P. Flach, M. Kull, Precision-recall-gain curves: PR analysis done right, in: Advances in Neural Information Processing Systems (NIPS 2015), 2015, pp. 838–846.
- (20) C. van Rijsbergen, Information Retrieval, 2nd Edition, Butterworths, 1979.
Y. Sun, A. Wong, M. Kamel, Classification of imbalanced data: A review, International Journal of Pattern Recognition and Artificial Intelligence 23 (04) (2009) 687–719.
- (22) M. Shepperd, D. Bowes, T. Hall, Researcher bias: The use of machine learning in software defect prediction, IEEE Transactions on Software Engineering 40 (6) (2014) 603–616.
- (23) D. Hand, P. Christen, A note on using the F-measure for evaluating record linkage algorithms, Statistics and Computing 28 (3) (2018) 539–547.
- (24) V. Garousi, M. Felderer, M. Mäntylä, Guidelines for including grey literature and conducting multivocal literature reviews in software engineering, Information and Software Technology 106 (2019) 101–121.
- (25) R. Elmore, Comment on “towards rigor in reviews of multivocal literatures: applying the exploratory case study method”, Review of educational research 61 (3) (1991) 293–297.
- (26) D. Donoho, A. Maleki, I. Rahman, M. Shahram, V. Stodden, Reproducible research in computational harmonic analysis, Computing in Science and Engineering 11 (1) (2009) 8–18.
- (27) D. Allison, A. Brown, B. George, K. Kaiser, Reproducibility: A tragedy of errors, Nature 530 (7588) (2016) 27–29.
- (28) M. Shepperd, Y. Guo, N. Li, M. Arzoky, A. Capiluppi, S. Counsell, G. Destefanis, S. Swift, A. Tucker, L. Yousefi, The prevalence of errors in machine learning experiments, in: International Conference on Intelligent Data Engineering and Automated Learning, Springer, 2019, pp. 102–109.
- (29) P. Schober, C. Boer, L. Schwarte, Correlation coefficients: Appropriate use and interpretation, Anesthesia & Analgesia 126 (5) (2018) 1763–1768.
- (30) J. Cohen, The earth is round (p ), American Psychologist 49 (12) (1994) 997–1003.
- (31) A. Gelman, H. Stern, The difference between “significant” and “not significant” is not itself statistically significant, The American Statistician 60 (4) (2006) 328–331.
- (32) D. Colquhoun, An investigation of the false discovery rate and the misinterpretation of p-values, Royal Society Open Science 1 (140216). doi:10.1098/rsos.140216.
- (33) N. Mittas, I. Mamalikidis, L. Angelis, A framework for comparing multiple cost estimation methods using an automated visualization toolkit, Information and Software Technology 57 (2015) 310–328.
- (34) H. Tran, L. Hanh, N. Binh, Combining feature selection, feature learning and ensemble learning for software fault prediction, in: 11th IEEE International Conference on Knowledge and Systems Engineering (KSE), 2019, pp. 1–8.
- (35) L. Brown, T. Cai, A. DasGupta, Interval estimation for a binomial proportion, Statistical Science 16 (2) (2001) 101–117.
- (36) A. Gelman, D. Park, Splitting a predictor at the upper quarter or third and the lower quarter or third, The American Statistician 63 (1) (2009) 1–8.
- (37) S. Herbold, On the costs and profit of software defect prediction, IEEE Transactions on Software Engineering online.
T. M. Khoshgoftaar, E. B. Allen, Classification of fault-prone software modules: Prior probabilities, costs, and model evaluation, Empirical Software Engineering 3 (3) (1998) 275–298.
- (39) M. Munafò, B. Nosek, D. Bishop, K. Button, C. Chambers, N. du Sert, U. Simonsohn, E. Wagenmakers, J. Ware, J. Ioannidis, A manifesto for reproducible science, Nature Human Behaviour 1 (1) (2017) 0021.
- (40) D. Bowes, T. Hall, D. Gray, DConfusion: a technique to allow cross study performance evaluation of fault prediction studies, Automated Software Engineering 21 (2) (2014) 287–313.
- (41) G. Abaei, A. Selamat, J. Al Dallal, A fuzzy logic expert system to predict module fault proneness using unlabeled data, Journal of King Saud University-Computer and Information Sciences online.
- (42) J. Al Dallal, Predicting fault-proneness of reused object-oriented classes in software post-releases, Arabian Journal for Science and Engineering 43 (12) (2018) 7153–7166.
- (43) U. Ali, S. Aftab, A. Iqbal, Z. Nawaz, M. S. Bashir, M. A. Saeed, Software defect prediction using variant based ensemble learning and feature selection techniques., International Journal of Modern Education & Computer Science 12 (5).
- (44) S. Amasaki, Cross-version defect prediction using cross-project defect prediction approaches: Does it work?, in: Proceedings of the 14th ACM International Conference on Predictive Models and Data Analytics in Software Engineering, 2018, pp. 32–41.
- (45) S. Amasaki, Cross-version defect prediction: use historical data, cross-project data, or both?, Empirical Software Engineering 25 (2020) 1573–1595.
- (47) S. Ayon, Neural network based software defect prediction using genetic algorithm and particle swarm optimization, in: 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT), IEEE, 2019, pp. 1–4.
- (48) A. A. Bangash, H. Sahar, A. Hindle, K. Ali, On the time-based conclusion stability of cross-project defect prediction models, Empirical Software Engineering 25 (6) (2020) 5047–5083.
- (49) D. Bowes, T. Hall, D. Gray, Comparing the performance of fault prediction models which report multiple performance measures: recomputing the confusion matrix, in: 8th ACM International Conference on Predictive Models in Software Engineering, 2012, pp. 109–118.
- (50) D. Bowes, T. Hall, J. Petrić, Different classifiers find different defects although with different level of consistency, in: 11th ACM International Conference on Predictive Models and Data Analytics in Software Engineering, 2015, pp. 1–10.
- (51) D. Bowes, T. Hall, J. Petrić, Software defect prediction: do different classifiers find the same defects?, Software Quality Journal 26 (2) (2018) 525–552.
- (52) H. Chen, X. Jing, Z. Li, D. Wu, Z. Peng, Y.and Huang, An empirical study on heterogeneous defect prediction approaches, IEEE Transactions on Software Engineering online.
- (53) E. A. Felix, S. P. Lee, Predicting the number of defects in a new software version, PloS one 15 (3) (2020) e0229131.
- (54) J. Ge, J. Liu, W. Liu, Comparative study on defect prediction algorithms of supervised learning software based on imbalanced classification data sets, in: 2018 19th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), IEEE, 2018, pp. 399–406.
- (55) L. Gong, S. Jiang, L. Jiang, An improved transfer adaptive boosting approach for mixed-project defect prediction, Journal of Software: Evolution and Process 31 (10) (2019) e2172.
- (56) S. Herbold, A. Trautsch, J. Grabowski, A comparative study to benchmark cross-project defect prediction approaches, IEEE Transactions on Software Engineering 44 (9) (2018) 811–833.
- (57) A. Iqbal, S. Aftab, A classification framework for software defect prediction using multi-filter feature selection technique and mlp, International Journal of Modern Education & Computer Science 12 (1).
- (58) V. Lenarduzzi, F. Lomio, H. Huttunen, D. Taibi, Are SonarQube rules inducing bugs?, in: International Conference on Software Analysis, Evolution and Reengineering (SANER 2020), 2020.
- (59) F. Matloob, S. Aftab, A. Iqbal, A framework for software defect prediction using feature selection and ensemble learning techniques, International Journal of Modern Education and Computer Science 12 (2019) 14–20.
- (60) R. Naseem, B. Khan, A. Ahmad, A. Almogren, S. Jabeen, B. Hayat, M. A. Shah, Investigating tree family machine learning techniques for a predictive system to unveil software defects, Complexity 2020.
- (61) M. NezhadShokouhi, M. Majidi, A. Rasoolzadegan, Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance, The Journal of Supercomputing 76 (1) (2020) 602–635.
- (62) L. Niu, J. Wan, H. Wang, K. Zhou, Cost-sensitive dictionary learning for software defect prediction, Neural Processing Letters 52 (3) (2020) 2415–2449.
- (63) C. Pan, M. Lu, B. Xu, H. Gao, An improved CNN model for within-project software defect prediction, Applied Sciences 9 (10).
- (64) S. Pandey, R. Mishra, A. Tripathi, BPDET: An effective software bug prediction model using deep representation and ensemble learning techniques, Expert Systems with Applications 144.
- (65) F. Pecorelli, D. Di Nucci, C. De Roover, A. De Lucia, A large empirical assessment of the role of data balancing in machine-learning-based code smell detection, Journal of Systems and Software 169 (2020) 110693.
- (66) S. Rizwan, T. Wang, X. Su, Salahuddin, Empirical study on software bug prediction, in: Proceedings of the 2017 International Conference on Software and e-Business, 2017, pp. 55–59.
- (67) D. Rodriguez, I. Herraiz, R. Harrison, J. Dolado, J. Riquelme, Preliminary comparison of techniques for dealing with imbalance in software defect prediction, in: Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering, ACM, 2014, p. 43.
- (68) T. Shippey, D. Bowes, S. Counsell, T. Hall, Code cleaning for software defect prediction: A cautionary tale, in: 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), IEEE, 2018, pp. 239–243.
- (69) Y. Tian, N. Li, J. Tian, W. Zheng, How well just-in-time defect prediction techniques enhance software reliability?, in: 2020 IEEE 20th International Conference on Software Quality, Reliability and Security (QRS), IEEE, 2020, pp. 212–221.
H. Tong, B. Liu, S. Wang, Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning, Information and Software Technology 96 (2018) 94–111.
- (71) H. Tong, S. Wang, G. Li, Credibility based imbalance boosting method for software defect proneness prediction, Applied Sciences 10 (22) (2020) 8059.
- (72) X. Xuan, D. Lo, X. Xia, Y. Tian, Evaluating defect prediction approaches using a massive set of metrics: An empirical study, in: Proceedings of the 30th Annual ACM Symposium on Applied Computing, 2015, pp. 1644–1647.
- (73) Z. Xu, J. Liu, X. Luo, Z. Yang, Y. Zhang, P. Yuan, Y. Tang, T. Zhang, Software defect prediction based on kernel PCA and weighted extreme learning machine, Information and Software Technology 106 (2019) 182–200.
- (74) Z. Xu, K. Zhao, M. Yan, P. Yuan, L. Xu, Y. Lei, X. Zhang, Imbalanced metric learning for crashing fault residence prediction, Journal of Systems and Software 170 (2020) 110763.
- (75) F. Zhang, A. Mockus, I. Keivanloo, Y. Zou, Towards building a universal defect prediction model with rank transformed predictors, Empirical Software Engineering 21 (5) (2016) 2107–2145.
- (76) Y. Zhang, D. Jin, Y. Xing, Y. Gong, Automated defect identification via path analysis-based features with transfer learning, Journal of Systems and Software 166 (2020) 110585.
- (77) L. Zhao, Z. Shang, L. Zhao, A. Qin, Y. Tang, Siamese dense neural network for software defect prediction with small data, IEEE Access 7 (2019) 7663–7677.