The impact of using biased performance metrics on software defect prediction research

by   Jingxiu Yao, et al.
Brunel University London

Context: Software engineering researchers have undertaken many experiments investigating the potential of software defect prediction algorithms. Unfortunately, some widely used performance metrics are known to be problematic, most notably F1, but nevertheless F1 is widely used. Objective: To investigate the potential impact of using F1 on the validity of this large body of research. Method: We undertook a systematic review to locate relevant experiments and then extract all pairwise comparisons of defect prediction performance using F1 and the un-biased Matthews correlation coefficient (MCC). Results: We found a total of 38 primary studies. These contain 12,471 pairs of results. Of these, 21.95 instead of the biased F1 metric. Unfortunately, we also found evidence suggesting that F1 remains widely used in software defect prediction research. Conclusions: We reiterate the concerns of statisticians that the F1 is a problematic metric outside of an information retrieval context, since we are concerned about both classes (defect-prone and not defect-prone units). This inappropriate usage has led to a substantial number (more than one fifth) of erroneous (in terms of direction) results. Therefore we urge researchers to (i) use an unbiased metric and (ii) publish detailed results including confusion matrices such that alternative analyses become possible.



There are no comments yet.


page 7

page 13

page 19


Assessing Software Defection Prediction Performance: Why Using the Matthews Correlation Coefficient Matters

Context: There is considerable diversity in the range and design of comp...

Does class size matter? An in-depth assessment of the effect of class size in software defect prediction

In the past 20 years, defect prediction studies have generally acknowled...

Fast Static Analyses of Software Product Lines – An Example With More Than 42,000 Metrics

Context: Software metrics, as one form of static analyses, is a commonly...

Replication studies considered harmful

CONTEXT: There is growing interest in establishing software engineering ...

RepoMiner: a Language-agnostic Python Framework to Mine Software Repositories for Defect Prediction

Data originating from open-source software projects provide valuable inf...

Evaluating prediction systems in software project estimation

Context: Software engineering has a problem in that when we empirically ...

The Prevalence of Errors in Machine Learning Experiments

Context: Conducting experiments is central to research machine learning ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The idea of trying to predict which software components e.g., classes, files or even methods are likely to be defect-prone has gained a great deal of traction in software engineering research over the past three decades. To be able to reliably distinguish between defect-prone and clean components is clearly desirable since QA resources can then be allocated more effectively. A considerable number of systematic reviews Cata09 ; Hall12 ; Malh15 ; Hoss17 ; Ozak18 ; Son19 ; Li20 have identified and summarised many hundreds of such studies.

These defect prediction studies have generally approached the problem empirically in the form of computational experiments, where different prediction systems are compared over data with known outcomes (i.e., labelled defect-prone or not defect-prone). Comparisons are made using various classification performance metrics, typically F1111More correctly speaking, F1 is a specific instantiation () of the F-measure which is defined as:

However, it is a near universal practice in software defect prediction to set so for simplicity in this paper, we will simply refer to F1.

and Area under the Curve (AUC) as the response variables. Unfortunately despite its widespread use, statisticians and machine learning specialists, have drawn attention to various difficulties with F1 particularly when it is used for two-class problems

Soko09 ; Powe11 ; Luqu19 . Section 2.2 reviews these difficulties in some detail.

So we pose the question: do the problems with F1 actually matter, or are the results from the many studies utilising F1 good enough approximations to the ‘truth’? This is an important question because the use of F1 remains widespread (more than a third of papers over the period 2015-20). To answer it we locate studies that report results both with F1 and also the unbiased Matthews correlation coefficient (MCC) Bald00 which enables us to make comparisons and check whether the conclusion changes depending upon the choice of classification performance metric.

This paper extends our previous analysis Yao20 in five ways.

  • We provide a comprehensive review and critique of the widely-used, classification performance metric F1 drawing from both the machine learning and statistical literature, complemented with an analysis of all permutations of the N=40 confusion matrix.

  • We have extended the searches for relevant, defect prediction studies which increases the number of primary studies from 8 to 38 meaning that the number of individual results in the meta-analysis is now 12,471 (a more than threefold increase).

  • We investigate how widely the F1 metric is used in software defect prediction experiments;

  • We undertake an in-depth investigation of the circumstances when the metric F1 is most likely to be misleading, in particular imbalance.

  • The updated raw data and code are available from

The remainder of this paper is structured as follows. Section 2 first reviews different approaches to evaluating defect prediction performance, followed by a detailed critique of the F1 metric and an assessment of its role within software defect prediction research. Section 3 describes the systematic review we undertook to find as many F1 and MCC results as possible, for our meta-analysis. Then, we present our bibliometric findings in Section 4. This is followed by our meta- analysis, results and discussion contained in Section 5. The article concludes (Section 6) with a summary of our findings, threats to validity and a set of recommendations for researchers and for readers of defect prediction studies.

2 Background

In this section we review the most widely utilised classification performance metrics (for more exhaustive reviews see Powers Powe11 and Luque et al. Luqu19 ). Next, we focus on the F1 measure in detail, highlight some problems and contrast it with an alternative metric, namely the Matthews correlation coefficient. Finally, we briefly examine the question of how widely F1 is used in software defect prediction experiments.

2.1 Classification performance metrics in software defect prediction

In this discussion we focus on two-class classification problems as per all of our included studies. This approach restricts the analysis to defect-prone (positive) and not defect-prone (negative) software components. These classes are mutually exclusive. Thus the structure of the classifier performance can be represented as a confusion matrix (see Table 

1) which, since we have two classes, is a contingency table of predicted versus actual class. From this matrix we are able to derive most classification performance metrics.

Actual Positive Actual Negative
Predicted Positive
Predicted Negative
Table 1: The confusion matrix

The four cells of the confusion matrix comprise counts of true positives (TP), true negatives (TN), false positives (FP) and false negatives (TN). For our problem domain these correspond to defective components correctly classified as defective, defect-free components correctly classified as defect-free, defective components incorrectly classified as defect-free and defect-free components incorrectly classified as defective.

For the discussion regarding confusion matrices we use the following terminology and concepts.


is the proportion of the actual positive cases (i.e., defect density) hence the prevalence of negative cases is simply . Alternatively where is the cardinality. This is an important concept since it gives rise to problems of imbalanced data sets which in turn cause difficulties for training classifiers. For the problem domain of software defect prediction, it is common, but not invariably so, that is close to zero and the datasets are imbalanced Wang13 . Unfortunately this also leads to difficulties with biased classification performance metrics as we will shortly demonstrate.


is defined as which is the proportion of correctly identified defect-prone units from all cases classified as defect-prone.


also referred to as sensitivity or the true positive rate (TPR). It is the proportion of positive cases that are correctly predicted positive (defect-prone) out of all positive cases.


or the true negative rate (TNR) is defined as the proportion of negative cases that are correctly considered as negative from all negative cases. Specificity and Recall are inversely proportional to each other. When we increase Specificity, Recall decreases and vice versa.

False positive rate (FPR):

is defined as the proportion of negative cases that are mistakenly considered as positive out of all negative cases. It is also sometimes referred to as Fallout and is a way of characterising the contamination of positive predictions by negative examples.


is defined as the proportion of cases correctly classified to all cases. However, it is not chance-corrected and therefore often a misleading guide. Consider the situation where 95% of cases are positive (so ) then trivially a classifier could achieve 95% accuracy simply by predicting all cases belong to the modal (in this case positive) class.


is the harmonic mean of Precision and Recall. It is based on the F-family of measures, but specifically where

regulates Precision and Recall such that they have equal weight. Although Precision and Recall can each be trivially optimised independently (either predicting no positive cases or by predicting all positive cases) the idea of combining both measures is intended to take a more balanced view. For this reason F1 has been widely deployed as a means of assessing classifier performance. NB A harmonic mean, as opposed to an arithmetic mean, will penalise more extreme differences between Precision and Recall.

Bookmaker’s odds:

is also known as Youden’s J Youd50 or Informedness for multi-class classification. It is defined as: . It yields the proportion of time we are making an informed decision as opposed to guessing Powe03 . A classifier that has a Bookmaker score of zero is doing no better than chance and a negative score implies worse than chance. This is important information when evaluating classifiers. Its chief value is as a simple benchmark of the extent to which a classifier adds value.

Matthews correlation coefficient (MCC):

is the Pearson correlation for a contingency table and is known as or by statisticians. It is defined using TP, TN, FP and FN, and so includes all parts of the confusion matrix. As with any correlation coefficient, it ranges from -1 to +1 so more extreme values represent better performance. Thus +1 indicates perfect classification, -1 indicates perfectly perverse classification, and zero indicates random predictions i.e., no classification value. It is related to the chi-square statistic for a contingency table such that where is the number of cases.

Receiver operating characteristic (ROC) curve:

is a two-dimensional chart where TPR (Recall) is plotted on the Y axis and FPR (Fallout) is plotted on the X axis. A ROC curve describes the relative trade-offs between TPs and FPs for different thresholds of accepting a case as being positive ranging from all (when TPR=1) to none (when FPR=0). As a two-dimensional assessment index, it can be problematic when comparing the performance of different classifiers. In order to compare classifiers using a single scalar value, the usual method is to calculate the area under the ROC curve (AUC) Fawc06 . Since AUC is a proportion of the total area of the unit square, its value must fall between 0 and 1, where AUC=1 means the classifier can perfectly distinguish between positive and negative classes, whilst AUC=0 means the classifier is perfectly perverse, i.e., it predicts all positive cases as negative ones and vice versa. When AUC=0.5, the classifier has no discriminative value and is equivalent to random guessing (equivalently Youden’s J or MCC equal zero). Therefore, any value greater than 0.5 represents a better than chance classification for the two-class case. Note, however, this metric refers to a family of possible classifiers rather than any specific classifier. Thus, unless the ROC curve for classifier A strictly dominates classifier B we cannot make any remarks about our preference for A over B since, in practice, we can only deploy a single classifier. Morasca and Lavazza Mora20 suggest this difficulty might be reduced by only examining ”relevant areas”, that is regions of interest. But we also note that AUC has come under considerable criticism (for example that it uses different misclassification cost distributions for different classifiers Hand09 ; Flac15 ). Moreover, its purpose is somewhat different from our primary interest which is to address performance metrics for particular classifiers. For this reason we do not explore AUC further in this paper.

Table 2 gives formal definitions of the most commonly deployed classification performance metrics in terms of the confusion matrix. It also denotes those metrics that are chance-corrected, in other words making a comparison with a guessing strategy, where an example would be a negative MCC score. By contrast, F1 is not chance-corrected because the F1 value of guessing all predictions are the modal class will depend upon its prevalence.

Metric Definition Range Better Chance corrected
Cardinality, n.a. n.a.
Prevalence or [0,1] n.a. n.a.
Accuracy [0,1] High No
Precision [0,1] High No
Recall or [0,1] High No
True Positive Rate (TPR)
Specificity (TNR) [0,1] High No
False Positive Rate (FPR) [0,1] Low No
Bookmaker’s odds or [-1,1] High Yes
Youden’s J
F-measure (F1) [0,1] High No
Matthews correlation [-1,1] High Yes
coefficient (MCC) or
Area Under Curve (AUC) FPR versus TPR [0,1] High Yes
Table 2: Commonly used classification performance metrics

2.2 A critique of F1

F1 is a widely used performance metric in the field of software defect prediction. It is the harmonic mean of recall and precision (see Table 2) and is a specific instantiation () of the F-measure that is defined as:


It originates from the information retrieval community and was first proposed by van Rijsbergen vanR79 in the 1970s. This metric is only sensitive to the positive class as the definition does not include TNs. Precision is the proportion of true positives in the cases predicted positive and recall is the proportion of positive cases that are predicted. Hence their values are entirely independent of the number of negative cases. Ignoring negative cases, except inasmuch as they contaminate predictions, makes perfect sense when the problem domain is essentially a single-class problem e.g., retrieval of relevant pages from the web when the number of irrelevant pages correctly not retrieved (TNs) is vast, cannot be determined and is not of interest.

So we have a classification metric suitable for single-class information retrieval problems being redeployed for two-class problems. Moreover, we speculate that some researchers have not fully considered the ramifications of doing so. For software defects (and many other problem domains) we most definitely have two classes. Knowing that a software unit has been correctly classified as not defect-prone is important in terms of project resources and software quality. In addition, the dichotomous view of prediction is an over-simplification since most classifiers predict class membership with a given confidence or probability. The threshold for positive class assignment is therefore both flexible and arbitrary in that changing the acceptance threshold for positive cases can move a software unit from being predicted positive (defect-prone) to not defect-prone. The problem of two classes is compounded by much variation in the prevalence of defect-prone cases in training data sets. Typically these positive cases are very much in the minority, hence most data sets are highly imbalanced

Sun09 .

A second problem is that F1 is difficult to interpret other than zero222Strictly speaking, even zero can be problematic because in the event TN=0 then F1 is undefined, although it is customary to record this situation as F1=0. is the worst case and unity the best case. Specifically the chance component of the metric is unknown, unlike a correlation coefficient or AUC (see Table 2). So, for example, it is hard to know what means. Is it better than chance? Is the classifier actually predicting or would we be better off just guessing? In contrast, a correlation coefficient equal to 0.25 means there is a small positive effect and that the classifier is indeed doing better than chance.

Third, F1 is not chance-corrected. An alternative way to think about the difficulties of F1 is in terms of its relationship to the so-called Bookmaker’s odds (otherwise known as Youden’s J Youd50 or informedness for multi-class classification problems Powe11 ). It gives the probability that the classifier is doing better than chance (see Table 2) and is independent of the relative proportions of positive and negative (defect-prone and not defect-prone) instances. Youden suggested that the metric ranges [0,1] since he appears not to have considered the possibility of a perverse classifier. Unfortunately such circumstances do arise in machine learning experiments, see for example the meta-analysis in Shep14 . Thus, the Bookmaker’s odds provides a simple benchmark333

Whilst we utilise the Bookmaker’s odds to be a simple benchmark to evaluate the chance component of any classification performance, we consider correlation coefficients such as MCC to be more useful for overall performance evaluation. In the case of MCC it has the added utility of being related to a chi-squared distribution.

to assess F1.

In Figure 1 we plot, for all 12341 possible permutations of an confusion matrix, the F1 score, and the associated Bookmaker’s odds and the degree of imbalance, defined as: where is the prevalence (of positive cases). Note that for 940 permutations one or both metrics has no defined value e.g., because TP=0. We can observe there is some general tendency for the Bookmaker odds to increase as the F1 score increases as indicated by the green smoothed line. However, there are many deviations and if we examine the lower righthand quadrant (F1 0.5 and Bookmaker 0) we see there is extensive potential for F1 to provide very misleading scores for a classifier that is in actual fact worse than random. As an extreme example, the confusion matrix: yields F1 = 0.93 but Bookmaker’s Odds of -0.10. In other words a near perfect F1 score corresponds to a classifier that is, in reality, slightly worse than guessing. In other words it is perverse!

Whilst these are hypothetical examples, it reveals a potentially, highly misleading performance metric. We also note that these more extreme values tend to correspond to high imbalance scores (i.e., where the number of positive cases have very low or high prevalence).

Figure 1: Plot of F1 scores by Bookmaker’s Odds for all valid permutations of a N=40 confusion matrix

Imbalance is defined as and shaded between blue (low) and red (high). The green trend line is based on a loess smoother with bootstrapped 95% confidence limits shown as a shaded grey area. The grey long dashed line shows Bookmaker’s odds of , below which a classifier is performing worse than chance.

Fourthly, researchers have commented on the way that F1 combines two distinct quantities, Precision and Recall and that this accomplished via the harmonic mean which will distort the impact of extreme values Hand18 . By contrast, an arithmetic mean is more intuitively interpretable.

So we conclude that F1 is — at least in theory — an unreliable indicator of software defect prediction performance. As an alternative, we propose the Matthews correlation coefficient (MCC) Bald00 , since it is chance corrected (see Table 2), is based upon both classes (it utilises the complete confusion matrix) and has a straightforward interpretation. It also follows a chi-square distribution.

2.3 How widely is F1 used in software defect prediction experiments?

Two previous systematic reviews have explicitly tried to quantify the extent to which F1 is used as the response variable for software defect prediction studies. Malhotra et al. Malh15 reported that 17/64 ( 27%) of included studies between 1991 and 2013, used F1 directly and an additional  37% and  66% used Precision and Recall, which are the two constituent components of F1 (see Table 1). Another systematic review by Hosseini et al. Hoss17 reports that out of 30 studies (2006-2016) 11 ( 37% ) use F1and 21 ( 70%) use precision and recall.

To obtain a more up to date, though somewhat approximate, view of current utilisation rates we applied the following two searches using Google Scholar (29th May, 2020). We excluded patents and citations.

General search ”software defect prediction” AND (”experiment” OR ”empirical”)

F1 subset search ”software defect prediction” AND (”experiment” OR ”empirical”) AND (”F1” OR ”F-measure” OR ”F-score”)

These searches retrieved 2250 (General search) and 978 (F1 subset search) results respectively, which suggests that in the past five years of the order of  43% (978/2250) articles that discussed software defect prediction experiments also mentioned F1. We then randomly sampled444The sampling was conducted by randomly sampling 3 papers from each page (of 10 papers) returned by Google Scholar, 97 pages in total. Overall we examined papers. 30% of these 978 papers and read them carefully to determine if they actually employed F1 in their analysis. We found that 82% (239/291) of papers actually used the F1 metric in their methods. From this we argue that in the last five years there are of the order of software defect prediction studies that make use of the F1 classification performance metric. Given our concerns regarding this metric we find this somewhat worrying.

3 The systematic review

Next we seek to find software defect prediction studies that publish results using F1 and another more reliable metric, MCC, so that we can make comparisons.

This review was carried out in January 2021. The goal was to locate primary studies that undertook experiments to assess software defect prediction methods on historical data sets. Specifically, we needed papers that reported results with both the widely used, but problematic, F1 metric and the Matthews correlation coefficient (MCC). This would enable us to determine whether differences between these metrics are merely a theoretical concern or have real-world impact.

We conducted a basic search in our earlier conference paper Yao20 and located 8 studies. However, for this work we decided to conduct a more in-depth search which resulted in a further 30 studies making a total of 38 papers. The details are given below (and summarised in Table 3).

Characteristic Description
Objective To find experimental results where both the F1 and MCC classifier performance metrics are reported
Target domain Software defect prediction experiments
Target audience (i) Ourselves (for a meta-analysis) and (ii) other researchers
Date January 2021
Databases searched Google Scholar
Additional searches Forward and backward chaining, results from previous search Yao20
Inclusion validation JY and MS independently checked papers for potential inclusion and disagreements were discussed
Grey literature Not included
Study quality (i) Refereed and (ii) for predictive studies uses some cross-validation procedure
Data collected (i) Bibliographic data including: authors, title, year and publication venue and types of classification performance metric collected, (ii) result count, (iii) individual F1 and MCC results for , where is dataset and is classifier, (iv) inference procedure details including: use of NHST and correction procedures e.g., Bonferroni
Data published zenodo 10.5281/zenodo.3949897
Table 3: Systematic review summary

We decided to restrict our search to the domain of software defect prediction experiments. This was because there are aspects to the researcher’s choice of classifier performance metric that are domain-specific, namely (i) whether true negatives can be enumerated, i.e., is this count knowable and (ii) the prevalence of true cases, i.e., defect-prone software components. In the case of software defects, not only is the number of true negatives knowable, it’s important since these are the software components that are correctly identified as not being defect-prone. Such components can then be allocated reduced testing resources. The other aspect is that in general, the majority of data sets have few positive (defect-prone) cases, i.e., they are imbalanced. These two conditions render F1 an unsafe choice of performance metric for software defect prediction experiments (see Section 2.2).

We wanted to ensure that we collected high quality experimental results for our meta-analysis. For this reason we only used peer reviewed studies, meaning that the so-called grey literature has been excluded. Although some researchers are strongly advocating ‘multi-vocal’ literature reviews e.g., Garo19 our purpose is slightly different. First, we are not motivated by “closing the gap between academic research and professional practice” (Elmore Elmo91 quoted in Garousi et al. Garo19 ) because our target is researchers. Second, we are aware of the ease with which it is possible to make errors in computational experiments and therefore are motivated to maximise independent scrutiny. Of course, the peer reviewed literature is still replete with mistakes Dono09 ; Alli16 and more specifically in defect prediction Shep19 ; Li20 .

The complete list of exclusion criteria is given in Table 4. The long list of 194 articles was constructed by scanning the title and venue. Where there was any doubt, the article remained in the long list. If we were unable to obtain content through the usual channels we attempted to email the authors. Each criterion was successively applied in the order listed in Table 4, consequently the counts only refer to the residual articles (e.g., if an article is unavailable we make no judgement as to whether it is written in English). Note ‘new data’ refers to the situation where the same experimental results are presented in more than one paper. Finally, ‘suitable data’ is something of a catch all where we are unable to use the data for a range of particular reasons such as no meaningful differences between the classifiers being compared or no comparisons (i.e., results being presented without benchmarks).

Exclusion criterion Removed Remaining
Long list of candidate papers - 194
Content unavailable 9 185
Not written in English 6 179
Different problem domain to software defect prediction 45 135
Not refereed 24 113
Provides F1 and MCC data 69 43
New data 1 42
Suitable data 4 38
No cross-validation (where appropriate) 0 38
Table 4: Systematic review exclusion criteria

4 Bibliometric summary data

As indicated, we located 38 research papers that described computational experiments that compare the performance of different classifiers, e.g., logistic regression of random forest algorithms, across various software defect data sets, e.g., Nasa MDP and Eclipse. The publication venues are quite widely distributed, though we note there are three papers from Promise conferences, EMSE and JSS, two from IST, TSE and MECS (the Intl. J. of Modern Education and Computer Science). In total there are 12 conference and 26 journal papers. A complete list is given in Appendix 


The papers range from 2012 to 2020. On the whole, however, providing MCC results seems to be a relatively recent phenomenon with 32/38 papers published since 2018. We also observed that just over half (20/38) of the papers also reported AUC, however, we did not analyse these data as being beyond the scope of our study. 555Area under the curve (AUC) would also appear to be growing in popularity amongst software defect prediction researchers. However, it is a metric based on the frontier between the true and false positive rates and thus is a characteristic of a family of classifiers rather than any specific classifier Fawc06 ; Hand09 ; Powe11 . The line plot in Fig. 2 gives an indication of the clear, steep upwards trend using a loess smoother.

Figure 2: Line plots by of papers published by year (i) reporting F1 and MCC results and (ii) making NHST-based inferences

NB We show counts of papers since 2010, however, our search was unconstrained date-wise. The NHST counts refer to papers that make use of null hypothesis significance testing for making inferences about classifier comparisons. The dashed lines are loess smoothers.

In total, the 38 papers contain 12,471 usable results, that is where we have complete cases, and pairs of values for each of F1 and MCC. The number of results per paper varied hugely from 6 to 1890 with a median of 140.5. We noted that the papers containing a large number of results tended to be major benchmarking exercises, whilst those with few results tended to promote new algorithms or approaches and just compared these with some baseline approach.

These papers between them utilise 97 distinct datasets, however many are used multiple times across multiple studies. Unfortunately, it would seem that different studies refer to the same data set by different names, e.g., Promise and NASA MDP. Also, it’s not always clear which version of a data set is being used plus different studies deploy differing data-cleaning strategies. Nevertheless this indicates the breadth of research activity.

Procedure Count
Simple comparison 10
Other 4
Total 38
Table 5: Result decision procedure

In terms of analysis, we observe from Table 5 that 24/38 papers use null hypothesis significance testing (NHST) to determine whether a result is “statistically significant” or ‘meaningful’. Fig. 2 shows a clear upward trend such that in the past couple of years this is the dominant means of reasoning about comparative classification performance. Of these, half (12/24) use some procedure (e.g., a post hoc Nemenyi test) to adjust the acceptance threshold when making multiple, inferential tests. A further 10 papers merely compare values, so a higher score of the performance metric (F1 or MCC) is to be preferred to a lower score, irrespective of the magnitude. The remaining 4 papers either use alternative methods, are unclear or have some alternative purpose than comparing competing defect predictors.

5 Meta-analysis

5.1 Classification performance metrics

First we, examine the spread of values for F1 and MCC, recalling that but so the summary statistics in Table 6 are not directly comparable. Although the statistics are derived from all reported results from our systematic review, it is noteworthy that many classifiers appear to perform poorly. Some are even have negative correlation coefficient values (592/25,467 observations).

As, discussed in Section 2.2, the F1 metric is not easy to interpret with respect to chance odds, but we did find 235 instances of the boundary case of which comprise of all observations. We assume the authors are reporting a divide-by-zero error as zero. This could be caused by either no positive cases (TP+ FN) or no true positives (TP). The former situation might arise from the vagaries of cross-validation (if there are few positive cases relative to the size of each fold). The latter might arise from a very poorly performing classifier.

Minimum 0.000 -0.161

1st Quartile

0.296 0.178
Median 0.400 0.260
Mean 0.437 0.288
3rd Quartile 0.534 0.360
Maximum 0.999 0.976
Table 6: Summary statistics for all reported F1 and MCC results

We can also visualise the individual distributions as violin plots (see Fig. 3

). Note that, unlike MCC, the density plot for F1 is truncated at zero since this is the minimum possible value. The kernel density estimators suggest quite complex, but positively skewed distributions, i.e., the means are greater than the medians. In other words, most classifiers do not predict well. However, we do stress that all results are included encompassing possibly quite naïve or simplistic baselines that the researchers only intended as comparators.

Figure 3: Violin plots of the classification performance metrics F1 and MCC

NB the horizontal bar within the box represents the median, the box represents the interquartile range and the black dots are outliers. Also recall, that the two metrics are measured on different scales, F1 is

and MCC is .

Although the two performance metrics are measured on different scales, we can use Spearman’s correlation coefficient to evaluate the strength of a monotonic association and find . Some researchers would refer to this as ‘moderate’ strength Scho18 . Probably more informative is to examine the relationship graphically in the scatter plot given by Fig. 4. Although the relationship is positive, i.e., as one metric value increases so does the other, we observe a good deal of scatter and some extreme outliers. This breakdown in the relationship is most noticeable for the higher values of F1. In other words, the relationship between the two classification performance metrics is far from straightforward and, paradoxically, the nearer F1 is to unity the less trustworthy is the result. A near perfect score for F1 can in practice mean anything from a similarly near perfect MCC score to a negative correlation and everything in between! Recall, these are real, reported results in the refereed scientific literature.

Figure 4: Scatter plot of the classification performance metrics F1 vs MCC

NB the red-dashed line represents a loess smoother. The blue-dashed vertical and horizontal lines represent the median values for F1 and MCC respectively.

5.2 Difference between F1 and MCC

In order to extract information from what is a diverse set of primary studies, we must adopt a standardised view. Essentially, all our studies are conducting computational experiments where the:

  • treatments are the different predictive algorithms or classifiers that are being investigated and manipulated by the researcher;

  • response variables are the different classification performance metrics, in our case F1 and MCC;

  • experimental units are software project data for which predictions are made;

  • block an aggregation of similar experimental units such as a data set which is typically the level of reporting.

Almost without exception, a repeated measures type of design is deployed by our set of primary studies, that is all treatments are applied to all experimental units. This is possible — unlike for many experiments involving humans — since there is no potential for carryover or other ordering effects. Results are presented as tables of response variables, for example, see Table 7.

Logistic Naïve
regression Bayes
Block F1 MCC F1 MCC
Dataset1 0.5 0.4 0.7 0.6
Dataset2 0.3 0.2 0.8 0.1
Table 7: A hypothetical example of reported results

When researchers assess their results they do so by making comparisons between treatments. Which one is to be preferred? These are generally decomposed to pairwise comparisons.666Although omnibus tests (e.g., a Friedman test) are sometimes utilised to make comparisons of multiple results such tests can always be decomposed to their primitive components. In Table 7 we see some hypothetical results. In this simple example we have two comparisons, one for each block or dataset, of Logistic Regression (LR) compared to Naïve Bayes (NB). For example, using F1 we have Naïve Bayes out-performing Logistic Regression for both data sets since ( and ), however, the results using MCC are not fully concordant since ( but ). In the latter case, which treatment or type of defect predictor we prefer depends upon the choice of the measurement function . In such a situation we refer to the results as being discordant. One can imagine the hypothetical example being extended with a third classifier, say Random Forest, which would then mean the table of results could be decomposed to six comparisons (LR-NB, LR-RF, NB-RF twice over, once per dataset). To generalise, a table reporting treatments over data sets can be decomposed to comparisons.

More formally, the comparisons are made using different metrics or measurement functions . For our analysis, we have and . These functions are applied to different dataset-treatment combinations, yielding for example, (Dataset1, LR) which gives the F1 metric from applying Logistic Regression to Dataset1. By taking pairs of measures, we can establish preference relations, e.g., (Dataset1, LR) (Dataset1, NB), in other words, NB is to be preferred to LR for Dataset1. The question arises whether other measurement functions, MCC in our analysis, yield concordant or discordant relations. If the choice of metric governs the outcome this is concerning, the more so because we know that F1 is a flawed metric in the context of two-class classification in software defect prediction.

Comparisons between pairs of treatment-blocks can differ in either magnitude or direction. Discordance, our focus, addresses the latter. In other words using F1 leads one to prefer X to Y, yet using MCC would lead the investigator to conclude the opposite. Researchers adopt a range of approaches when considering how evaluate differences in magnitude, particularly small differences. A common, though controversial, approach is the use of null hypothesis significance testing (NHST) Cohe94 ; Gelm06 ; Colq14 . Here the idea is to distinguish between small differences in magnitude that might be merely due to noise, and differences of greater import. In such circumstances one can deploy a “not worse than” () preference relation. Scott-Knott is another approach with a similar goal Mitt15 .

We are agnostic about the direction of the difference, since whether a classifier is interpreted as treatment 1 or 2 is entirely arbitrary, thus we look at absolute differences. Figure 5

shows the distribution of differences together with a kernel density estimator. Both metrics show a similar highly positively skewed distribution (partly as a consequence of taking the absolute difference), nevertheless it is noteworthy that for both metrics the median differences between pairs of observations are small (F1 = 0.060 and MCC = 0.068). This suggests that for the majority of comparisons between pairs of defect predictors, the disimilarities are small.

The question arises, if a difference in direction occurs, could this be due to trivial change in an accuracy metric (e.g., )? In other words what if many differences in pairs of observations are very small? However, given that the majority of studies (24/38) use NHST as a decision procedure, even small differences are likely to be interpreted as meaningful due to the generally quite large data sets. 777Using a simulation based on the NASA MDP data sets, which are the most widely used, a median dataset size of

, and a median standard deviation of sd = 0.03465 reported by Tran et al. 

Tran19 we determined that a Welsh test on this data could detect (i.e., find statistically significant) a difference in treatments of in F1 metric values (95% CI 0.0065, 0.0127 with p-value = 2.204e-09). This means that even very small differences in pairs of classification performance would be viewed as ‘significant’ when viewed through the lens of NHST. For our purposes this means even small differences between F1 and MCC might lead researchers to very different conclusions. Furthermore, another 10 papers simply make direct comparisons. Being conservative we could argue a difference in F1 of 0.01 or more could be identified as ‘meaningful’. If smaller differences were discarded this would eliminate 426 results or 426/2737 which is  15% of the conclusion changes, in other words, only a small proportion of direction changes that we identify, might be considered to be too small to be as conclusion changes by the researchers. We return to this point in the threats to validity (see Section 6.1).

Figure 5: Violin plots of the absolute differences between treatments captured by F1 and MCC

Next, we turn to the question of what proportion of reported results exhibit discordance. This addresses the problem of how much does it practically matter that many research papers have used a biased classification metric as the response variable for their experiments. Since the two metrics use different scales, we initially focus on direction rather than magnitude. Overall from the 12,471 results there are 2737 (or 21.95%) instances of a conclusion (or direction) change. Using the Agresti-Coull Brow01

method to estimate the confidence interval of the binomial proportion, we have the following 95% confidence interval:

Note that given the large number of observations the confidence interval (CI) is tightly defined. Therefore these results indicate that more than a fifth of the reported results contain conclusions that will change if the biased F1 metric is replaced by a less problematic metric such as MCC.

We can also examine the relationship between the magnitude of differences between pairs of treatments as captured by F1 and by the unbiased MCC. Fig. 6 shows the comparisons for F1 and MCC plotted against each other. When the metrics are concordant the data points fall in the lower left and upper right quadrants (coloured red) whilst a change in direction is signified when the points lie in the upper left and lower right quadrants (coloured blue). The number of instances is given in each quadrant. The plot shows that whilst there is a broad trend — that as differences as captured by one metric increase so they do for the other — there are, however, many extreme outliers

Figure 6: Scatter plot of differences between pairs of treatment effects as measured by F1 and MCC

The quadrants indicate where the results are concordant (coloured red) or discordant (coloured blue, thereby indicating a conclusion change since the effect direction reverses when an unbiased metric is deployed).

Figure 7: Stacked bar plot of the proportion of results that change by paper

The dotted line shows the median proportion of changed conclusions by paper, whilst the dashed line shows the mean proportion of changed conclusions (21.95%) calculated from all results. The difference is in part due to a few papers with many results and many conclusion changes.

5.3 Can we predict the likelihood of a conclusion change?

Here we explore whether magnitude of the effect is an explanatory factor for the probability of a conclusion change when we replace the biased F1 classification performance metric with MCC.

One seemingly reasonable hypothesis is that the larger the observed difference between the two treatments (i.e., the effect magnitude) the less likely it matters which performance metric we employ. In other words, a large effect size can be trusted however measured? To investigate we use a simple procedure popularised by Gelman and Park Gelm09 based on a tertile split and comparing the bottom and top tertiles, in our case based upon absolute F1 difference (since we only care about magnitude, not direction). The reasoning to ignore the middle tertile is to avoid comparison of adjacent items which is the well known disadvantage of a median split.

Conclusion Change
Tertile N Y Odds
T1 2716 1364 0.502
T2 3495 943 0.270
T3 3523 430 0.122
Table 8: Tertile analysis of conclusion change

Table 8 reveals a considerable difference between tertile T1 and T3 which is reflected in the odds ratio of OR = 4.11 and the 95% CI = . Again we use the Agresti-Coull method Brow01 . From this we can see that a small effect or magnitude (in our case, an absolute difference between F1 values of , is more likely than not () to be in the wrong direction. NB The effect is the absolute difference in F1 scores, as opposed to a standardised effect size such as the d-family of statistics like the widely used Cohen’s . In other words, irrespective of “statistical significance” or other arguments, a result showing a small difference in competing classifiers, as captured by F1, is as likely as not, to be in the wrong direction.

6 Conclusions

This paper has posed the question: to what extent can we rely on research results from software defect prediction studies that are based on the problematic F1 performance metric? Unfortunately, although we, and many others before us, have shown that F1 is not a good choice of metric in the context of defect prediction (see Section 2.2, F1 has been very widely used. So, should we be concerned or is this just a minor academic quibble? This question is the theme of our paper.

To summarise our investigation, we have searched the literature for software defect prediction studies that report performance both as F1 and MCC. By this means we retrieved 38 refereed papers that contain a combined 12,471 pairs of results. We then analysed these results by assessing whether the F1 and MCC results are concordant. So, for example, we say a result is concordant if we prefer predictor A to B irrespective of whether we use F1 or MCC to make that comparison. Contrariwise, we say that the results are discordant if the direction of the preference depends upon which classification performance metric is used. That is the conclusion would change.

Our main findings are:

  1. Although not a new finding we show that F1 is problematic when used in a two-class problem domain such as software defect prediction. By enumerating and then plotting all N=40 confusion matrix permutations, we show how misleading F1 is because it is not chance-adjusted. We demonstrate this with respect to simple Bookmaker’s odds.

  2. The F1 metric is still widely used by researchers investigating classifiers for software defect prediction. Our analysis of the literature suggests of the order of 800 software defect prediction papers have used this metric in the past five years alone.

  3. We find that more than a fifth (21.95%) of all results change not only in magnitude but most importantly, in conclusion (or direction) when the unbiased MCC is used, instead of the F1 metric.

  4. In passing, we also note that some classifiers do not perform well, i.e., less well than chance. This is not apparent if researchers rely on F1, although clear from a negatively valued correlation coefficient.

  5. Unsurprisingly the smaller the effect (difference in performance between pairs of classifier) the more likely the conclusion will change. The odds-ratio between the lowest and top tertile is 4.11 (95% CI = ).

This tendency to use F1 in software defect studies has wider ramifications than just single studies, since it then propagates through into meta-analyses which are often based on this metric Hoss17 ; Malh15 . Other meta-analyses have been obliged to discard significant amounts of data when researchers only reported results in terms of F1 e.g., Shep14 .

Finally, we wish to be clear that we are not making a criticism of the authors of the 38 primary studies included in our meta-analysis. To their considerable credit they have provided the necessary data to make our investigation possible. Nor are we claiming that they have relied upon the F1 metrics. They have, however, provided the means whereby we can answer the question: how much does using F1 for software defect prediction studies matter? Sadly, the answer seems to be: a good deal.

6.1 Threats to validity

Internal: threats relate to extent to which the design of our investigation enables us to argue there is evidence to support our claim (i.e., that using F1 as a response variable for defect prediction studies causes misleading results and therefore conclusions).

  1. Can we be sure that discordance is an appropriate way to reveal problems with study conclusions due to using F1? We believe using the idea of sign or direction change is the most fundamental way of considering pairs of results. In other words in terms of preference relations. Changing a preference for Classifier A to Classifier B inverts the meaning of the results. The alternative of setting some minimum effect magnitude threshold by comparison seems arbitrary.

  2. Perhaps researchers only focus on very large differences between F1 measures? So a change in direction or discordant results might not matter if both sets of results are very close to a zero effect size. Researchers have indeed used a range of decision procedures to determine whether the magnitude of the effect matters (see Table 5) but the majority (24/38) use NHST. One of the many criticisms of NHST is that very small effects can be ‘significant’ when is large which is typically the case for software defect studies.

  3. Measurement error, for instance with regard to the boundary conditions e.g., F1=0. Elsewhere we have found that reporting and/or measurement errors can be depressingly prevalent Shep19 . However, it is not obvious why F1 would be more impacted than MCC.

  4. The data sets used by researchers generally assume simple relationships and traceability between defects and repairs. Herbold Herb19 has argued, with some justification, that we should expect m:n relationships between defects and software units. Whilst this may well weaken the practical relevance of software defect prediction research, our focus is on how we assess classification performance and whether it matters if F1 is employed. Hence we believe that this, valid threat, is somewhat peripheral.

  5. In line with all the research included in this study, we ignore costs by assuming the costs of false positives and false negatives are equal. Penalising one class of error more than the other is a potentially important area of software defect prediction research Khos98 ; Herb19 , but one outside the scope of this study.

External: threats concern the generalisability of our findings.

  1. Is our sample large enough? We have more than 12,000 pairs of results plus we have sought to locate all studies that provide the data we need for our investigation.

  2. Suppose we’d looked at other studies? Requiring the study to publish both F1 and MCC results might skew the findings? This is possible and it does seem that using MCC is a relatively recent practice. We excluded the grey literature (i.e., unrefereed studies). Given that (hopefully) research methods and practice improve over time and our focusing on demonstrably refereed studies, this would suggest that if anything we are biased to higher quality studies. Consequently, the overall picture could conceivably be worse than our meta-analysis reveals.

6.2 Recommendations

These results have implications both for researchers, but also for consumers of their research (both other researchers and practitioners).

  • The first, and most obvious, implication is that researchers should stop using the F1 metric to analyse and compare software defect classifiers. We should not reason that because people have previously used F1 we should continue to do so. Otherwise our research will be perpetually mired in the past! Minimally, we suggest researchers should provide other unbiased metrics such as the Matthews correlation coefficient Bald00 .

  • We need full reporting of data, results and code / scripts Muna17 . Preferably, papers should provide all the confusion matrices so that a wide range of metrics can potentially be computed as secondary analysis.

  • When undertaking meta-analyses these should not be based upon F1 results. Instead, it may be possible to either use results based upon other metrics or derive them from other information reported Bowe14 . Otherwise, the risk that  22% of the results used in a meta-analysis are completely misleading must be viewed as rendering the results critically contaminated.

  • When reading past studies based upon F1, consider the absolute size of the effect plus the confidence interval but ignore statistical significance. Unless the absolute F1 difference is non-trivial (our analysis would suggest ) we recommend little credence should be given to such a result. Even then, this still implies a chance that not only is the magnitude of the effect wrong but it’s actually in the opposite direction. So instead of classifier A being considerably better than B, it turns out that B is better than A. How can we expect practitioners to deploy substantial resources in the real-world e.g., guiding their testing effort when such advice could be completely misguided.

It seems a great deal of research effort has been deployed on the clearly important problem of how to predict where testing effort should be focused in large software systems. As important as that question might be, we can hardly expect our research to have much practical impact unless we, as a community, take reasonable steps to ensure our computational experiments have meaning.

Appendix A Details of the primary studies included in the systematic review

Paper Authors Year Type Title #Results
Abae18 Abaei et al. 2018 J A fuzzy logic expert system to predict module fault proneness using unlabeled data 14
AlDa18 Al Dallal 2018 J Predicting fault-proneness of reused object-oriented classes in software post-releases 27
Ali20 Ali et al. 2020 J

Software Defect Prediction Using Variant based Ensemble Learning and Feature Selection Techniques

Amas18 Amasaki 2018 C Cross-version defect prediction using cross-project defect prediction approaches: does it work? 1404
Amas20 Amasaki 2020 J Cross-version defect prediction: use historical data, cross-project data, or the both? 171
Anta20 Antal et al. 2020 J Enhanced Bug Prediction in JavaScript Programs with Hybrid Call-Graph Based Invocation Metrics 36
Ayon19 Ayon 2019 C Neural network based software defect prediction using genetic algorithm and particle swarm optimization 30
Bangash20 Bangash et al. 2020 J On the time-based conclusion stability of cross-project defect prediction models 190
Bowe12 Bowes et al. 2012 C Comparing the performance of fault prediction models which report multiple performance measures: recomputing the confusion matrix 6
Bowe15 Bowes et al. 2015 C Different classifiers find different defects although with different level of consistency 12
Bowe18 Bowes et al. 2018 J Software defect prediction: do different classifiers find the same defects? 18
Chen20 Chen et al. 2020 J An empirical study on heterogeneous defect prediction approaches 336
Felix20 Felix et al. 2020 J Predicting the number of defects in a new software version 30
Ge18 Ge et al. 2018 C

Comparative study on defect prediction algorithms of supervised learning software based on imbalanced classification data sets

Gong19 Gong et al. 2019 J An improved transfer adaptive boosting approach for mixed?project defect prediction 1512
Herb18 Herbold et al. 2018 J A comparative study to benchmark cross-project defect prediction approaches 1890
Iqba20 Iqbal 2020 J A classification framework for software defect prediction using multi-filter feature selection technique and MLP 588
Lena20 Lenarduzzi et al. 2020 C Are SonarQube rules inducing bugs? 36
Matl19 Matloob et al. 2019 J A framework for software defect prediction using feature selection and ensemble learning techniques 85
techniques 85
Naseem20 Naseem et al. 2020 J Investigating Tree Family Machine Learning Techniques for a Predictive System to Unveil Software Defects 450
Nezh20 Nezhad et al. 2020 J

Software defect prediction using over-sampling and feature extraction based on mahalanobis distance

Niu20 Niu et al. 2020 J Cost-sensitive Dictionary Learning for Software Defect Prediction 1595
Pan19 Pan et al. 2019 J An improved cnn model for within-project software defect prediction 336
Pand20 Pandey et al. 2020 J Bpdet: an effective software bug prediction model using deep representation and ensemble learning techniques 429
Pecorelli20 Pecorelli et al. 2020 J A large empirical assessment of the role of data balancing in machine-learning-based code smell detection 11
Rizw17 Rizwan et al. 2017 C Empirical study on software bug prediction 78
Rodr14 Rodrigues et al. 2014 C Preliminary comparison of techniques for dealing with imbalance in software defect prediction 726
Ship18 Shippey et al. 2018 C Code cleaning for software defect prediction: a cautionary tale 18
Tian20 Tian et al. 2020 C How Well Just-In-Time Defect Prediction Techniques Enhance Software Reliability? 110
Tong18 Tong et al. 2018 J

Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning

Tong20 Tong et al. 2020 J Credibility Based Imbalance Boosting Method for Software Defect Proneness Prediction 231
Tran19 Tran et al. 2019 C Combining feature selection, feature learning and ensemble learning for software fault prediction 36
Xuan15 Xuan et al. 2015 C Evaluating defect prediction approaches using a massive set of metrics: an empirical study 15
Xu19 Xu et al. 2019 J Software defect prediction based on kernel pca and weighted extreme learning machine 237
Xu20 Xu et al. 2020 J Imbalanced metric learning for crashing fault residence prediction 237
Zhan16 Zhang et al. 2016 J Towards building a universal defect prediction model with rank transformed predictors 50
Zhang20 Zhang et al. 2020 J

Automated defect identification via path analysis-based features with transfer learning

Zhao19 Zhao et al. 2019 J Siamese dense neural network for software defect prediction with small data 900


The authors wish to thank the authors of the 38 primary studies included for providing sufficient information to make this analysis possible. We also wish to stress that our criticism of F1 does not mean we are criticising their papers. On the contrary, their foresight that alternative metrics to F1 are needed, has been invaluable. Jingxiu Yao wishes to acknowledge the support of the China Scholarship Council.

Conflict of interest

The authors declare that they have no conflict of interest.


  • (1) C. Catal, B. Diri, A systematic review of software fault prediction studies, Expert Systems with Applications 36 (4) (2009) 7346–7354.
  • (2) T. Hall, S. Beecham, D. Bowes, D. Gray, S. Counsell, A systematic literature review on fault prediction performance in software engineering, IEEE Transactions on Software Engineering 38 (6) (2012) 1276–1304.
  • (3) R. Malhotra, A systematic review of machine learning techniques for software fault prediction, Applied Soft Computing 27 (2015) 504–518. doi:
  • (4) S. Hosseini, B. Turhan, D. Gunarathna, A systematic literature review and meta-analysis on cross project defect prediction, IEEE Transactions on Software Engineering 45 (2) (2017) 111–147.
  • (5) R. Özakıncı, A. Tarhan, Early software defect prediction: A systematic map and review, Journal of Systems and Software 144 (2018) 216–239.
  • (6) L. Son, N. Pritam, M. Khari, R. Kumar, P. Phuong, P. Thong, et al., Empirical study of software defect prediction: a systematic mapping, Symmetry 11 (2).
  • (7)

    N. Li, M. Shepperd, Y. Guo, A systematic review of unsupervised learning techniques for software defect prediction, Information and Software Technology online (2020) 106287.

  • (8) M. Sokolova, G. Lapalme, A systematic analysis of performance measures for classification tasks, Information Processing and Management, 45 (4) (2009) 427–437.
  • (9) D. Powers, Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation, Journal of Machine Learning Technologies 2 (1) (2011) 37–63.
  • (10)

    A. Luque, A. Carrasco, A. Martín, A. de las Heras, The impact of class imbalance in classification performance metrics based on the binary confusion matrix, Pattern Recognition 91 (2019) 216–231.

  • (11) P. Baldi, S. Brunak, Y. Chauvin, C. Andersen, H. Nielsen, Assessing the accuracy of prediction algorithms for classification: an overview, Bioinformatics 16 (5) (2000) 412–424.
  • (12) J. Yao, M. Shepperd, Assessing software defection prediction performance: Why using the Matthews correlation coefficient matters, in: Proceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering, 2020.
  • (13) S. Wang, X. Yao, Using class imbalance learning for software defect prediction, IEEE Transactions on Reliability 62 (2) (2013) 434–443.
  • (14) W. Youden, Index for rating diagnostic tests, Cancer 3 (1) (1950) 32–35.
  • (15) D. Powers, Recall & precision versus the bookmaker, International Conference on Cognitive Science, 2003.
  • (16) T. Fawcett, An introduction to ROC analysis, Pattern Recognition Letters 27 (8) (2006) 861–874.
  • (17) S. Morasca, L. Lavazza, On the assessment of software defect prediction models via ROC curves, Empirical Software Engineering 25 (5) (2020) 3977–4019.
  • (18) D. Hand, Measuring classifier performance: a coherent alternative to the area under the ROC curve, Machine Learning 77 (2009) 103–123. doi:10.1007/s10994-009-5119-5.
  • (19) P. Flach, M. Kull, Precision-recall-gain curves: PR analysis done right, in: Advances in Neural Information Processing Systems (NIPS 2015), 2015, pp. 838–846.
  • (20) C. van Rijsbergen, Information Retrieval, 2nd Edition, Butterworths, 1979.
  • (21)

    Y. Sun, A. Wong, M. Kamel, Classification of imbalanced data: A review, International Journal of Pattern Recognition and Artificial Intelligence 23 (04) (2009) 687–719.

  • (22) M. Shepperd, D. Bowes, T. Hall, Researcher bias: The use of machine learning in software defect prediction, IEEE Transactions on Software Engineering 40 (6) (2014) 603–616.
  • (23) D. Hand, P. Christen, A note on using the F-measure for evaluating record linkage algorithms, Statistics and Computing 28 (3) (2018) 539–547.
  • (24) V. Garousi, M. Felderer, M. Mäntylä, Guidelines for including grey literature and conducting multivocal literature reviews in software engineering, Information and Software Technology 106 (2019) 101–121.
  • (25) R. Elmore, Comment on “towards rigor in reviews of multivocal literatures: applying the exploratory case study method”, Review of educational research 61 (3) (1991) 293–297.
  • (26) D. Donoho, A. Maleki, I. Rahman, M. Shahram, V. Stodden, Reproducible research in computational harmonic analysis, Computing in Science and Engineering 11 (1) (2009) 8–18.
  • (27) D. Allison, A. Brown, B. George, K. Kaiser, Reproducibility: A tragedy of errors, Nature 530 (7588) (2016) 27–29.
  • (28) M. Shepperd, Y. Guo, N. Li, M. Arzoky, A. Capiluppi, S. Counsell, G. Destefanis, S. Swift, A. Tucker, L. Yousefi, The prevalence of errors in machine learning experiments, in: International Conference on Intelligent Data Engineering and Automated Learning, Springer, 2019, pp. 102–109.
  • (29) P. Schober, C. Boer, L. Schwarte, Correlation coefficients: Appropriate use and interpretation, Anesthesia & Analgesia 126 (5) (2018) 1763–1768.
  • (30) J. Cohen, The earth is round (p ), American Psychologist 49 (12) (1994) 997–1003.
  • (31) A. Gelman, H. Stern, The difference between “significant” and “not significant” is not itself statistically significant, The American Statistician 60 (4) (2006) 328–331.
  • (32) D. Colquhoun, An investigation of the false discovery rate and the misinterpretation of p-values, Royal Society Open Science 1 (140216). doi:10.1098/rsos.140216.
  • (33) N. Mittas, I. Mamalikidis, L. Angelis, A framework for comparing multiple cost estimation methods using an automated visualization toolkit, Information and Software Technology 57 (2015) 310–328.
  • (34) H. Tran, L. Hanh, N. Binh, Combining feature selection, feature learning and ensemble learning for software fault prediction, in: 11th IEEE International Conference on Knowledge and Systems Engineering (KSE), 2019, pp. 1–8.
  • (35) L. Brown, T. Cai, A. DasGupta, Interval estimation for a binomial proportion, Statistical Science 16 (2) (2001) 101–117.
  • (36) A. Gelman, D. Park, Splitting a predictor at the upper quarter or third and the lower quarter or third, The American Statistician 63 (1) (2009) 1–8.
  • (37) S. Herbold, On the costs and profit of software defect prediction, IEEE Transactions on Software Engineering online.
  • (38)

    T. M. Khoshgoftaar, E. B. Allen, Classification of fault-prone software modules: Prior probabilities, costs, and model evaluation, Empirical Software Engineering 3 (3) (1998) 275–298.

  • (39) M. Munafò, B. Nosek, D. Bishop, K. Button, C. Chambers, N. du Sert, U. Simonsohn, E. Wagenmakers, J. Ware, J. Ioannidis, A manifesto for reproducible science, Nature Human Behaviour 1 (1) (2017) 0021.
  • (40) D. Bowes, T. Hall, D. Gray, DConfusion: a technique to allow cross study performance evaluation of fault prediction studies, Automated Software Engineering 21 (2) (2014) 287–313.
  • (41) G. Abaei, A. Selamat, J. Al Dallal, A fuzzy logic expert system to predict module fault proneness using unlabeled data, Journal of King Saud University-Computer and Information Sciences online.
  • (42) J. Al Dallal, Predicting fault-proneness of reused object-oriented classes in software post-releases, Arabian Journal for Science and Engineering 43 (12) (2018) 7153–7166.
  • (43) U. Ali, S. Aftab, A. Iqbal, Z. Nawaz, M. S. Bashir, M. A. Saeed, Software defect prediction using variant based ensemble learning and feature selection techniques., International Journal of Modern Education & Computer Science 12 (5).
  • (44) S. Amasaki, Cross-version defect prediction using cross-project defect prediction approaches: Does it work?, in: Proceedings of the 14th ACM International Conference on Predictive Models and Data Analytics in Software Engineering, 2018, pp. 32–41.
  • (45) S. Amasaki, Cross-version defect prediction: use historical data, cross-project data, or both?, Empirical Software Engineering 25 (2020) 1573–1595.
  • (46) G. Antal, Z. Tóth, P. Hegedűs, R. Ferenc, Enhanced bug prediction in javascript programs with hybrid call-graph based invocation metrics, Technologies 9 (1) (2021) 3.
  • (47) S. Ayon, Neural network based software defect prediction using genetic algorithm and particle swarm optimization, in: 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT), IEEE, 2019, pp. 1–4.
  • (48) A. A. Bangash, H. Sahar, A. Hindle, K. Ali, On the time-based conclusion stability of cross-project defect prediction models, Empirical Software Engineering 25 (6) (2020) 5047–5083.
  • (49) D. Bowes, T. Hall, D. Gray, Comparing the performance of fault prediction models which report multiple performance measures: recomputing the confusion matrix, in: 8th ACM International Conference on Predictive Models in Software Engineering, 2012, pp. 109–118.
  • (50) D. Bowes, T. Hall, J. Petrić, Different classifiers find different defects although with different level of consistency, in: 11th ACM International Conference on Predictive Models and Data Analytics in Software Engineering, 2015, pp. 1–10.
  • (51) D. Bowes, T. Hall, J. Petrić, Software defect prediction: do different classifiers find the same defects?, Software Quality Journal 26 (2) (2018) 525–552.
  • (52) H. Chen, X. Jing, Z. Li, D. Wu, Z. Peng, Y.and Huang, An empirical study on heterogeneous defect prediction approaches, IEEE Transactions on Software Engineering online.
  • (53) E. A. Felix, S. P. Lee, Predicting the number of defects in a new software version, PloS one 15 (3) (2020) e0229131.
  • (54) J. Ge, J. Liu, W. Liu, Comparative study on defect prediction algorithms of supervised learning software based on imbalanced classification data sets, in: 2018 19th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), IEEE, 2018, pp. 399–406.
  • (55) L. Gong, S. Jiang, L. Jiang, An improved transfer adaptive boosting approach for mixed-project defect prediction, Journal of Software: Evolution and Process 31 (10) (2019) e2172.
  • (56) S. Herbold, A. Trautsch, J. Grabowski, A comparative study to benchmark cross-project defect prediction approaches, IEEE Transactions on Software Engineering 44 (9) (2018) 811–833.
  • (57) A. Iqbal, S. Aftab, A classification framework for software defect prediction using multi-filter feature selection technique and mlp, International Journal of Modern Education & Computer Science 12 (1).
  • (58) V. Lenarduzzi, F. Lomio, H. Huttunen, D. Taibi, Are SonarQube rules inducing bugs?, in: International Conference on Software Analysis, Evolution and Reengineering (SANER 2020), 2020.
  • (59) F. Matloob, S. Aftab, A. Iqbal, A framework for software defect prediction using feature selection and ensemble learning techniques, International Journal of Modern Education and Computer Science 12 (2019) 14–20.
  • (60) R. Naseem, B. Khan, A. Ahmad, A. Almogren, S. Jabeen, B. Hayat, M. A. Shah, Investigating tree family machine learning techniques for a predictive system to unveil software defects, Complexity 2020.
  • (61) M. NezhadShokouhi, M. Majidi, A. Rasoolzadegan, Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance, The Journal of Supercomputing 76 (1) (2020) 602–635.
  • (62) L. Niu, J. Wan, H. Wang, K. Zhou, Cost-sensitive dictionary learning for software defect prediction, Neural Processing Letters 52 (3) (2020) 2415–2449.
  • (63) C. Pan, M. Lu, B. Xu, H. Gao, An improved CNN model for within-project software defect prediction, Applied Sciences 9 (10).
  • (64) S. Pandey, R. Mishra, A. Tripathi, BPDET: An effective software bug prediction model using deep representation and ensemble learning techniques, Expert Systems with Applications 144.
  • (65) F. Pecorelli, D. Di Nucci, C. De Roover, A. De Lucia, A large empirical assessment of the role of data balancing in machine-learning-based code smell detection, Journal of Systems and Software 169 (2020) 110693.
  • (66) S. Rizwan, T. Wang, X. Su, Salahuddin, Empirical study on software bug prediction, in: Proceedings of the 2017 International Conference on Software and e-Business, 2017, pp. 55–59.
  • (67) D. Rodriguez, I. Herraiz, R. Harrison, J. Dolado, J. Riquelme, Preliminary comparison of techniques for dealing with imbalance in software defect prediction, in: Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering, ACM, 2014, p. 43.
  • (68) T. Shippey, D. Bowes, S. Counsell, T. Hall, Code cleaning for software defect prediction: A cautionary tale, in: 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), IEEE, 2018, pp. 239–243.
  • (69) Y. Tian, N. Li, J. Tian, W. Zheng, How well just-in-time defect prediction techniques enhance software reliability?, in: 2020 IEEE 20th International Conference on Software Quality, Reliability and Security (QRS), IEEE, 2020, pp. 212–221.
  • (70)

    H. Tong, B. Liu, S. Wang, Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning, Information and Software Technology 96 (2018) 94–111.

  • (71) H. Tong, S. Wang, G. Li, Credibility based imbalance boosting method for software defect proneness prediction, Applied Sciences 10 (22) (2020) 8059.
  • (72) X. Xuan, D. Lo, X. Xia, Y. Tian, Evaluating defect prediction approaches using a massive set of metrics: An empirical study, in: Proceedings of the 30th Annual ACM Symposium on Applied Computing, 2015, pp. 1644–1647.
  • (73) Z. Xu, J. Liu, X. Luo, Z. Yang, Y. Zhang, P. Yuan, Y. Tang, T. Zhang, Software defect prediction based on kernel PCA and weighted extreme learning machine, Information and Software Technology 106 (2019) 182–200.
  • (74) Z. Xu, K. Zhao, M. Yan, P. Yuan, L. Xu, Y. Lei, X. Zhang, Imbalanced metric learning for crashing fault residence prediction, Journal of Systems and Software 170 (2020) 110763.
  • (75) F. Zhang, A. Mockus, I. Keivanloo, Y. Zou, Towards building a universal defect prediction model with rank transformed predictors, Empirical Software Engineering 21 (5) (2016) 2107–2145.
  • (76) Y. Zhang, D. Jin, Y. Xing, Y. Gong, Automated defect identification via path analysis-based features with transfer learning, Journal of Systems and Software 166 (2020) 110585.
  • (77) L. Zhao, Z. Shang, L. Zhao, A. Qin, Y. Tang, Siamese dense neural network for software defect prediction with small data, IEEE Access 7 (2019) 7663–7677.