Software bugs can be very costly to both, the users and the company providing the software. As the world becomes more dependent on software, detecting defects early in the development process becomes more and more critical: the earlier a bug is found and fixed, the less it costs(Capers2008).
The research in software defect prediction aims at the automatic and early identification of problematic software components, in order to direct the most of the testing effort towards them.
A large number of classifiers have been investigated to build software defect prediction models (hall2011systematic). The most commonly used type of these classification models, two-class predictors, rely on the use of training data consisting of instances of the two classes being studied (i.e., defective and non-defective instances). These include widely known models such as Random Forest (RF) (breiman2001random), Support Vector Machines (SVM) (burges1998tutorial), Naïve Bayes (NB) (witten2005practical), etc. The use of both classes allows such models to learn characteristics about both types of modules possibly making the task of predicting a new unseen module more accurate.
One-class predictors are another type of classification models which have been recently gaining more attention. These techniques require the availability of one class only (i.e., non-defective instances) in order to learn characteristics of the training data. An unseen module is classified as an outlier if its characteristics are very different from those learnt by the model from the training set and hence it does not lie within the boundaries created by the technique.
The fact that the number of defective modules in real world systems is much lower than the non-defective ones, leads to highly imbalanced datasets that often cause two-class predictors to produce poor results (hall2011systematic; yao2020assessing). As a result, studies have attempted to improve the performance of defective prediction models by applying well-known under- and over-sampling approaches like Random Under Sampling (RUS), Synthetic Minority Over-sampling TEchnique (SMOTE) (chawla2002smote), ADAptive SYNthetic sampling technique (ADASYN) (he2008adasyn), and SMOTUNED (agrawal2018better) to balance the datasets. However, in a recent study investigating the impact of class rebalancing techniques on the performance measures and interpretation of defect models, Tantithamthavorn et al. (tantithamthavorn2018impact) show that sampling has an impact on the interpretability of defect prediction models and it should be avoided when deriving knowledge and insights from them. Thus, the availability of a model that only requires non-defective instances and no data balancing to learn and accurately classify defective modules could offer an alternative to address data-imbalance.
While several studies have shown that one-class predictors can be successfully used to address various classification tasks suffering from imbalanced data (manevitz2001one; erfani2016high; zhang2006fall; li2010positive; khan2014one), only one preliminary investigation has been carried out on the use of one-class SVM (OCSVM) for software defect prediction showing promising results chen2016software.
In this paper, we revisit and further investigate the application of OCSVM in the light of the more recent advances in defect prediction empirical studies.
To this end, we conduct a large-scale empirical study involving 15 real-world software projects to compare the performance of OCVSM vs. seven traditional two-class classifiers under three validation scenarios (namely, within-project, cross-version and cross-project defect prediction). Specifically, we investigate both the canonical OCSVM, as proposed in the literature, as well as a novel way to perform OCSVM hyper-parameter tuning by using a minimal number of defective instances (we will refer to this version as OCSVM).
Our results reveal that while we cannot recommend the use of OCSVM for the within-project scenario, its performance notably improves when heterogeneous data is used, as for the cross-project scenario, where OCSVM
is able to achieve statistically significantly better estimates with respect to three widely used two-class approaches (i.e., SVM, NB, LR). On the other hand, it is able to outperform Random Forest with statistical significance in 22% of the cases for the within-project and cross-version scenarios, and 44% of the cases for the cross-project scenario.
Although we cannot conclude that OCSVM is suitable for all scenarios, we believe reporting such findings can help advance the research agenda in defect prediction. Properly conducted studies with negative, null or neutral results, like ours, are essential for the progression of science and its self-correcting nature as much as positive results are (paige2017foreword; tichy2000hints; ferguson2012vast). Sharing these findings prevents other researchers from following the same route and it can help them adjust their own research plans, thus saving time and effort; it can also provide researchers with the knowledge needed to develop alternative strategies and evolve new better ideas (paige2017foreword; menzies2017negative).
In fact, even though our study shows that OCSVM does not consistently outperform all two-class classifiers considered herein, it also provides some interesting initial evidence on the potential advantages of using a minimal number of defective instances for hyper-parameter tuning, especially in case of heterogeneous data (i.e., training data composed by different versions of the same target project, or from projects different from the target one).
Thus, we encourage future work to further explore the extent to which using different ratio of defective instances coming from the target project or different ones, can impact the prediction performance, as this can serve as alternative solutions when data on defective instances is scarce or not available.
In the remainder of the paper, we present the design and the results of the empirical study we carried out to assess applicability of OCSVM for Defect Prediction in sections 2 and 4, respectively. Then we present the replication of the previous study by Chen et al. chen2016software. We discuss the threats to validity of both studies in Section 5. Related work is discussed in Section 6. We conclude the paper by reflecting on the lessons learnt and discussing future work in Section 7.
|Repository||Dataset||No. of modules (faulty %)|
|NASA (petric2016jinx)||CM1||296 (12.84%)|
|Hive_0.12.0||2662 (8.00 %)|
2 Empirical Study Design
2.1 Research Questions
First and foremost, we investigate if OCSVM is able to outperform a naive baseline.
Recent studies have stressed on the importance of including a baseline benchmark to assess any newly proposed prediction models (d2012evaluating; sarro2018linear). This check is essential to assess whether OCSVM is able to learn and differentiate non-defective modules from defective ones, instead of randomly classifying them. For this reason, we pose our first research question:
RQ1. OCSVM vs. Random: Does OCSVM outperform a Random classifier?
In order to answer RQ1, we compare OCSVM with a basic Random Classifier, which is completely independent of the training data (i.e., there is no learning phase), and instead generates predictions uniformly at random (scikitLearn). Any prediction system must outperform the Random Classifier, otherwise this would indicate that the prediction system is not learning any information from the training data (sarro2018linear).
Our second benchmark consists in assessing whether OCSVM performs better than its two-class counterpart, SVM. This is a required check, as if the results reveal the opposite, there is no advantage of using the one-class classifier. To this end, we ask:
RQ2. OCSVM vs. SVM: Does OCSVM outperform SVM, its two-class counterpart?
To answer this question we compare OCSVM versus its two-class version by using the same kernel. We use SVM both out-of-the-box and tuned, as the former has been widely used in past studies, although we discourage its use as lack of proper hyper-parameter tuning might lead to less accurate predictions (tantithamthavorn2018impact; sarro2012further; di2011genetic).
A positive answer to RQ2 means that using only information about the non-defective class is sufficient to achieve accurate predictions for SVM. However, it might still happen that OCSVM is not comparable with other two-class classifiers. This leads us to our third, and last research question where we compare the performance of OCSVM to that of other well-known traditional two-class techniques:
RQ3. OCSVM vs. Traditional ML: Does OCSVM outperform traditional machine learning techniques?
To address this question, we compare OCSVM with three traditional two-class machine learning techniques, namely NB, LR and RF, which have been widely used in defect prediction studies (hall2011systematic). Similarly to RQ2, we experiment with both tuned and non-tuned versions.
In the remaining of this Section we describe in details the experimental setting used to answer these RQs.
In our empirical study we have used two sets of publicly available software project datasets: NASA (petric2016jinx) and the Realistic(yatish2019mining) datasets.
The NASA datasets were made publicly available several years ago and have been extensively used in previous work,111Hall et al. (hall2011systematic) found more than a quarter of relevant defect prediction studies published between 2000 and 2010 made use of the NASA datasets however they only contain one version per software project, and thus can be used only to assess the within-project defect prediction scenario.
On the other end, the Realistic datasets, have been collected more recently (2019) and contain multiple versions of a same software project and can therefore be used for cross-version and cross-project defect prediction scenarios. In the following we describe in more details the two sets.
2.2.1 NASA dataset
The NASA datasets, made publicly available by the NASA’s Metrics Data Program (MDP), contain data on the NASA Lunar space system software written both in C and Java. In our empirical study we use the NASA datasets curated by Petrić et al. (petric2016jinx), who applied rules to clean and remove the erroneous data contained in the original NASA datasets (petric2016jinx; shepperd2013data). We use the six datasets listed in Table 1.
They contain static code measures (LOC, Halstead, MaCabe etc.,) and the number of defects for each software component.
2.2.2 Realistic dataset
The Realistic datasets have been collected from nine open-source software systems (i.e., ActiveMQ, Camel, Derby, Groovy, HBase, Hive, JRuby, Lucene, and Wicket), which vary in size, domain, and defect ratio in order to reduce potential conclusion bias. The data has been extracted from the JIRA Issue Tracking System of these software by following a rigorous procedure as explained elsewhere(yatish2019mining), resulting in less erroneous defect counts and hence representing a more realistic scenario of defective module collection. The metrics extracted include code, process, and ownership metrics for a total of 65 metrics (i.e., 54 code metrics, 5 process metrics, and 6 ownership metrics) as detailed in the original paper (yatish2019mining).
In our experiment we consider two releases for each of these nine software systems, as listed in Table 1. This allows us to explore the applicability of OCSVM for the cross-version defect prediction (CVDP) scenario, where data from one release is used to build prediction models to identify defects in subsequent releases. We also explore the cross-project defect prediction (CPDP) scenario by exploiting the Realistic datasets, where data from various software projects is used all together to build prediction models for predicting defective instances in a different target project. CPDP is useful when the target project lacks historical or sufficient local data (herbold2017comparative). However, it has been shown that CPDP is a more difficult prediction problem than CVDP due to the use of heterogeneous data (nam2017heterogeneous).
2.3 Evaluation Criteria
The performance of a classification model is normally evaluated based on the confusion matrix shown in Table2. The matrix describes four types of instances:
True Positives (TP), defective modules correctly classified as defective;
False Positives (FP), non-defective modules falsely classified as defective,
False Negatives (FN), defective modules falsely classified as non-defective,
True Negatives (TN), defective modules correctly classified as defective.
|Actual Value||Defective (1)||Non-Defective (0)|
The values in a confusion matrix are used to compute a set of evaluation measures. Common ones include Recall which describes the proportion of defective modules that are actually classified as defective; Precision which measures the proportion of modules that are actually defective out of the ones classified as defective and F-Measure which is the harmonic mean of Precision and Recall. However, when the data is imbalanced, which is frequently the case in defect prediction, these measures have been shown to be biased(shepperd2014researcher; yao2020assessing). Instead, it is recommended to use Matthews Correlation Coefficient (MCC), a more balanced measure which, unlike the other measures, takes into account all the values of the confusion matrix. In our empirical study we use MCC together with statistical significance tests and effect size as further explained in Section 2.4.
The MCC represents the correlation coefficient between the actual and predicted classifications and it is defined as follows:
It outputs a value between and where a value of indicates a perfect prediction, a value of signifies that the prediction is no better than random guessing, and represents a completely mis-classified output.
Moreover, in order to show whether there is any statistical significance between the results obtained by the models, we perform the Wilcoxon Signed-Rank Test (woolson2007wilcoxon) setting the confidence limit, , at 0.05 and applying the Bonferroni correction (, where K is the number of hypotheses) for multiple statistical testing (the most conservatively cautious of all corrections) (sarro2016multi). Unlike parametric tests, the Wilcoxon Signed-Rank Test raises the bar for significance, by making no assumptions about underlying data distributions. However, as pointed out by Arcuri and Briand (ArcuriB14), it is inadequate to merely show statistical significance alone without assessing whether the effect size is worthy of interest. To this end we use the Vargha and Delaney’s non-parametric effect size measure, as it is recommended to use a standardised measure rather than a pooled one like the Cohen’s
when not all samples are normally distributed(ArcuriB14), as in our case. The
statistic measures the probability that an algorithmyields better values for a given performance measure than running another algorithm , based on the following formula , where is the rank sum of the first data group we are comparing, and and are the number of observations in the first and second data sample, respectively. If the two algorithms are equivalent, then . Given the first algorithm performing better than the second, is considered small for , medium for 0.7 , and large for , although these thresholds are somewhat arbitrary (sarro2016multi). As we are interested in any improvement in predictive performance, no transformation of the metric is needed (NeumannHP15; sarro2016multi).
2.4 Validation Criteria
For the within-project scenario experiments, involving the NASA data, we follow common practice for Hold-Out validation using 80% of the data for training and the other 20% for testing, and repeating this process 30 times, each time using a different seed, in order to reduce any possible bias resulting from the validation splits (arcuri2014hitchhiker).
For the experiments involving the Realistic data, we explore the performance of the models in two additional scenarios (namely CVDP and CVDP) given that this data consists of multiple releases as explained in Section 2.2. In the CVDP scenario, for each of the software systems, we train on one release and test on a different one, i.e., we train on version v and test on version v, where as done in previous work (see e.g., (harman2014less)). In the CVDP, for each of the software systems, we consider the version with the higher release number as the test set and train the model on the union of the versions of the other datasets with a lower release number. The versions used as train and test sets are not subsequent releases nor are they the system’s most recent ones. In addition, there is always a window of at least five months between these releases. This reduces the likelihood of the snoring effect or unrealistic labelling as described in previous studies (ahluwalia2020need; jimenez2019importance; bangash2020time).
In the following, we briefly describe each of the techniques used in our study, while the setting for each of the techniques are discussed in Section 2.5.1.
Naïve Bayes (NB) is a statistical classifier that uses the combined probabilities of the different attributes to predict the target variable, based on the principle of Maximum a Posteriori (witten2005practical). This technique assumes that there is no correlation between the features and therefore treats them independently, using one at a time to make the prediction. NB is known to perform well in spite of the underlying simplifying assumption of conditional independence.
Random Forest (RF) is a decision tree-based classifier consisting of an ensemble of tree predictors(breiman2001random). It classifies a new instance by aggregating the predictions made by its decision trees. Each tree is constructed by randomly sampling cases, with replacement, from the sized training set. At each node, it randomly selects from the set of attributes, where and chooses the one that best splits the data according to their Gini impurity. This process is repeated until the tree is expanded to the largest extent obtaining leaf nodes that refer to either one of the concerned classes. Given its nature, RF is known to perform well on large datasets and to be robust to noise.
Support Vector Machine (SVM) is a widely known classification technique (burges1998tutorial)
. A linear model of this technique uses a hyperplane in order to separate data points into two categories. However, in many cases, there might be several hyperplanes that can correctly separate the data. SVM seeks to find the hyperplane that has the largest margin, in order to achieve a maximum separation between the two categories. The margin is the summation of the shortest distance from the separating hyperplane to the nearest data point of both categories. It can be represented by Equation1 where are the parameters (one per training observation), is the new observation, is an observation in the training data and is a bias term. When the data is not linearly separable, SVM does the mapping from input space to feature space. To achieve this, a kernel function is used instead of an inner product. This allows the formation of a non-linear decision boundary.
One-Class SVM (OCSVM) is an unsupervised version of SVM, whereby the technique trains on one class label only instead of two. Similar to its two-class counterpart SVM, it aims to draw a boundary around the instances that belong to the same class. However, given that this technique learns from one class only, it creates boundaries for the instances that belong to that class. Any instance that is not mapped inside the created boundary is considered an anomaly or an outlier, and hence classified as the other class.
In our empirical study we benchmark OCSVM with respect to a Random Classifier (RQ1) and four traditional and widely used two-classes classifiers: SVM (RQ2), LR, NB and RF (RQ3). We experiment with both tuned and non-tuned versions of SVM and a tuned version of RF since recent work has emphasized on the importance of hyperparameter tuning for defect prediction(di2011genetic; tantithamthavorn2018impact; tantithamthavorn2016automated). We use the default parameters of Scikit Learn (scikitLearn) for the non-tuned version and perform Grid Search for the tuned ones222A technique for which hyperparameter tuning has been applied is denoted as: technique. by using GridSearchCV from the Scikit-Learn v.0.20.2 Python (v.3.6.8) library (scikitLearn). By applying hyperparameter tuning, we search for and obtain the best parameter values for the ML techniques used in our study based on MCC.
Since GridSearchCV cannot be applied for classifiers that only learn from modules of the same class like OCSVM, we implement our own Grid Search method, namely GridSearchCV-OCSVM. In the latter, we introduce instances of the other class (i.e., defective modules) in the testing fold of Grid Search’s inner CV (which is only used to assess the hyper-parameters and not used in the training of the model or its validation). This allows us to obtain a confusion matrix, like the one obtained when GridSearchCV is performed on a two-class predictor, from which we are able to calculate the MCC and choose the most suitable parameter values. The number of defective instances added to the testing fold reflects the proportion of classes in the original training set and are drawn randomly, with no duplicates.
3 Empirical Study Results
In this section we report the results of our empirical study answering RQs1–3 for each of the scenarios investigated, namely Section 3.1 reports on the within-project results, Section 3.2 on the cross-version results, and Section 3.3 on the cross-project ones.
3.1 Results for Within-Project Scenario
RQ1. OCSVM vs. Random: Table 3 shows that the MCC values obtained by OCSVM are higher than those obtained by Random on two datasets (i.e., KC3 and PC5), while on the other four datasets, the former obtains the same results, with values indicating that it is no better than random guessing. The Wilcoxon test and effect size measures, reported in Table 4, also support this by showing that the difference achieved on PC5 is statistically significant. On the other hand, OCSVM outperforms Random on all datasets, with a statistically significant difference and a large effect size on four datasets and medium effect size on the remaining two datasets.
RQ2. OCSVM vs. SVM: To address this question, we compare both OCSVM and OCSVM with their two-class couterparts, SVM and SVM. It is clear from the results shown in Table 3 that hyperparameter tuning enhances the performance of these techniques, as OCSVM and SVM generally obtain better results than OCSM and SVM, respectively. We can also observe that, while SVM and OCSVM obtain the same results on five datasets, with the latter performing better on the remaining dataset (i.e., PC5), OCSVM outperforms both SVM and OCSVM, on all the datasets, with a statistically significant difference, with five of them showing a large effect size and the remaining dataset having a medium one, as shown in Table 4. OCSVM also outperforms SVM on five out of six datasets, with the difference being statistically significant on three of them with a large effect size.
|Validation||Dataset||OCSVM vs. Random||OCSVM vs. Random||OCSVM vs. SVM||OCSVM vs. SVM||OCSVM vs. SVM||OCSVM vs. SVM|
|HoldOut||CM1||0.484 (0.53)||0.003 (0.71)||0.373 (0.50)||1.000 (0.00)||<0.001 (0.90)||0.786 (0.47)|
|KC3||0.245 (0.57)||<0.001 (0.94)||0.164 (0.50)||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)|
|MW1||0.452 (0.53)||0.003 (0.72)||0.774 (0.50)||0.050 (0.68)||<0.001 (0.77)||<0.001 (0.80)|
|PC1||0.580 (0.47)||<0.001 (0.89)||0.009 (0.60)||0.755 (0.48)||<0.001 (0.97)||<0.001 (0.86)|
|PC3||0.540 (0.50)||<0.001 (0.97)||0.886 (0.50)||1.000 (0.08)||<0.001 (0.97)||0.060 (0.65)|
|PC5||0.021 (0.64)||<0.001 (0.81)||<0.001 (0.92)||1.000 (0.00)||<0.001 (0.92)||0.180 (0.60)|
|CVDP||ActiveMQ||<0.001 (0.93)||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (1.00)||1.000 (0.00)|
|Camel||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (1.00)|
|Derby||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (1.00)||1.000 (0.00)|
|Groovy||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (1.00)||1.000 (0.00)|
|HBase||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (1.00)|
|Hive||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (0.90)|
|JRuby||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)|
|Lucene||<0.001 (0.90)||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (1.00)|
|Wicket||<0.001 (0.73)||<0.001 (0.93)||<0.001 (1.00)||1.000 (0.00)||<0.001 (0.93)||1.000 (0.00)|
|CPDP||ActiveMQ||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (1.00)|
|Camel||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (1.00)|
|Derby||<0.001 (0.90)||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (1.00)|
|Groovy||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)|
|HBase||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (1.00)|
|Hive||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)|
|JRuby||<0.001 (0.93)||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (1.00)|
|Lucene||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (1.00)|
|Wicket||<0.001 (0.97)||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (1.00)|
RQ3. OCSVM vs. traditional ML: To answer this question, we compare the performance of OCSVM to that of NB, LR and RF. We can observe from Table 3 and 5 that OCSVM does not outperform LR, only preforms better than RF on one dataset with a statistical significance and a large effect size and outperforms NB on two datasets with the differences being statistically significant and having a medium effect size. On the other hand, OCSVM outperforms traditional ML techniques on seven out of the 18 cases studied. Specifically, it outperforms NB and LR on two datasets each, obtaining differences that are statistically significant with large effect sizes when compared to NB. OCSVM also outperforms RF on three datasets with two of them being statistically significant and having a large effect size.
|Validation||Dataset||OCSVM vs. NB||OCSVM vs. LR||OCSVM vs. RF||OCSVM vs. NB||OCSVM vs. LR||OCSVM vs. RF|
|HoldOut||CM1||1.000 (0.15)||0.999 (0.37)||<0.001 (0.80)||0.989 (0.34)||0.755 (0.45)||<0.001 (0.92)|
|KC3||1.000 (0.13)||1.000 (0.22)||1.000 (0.00)||0.962 (0.34)||0.067 (0.61)||<0.001 (0.94)|
|MW1||1.000 (0.03)||1.000 (0.33)||0.999 (0.33)||1.000 (0.17)||0.975 (0.40)||0.680 (0.47)|
|PC1||1.000 (0.07)||1.000 (0.20)||1.000 (0.07)||1.000 (0.20)||0.985 (0.34)||1.000 (0.15)|
|PC3||0.005 (0.77)||1.000 (0.27)||1.000 (0.00)||<0.001 (0.95)||0.604 (0.46)||0.974 (0.32)|
|PC5||0.008 (0.64)||0.995 (0.35)||1.000 (0.04)||<0.001 (0.87)||0.271 (0.54)||0.089 (0.63)|
|CVDP||ActiveMQ||1.000 (0.00)||1.000 (0.00)||1.000 (0.00)||1.000 (0.17)||1.000 (0.17)||1.000 (0.00)|
|Camel||1.000 (0.00)||<0.001 (1.00)||1.000 (0.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (0.99)|
|Derby||<0.001 (1.00)||1.000 (0.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)|
|Groovy||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.01)|
|HBase||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)|
|Hive||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)|
|JRuby||1.000 (0.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (0.97)||<0.001 (1.00)||1.000 (0.00)|
|Lucene||1.000 (0.00)||1.000 (0.00)||1.000 (0.03)||1.000 (0.00)||<0.001 (1.00)||<0.001 (0.90)|
|Wicket||1.000 (0.00)||<0.001 (1.00)||1.000 (0.00)||1.000 (0.00)||<0.001 (0.93)||1.000 (0.00)|
|CPDP||ActiveMQ||<0.001 (1.00)||1.000 (0.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)|
|Camel||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.20)|
|Derby||1.000 (0.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)|
|Groovy||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.13)|
|HBase||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.05)|
|Hive||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (1.00)||<0.001 (1.00)|
|JRuby||1.000 (0.00)||1.000 (0.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)|
|Lucene||1.000 (0.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)|
|Wicket||<0.001 (1.00)||<0.001 (1.00)||1.000 (0.00)||<0.001 (1.00)||<0.001 (1.00)||<0.001 (0.93)|
Therefore, based on the above results we can state that:
3.2 Results for Cross-Version Scenario
In this section, we discuss the results of the use of OCSVM for cross-version defect prediction.
RQ1. OCSVM vs. Random: Table 3 shows the MCC values obtained by OCSVM and Random. We can observe that OCSVM obtains better results on all datasets, with the difference always being statistically significant and the effect size being large for eight of these nine datasets. When comparing OCSVM and Random, results show that OCSVM also outperforms Random on all datasets. The Wilcoxon Test and effect size results, described in Table 4, support this conclusion as they show a statistically significant difference and a large effect size on all cases considered.
RQ2. OCSVM vs. SVM: To address RQ2, we compare the performance of OCSVM with that of SVM. Results in Table 3 indicate that OCSVM always outperforms its two-class counterpart SVM. Table 4 also indicates that this conclusion is supported by the statistical tests, as it shows that the difference is always statistically significant with the effect size being large. However, this is not always the case when comparing the hyperparameter tuned version of both of these models, OCSVM and SVM. Results show that the former obtains better results than the latter on five out of the nine cases studied with all differences being statistically significant and the effect size being large.
RQ3. OCSVM vs. traditional ML:
Based on the results reported in Tables 3 and 5, we can observe that RF generally performs better than the other techniques, achieving the highest MCC values in six out of the nine cases under study. However, when analysing the results obtained by NB, LR, and OCSVM we can see that the latter performs better than LR on 67% of the cases (i.e., six out of nine) and better than NB on 44% (four out of nine cases) respectively, with the differences always being statistically significant and having a large effect size. While when hyperparameter tuning is applied, OCSVM obtains better results than LR, NB and RF 89% of the time (eight out of nine cases), 56% of the time (i.e., five out of nine cases) and 22% of the time (two out of nine cases) with the differences always being statistically significant and having large effect sizes.
Based on the results above we can conclude that:
3.3 Results for the Cross-Project Scenario
In this section, we discuss the results for cross-project defect prediction.
RQ1. OCSVM vs. Random: To address RQ1, we compare the performance of OCSVM with that of a Random classifer. Results reported in Table 3 show that both OCSVM and OCSVM outperform Random on all datasets with all differences being statistically significant and the effect size always being large as shown in Table 4. RQ2. OCSVM vs. SVM: When compared to its two-class counterpart, OCSVM performs better than SVM on all the cases considered (see Table 3). The Wilcoxon and results, reported in Table 4, confirm that these differences are statistically significant with a large effect size. When hyperparameter tuning is applied, OCSVM also outperforms both SVM and SVM on all nine datasets with statistically significant differences and large effect size. However, the MCC values as well as the Wilcoxon and results show that SVM outperforms OCSVM in seven out of the nine cases considered and performs better than SVM in all cases studied.
RQ3. OCSVM vs. traditional ML: To investigate RQ3, we compare the performance of OCSVM and OCSVM to that of the traditional two-class classifiers (i.e., NB, LR and RF). Results reported in Table 3 show that both OCSVM and RF perform better than the other techniques. Specifically, while OCSVM performs better than NB and LR on 67% of the cases (six out of nine cases each), it always performs worse than RF. On the other hand, when hyperparameter tuning is applied, OCSVM always performs better than NB and LR with the differences being statistically significant and the effect size being large in all cases. When compared to RF, OCSVM performs better in 44% of the cases (four out of nine) with the differences always being statistically significant and the effect size being large.
Based on the results of RQs 1-3 for the cross-project scenario we can conclude that:
The results of our empirical study (Section 3) show that OCSVM generally outperforms its two-class counterpart, SVM, however it does not achieve results consistently higher than traditional two-class classifiers and its performance is not as promising as shown in the work of Chen et al. chen2016software. For due diligence and completeness, we replicate their study chen2016software. Below we describe the design and results of our replication, a summary of the design is given in Table 6.
4.1.1 Research Questions
Chen et al. chen2016software’s study did not explicitly state multiple research questions, but rather their overall research goal to investigate whether OCSVM can be used to predict defects, and whether it would outperform other ML techniques. Hence, in this replication we aim to address the same goal organised as the research questions we described in Section 2.1. We therefore investigate whether (RQ1) OCSVM outperforms the Random Classifier; (RQ2) OCSVM outperforms its two-class counterpart, namely SVM; (RQ3) OCSVM outperforms traditional two-class classification techniques widely used for defect prediction.
Chen et al. chen2016software investigate six highly imbalanced datasets obtained from the public NASA repository (Sayyad-Shirabad+Menzies:2005): CM1, KC3, MC1, MW1, PC1, PC2.
In our replication we use the same datasets available from the tera-PROMISE repository (Sayyad-Shirabad+Menzies:2005). We report in Table 7 the number of modules and percentage of faulty modules per dataset. We observe that three out of the six datasets used (i.e., KC3, MW1, PC1) are identical, whereas the other three datasets (i.e., CM1, MC1, PC2) vary slightly in the number of instances and percentage of defects from those used by Chen et al. chen2016software. The reason for this difference cannot be determined given that no indication of pre-processing was stated in Chen et al. chen2016software’s study.
4.1.3 Validation and Evaluation Criteria
Chen et al. chen2016software performed 20 independent Hold-Out validations, where each time, 10% of the data was randomly selected for training and 90% for testing. We, perform the same Hold-Out validation, but we increase the number of independent runs to 30 in order to gather more robust results.
Recall, False Positive Rate and G-mean were used to evaluate and compare the performances of the techniques, however, the main conclusions were drawn based on G-Mean chen2016software. We decided to assess the performance of the techniques using MCC (see Section 2.3), since it is a robust evaluation measure (shepperd2014researcher; yao2020assessing). For completeness, we also include the G-mean results for OCSVM and compare them with those obtained in the original study (OCSVM-O) in Table 10. We observe that the G-mean values obtained with OCSVM in our study are much lower than those reported in the work of Chen et al. chen2016software.
We use the Wilcoxon signed rank test and the Vargha and Delaney Â12 effect size (described in Section 2.3) to check for any statistically significant difference in order to strengthen the robustness of our conclusions.
4.1.4 Technique Setting
For these experiments, to reserve a fair comparison with previous work, we use the non-tuned version of the techniques. Specifically we use OCSVM and compare its performance to that of its two class counterpart, SVM and the three other traditional machine learner: RF, NB and LR (see Section 2.5.1).
|Dataset||Modules (faulty %)||Modules (faulty %)|
|CM1||496 (9.68%)||498 (9.83%)|
|KC3||458 (9.39%)||458 (9.39%)|
|MC1||9277 (0.73%)||9466 (0.72%)|
|MW1||403 (7.69%)||403 (7.69%)|
|PC1||1107 (6.87%)||1107 (6.87%)|
|PC2||5460 (0.42%)||5589 (0.41%)|
In this section we report and discuss the results we obtained in the replication study.
|Dataset||OCSVM vs. Random||OCSVM vs. SVM||OCSVM vs. RF||OCSVM vs. NB||OCSVM vs. LR|
|CM1||0.306 (0.54)||0.007 (0.77)||0.997 (0.26)||0.994 (0.14)||1.000 (0.13)|
|KC3||<0.001 (0.99)||<0.001 (1.00)||0.846 (0.43)||1.000 (0.14)||0.998 (0.27)|
|MC1||<0.001 (1.00)||1.000 (0.21)||1.000 (0.23)||0.999 (0.14)||0.396 (0.59)|
|MW1||0.245 (0.55)||<0.001 (0.85)||1.000 (0.29)||1.000 (0.17)||1.000 (0.09)|
|PC1||0.169 (0.63)||1.000 (0.31)||1.000 (0.06)||1.000 (0.06)||1.000 (0.07)|
|PC2||<0.001 (1.00)||<0.001 (1.00)||0.002 (0.83)||0.983 (0.37)||0.708 (0.48)|
RQ1. OCSVM vs. Random: To address RQ1, we compare the performance of OCSVM to that of a Random Classifier. From Tables 8 and 9 we can observe that OCSVM outperforms the Random classifier on three out of the six datasets with statistically significant difference and a large effect size. On the other three datasets, OCSVM obtains the same results as the random classifier.
RQ2. OCSVM vs. SVM: To answer RQ2, we compare the performance of OCSVM with its two-class counterpart SVM. By looking at the MCC values reported in Table 8, we can see that OCSVM performs similarly or better than SVM on four out of the six datasets (i.e., CM1, KC3, MW1, PC2) with differences being statistically significant on all four of them and the effect size being medium in one case and large in the three other cases. This shows that by learning from the non-defective modules only, a technique like SVM is able to perform similarly and sometimes better than when built and trained on both, defective and non-defective modules.
RQ3. OCSVM vs. traditional ML: In order to verify whether OCSVM outperforms traditional two-class classifiers, we compare its performance to that of three different techniques widely used in defect prediction studies (i.e., RF, NB, LR). Results, reported in Tables 8 and 9, show that OCSVM only outperforms RF in one case, with the difference being statistically significant and the effect size being large. When compared to LR and NB, results show that OCSVM does not outperform these two techniques on any of the 12 cases considered. Based on the results above, we conclude that:
5 Threats to Validity
The validity of our empirical study can be affected by three types of threats: construct validity, conclusion validity, and external validity. The construct validity threat arising from the choice of the data and the way it has been collected, was mitigated by the use of publicly available datasets that have been carefully collected and used in previous work (petric2016jinx; yatish2019mining).
In relation to conclusion validity, we carefully calculated the performance measures and applied statistical tests, verifying all the required assumptions. We use a robust evaluation measure (i.e., MCC) to evaluate the performance of the prediction models (yao2020assessing; d2012evaluating).
The conclusion drawn from our replication may be affected by the fact that we use different hold-out data splits as this information was not present in the original study, however to mitigate this bias we run the experiments 30 times and report the average results herein. Similarly, the tools used to run the experiments may differ given that the original study did not report this information and the authors could not provide additional details. Also, we include different benchmarks with respect to those used in the original study, as it is preferable to use basic and widely used classifiers rather than more complex variants, based on the rationale that any novel approach should be able to outperform basic ones (shepperd2012evaluating; sarro2018linear).
The external validity of our study can be biased by the ML techniques and subjects we considered. However, we have designed our study aiming at using ML techniques and datasets, which are as representative as possible of the defect prediction literature. We considered traditional two-class classification techniques widely used in previous studies (hall2011systematic) as our aim is to benchmark one-class predictors vs. traditional two-class predictors, and not to search for the best prediction technique. If OCSVM is not able to outperform such traditional baselines, it is reasonable to assume that it will not perform better than more sophisticated ones proposed for cross-version and cross-project defect prediction (e.g., (Nam2013; Xia2016; Herbold2018; Zhou18; Hosseini2019; amasaki2020cross)).
Moreover, we used techniques freely available from a popular API library (i.e., Scikit-Learn) to mitigate any bias/error arising from ad-hoc implementations. We also used publicly available datasets previously used in the literature, which are of different nature and size, and have been carefully curated in previous work as explained in Section 2.2. We cannot claim the results will generalise to other software, despite the fact that we have analysed the use of the proposed approaches for 15 real-world software projects having different characteristics and in three different scenarios (hold-out, cross-release, cross-project). The only way to mitigate this threat is to replicate the present study on other datasets.
In order to facilitate future replications and extensions of this work, we will make the code and data publicly available upon acceptance of the paper.
6 Related Work
A great deal of research has been conducted to predict defect in software modules. This includes work that explores a wide range of two-class classifiers as potential solutions to identify the possibility of a module being defective. A survey of this can be found elsewhere (hall2011systematic). However, only two studies investigate the use of one-class classifiers to predict software defects chen2016software (ding2019novel).
The work by Chen et al. chen2016software, replicated herein, is the only one using an ML classifier (i.e., OCSVM) which learns solely from defective training data. The work by Ding et al. (ding2019novel) uses Isolation Forest as ML approach, which instead is given some amount of defective data, together with non-defective ones, to build the prediction model. Ding et al. validate this approach on five NASA datasets (e.g., KC1, KC2, CM1, PC1, JM1) collected from the PROMISE repository (Sayyad-Shirabad+Menzies:2005) and they compare the proposed approach, using 10-fold Cross Validation, to three other ensemble learning methods (i.e., Bagging, Boosting, RF) where C4.5 is adopted as the meta-predictor of ensemble learning. Various ensemble sizes were tested (i.e., 100, 150, 200) and the performance of all techniques was evaluated using F-measure and AUC. Results showed that, on average over all datasets, Isolation Forest obtained higher F-Measure and AUC values, however no statistical test was performed to verify whether the difference observed in performance was statistically significant.
Other studies have tackled the problem of class-imbalance in defect prediction data (khoshgoftaar2010attribute; wang2013using)
by treating this problem as an anomaly detection one, where defective instances are considered as anomalies(neela2017modeling; afric2020repd).
Neela et al. (neela2017modeling)
construct an anomaly detection approach which incorporates both univariate and multivariate Gaussian distributions to model non-defective software modules. The defective models, known as anomalies, are then identified based on their deviation from the generated model. The proposed method was tested on seven NASA datasets collected from the PROMISE repository(Sayyad-Shirabad+Menzies:2005), and accuracy, probability of detection, probability of false alarm and G-Mean were used as measures to evaluate its performance. Results showed that the existence of a correlation between various attributes plays a significant role in predicting defects using anomaly detection algorithms. As a result, multivariate Gaussian distribution generally outperforms univariate Gaussian distribution methods.
Afric et al. (afric2020repd)
present a Reconstruction Error Probability Distribution (REPD) model for within-project defect prediction, and assess its effectiveness on five NASA datasets (i.e., CM1, JM1, KC1, KC2, and PC1) obtained from the PROMISE repository(Sayyad-Shirabad+Menzies:2005)
. This approach is used to handle point and collective anomalies and it is compared to five ML models (i.e., Gaussian NB, LR, k-NN, decision tree, and Hybrid SMOTE-Ensemble). The constructions of their technique includes training an autoencoder to reconstruct non-defective examples. F-measure is used to evaluate the performance of the techniques and results show that the proposed approach improves the results by up to 7.12% concluding that the defective instances can be viewed as anomalies and therefore can be tackled using an anomaly detection technique.
7 Lessons Learnt and Future Work
In this paper we have investigated the effectiveness of OCSVM for software defect prediction by carrying out a comprehensive empirical study involving the most commonly used machine learners and evaluation scenarios (i.e., within-project, cross-version, cross-project).
We summarise the main lessons learnt below:
OCSVM does not pass the sanity check for the within-project scenario (i.e., its estimates are significantly better than random guessing in only 17% of the cases) and is not as effective as SVM, SVM and the other traditional two-class classifiers (i.e., NB, LR, RF) for this scenario.
OCSVM passes the sanity check for both the cross-version and cross-project scenarios (i.e., its estimates are always statistically significant better than random guessing), it also provides significantly better estimates than SVM in all cases, and than SVM in 11% of the cases for cross-version and 22% of the cases for cross-project. On the other hand, it is able to outperform with statistical significant results the traditional two-class classifiers (i.e., NB, LR, RF) in 37% and 48% of the cases in total for the cross-version scenario and cross-project scenario, respectively.
When we consider OCVSM, which makes use of a minimal number of defective instances for hyper-parameter tuning, the overall results improve for all scenarios, as follows:
OCVSM performs statistically significantly better than random guessing in all cases for the within-project scenario (which is a very good improvement compared to the 11% of OCSVM) passing the sanity check. Similarly, its performance against SVM, SVM and the other traditional classifiers (i.e., NB, LR, RF) improves to 43% of the cases (compared to the 20% of OCSVM). Overall, we conclude that OCVSM is effective in less than half of the cases for within-project defect prediction.
OCVSM performs statistically significantly better than random guessing and SVM in all cases for both the cross-version and cross-project scenarios. Whereas it is significantly better than SVM in 55% of the cases for the cross-version scenario and 100% of the cases for the cross-project scenario. OCVSM also performs statistically significantly better than the traditional ML in 56% of the cases for cross-version and 81% of the cases for cross-project.
The results, overall, suggest that neither OCSVM nor OCSVM is as effective as the traditional two-class classifiers for the within-project scenario. Therefore their use cannot be recommended in this case, with the only exception being that OCSVM should be preferred to SVM when it is not feasible to tune the latter. On the other hand, we observe that while OCSVM also remains ineffective for cross-version and cross-project defect prediction, its tuned counterpart (i.e., OCSVM) achieves statistically significantly better results than traditional approaches in 64% of the cases for cross-version and in 67% of the cases for cross-projects.
Although our study reveals negative results for OCSVM (i.e., OCSVM is not able to consistently outperform the more traditional two-class classifiers and RF is overall the best performing approach according to RQ3), we believe the results also shed light on an another interesting aspect. In fact, the above findings suggest that when the training data is more heterogeneous, using OCSVM tuned with a minimal number of defective instances, can improve the estimates with respect to using traditional approaches trained on all the available defective and non-defective instances. This points to recommendations for future work to explore the extent to which performing hyper-parameter tuning on different ratios of non-defective instances affects the prediction performance.
This research is supported by the ERC grant EPIC (under fund no. 741278) and the Lebanese American University (under fund no. SRDC-F-2018-107).