Software testing activities are crucial quality assurance activities in a software development process. However, these activities are known to be expensive. In a scenario with limited resources (e.g., financial, time, manpower), a proper management of the available resources is necessary. In this context, software defect prediction models can be used to identify which parts of a software are more defect-prone and should thus receive more attention .
The predictive power of a model is driven by the machine learning technique employed and also to the quality of the training data. Usually, defect prediction models are investigated in the literature using a within-project context that assumes the existence of previous defect data for a given project. This approach is called Within-Project Defect Prediction (WPDP). However, in practice, not all companies maintain clear historical defect data records or sufficient data from previous versions of a project . For example, a software project in its first release has no historical defect data. In such cases, an alternative approach is to assemble a training set composed of known external projects. This approach, named Cross-Project Defect Prediction (CPDP), has in recent years attracted the interest of researchers due to its broader applicability .
Although the CPDP principle is attractive, the predictive performance has been shown to be a limiting factor, where poor predictive power is commonly credited to the heterogeneity present on the data employed . Zimmermann et al. , for example, show that only 3.4% of 622 pairs of projects (, ) presented adequate predictive power when was predicted based on a model trained with .
Other studies in the literature focused on solutions to improve the predictive performance of CPDP models. Herein we emphasize transfer learning solutions. Such approach is characterized by the use of some knowledge about the target domain in order to approximate the different distributions of source and target data. Different strategies are explored in this context, such as the transformation of data [8, 9], the filtering of relevant examples or projects [10, 11, 12], or the different weighting of relevant training examples . In addition, most of these methods can be associated to different machine learning algorithms, composing variations with different predictive performances. Knowledge about which methods are typically best could assist practitioners select a method to use. However, to the best of our knowledge, no comprehensive comparison between such transfer learning solution methods has been published to date. Existing comparative analyses are restricted to a few methods [5, 12, 2] or do not provide a uniform experimental setup .
As part of the first contribution of this study, we provide an extensive experimental comparison of 31 CPDP methods applied to 47 versions of 15 open source software projects. The 31 methods derived from the combination of six state-of-the-art transfer learning solutions with the five most frequently used classifiers in the defect prediction literature. We designed a uniform experimental setup to mitigate potential sources of bias that could accidentally lead to conclusion instability[13, 14, 15, 16]. In this experimental comparison, we investigated which methods perform better across datasets. Based upon the results obtained, we identified four CPDP methods with best performance in general. We also investigated whether these four methods performed better for the same datasets. This answer can be helpful considering the different software domains included in the investigated collection of datasets. Results showed that even though these four methods performed best across datasets the most suitable method for a project varied according to the project (or dataset) being predicted.
Such result is also found in the broader machine learning research area. The no free lunch theorem states that no machine learning algorithm is suitable for all domains . This statement led us to also investigate: “which is the most suitable solution for a given domain?”. This task is named algorithm recommendation [18, 19, 20], which is part of the general meta-learning literature [21, 22]. A meta-learning model is characterized by its capacity of learning from previous experiences and to adapt its bias dynamically conforming to the target domain . In the context of this study, a meta-learning model should be able to recommend a suitable CPDP method according to the characteristics of the software being predicted.
Therefore, in this paper there is a second main contribution, which is to propose and evaluate a meta-learning solution specifically designed for the CPDP context. The particularities of this context are carefully investigated and included in a general meta-learning architecture . Such particularities include the unsupervised characterization of datasets . Traditionally, such characterization is given by supervised meta-attributes, which assumes the existence of information about the target attribute [25, 26]. However, in the CPDP context this information may not be available (e.g., a project in its first release). Another particularity of the proposed solution relates to the multi-label context . The results obtained as part of our first contribution showed that more than one CPDP method can achieve the best performance for a project, which characterizes a multi-label learning task. These particularities differentiate the proposed solution from previous studies, such as the meta-learning solution proposed by Nunes das Dores et al.  and Nucci et al. , designed for the WPDP context.
We consider two distinct meta-attributes sets for the characterization of datasets. Both are evaluated in two different levels - meta- and base-level. In the meta-level, we evaluate the learning capacity of each meta-attributes set in relation to a baseline. In the base-level, we evaluate the general performance of the proposed solution in relation to the four best CPDP methods applied individually.
In general, we aim to answer the following research questions encompassing both contributions:
RQ1: Which CPDP methods perform better across datasets?
We managed to identify the four best-ranked CPDP methods across datasets in terms of AUC.
RQ2: Do the best CPDP methods perform better for the same datasets?
These four methods presented best performances for distinct groups of datasets, evidencing that the most suitable method for a project depends on the project being predicted. These results accredit the investigation of new resources to assist the task of deciding which methods are most suitable for a given project.
RQ3: To what extent can meta-learning help us to select the most suitable CPDP method for a given dataset?
RQ3.1: Does the meta-learner learn? (Meta-level)
The meta-learner presented a better performance in relation to the four evaluated CPDP methods considering the frequency of predicted best solutions. In other words, the meta-learner was able to predict the best solution for a larger amount of project versions. This result indicates a proper predictive power of the proposed solution in the meta-level.
RQ3.2: How does the meta-learner perform across datasets? (Base-level)
Even though the meta-learner presented the higher mean value of AUC, it did not present significant difference in relation to the best-ranked base method.
The remainder of this paper is organized as follows. In Section 2 we present the necessary background and related work. In Section 3 we present an extensive comparison of performance for the state-of-the-art CPDP methods. In Section 4 we propose a meta-learning architecture applied for CPDP and evaluate its performance. In Section 5 we present the threats to validity. In Section 6 we present the general conclusions and comments on future work.
2 Background and Related Work
2.1 Software Defect Prediction
The success of a software defect prediction model depends on two main factors: 1) building an adequate training set; and 2) applying a suitable machine learning technique.
In a classification task, the training set can be represented by a table of elements (see Table I). In this table, each row represents an example (software part or module); the independent variables (or attributes) represent the characteristics of each example; and the dependent variable (class or target attribute) represents the binary class - defective or not defective.
Much effort has been spent in order to find effective independent variables, i.e., software properties able to provide relevant information for the learning task. Seminal studies were presented in Basili et al. , where the authors investigated the use of object-oriented metrics. Ostrand et al.  proposed the use of code metrics and historical defect information for defect prediction in industrial large-scale systems. Zimmermann and Nagappan  proposed the use of social network analysis metrics, extracted from the software dependency graph. Further, other metrics were also investigated, such as developer-related metrics , organizational metrics , process metrics , change related metrics , antipatterns related metrics , to name a few. Herein we use the code metrics set investigated in Jureczko  because such metrics set has been reported as effective for defect prediction models [37, 11, 38, 39], and can be automatically extracted from a project’s source code. This latter characteristic is important due to two factors: 1) no historical or additional data is required (e.g., process metrics or change-related metrics); and 2) any software project with available source code can be used as input for the prediction model.
There are also additional studies focusing upon the performance of classification algorithms applied to defect prediction. Lessmann et al.  compared the performances of 22 classifiers over 10 public domain datasets from the NASA Metrics Data Repository (MDP). Their results show no statistically significant differences between the top 17 classifiers. Later, Ghotra et al.  revisited Lessmann et al.’s work considering two different collections of data: the cleaned NASA MDP; and the open source PROMISE111http://openscience.us/repo/ corpus. Opposite to the previous results, Ghotra et al. concluded that the used classification technique had a significant impact on the performance of defect prediction models. These contrasting results reinforce the conclusion instability (discussed in Section 2.3), as well as highlight the influence of noisy data upon results. In this study, we employed data from the latest version of the open source PROMISE data corpus, provided by Madeyski and Jureczko  (presented in Section 3.1.1).
Finally, Malhotra 
presents a systematic literature review (SLR) on machine learning techniques for software defect prediction, based on 64 primary studies (from 1991 to 2013). This SLR identified the five most frequently used classifiers: Naive Bayes; Random Forest; Support Vector Machines; Multilayer Perceptron; and C4.5. A better description of each classifier is provided in AppendixB. These five classifiers are evaluated herein, combined with transfer learning solutions, since their different algorithms can lead to different performance.
All the studies abovementioned employed the same methodology, called within-project defect prediction (WPDP) . In this methodology, the training set includes examples from the same software. Therefore, the prediction of a software defect is based on this software’s previous versions. However, an appropriate amount of historical defect data may not be available. An alternative is to compose the training set with external known projects, as discussed next.
2.2 Cross-project Defect Prediction
Within the context of software defect prediction, the dependent variable, i.e., what we aim to predict, also has an important role. In fact, learning is only possible when defect data is available. However, in practice, not all software companies maintain clear records of historical defect data or sufficient data from previous projects . In this case, the training set can be composed by external projects with known defect information. This approach is called Cross-project Defect Prediction (CPDP) . The positive aspect of such approach is that it tackles the lack of historical defect data; however, on the negative side, it introduces heterogeneity on data, which may decrease the efficiency of defect prediction models .
In a recent systematic literature review, Gunarathna et al.  highlighted the contributions of 46 primary studies on CPDP from 2002 to 2015. Part of these works focus on the feasibility of CPDP. Briand et al.  conducted one of the earliest studies on this topic and verified that CPDP models outperform random models in the studied case. Zimmermann et al.  conducted a large scale experiment in which they evaluated 622 pairs of cross-project models (, ). They found that a model trained with was able to predict with good performance in only of pairs.
Other studies are focused on possible solutions to mitigate the heterogeneity of the data and to improve the learning capacity of CPDP models. Among all solutions, we highlight the transfer learning solutions222The context of this work fits in a special case of transfer learning, called domain adaptation . In this case, source (external projects) and target (project to be predicted) domains may be different while sharing the same learning task. In addition, the target domain may be unlabelled (e.g., the first release of a software project, with no historical defect data). . In this approach, the solutions use some knowledge about the target domain in order to approximate the different distributions of source and target data.
At least three different strategies can be identified among the transfer learning solutions: transformation of data [8, 9, 42, 43]; filtering a subset of the training data [10, 44, 11, 12]; and weighting the training data according to the target data .
Watanabe et al.  proposed to standardize the target data based on the source mean values. Camargo Cruz and Ochimizu  put forward a similar approach to the one by Watanabe et al., however based on the median values, associated with power transformation. Nam et al. 
suggested to project both source and target data in a common attribute space by means of TCA (Transfer Component Analysis). Moreover, they investigated the impact of min/max and Z-score standardizations applied over source and target data.Zhang et al.  offered a universal defect prediction model based on clusters created using the project context.
Turhan et al. 
suggested the use of the k-nearest neighbour (KNN) filter to select the most similar training examples in relation to the target examples.Jureczko and Madeyski  recommended to perform clustering in order to identify the most suitable training projects for a target project. Herbold  put forward the use of the k-nearest neighbour filter but to select the most similar projects (instead of individual examples). He et al.  suggested the selection of the most similar projects but considering the separability of each project in comparison to the target data. Moreover, they also suggested the use of the separability strategy to filter unstable attributes. Lastly, they applied an ensemble approach over the selected projects with filtered attributes.
Ma et al. 
used the idea of data gravitation to prioritize and weight the training data for the Naive Bayes classifier based on its similarity to the target data.
To the best of our knowledge, no extensive comparison of transfer learning methods were conducted in the CPDP context. Commonly, authors compared their new solutions either with the KNN filter (e.g., [5, 12, 2]) or with some baseline (e.g., [11, 42, 2]). The baseline, usually, is defined as the direct application of a machine learning technique, with no CPDP treatment.
In the systematic literature review presented in Gunarathna et al. , the authors compared the performance of CPDP methods based on the information provided by each published work. However, the considered studies diverge in terms of data, classifiers, and performance measures. The absence of a uniform experimental setup precludes concrete comparisons of performances between studies on a large scale. This issue is better discussed in the next section.
2.3 Conclusion Instability
Defect prediction models assist testers on prioritizing test resources. Yet, there is another important decision in this process: which model is the most suitable for a specific domain? The answer to this question is not straightforward for two main reasons. First, as stated in Shepperd et al. , no single defect prediction technique dominates. This statement has support in the theorem known as no free lunch, which says that there is no machine learning algorithm suitable for all domains . Second, because of the conclusion instability found in the experimental software engineering . Not rarely, different studies on the same subject produce conflicting conclusions. For example, Kitchenham et al. 
reviewed empirical studies applied to effort estimation in which local data are evaluated in relation to data imported from other organizations. From the seven reviewed studies, three concluded that imported data are not worse than local data, while four studies concluded that imported data are significantly worse than local data.
Part of this conclusion instability is credited to the bias produced by the different research groups . However, in a recent study, Tantithamthavorn et al.  argue that this bias is actually due to the strong tendency of a research group to reuse experimental resources such as datasets and metrics families.
Menzies and Shepperd 
also credit the conclusion instability to the variance between similar experiments. This variance is related to the internal resources and procedures used in each experimentation. One of the suggestions proposed byMenzies and Shepperd  to reduce the conclusion instability is the use of a uniform experimental setup. In the context of CPDP, several factors can influence the performance analysis, such as:
The independent variables set: some of the data repositories mentioned above provide datasets with different metrics sets. The metrics set is an important source of performance variability ;
Data preprocessing: different methods can be applied in the preprocessing step, such as normalization, standardization, attribute selection, or data transformation . These methods modify the value range, distribution and scale of data, interfering directly on the CPDP model performance [47, 42]; and
Performance measure: some traditional metrics (e.g., precision, recall, f-measure) are sensitive to the threshold that separates a defective from a non-defective example. Other metrics (e.g., AUC), however, are unrestricted to this threshold. Furthermore, different metrics focus on different aspects of the performance. Thus, a proper performance comparison demands the use of equivalent measures.
Considering each of the bias factors abovementioned, in Section 3 we define a uniform experimental setup for the performance comparison of 31 CPDP methods333In this work, the term CPDP method refers to the entire process (transfer learning solution + classifier) which leads to the prediction model. derived from the combination of six state-of-the-art transfer learning solutions with the five most popular classifiers for defect prediction.
In order to mitigate conclusion instability, Menzies and Shepperd  also suggest the investigation of approaches able to learn the properties of a particular domain and, as a result, suggest a suitable learner. This suggestion is strongly related to the meta-learning architecture proposed and evaluated in Section 4. We discuss the background of this proposal in the next section.
2.4 Meta-learning for Algorithm Recommendation
Considering the no free lunch theorem and the concept of conclusion instability discussed above, automated resources to assist practitioners in the difficult task of choosing a suitable model for a specific domain are desirable. This task is related to the algorithm recommendation task investigated in the meta-learning literature . The learning capacity of a traditional model (here identified as base-learner) is limited to its inductive bias . A meta-learning model, however, is characterized by its capacity of learning from previous experiences and adapting its inductive bias dynamically according to the target domain . In other words, a meta-learning model is designed to learn in which conditions a given solution is more suitable to be applied to, when compared to all conditions.
The main challenge in the algorithm recommendation task is to discover the relationship between measurable features of the problem and the performance of different solutions . In the literature, this task is approached as a typical classification problem, although at a meta-level [23, 50, 22].
Similar to the traditional learning task, the success of a meta-model depends on two main factors: 1) building an accurate meta-data; and 2) applying a suitable machine learning algorithm to compose the meta-learner. The meta-data is represented by a table of meta-examples. Each meta-example represents a dataset (i.e., a previous learning task experience); the independent variables (or meta-attributes) characterize each dataset; and the dependent variable (or meta-target) represents the goal of the meta-learning task.
Different goals have been explored in the literature for both classification and regression tasks. In classification tasks, the meta-target is commonly associated to a label representing the solution with highest performance (or lower cost). Examples of meta-learning classification tasks are: to predict a suitable base-learner [19, 28]; to predict a suitable attribute selection method ; to predict if the tuning of hyper-parameters for SVMs is beneficial or not ; among others . In regression tasks, each meta-target represents the continuous value of performance (or cost) obtained by a specific solution. In this context, the meta-learning task is commonly designed to produce a rank of possible solutions . The recommended order can be useful either to guide the decision task  or to compose ensemble solutions .
: general attributes, obtained directly from the properties of the dataset (e.g., number of examples, number of attributes); statistical attributes, obtained from statistical measures (e.g., correlation between attributes); information-theoretic attributes, typically obtained from entropy measures (e.g., class entropy, attribute entropy); model-based attributes, extracted from internal properties of an applied model (e.g., the width and height of a decision tree model); and landmarking attributes, obtained from the resulting performance of simple classifiers (e.g., the accuracy obtained from a 10-fold cross-validation with 1-nearest neighbour classifier).
In the context of this study, for example, all datasets present the same set of continuous attributes (see Section 3.1.1). In addition, the target data may be unlabelled, as discussed in Section 2.2. These two characteristics make most of the traditional meta-attributes unsuitable to use, since many of them are related to discrete attributes, to the dimensionality of the dataset, or are label dependent - including information-theoretic, model-based and landmarking attributes.
A viable alternative is to characterize the datasets based on unsupervised meta-attributes (non-dependent on the class attribute). However, this alternative is little explored in the algorithm recommendation literature. Although some of the traditional meta-attributes are not dependent on the class attribute, they are commonly addressed associated with label dependent meta-attributes. In a recent study, dos Santos and de Carvalho 
proposed a set composed of 53 meta-attributes applied in the context of active learning, including only unsupervised attributes such as general attributes, statistical attributes and clustering based attributes. Another related approach is to represent a dataset based on its distributional characteristics (e.g., mean, maximum and standard deviation of all attributes). Although this latter approach is not commonly applied for algorithm recommendation tasks, it is already applied in the context of CPDP with different purposes[53, 12, 11]. For example, Herbold  uses the distributional characteristics of datasets to filter the most similar projects to compose the final training set. These two unsupervised approaches are explored and evaluated in this study. A better description is provided in Section 4.1.2.
In principle, any machine learning algorithm can be used at the meta-level. However, a common aspect of meta-learning tasks is the scarce set of training data . This issue is mitigated in the literature with lazy learning methods, such as k-nearest neighbour classifier [21, 22]
, although other types of models have been successfully applied (e.g., neural networks, random forest, ensembles)[19, 20].
In Nunes das Dores et al. , the authors proposed a meta-learning framework for algorithm recommendation in the context of WPDP. Their meta-learning solution includes a meta-data composed of seven classifiers applied over 71 distinct datasets. The datasets are characterized with traditional meta-attributes, as described above. They evaluated two meta-learners: a Random Forest model; and a majority voting ensemble, including all seven classifiers. Their experiments reveal a better performance obtained with the meta-learning solution across datasets in relation to each of the seven classifiers applied individually. However, no statistical analysis was reported. In Nucci et al. , the authors proposed a different approach for the dynamic selection of classifiers in the context of WPDP. Their solution differs from Nunes das Dores et al.  for two main reasons: 1) their solution considers a different set of classifiers applied over 30 distinct datasets; and 2) their meta-data is constructed in a different level of granularity, i.e., the meta-examples do not represent software project versions (or datasets) but single Java classes. In this way, the characterization of meta-examples are based on the same independent variables used in the base-level. The meta-learner considered in their work is also based on the Random Forest algorithm. Their results are positive since the proposed solution outperformed the evaluated stand-alone classifiers and a voting ensemble technique.
In Section 4, we propose and evaluate a different meta-learning solution designed specifically for the CPDP context. The particularities of this context lead to new issues, carefully investigated in this study. Some of such issues relate to the unsupervised meta-attributes set and the multi-label context (see Section 2.5). Furthermore, a series of other factors distinguish the proposed solution from previous meta-models, including the collection of datasets, the validation process, the performance measure, the attribute selection procedure, and the statistical analysis.
2.5 Multi-label Learning
The meta-learning solution proposed in this study also configures a multi-label learning task since each meta-example can be associated to more than one suitable CPDP method. In this section we present the main concepts related to this particular domain.
In the traditional single-label classification task, as described in Section 2.1, each example of a dataset is associated with a single label from a set of disjoint labels , where and . The learning task is called binary classification when and multi-class classification when . In a multi-label classification task, each example is associated with a set of labels .
Let be a real value example vector, a label vector for , the number of training examples and the th label of the th example, where and . The element is (or ) when the th label is relevant (or irrelevant) for the th example. We denote as the set of relevant labels for the example . A training set is composed by the example matrix and the label matrix (see Table II).
Given a multi-label training set (, ), the learning task can have two goals: classification or ranking . A multi-label classification model is defined by . A multi-label ranking model is defined by , where the predicted value can be regarded as the confidence of relevance. The ranking is given by ordering the resulting confidence of relevance. We can also define a multi-label classifier that assigns a single label for by setting . In other words, the predicted single label for is given by the top ranked confidence of relevance generated by .
Several methods have been proposed in the literature of multi-label learning . The existing methods can be grouped into two main categories: 1) problem transformation, in which the multi-label learning task is transformed into one or more single-label learning tasks; and 2) algorithm adaptation, in which specific learning algorithms are adapted to handle multi-label problems directly.
In this study we highlight a problem transformation method called Binary Relevance (BR) . In this method, the multi-label problem is transformed into binary classifiers, one for each different label in . Then, a dataset , , is created for each label containing all examples of the original dataset. Each example in is labelled positively if and negatively otherwise.
In this study, this method is appropriate due to three main reasons: 1) BR presents a competitive performance with respect to more complex methods ; 2) this is a popular, simple and intuitive method ; and 3) after data transformation, any binary learning method can be taken as base learner . This characteristic enables us to construct the multi-label model discussed in Section 4.1.5.
3 Performance Evaluation of CPDP Methods
In this section we evaluate the performance of CPDP methods including six state-of-the-art transfer learning solutions and their variations combined with the five most popular classification learners in the literature of defect prediction. We provide a uniform experimental setup based on versions of open source software projects.
The contribution of this experimental analysis is twofold: 1) it provides a comprehensive evaluation of performance for the state-of-the-art CPDP methods, covering different classification algorithms and transfer learning methods; and 2) the results show that, despite the fact that a group of methods generally present better predictive performance than other methods across datasets, the most suitable method can vary according to the project being predicted - accrediting the meta-learning solution proposed in Section 4.
Below, we discuss the experimental setup and issues involved in this experimentation. Next, we present and discuss the obtained results. All the experiments and data analysis were implemented and conducted with R444https://www.r-project.org/.
3.1 Experimental Setup
3.1.1 Software Projects
All the conducted experiments are based on 47 versions of 15 Java open source projects, provided by Madeyski and Jureczko . The authors provide a link555http://purl.org/MarianJureczko/MetricsRepo/ with detailed information about the software projects and the construction of each dataset. Each example in a dataset represents a Java Object-Oriented class (OO class). The dependent variable corresponds to the number of defects found for each OO class. We converted the dependent variable to a binary classification problem (, for number of defects ; or , otherwise) . Table III list the number of examples, number of defects, and defect rate for each of the analysed datasets.
|Project||# Examples||# Defective Examples||% Defective Examples|
The set of independent variables is composed of 20 code metrics and include complexity metrics, C&K (Chidamber and Kemerer) metrics, and structural code metrics. This metrics set is briefly described in Appendix A. These metrics have been reported in the literature as good quality indicators, as discussed in Jureczko and Madeyski . In addition, they are numeric and can be automatically extracted directly from the source code with the Ckjm666http://gromit.iiar.pwr.wroc.pl/p_inf/ckjm tool. Further details can be found in Jureczko and Madeyski .
3.1.2 Data Preprocessing
Data transformation is a resource used to re-express data in a new scale aiming to spread skewed curves equally among the batches9, 10]. We applied a log transformation. Each numeric value is replaced by its logarithm. Since some metrics present zero values which would result on infinitive values, we applied a simple solution as follows: x = log(x + 1) .
3.1.3 CPDP Methods
In order to carry out this experiment we selected six state-of-the-art transfer learning solutions based on the benchmark proposed in Herbold . We re-implemented each solution in R following their respective original publication instructions. We refer to each solution with the abbreviation [year][first author], as described bellow:
2008Watanabe : the authors proposed a simple compensation method based on the mean value. Let be the th attribute of the test example , and and be the mean values of the th attribute of the training and test sets, respectively. The compensated test metric value is given by:
The base-learner is then applied over the transformed data.
2009Turhan : also known as KNN filter or Burak filter, this solution applies the k-nearest neighbour algorithm to select the most similar examples based on the Euclidean distance. For each test example, the solution searches for the most similar training examples (from all available projects). The selected examples compose the filtered training set. Duplicated examples are discarded. As in the original proposal, we set .
2013He : this solution is composed of three steps. First, the projects datasets most similar to the test set are selected to compose the training data. Let be the dataset of project () and the test set. In this step, a new dataset called with examples is constructed containing examples from and examples from . The dependent variable of is a binary label differentiating training examples from test examples. The similarity between and is calculated based on the accuracy (rate of correct predictions) in which a classifier is able to differentiate (or separate) between training and test examples in
. In this step we applied a Logistic Regression as the separability classifier and set, as suggested by the authors.
Second, for each of the project datasets selected in the previous step, the ratio of percent of unstable attributes are removed, where is a predefined parameter. The same dataset is used to measure the information gain of each attribute in relation to the dependent variable. The attributes with higher information gains are considered more unstable and are removed. It is important to note that each of the selected projects (step 1) generates a new training set with, possibly, a distinct attributes subset (step 2). We set , as suggested by the authors.
Finally, the authors apply an ensemble strategy with majority voting to combine the predictions given by the selected training sets (obtained from steps 1 and 2). For each selected training set, a model is constructed using a classifier. Then, for each test example, a prediction score is generated for each constructed model. In the original proposal, a threshold of is applied for each prediction score and the majority voting defines the final binary prediction.
In our implementation, since we are working with the AUC measure (obtained from the prediction score), we considered the average prediction score obtained from the ensemble strategy instead of the majority voting.
2013Herbold : in the original work, the authors proposed and evaluated two distinct solutions. We chose to implement and evaluate in this study only the solution reported with better performance. As in the KNN filter, this solution is also based on the k-nearest neighbour algorithm. However, instead of filtering individual examples, this solution filters the most similar projects. Each project is represented by its distributional characteristics, given by the mean and standard deviation of each attribute. The authors suggest a neighbourhood size of from the available projects to compose the training data.
2012Ma : also known as Transfer Naive Bayes, this solution introduces the concept of data gravitation to prioritize and weight training examples. The weights of the training examples are inversely proportional to their distances from the test set. Given the indicator :
where is the th attribute of the training example , and / are the minimum/maximum values of the th attribute across all test examples. The function indicates whether is within the range of values of the test examples. The distance of a training example from the test set is given by:
where is the number of attributes. Then, the weighting measure is given by:
According to this formula, higher weights are given to the more similar training examples. A weighted Naive Bayes classifier is then constructed according to the weights given by .
The solutions 2008Watanabe, 2009Turhan, 2009Cruz, 2013He and 2013Herbold are not dependent on a specific classifier. The solution 2012Ma, however, is intrinsically associated with the Naive Bayes algorithm. Therefore, in order to provide a fair comparison, we combined each of the independent solutions with the five most popular classifiers in the literature of defect prediction : Random Forest (RF); Support Vector Machines (SVM); Multilayer Perceptron (MLP); C4.5 (C45); and Naive Bayes (NB). In Appendix B, we briefly describe each classifier. Detailed information can be found in a specialized machine learning literature [46, 22].
For all five classifiers we used the implementation and original parameters provided in the RWeka Package777https://CRAN.R-project.org/package=RWeka. As a baseline, we also evaluate the performance of each classifier in its original form (Orig), with no transfer learning solution. In total, we evaluate 31 methods, as illustrated in Figure 1.
3.1.4 Performance Measure
We evaluate the predictive performance by means of the AUC (Area Under the Receiver Operating Characteristic Curve) measure. We chose this performance measure for three main reasons. First, the AUC measure provides a single scalar value balancing the influence of True Positive Rate (TPR) (benefit) and False Positive Rate (FPR) (cost)
. Further, the use of a single value facilitates the comparison across models. Second, the AUC measure is not sensitive to the thresholds used as cutoff parameters. Traditional measures, such as Precision and Recall, demands a cutoff probability configuration (or prediction threshold) that defines when an example is classified as defective or not-defective. Commonly, this parameter is configured to a probability of 0.5. However, this arbitrary configuration can introduce bias on the performance analysis. Lastly, the AUC measure is also robust to class imbalance, which is frequently present in software defect prediction datasets . The datasets used in this study also present class imbalance, as shown in Table III. An AUC value of means a perfect classifier and a means a random classifier.
3.1.5 Experimental Design
The experimental design is shown in Figure 2. We aim to evaluate the 31 CPDP methods applied to all the 47 project versions. For this, we follow a variation of the leave-one-out cross-validation procedure , here called Cross-project Leave-One-Out (CPLOO). In this variation, the training set contains all the available project versions except the versions of the same project of the current test dataset, as described below.
First, we joined all datasets in one unique Cross-Project Dataset (). Then, we conducted a many-to-one cross-project analysis . Consider a project , the set of all versions of , and a specific version of . For each , , we used as test and ( subtracted from ) as training. In this way, we can analyse the predictive performance disregarding any bias from the different versions of a same project. This variation is important considering our experimentation context: to predict defects in a project based on defect patterns from external projects. This approach is also named strict CPDP in the literature .
3.1.6 Statistical Test
The statistical tests used for statistical analysis were the Friedman test followed by Fisher’s LSD test . The use of Friedman test followed by a Post-hoc analysis is recommended for the comparison of multiple solutions over multiple datasets .
The performances are organized as a table of rankings in which the rows represent the datasets (project versions), the columns represent the compared CPDP methods, and each cell is filled with the respective performance ranking position of a method for a dataset. In this configuration, tied performances share the average ranking position. For example, the following AUC performances , , , , , are ranked as , , , , , .
As mentioned in Larson , the number of significant digits is justified by the specific research purpose, the measurement accuracy and sample size. In this statistical test, we rounded the AUC performances to only 2 significant digits in order to not differentiate very similar performances.
The Friedman test verifies whether the ranking performances of all methods are statistically equivalent. When the null hypotheses is refused, a Post-hoc analysis is applied to define the solutions with significant difference of performance. Fisher’s LSD test is indicated as the most powerful method in this scenario according to a recent study . In this test, we considered the confidence of (i.e., ).
RQ1: Which CPDP methods perform better across datasets?
Two main factors are compared in this performance analysis. First, we extract the AUC performance for each of the 31 CPDP methods applied for all the 47 software projects (or datasets). Then, we extracted the performance ranking by ordering the AUC performances of the evaluated methods for each dataset. Table IV presents the performance mean (and standard deviation) for each CPDP method. The methods are ordered according to the mean rank.
First, we can observe a generally better performance (across datasets) of CPDP methods based on the NB learner, including orig_nb. All 7 methods combined with NB performed among the top 10 positions. In contrast, the performances based on the C45 classifier performed among the last positions, except for 2013He_c45 (position 17). The other classifiers are spread along mixed positions with no clear assignments. Although the first positions in this table present close AUC mean values, it is clear the distance between the best and worst AUC mean performances. The method on the last position performed close to a random classifier (when ). In this study, we assume, for practical use, that an AUC represents a successful model. Although the success of a predictive model is relative to the application domain, it is commonly set to a performance greater than 75% in the CPDP literature [6, 59, 60].
|Pos.||Method||Mean Rank||Mean AUC|
|1||2012Ma_nb||7.57 (5.63)||0.771 (0.10)|
|2||2013He_rf||8.26 (6.02)||0.766 (0.09)|
|3||2009Turhan_nb||9.48 (6.12)||0.761 (0.09)|
|4||2013He_svm||9.90 (6.15)||0.765 (0.09)|
|5||2013Herbold_nb||10.00 (5.86)||0.760 (0.09)|
|6||2009Cruz_nb||10.38 (6.03)||0.757 (0.09)|
|7||orig_nb||10.52 (5.62)||0.758 (0.09)|
|8||2013He_nb||11.47 (5.88)||0.756 (0.09)|
|9||2008Watanabe_nb||11.55 (6.83)||0.752 (0.10)|
|10||orig_rf||11.74 (6.00)||0.756 (0.09)|
|11||2008Watanabe_rf||12.04 (6.68)||0.753 (0.10)|
|12||2013He_mlp||12.82 (8.10)||0.753 (0.09)|
|13||2013Herbold_rf||12.84 (7.10)||0.748 (0.08)|
|14||2009Turhan_rf||13.03 (7.56)||0.745 (0.10)|
|15||orig_mlp||14.86 (7.37)||0.739 (0.09)|
|16||2009Cruz_rf||14.97 (5.92)||0.737 (0.10)|
|17||2013He_c45||15.73 (8.27)||0.734 (0.09)|
|18||2013Herbold_mlp||17.00 (6.96)||0.729 (0.08)|
|19||2009Turhan_svm||17.72 (9.38)||0.702 (0.13)|
|20||2009Cruz_svm||18.17 (7.00)||0.717 (0.10)|
|21||orig_svm||18.17 (7.00)||0.717 (0.10)|
|22||2008Watanabe_mlp||18.78 (6.99)||0.714 (0.09)|
|23||2008Watanabe_svm||19.14 (7.05)||0.706 (0.11)|
|24||2013Herbold_svm||19.26 (9.58)||0.680 (0.13)|
|25||2009Cruz_mlp||21.10 (6.92)||0.693 (0.10)|
|26||2009Turhan_mlp||21.31 (8.26)||0.688 (0.10)|
|27||orig_c45||24.87 (6.88)||0.639 (0.09)|
|28||2013Herbold_c45||25.06 (6.99)||0.636 (0.10)|
|29||2009Turhan_c45||25.16 (7.22)||0.633 (0.10)|
|30||2009Cruz_c45||25.69 (6.82)||0.624 (0.09)|
|31||2008Watanabe_c45||27.39 (4.77)||0.602 (0.08)|
As discussed in Section 3.1.6
, we analysed the statistical significance of results based on the Friedman test. The null hypothesis assumes all performances as equivalent. The alternative hypothesis is that at least one pair of predictive models has different performance. The null hypothesis is rejected with ap-value2.2e-16. Therefore, we analyse the pairwise difference of performances with the Fisher’s LSD test. The result is presented in Figure 3. The alphabet letters group the methods with no significant difference.
No significant difference was found between 2012Ma_nb and the next 3 methods although it presents better performance in relation to all remaining methods - including orig_nb. This differentiation in relation to the original learner is important since it elucidates the real gain produced by each transfer learning solution. For example, except for 2012Ma_nb, no significant difference between the CPDP methods using NB and orig_nb was found - as stated in group ’c’.
This relation between transfer learning solutions and their respective learner baseline is better exposed in Table V. When there is no significant difference, the relation is represented by the symbol “/”. Otherwise, it is filled with “(+)” or “()”, meaning a better or worse mean rank position, respectively.
From this table, we can observe that most of the applied transfer learning solutions do not lead to significant difference of performance in relation to the original classifiers. Moreover, some solutions diminish the performances when associated to a learner, as observed for the MLP classifier with the solutions 2008Watanabe, 2009Cruz, and 2009Turhan. On the other hand, the solution 2013He significantly improves the performances of the classifiers RF, SVM, and C45. The solution 2012Ma also improves the performance when associated to the NB classifier.
Based upon the statistical analysis presented in Figure 3, the methods 2012Ma_nb, 2013He_rf, 2009Turhan_nb, and 2013He_svm presented the best performances across datasets although the method 2009Turhan_nb did not present significant difference in relation to its respective original learner baseline.
However, other criteria can be considered for a comparative analysis. For example, in their original published work [5, 12], both 2012Ma and 2013He presented lower computation time cost in relation to the 2009Turhan solution, although 2013He presented higher complexity in relation to 2012Ma. On the other hand, 2013He is more robust to redundant and irrelevant attributes in relation to 2012Ma. First, because 2013He has an internal procedure to filter the most relevant attributes. In addition, both classifiers RF and SVM are known to be robust in this context. Second, because 2012Ma is sensitive to redundant and irrelevant attributes in two points: its internal weighting procedure, based on the attributes relation between testing and training examples; and its assumption of independence between attributes inherited from the Naive Bayes algorithm.
RQ2: Do the best CPDP methods perform better for the same datasets?
To answer this question, we use of the information shown in Figure 3. We evaluate the performances of the CPDP methods: 2012Ma_nb, 2013He_rf, 2009Turhan_nb, and 2013He_svm; referring to the four methods with better ranking across datasets. Then, for each dataset, we associate the best AUC performance obtained among these four CPDP methods. This association of best methods for each dataset is presented in Table VI. As already mentioned in Section 3.1.6, the AUC performance comparison is based on 2 significant digits only. In this way, we do not differentiate very similar performances.
|Project||Best AUC||# Best Methods||Best Methods|
|jedit-4.2.csv||0.92||4||2013He_rf, 2012Ma_nb, 2009Turhan_nb, 2013He_svm|
|poi-3.0.csv||0.85||3||2013He_rf, 2013He_svm, 2012Ma_nb|
|synapse-1.1.csv||0.68||3||2012Ma_nb, 2013He_svm, 2013He_rf|
We can extract some information from this table. First, the best CPDP method for a project version is not necessarily the same for all versions. More than one CPDP method can achieve the best AUC performance for the same project version. From all the 47 project versions: achieved a successful performance with ; performed in the range ; and only performed with AUC below . Last but not least, there is no general better performance. This statement is better visualized in Figure 4. In this figure, each CPDP method represents a set of all projects in which it achieved the best performance. The intersections between sets (i.e when a project version has more than one best method) are represented by the connected dots.
From all the 47 project versions: 28% achieved the best performance exclusively with 2012Ma_nb; 19% with 2013He_rf; 17% with 2009Turhan_nb; 9% with 2013He_svm; and 27% share the best performance with more than one CPDP method. As we can see, no method always performs best.
In the next section we investigate to what extent a meta-learning solution could predict the best CPDP method according to the project characteristics. The next experiments are based on the results discussed in this section.
4 Meta-learning for CPDP
In this section we propose a meta-learning architecture for the recommendation of CPDP methods. First, we present the general methodology and details related to the construction of the meta-model. Next, we evaluate the performance of the proposed solution. The experimental design is presented in Section 4.2 and the results are discussed in Section 4.3.
4.1 Proposed Methodology
As mentioned in Section 2.4, meta-learning is commonly applied for the recommendation of base learners in specific tasks. In this study, we propose a different application of meta-learning, designed for the recommendation of CPDP methods. The process is based on the general meta-learning approach proposed in Kalousis . Three main differences can be highlighted: 1) the performances are obtained from external project datasets with CPLOO cross-validation instead of a cross-validation within the datasets (see Section 3.1.5); 2) we adopted unsupervised meta-attributes instead of the traditional supervised meta-attributes, commonly used in the literature  (see Section 4.1.2); and 3) the meta-target characterizes a multi-label classification task  (see Section 4.1.3).
Figure 5 presents the general architecture of the proposed solution. The meta-target is determined by the performances obtained from the input datasets. The meta-attributes are also extracted from the characterization of these datasets. Next to the meta-data preprocessing, a multi-label meta-learner is applied. This meta-learner is associated to a wrapper attribute selection in order to select a subset of meta-attributes to compose the final meta-model. The meta-model can be then used to recommend suitable CPDP methods for new project datasets. Once the meta-model is constructed, only the new project datasets need to be characterized and preprocessed before predicting a suitable CPDP method. Greater details and related issues are discussed in the following subsections.
4.1.1 Input Datasets
The collection of datasets (meta-examples) is composed of the 47 projects already presented in Section 3.1.1. The amount of meta-examples is important since the more available information more effective and generalizable can be the meta-model. However, as stated in Brazdil et al. , scarce training data for the meta-learning task is a common aspect in this context. Although there is no recommended minimum amount of data, more than 50 datasets are desirable for a meta-learning analysis . Thereby, the available amount of data is a limitation for this study.
As already mentioned in Section 2.4, in this study we focus on the unsupervised characterization of datasets. Contrary to the supervised characterization, the unsupervised approach does not demand the previous information about the class attribute (defective or not-defective). This approach is important in the context of this study since we work with the assumption that historical defect information may not be available for a software company. This approach, however, is little explored in the meta-learning literature .
Within the context of this study, we evaluate two different sets of meta-attributes. Both are related to continuous data and can be automatically extracted from the original dataset.
The first meta-attributes set (here called MS-Dist) corresponds to the direct application of distributional measures over all the dataset attributes. For this, we consider 5 distributional measures: mean (mean), standard deviation (sd), median (med), maximum (max) and minimum (min). This approach leads to 100 meta-attributes, considering all 5 distributional measures applied for all the 20 attributes of a dataset. We also included the size (number of examples) of a dataset, totalizing 101 meta-attributes. The use of distributional measures to characterize datasets was already proposed in the CPDP literature but for different purposes [53, 12, 11].
The second meta-attributes set (here called MS-Uns) is proposed in dos Santos and de Carvalho  and covers different characteristics of a dataset including general measures, statistical measures and clustering based measures. Originally, this set includes 53 unsupervised meta-attributes applied in the context of active learning. From the original set, we selected 44 measures applicable to the context of this study. The selected subset of meta-attributes is presented in Table VII. The measures matching the pattern (e.g., , , ) are obtained in two steps. First, the measure is extracted for each element of the dataset, generating a vector of values (e.g., the standard deviation extracted for all the attributes or the correlation extracted for all the pairs of attributes). Then, a distributional measure (e.g., min, max) is applied over this vector of values, generating a single value. The numbers , , and attached to the measures kurt, conn, dunn, and silh represents an internal parametrization indicating the proportion of clusters per class.
|size||Size (number of instances)|
|Logarithm of size|
|, , ,||Mean|
|, , ,||Standard deviation|
|, , ,||Normalized entropy|
|, , ,||Correlation between features|
|, , ,||Skewness|
|, , ,||Kurtosis|
|, ,||Connectivity k-means||Cluster validity measure |
|, ,||Connectivity hierarc. clust.||Cluster validity measure |
|, ,||Dunn index k-means||Cluster validity measure |
|, ,||Dunn index hierarc. clust.||Cluster validity measure |
|, ,||Silhouette k-means||Cluster validity measure |
|, ,||Silhouette hierarc. clust.||Cluster validity measure |
The meta-target is obtained from the experiment results presented in Section 3. From the 31 evaluated CPDP methods, we consider four possible labels for a project: 2012Ma_nb, 2013He_rf, 2009Turhan_nb, and 2013He_svm. These four methods refer to the best-ranked methods across datasets, as discussed in Section 3.2.
Another important characteristic in this study relates to the meta-target designed for a multi-label scenario. As mentioned in Section 3.1.6, we rounded the AUC performances for only 2 significant digits. On the one hand, this leads to a more accurate analysis since we do not differentiate CPDP methods with very similar performance. On the other hand, however, this leads to a multi-label classification task since more than one label can be associated to the same meta-example (see Section 2.5). Table VI presents the best AUC performance obtained for each project and the respective best methods with equivalent performances.
For this multi-label classification task we applied the Binary Relevance (BR) transformation method . As discussed in Section 2.5, this method creates distinct datasets (, total number of labels), each for one of the four possible labels. The multi-label problem is then transformed in four binary classification problems. For each dataset , , the class attribute is positive for meta-examples that belongs to the label and negative otherwise.
4.1.4 Meta-data Preprocessing
In the preprocessing step, we address two issues. First, we mitigate the likely influence of the different ranges and scales of data over the meta-model performance. For this, we applied the z-score normalization . In this normalization technique, each meta-attribute column is centred by subtracting the mean; and also scaled by dividing each value by the standard deviation. Second, we address the class-imbalance issue, possibly resulting from the BR transformation method . For example, consider the label 2009Turhan_nb. This method performed the best AUC for only 26% of all project datasets (see Table VI), which leads to an imbalanced binary dataset. In order to mitigate this issue, we applied the oversampling technique . This technique consists of randomly duplicating examples of the minority class until the desired class distribution is achieved. It allows to adjust the class distribution of a dataset without discarding information. This characteristic is important in the context of this study considering the limited amount of meta-examples. On the other hand, the duplication of examples may lead to overfitting on data.
Given a new software project dataset, represented by the meta-example , the meta-model must be able to recommend an appropriate CPDP method (or label for ). For each dataset , , generated with BR transformation method (see Section 4.1.3), we apply a binary classifier able to generate the confidence of relevance (the confidence of be a relevant label for ). The final recommended label refers to the label with higher confidence of relevance .
For this task, we used the Random Forest algorithm . Based on our experiments, this classifier presented the best learning capacity compared to each of the five classifiers investigated in this study. This algorithm is also a common choice in the contexts of software defect prediction  and algorithm recommendation [19, 24].
4.1.6 Performance Measure
Several performance measures focused on different aspects of multi-label learning have been proposed in the literature . In this study we are specifically interested in evaluating the frequency (or accuracy) in which the top-ranked (higher confidence of relevance ) label is actually among the relevant labels of an example. This aspect can be obtained from the one-error measure . The one-error measures how many times the predicted label was not in (set of relevant labels for ). It is defined as follows:
Note that, for single-label classification tasks, the one-error is identical to the ordinary error measure. Our general goal is to maximize the accuracy given by .
4.1.7 Attribute Selection
As mentioned in Section 4.1.1, the amount of meta-data available is limited. High dimensional datasets, with redundant and irrelevant attributes, can lead to an ineffective performance . Therefore, we also apply an Attribute Selection method over the meta-data in order to achieve a suitable subset of meta-attributes.
We apply a Best-First Forward Wrapper strategy, adapted from Kalousis and Hilario 
. This strategy applies an extensive and systematic search in the state space of all possible attribute subsets using the Best-First heuristic. The searching is guided by the estimated accuracy of each subset, provided by an induction algorithm. To estimate the accuracy of a given subset we apply the CPLOO cross-validation; where each version (or meta-example) is tested over a training set containing all the remaining project versions, excepting the versions of same project (see Section 3.1.5). The accuracy measure is presented in Section 4.1.6 and the induction algorithm is discussed in Section 4.1.5.
At the end, the subset of meta-attributes with highest accuracy is selected.
4.2 Performance Evaluation
In an ideal scenario, we would be able to construct a meta-model based on the 47 available datasets (as proposed in Section 4.1) and evaluate its performance with a different set of datasets. However, this scenario is not possible since this different set of datasets is not available. In order to approximate the ideal scenario, we evaluate the proposed solution based on a variation of the CPLOO procedure, here called meta-CPLOO. We adapted this leave-one-out procedure to the context of CPDP although it is a common approach in the meta-learning literature .
We separate one project (and its respective versions) for testing and construct the meta-model based on the remaining project versions. The meta-model is constructed following the configuration presented in the previous section. Finally, each version is tested separately with the respective constructed meta-model. It is important to observe that different meta-models (one for each project) will be constructed to test all the project versions. In the end, a CPDP method is recommended for each version.
On the one hand, the meta-CPLOO procedure enables us to estimate the performance for the proposed solution. On the other hand, it reduces the already limited amount of meta-data. For example, if we separate for test the five versions of the Ant project, the meta-model will be constructed based only on 42 versions. Consequently, the amount of available data for the internal CPLOO procedures are reduced.
Based on the meta-CPLOO procedure, we evaluate the performance of the meta-learning solution in two different levels: the meta-level and the base-level. In the meta-level we evaluate the meta-learning capacity, i.e., whether the meta-learner can learn from the meta-data in relation to the defined baselines. In the base-level we evaluate the performance of the meta-learning solution across datasets in terms of AUC and compare it in relation to the four considered base CPDP methods. In each level, we evaluate two different configurations of meta-learning: MS-Dist and MS-Uns; referring to the two meta-attributes sets presented in Section 4.1.2.
In the meta-level analysis, we evaluate two sources of results. The first source is obtained from the attribute selection step (see Section 4.1.7). For each of the 15 projects, the subset of attributes with highest accuracy estimate is selected to compose the meta-model. We compare these accuracy estimates in relation to a baseline, defined as the majority class. In this context, the majority class is given by the most frequent label of the respective meta-data. The statistical significance is verified with the non-parametric Wilcoxon signed rank test () .
The second source is obtained from the final recommendations provided by the proposed solution. We compare the accuracy (i.e., the rate of correct predictions) obtained with the meta-learning solution in relation to the majority class. In this context, the majority class is given by the most frequent label presented in Table VI.
In the base-level analysis, we compare the general performance (in terms of AUC) of the meta-learning solution in relation to the four considered CPDP methods applied individually. In addition, we compare all the performances in relation to a random baseline. For each project version, one of the four evaluated methods is randomly selected. This process is repeated 30 times. The random baseline is given by the mean AUC of the respective selected methods. The statistical analysis is based on the Friedman test followed by Fisher’s LSD test , as presented in Section 3.1.6.
RQ3: To what extent can meta-learning help us to select the most suitable CPDP method for a given dataset?
RQ3.1: Does the meta-learner learn? (Meta-level)
To answer this question, we evaluate two sources of results, as discussed in Section 4.2.1. First, we compare the accuracies estimates produced in the attribute selection step (see Section 4.1.7) for each of the 15 generated meta-models. We use these accuracies to evaluate the learning capacity of the two meta-learning configurations: MS-Dist and MS-Uns. We compare their performances in relation to the majority class baseline. The obtained accuracies are presented in Table VIII.
|Best-First - Acc||Best-First - Selected Subset|
|ckjm||0.435||0.609||0.739||,||, , ,|
|forrest||0.533||0.733||0.733||, ,||, , ,|
|synapse||0.477||0.682||0.636||, ,||, , ,|
|xalan||0.395||0.674||0.488||, , ,|
|xerces||0.465||0.698||0.628||, , ,||, ,|
Initially, we can observe that both meta-attributes sets MS-Uns and MS-Dist produced an accuracy superior to the majority class for all projects meta-data. MS-Dist produced the best mean value although it cannot be differentiated from MS-Uns with statistical significance. These performances are better visualized in Figure 6.
Although the obtained performances characterize some level of meta-learning capacity, they can also represent an overfitting on data; i.e., when a subset achieves a high accuracy estimate but poor predictive power for new examples . For small samples and a high dimensionality of attributes (such as the evaluated context), it is likely that one of the many attribute subsets lead to a hypothesis with high predictive accuracy .
The existence of overfitting can be verified by testing the meta-models for new examples. For this, we use each meta-model to recommend a suitable CPDP method for the respective project versions previously separated for test (see Section 4.2). The meta-learners MS-Dist and MS-Uns recommended correctly a suitable CPDP method for 25 (53%) and 16 (34%) of the 47 tested project versions, respectively. The majority class, represented by the label 2013He_rf, achieved the best AUC for 20 (43%) of all project versions. These results indicate an appropriate learning of the meta-learning solution based on the meta-attributes set MS-Dist since it produced an accuracy superior to the majority class, defined as the baseline.
RQ3.2: How does the meta-learner perform across datasets? (Base-level)
The general performance of the meta-learning solution is discussed below. We compare the resulting recommendations provided in the meta-CPLOO procedure in relation to the four evaluated CPDP methods applied individually. We also compare the meta-learning performances with the random baseline. Table IX presents the mean ranking and mean AUC values for each method as well as the frequency in which each method performed the best and worst AUC value for a project among the four possible labels.
|Method||Mean Rank||Mean AUC||Freq. Best||Freq. Worst|
|Meta_MS-Dist||3.40 ()||0.774 ()||25 (53%)||12 (26%)|
|2012Ma_nb||3.61 ()||0.771 ()||19 (40%)||12 (26%)|
|2013He_rf||3.76 ()||0.766 ()||20 (43%)||11 (23%)|
|Meta_MS-Uns||4.03 ()||0.770 ()||16 (34%)||17 (36%)|
|Random||4.32 ()||0.766 ()||4 (9%)||3 (6%)|
|2013He_svm||4.37 ()||0.765 ()||13 (28%)||18 (38%)|
|2009Turhan_nb||4.51 ()||0.761 ()||12 (26%)||21 (45%)|
The meta-learner MS-Dist presented the highest mean AUC of all evaluated methods. This solution also presented the highest frequency of best AUC. It provided the best AUC performance for 53% of all project versions against 40% of the second best-ranked method (2012Ma_nb). Together with 2012Ma_nb, the solution MS-Dist provided the worst AUC performance for only 26% of all project versions. It is worth to note that even the worst AUC performance represents one of the four best-ranked solution across datasets, as discussed in Section 4.1.3.
We applied the Friedman test and obtained the p-value0.044, which refuses the null hypothesis of performance equivalence between methods. The results of the Fisher’s LSD test are presented in Figure 7. From these results, we can highlight: 1) the random baseline is not significantly different from 2013He_svm and 2009Turhan_nb; 2) the meta-learner MS-Uns cannot be differentiated from the random baseline; 3) the three best-ranked methods MS-Dist, 2012Ma_nb, and 2013He_rf present significant difference in relation to the random baseline; and 4) the two best-ranked methods do not present significant difference from each other although MS-Dist was more frequently the best approach (53% against 40%).
5 Threats to Validity
Some factors can threat the validity of the experiments conducted in this study. The first issues are related to the collection of data used in the experiments. Any lack of quality in these data may jeopardize the entire study. For example, it is known that the number of defects found for each software part is an approximate estimate and does not represent the actual number of defects . However, a precise information in this context is difficult if not impossible to acquire in a real project . Kitchenham et al.  argue that software companies frequently do not keep suitable historical information about software quality. When available, those information are commonly private or restricted only for internal use . Also, the collection of data is composed only by Java open source projects, which restrict the external validity of the results.
Despite these factors, some positive points justify the use of this collection of data: 1) the datasets are open for reuse; 2) the independent variables can be automatically extracted; 3) the collection is composed of several project versions, allowing us to conduct the experiments in the CPDP context; 4) the data acquisition is based on a systematic and frequently used process [71, 43, 38]; and 5) these data were extensively reused in the literature [44, 12, 72, 11, 38, 73].
The methods evaluated in this study (including transfer learning solutions and classifiers) compose a representative sample of the state-of-the-art in CPDP. However, this does not represents an exhaustive comparison covering all the existing solutions. The extension of these experiments with additional methods can lead to alternative conclusions. Also, the internal parameters of each method follow either the default configuration provided by the code libraries or the original recommendations of the authors. The tuning of parameters can influence the performances and, consequently, the conclusions over results. Although the use of default configurations is a common approach in the experimental software engineering literature, future work should investigate the impact of parameter tuning on this analysis.
The preprocessing step also influences directly on the performance of predictive models . We set the same log transformation resource for all methods in order to diminish the impact of this step on results. However, some internal preprocessing resources can still interfere. For example, the Naive Bayes classifier operates in conjunction with data discretization; which may benefit this classifier over other . Also, Herbold  argues that the performance of SVMs can be positively impacted by the weighting of imbalanced training data, which is ignored in this study.
Another important issue, specific for this study, regards to the CPDP methods considered as the labels for the meta-target. Those CPDP methods presented the best ranking performances across datasets. Also, they presented distinct bias in relation to each other, which can contribute both for the diversity as well as the complementarity of the meta-learning solution. However, other criteria can be considered in this case, which may completely alter the obtained conclusions.
In this study we provided two main contributions. First, we conducted an experimental comparison of 31 CPDP methods derived from six state-of-the-art transfer learning solutions associated to the five most frequently used classifiers in the defect prediction literature. Second, we investigated the feasibility of meta-learning applied to CPDP.
The experiments are based on 47 versions of 15 open source software projects. Different from previous studies, we considered a context in which no previous information about the software is demanded. In practice, this characteristic allows a broader applicability of defect prediction models and the proposed meta-learning solution. For example, for companies with insufficient data from previous projects or for projects in its first release. This characteristic is possible due to three main factors: 1) the training set is composed of known external projects; 2) the software characterization can be extracted directly from the source code; and 3) the meta-model is constructed based on unsupervised meta-attributes.
From the first experiment, we identified the four best-ranked CPDP methods across datasets in terms of AUC: 2012Ma_nb, 2013He_rf, 2009Turhan_nb, and 2013He_svm. These four methods did not present significant difference of performance in relation to each other. These four methods, however, presented the best performance for distinct groups of datasets. In other words, the most suitable CPDP method for a project varies according to the project being predicted. These results accredited the investigation of the meta-learning solution proposed in this study.
We evaluated two distinct unsupervised meta-attributes sets for a multi-label meta-learning task. The performance analysis was conducted in two levels. In the meta-level, the results indicate a proper predictive power of the proposed solution. The meta-learner based on the meta-attributes set MS-Dist presented an accuracy 10 percentage points superior to the majority class. In the base-level, we compared the proposed solution in relation to the four best-ranked CPDP methods across datasets in terms of AUC. The meta-learner MS-Dist presented the higher mean AUC although it did not present significant difference of performance in relation to the base method 2012Ma_nb.
Assuming the proper generalization of these results, three factors contribute to the practical use of the proposed solution: 1) there is learning in the meta-level. The meta-learner provided the best solution for a larger amount of datasets than each of the four evaluated methods applied individually; 2) in the worst case, the meta-learner will still recommend one of the four CPDP methods with best ranking across datasets; and 3) considering its application domain, the proposed solution is not expensive. The hard computational cost is spent in the meta-model construction. For the recommendation task, only the meta-characterization and prediction costs are demanded.
However, further studies are still necessary to guarantee the generalization of the presented results. In addition, alternative solutions can be investigated aiming to improve the meta-learning performance. We point out three factors for future investigation. First, the meta-data can be expanded. A larger amount of examples can contribute to improve the predictive power of the meta-learner. Second, other meta-attributes sets can be explored. For example, the relation (or difference) between testing and training data can be considered in conjunction with the particularities of each dataset. Third, other attribute selection methods can lead to subsets with higher predictive power.
Research developed with the computational resources from the Center for Mathematical Sciences Applied to Industry (CeMEAI) financed by FAPESP. The authors also acknowledge the support granted by FAPESP (grant 2013/01084-3).
- Malhotra  R. Malhotra, “A systematic review of machine learning techniques for software fault prediction,” Appl. Soft Comput., vol. 27, no. C, pp. 504–518, Feb. 2015. [Online]. Available: http://dx.doi.org/10.1016/j.asoc.2014.11.023
- Herbold et al.  S. Herbold, A. Trautsch, and J. Grabowski, “Global vs. local models for cross-project defect prediction,” Emp. Soft. Engin., pp. 1–37, 2016. [Online]. Available: http://dx.doi.org/10.1007/s10664-016-9468-y
- Kitchenham et al.  B. A. Kitchenham, E. Mendes, and G. H. Travassos, “Cross versus within-company cost estimation studies: A systematic review,” IEEE Trans Soft Eng, vol. 33, no. 5, pp. 316–329, 2007. [Online]. Available: http://dx.doi.org/10.1109/TSE.2007.1001
- Gunarathna et al.  D. Gunarathna, B. Turhan, and S. Hosseini, “A systematic literature review on cross-project defect prediction,” Master’s thesis, University of Oulu - Information Processing Science, Oct. 2016.
- Ma et al.  Y. Ma, G. Luo, X. Zeng, and A. Chen, “Transfer learning for cross-company software defect prediction,” Inf Soft Tech, vol. 54, no. 3, pp. 248–256, 2012. [Online]. Available: http://dx.doi.org/10.1016/j.infsof.2011.09.007
- Zimmermann et al.  T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy, “Cross-project defect prediction: A large scale experiment on data vs. domain vs. process,” in Proc Int Symp on Found of Soft Eng (SIGSOFT/FSE2009). ACM, 2009, pp. 91–100. [Online]. Available: http://dx.doi.org/10.1145/1595696.1595713
- Pan and Yang  S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. on Knowl. and Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct 2010. [Online]. Available: http://dx.doi.org/10.1109/TKDE.2009.191
- Watanabe et al.  S. Watanabe, H. Kaiya, and K. Kaijiri, “Adapting a fault prediction model to allow inter language reuse,” in Proc Int Conf on Pred Mod in Soft Eng (PROMISE2008). ACM, 2008, pp. 19–24. [Online]. Available: http://dx.doi.org/10.1145/1370788.1370794
- Camargo Cruz and Ochimizu  A. E. Camargo Cruz and K. Ochimizu, “Towards logistic regression models for predicting fault-prone code across software projects,” in Proc. Int. Symp. on Emp. Soft. Eng. and Meas. (ESEM ’09), Washington, DC,USA, 2009, pp. 460–463. [Online]. Available: http://dx.doi.org/10.1109/ESEM.2009.5316002
- Turhan et al.  B. Turhan, T. Menzies, A. B. Bener, and J. Di Stefano, “On the relative value of cross-company and within-company data for defect prediction,” Emp Soft Eng, vol. 14, no. 5, pp. 540–578, 2009. [Online]. Available: http://dx.doi.org/10.1007/s10664-008-9103-7
- Herbold  S. Herbold, “Training data selection for cross-project defect prediction,” in Proc. Int. Conf. on Pred. Mod. in Soft. Eng. (PROMISE ’13), New York, NY, USA, 2013, pp. 6:1–6:10. [Online]. Available: http://dx.doi.org/10.1145/2499393.2499395
- He et al.  Z. He, F. Peters, T. Menzies, and Y. Yang, “Learning from open-source projects: An empirical study on defect prediction.” in ESEM. IEEE Computer Society, 2013, pp. 45–54. [Online]. Available: http://dx.doi.org/10.1109/ESEM.2013.20
- Menzies and Shepperd  T. Menzies and M. Shepperd, “Special issue on repeatable results in software engineering prediction,” Empirical Softw. Eng., vol. 17, no. 1-2, pp. 1–17, Feb. 2012. [Online]. Available: http://dx.doi.org/10.1007/s10664-011-9193-5
- Song et al.  L. Song, L. L. Minku, and X. Yao, “The impact of parameter tuning on software effort estimation using learning machines,” in Proc. of the 9th Int. Conf. on Pred. Mod. in Soft. Eng., ser. PROMISE ’13, 2013, pp. 9:1–9:10. [Online]. Available: http://dx.doi.org/10.1145/2499393.2499394
- Shepperd et al.  M. Shepperd, D. Bowes, and T. Hall, “Researcher bias: The use of machine learning in software defect prediction,” IEEE Trans. on Soft. Eng., vol. 40, no. 6, pp. 603–616, June 2014. [Online]. Available: https://doi.org/10.1109/TSE.2014.2322358
- Tantithamthavorn et al.  C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Matsumoto, “Comments on "researcher bias: The use of machine learning in software defect prediction",” IEEE Transactions on Software Engineering, vol. 42, no. 11, pp. 1092–1094, Nov 2016. [Online]. Available: http://dx.doi.org/10.1109/TSE.2016.2553030
D. H. Wolpert,
The Supervised Learning No-Free-Lunch Theorems. London: Springer London, 2002, pp. 25–42. [Online]. Available: http://dx.doi.org/10.1007/978-1-4471-0123-9_3
- Mantovani et al.  R. G. Mantovani, A. L. D. Rossi, J. Vanschoren, B. Bischl, and A. C. P. L. F. Carvalho, “To tune or not to tune: Recommending when to adjust svm hyper-parameters via meta-learning,” in Int. Joint Conf. on Neural Net. (IJCNN), July 2015, pp. 1–8. [Online]. Available: http://dx.doi.org/10.1109/IJCNN.2015.7280644
- Nunes das Dores et al.  S. Nunes das Dores, L. Alves, D. D. Ruiz, and R. C. Barros, “A meta-learning framework for algorithm recommendation in software fault prediction,” in Proc. of Annual ACM Symp. on Appl. Comp. (SAC ’16). New York, NY, USA: ACM, 2016, pp. 1486–1491. [Online]. Available: http://dx.doi.org/10.1145/2851613.2851788
Parmezan et al. 
A. R. S. Parmezan, H. D. Lee, and F. C. Wu, “Metalearning for choosing feature selection algorithms in data mining: Proposal of a new framework,”Expert Systems with Applications, vol. 75, pp. 1–24, Jun 2017. [Online]. Available: http://dx.doi.org/10.1016/j.eswa.2017.01.013
- Lemke et al.  C. Lemke, M. Budka, and B. Gabrys, “Metalearning: a survey of trends and technologies,” Artificial Intelligence Review, vol. 44, no. 1, pp. 117–130, 2015. [Online]. Available: http://dx.doi.org/10.1007/s10462-013-9406-y
- Brazdil et al.  P. Brazdil, R. Vilalta, C. Giraud-Carrier, and C. Soares, Metalearning. Boston, MA: Springer US, 2017, pp. 818–823. [Online]. Available: http://dx.doi.org/10.1007/978-1-4899-7687-1_543
- Kalousis  A. Kalousis, “Algorithm selection via meta-learning,” Ph.D. dissertation, Université de Geneve - Faculté des Sciences, Geneva, 2002.
- dos Santos and de Carvalho  D. P. dos Santos and A. C. P. L. F. de Carvalho, “Automatic selection of learning bias for active sampling,” in Brazilian Conf. on Intel. Syst. (BRACIS2016), Oct 2016, pp. 55–60. [Online]. Available: https://dx.doi.org/10.1109/BRACIS.2016.021
Brazdil and Henery 
P. B. Brazdil and R. J. Henery, “Analysis of results,” in
Machine Learning, Neural and Statistical Classification, D. Michie, D. J. Spiegelhalter, C. C. Taylor, and J. Campbell, Eds. Upper Saddle River, NJ, USA: Ellis Horwood, 1994, pp. 175–212.
- Reif et al.  M. Reif, F. Shafait, M. Goldstein, T. Breuel, and A. Dengel, “Automatic classifier selection for non-experts,” Pattern Analysis and Applications, vol. 17, no. 1, pp. 83–96, 2014. [Online]. Available: http://dx.doi.org/10.1007/s10044-012-0280-z
- Tsoumakas and Katakis  G. Tsoumakas and I. Katakis, “Multi-label classification: An overview,” IJDWM, vol. 3, no. 3, pp. 1–13, 2007. [Online]. Available: http://dx.doi.org/10.4018/jdwm.2007070101
- Nucci et al.  D. D. Nucci, F. Palomba, R. Oliveto, and A. D. Lucia, “Dynamic selection of classifiers in bug prediction: An adaptive method,” IEEE Trans. on Emerg. Topics in Comp. Intell., vol. 1, no. 3, pp. 202–212, June 2017. [Online]. Available: http://dx.doi.org/10.1109/TETCI.2017.2699224
- Basili et al.  V. Basili, L. Briand, and W. Melo, “A validation of object-oriented design metrics as quality indicators,” IEEE Trans Softw Eng, vol. 22, pp. 751–761, 1996. [Online]. Available: http://dx.doi.org/10.1109/32.544352
- Ostrand et al.  T. J. Ostrand, E. J. Weyuker, and R. M. Bell, “Where the bugs are,” in Proc Int Symp on Soft Test and Analy (ISSTA2004). ACM, 2004, pp. 86–96. [Online]. Available: http://dx.doi.org/10.1145/1007512.1007524
- Zimmermann and Nagappan  T. Zimmermann and N. Nagappan, “Predicting defects using network analysis on dependency graphs,” in Proc Int Conf on Soft Eng (ICSE2008), 2008, pp. 531–540. [Online]. Available: http://dx.doi.org/10.1145/1368088.1368161
- Pinzger et al.  M. Pinzger, N. Nagappan, and B. Murphy, “Can developer-module networks predict failures?” in Proc Int Symp on Found of Soft Eng (SIGSOFT/FSE2008). ACM, 2008, pp. 2–12. [Online]. Available: http://dx.doi.org/10.1145/1453101.1453105
- Nagappan et al.  N. Nagappan, B. Murphy, and V. Basili, “The influence of organizational structure on software quality: An empirical case study,” in Proc Int Conf on Soft Eng (ICSE2008), 2008, pp. 521–530. [Online]. Available: http://dx.doi.org/10.1145/1368088.1368160
- Hassan  A. E. Hassan, “Predicting faults using the complexity of code changes,” in IEEE Int Conf on Softw Eng. IEEE, 2009, pp. 78–88. [Online]. Available: http://dx.doi.org/10.1109/ICSE.2009.5070510
- Herzig et al.  K. Herzig, S. Just, A. Rau, and A. Zeller, “Predicting defects using change genealogies,” in IEEE Int Symp on Soft Reliab Eng 2013. IEEE, 2013. [Online]. Available: http://dx.doi.org/10.1109/ISSRE.2013.6698911
- Taba et al.  S. E. S. Taba, F. Khomh, Y. Zou, A. E. Hassan, and M. Nagappan, “Predicting bugs using antipatterns,” in Proc IEEE Int Conf on Soft Maint (ICSM2013), 2013, pp. 270–279. [Online]. Available: http://dx.doi.org/10.1109/ICSM.2013.38
- Jureczko  M. Jureczko, “Significance of different software metrics in defect prediction,” Software Engineering: An International Journal, vol. 1, no. 1, pp. 86–95, 2011.
- Madeyski and Jureczko  L. Madeyski and M. Jureczko, “Which process metrics can significantly improve defect prediction models? an empirical study,” Software Quality Journal, pp. 1–30, 2014. [Online]. Available: http://dx.doi.org/10.1007/s11219-014-9241-7
- Ghotra et al.  B. Ghotra, S. McIntosh, and A. E. Hassan, “Revisiting the impact of classification techniques on the performance of defect prediction models,” in 2015 IEEE/ACM 37th IEEE Int. Conf. on Soft. Eng., vol. 1, 2015, pp. 789–800. [Online]. Available: https://doi.org/10.1109/ICSE.2015.91
- Lessmann et al.  S. Lessmann, B. Baesens, C. Mues, and S. Pietsch, “Benchmarking classification models for software defect prediction: A proposed framework and novel findings,” IEEE Trans Soft Eng, vol. 34, no. 4, pp. 485–496, 2008. [Online]. Available: http://dx.doi.org/10.1109/TSE.2008.35
- Briand et al.  L. C. Briand, W. L. Melo, and J. Wust, “Assessing the applicability of fault-proneness models across object-oriented software projects,” IEEE Trans. Softw. Eng., vol. 28, no. 7, pp. 706–720, Jul. 2002. [Online]. Available: http://dx.doi.org/10.1109/TSE.2002.1019484
- Nam et al.  J. Nam, S. J. Pan, and S. Kim, “Transfer defect learning,” in Proc Int Conf on Soft Eng (ICSE2013), 2013, pp. 382–391. [Online]. Available: http://dx.doi.org/10.1109/ICSE.2013.6606584
- Zhang et al.  F. Zhang, A. Mockus, I. Keivanloo, and Y. Zou, “Towards building a universal defect prediction model,” in Proc Work Conf on Mining Soft Rep (MSR2014). ACM, 2014, pp. 182–191. [Online]. Available: http://dx.doi.org/10.1145/2597073.2597078
- Jureczko and Madeyski  M. Jureczko and L. Madeyski, “Towards identifying software project clusters with regard to defect prediction,” in Proc Int Conf on Pred Mod in Soft Eng (PROMISE2010). ACM, 2010, pp. 9:1–9:10. [Online]. Available: http://dx.doi.org/10.1145/1868328.1868342
- He et al.  P. He, B. Li, X. Liu, J. Chen, and Y. Ma, “An empirical study on software defect prediction with a simplified metric set,” Inf. and Soft. Tech., vol. 59, pp. 170 – 190, 2015. [Online]. Available: https://doi.org/10.1016/j.infsof.2014.11.006
- Maimon and Rokach  O. Maimon and L. Rokach, Data Mining and Knowledge Discovery Handbook, 2nd ed. Springer Publishing Company, Incorporated, 2010.
- Keung et al.  J. Keung, E. Kocaguneli, and T. Menzies, “Finding conclusion stability for selecting the best effort predictor in software effort estimation,” Automated Software Engineering, vol. 20, no. 4, pp. 543–567, 2013. [Online]. Available: http://dx.doi.org/10.1007/s10515-012-0108-5
- Mitchell  T. M. Mitchell, Machine Learning, 1st ed. McGraw-Hill, Inc., 1997.
- Rice  J. R. Rice, “The algorithm selection problem,” Advances in Computers, vol. 15, pp. 65 – 118, 1976. [Online]. Available: http://dx.doi.org/10.1016/S0065-2458(08)60520-3
- Ali and Smith  S. Ali and K. A. Smith, “On learning algorithm selection for classification,” Applied Soft Computing, vol. 6, no. 2, pp. 119 – 138, 2006. [Online]. Available: http://dx.doi.org/10.1016/j.asoc.2004.12.002
- Kanda et al.  J. Kanda, A. de Carvalho, E. Hruschka, C. Soares, and P. Brazdil, “Meta-learning to select the best meta-heuristic for the traveling salesman problem: A comparison of meta-features,” Neurocomputing, vol. 205, pp. 393 – 406, 2016. [Online]. Available: https://doi.org/10.1016/j.neucom.2016.04.027
- Cruz et al.  R. M. Cruz, R. Sabourin, and G. D. Cavalcanti, “Meta-des.oracle: Meta-learning and feature selection for dynamic ensemble selection,” Information Fusion, vol. 38, pp. 84 – 103, 2017. [Online]. Available: https://doi.org/10.1016/j.inffus.2017.02.010
- He et al.  Z. He, F. Shu, Y. Yang, M. Li, and Q. Wang, “An investigation on the feasibility of cross-project defect prediction,” Automated Software Engineering, vol. 19, no. 2, pp. 167–199, Jun. 2012. [Online]. Available: http://dx.doi.org/10.1007/s10515-011-0090-3
- Schapire and Singer  R. E. Schapire and Y. Singer, “Boostexter: A boosting-based system for text categorization,” Machine Learning, vol. 39, no. 2, pp. 135–168, 2000. [Online]. Available: http://dx.doi.org/10.1023/A:1007649029923
- Luaces et al.  O. Luaces, J. Díez, J. Barranquero, J. J. del Coz, and A. Bahamonde, “Binary relevance efficacy for multilabel classification,” Prog. in Art. Intel., vol. 1, no. 4, pp. 303–313, 2012. [Online]. Available: http://dx.doi.org/10.1007/s13748-012-0030-x
- Pereira et al.  D. G. Pereira, A. Afonso, and F. M. Medeiros, “Overview of friedman’s test and post-hoc analysis,” Commun. in Stat. - Simul. and Comp., vol. 44, no. 10, pp. 2636–2653, 2015. [Online]. Available: http://dx.doi.org/10.1080/03610918.2014.931971
- Demsar  J. Demsar, “Statistical comparisons of classifiers over multiple data sets,” J. Mach. Learn. Res., vol. 7, pp. 1–30, Dec. 2006.
M. G. Larson, “Descriptive statistics and graphical displays,”Circulation, vol. 114, no. 1, pp. 76–81, 2006. [Online]. Available: https://doi.org/10.1161/CIRCULATIONAHA.105.584474
- Peters et al.  F. Peters, T. Menzies, and A. Marcus, “Better cross company defect prediction,” in Proc Work Conf on Mining Soft Rep (MSR2013). IEEE, 2013, pp. 409–418. [Online]. Available: http://dx.doi.org/10.1109/MSR.2013.6624057
- Porto and Simao  F. Porto and A. Simao, “Feature subset selection and instance filtering for cross-project defect prediction - classification and ranking,” CLEI Electronic Journal, vol. 19, no. 3, pp. 4:1–4:17, 2016. [Online]. Available: http://dx.doi.org/10.19153/cleiej.19.3.4
- Xu and Wunsch  R. Xu and D. Wunsch, Clustering. Wiley-IEEE Press, 2009.
- Dunn  J. C. Dunn, “A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters,” Journal of Cybernetics, vol. 3, pp. 32–57, 1974.
P. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and validation of cluster analysis,”J. Comput. Appl. Math., vol. 20, no. 1, pp. 53–65, Nov. 1987. [Online]. Available: http://dx.doi.org/10.1016/0377-0427(87)90125-7
- Zhang and Zhou  M. L. Zhang and Z. H. Zhou, “A review on multi-label learning algorithms,” IEEE Trans. on Knowl. and Data Eng., vol. 26, no. 8, pp. 1819–1837, Aug 2014. [Online]. Available: http://dx.doi.org/10.1109/TKDE.2013.39
- Galar et al.  M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, “A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches,” IEEE Tran. on Sys., Man, and Cyber., Part C (Applications and Reviews), vol. 42, no. 4, pp. 463–484, July 2012. [Online]. Available: http://dx.doi.org/10.1109/TSMCC.2011.2161285
- Breiman  L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001. [Online]. Available: http://dx.doi.org/10.1023/A:1010933404324
- Wu and Zhou  X. Wu and Z. Zhou, “A unified view of multi-label performance measures,” CoRR, vol. abs/1609.00288, 2016. [Online]. Available: http://arxiv.org/abs/1609.00288
- Kalousis and Hilario  A. Kalousis and M. Hilario, “Feature selection for meta-learning,” in Proc. of Pacific-Asia Conf. on Know. Disc. and Data Min. (PAKDD2001). London, UK, UK: Springer-Verlag, 2001, pp. 222–233.
- Kohavi and Sommerfield  R. Kohavi and D. Sommerfield, “Feature subset selection using the wrapper method: Overfltting and dynamic search space topology,” in Proc. of the 1st Int. Conf. on Knowl. Disc. and Data Min., ser. KDD’95. AAAI Press, 1995, pp. 192–197.
- Yang et al.  X. Yang, K. Tang, and X. Yao, “A learning-to-rank approach to software defect prediction,” IEEE Trans on Reliability, vol. 64, pp. 234–246, 2015. [Online]. Available: http://dx.doi.org/10.1109/TR.2014.2370891
- Zimmermann et al.  T. Zimmermann, R. Premraj, and A. Zeller, “Predicting defects for eclipse,” in Proc Int Conf on Pred Mod in Soft Eng (PROMISE2007). IEEE, 2007, pp. 9–. [Online]. Available: http://dx.doi.org/10.1109/PROMISE.2007.10
- Prateek et al.  S. Prateek, A. Pasala, and L. Moreno Aracena, “Evaluating performance of network metrics for bug prediction in software,” in Asia-Pacific Soft Eng Conf (APSEC2013), vol. 1, 2013, pp. 124–131. [Online]. Available: http://dx.doi.org/10.1109/APSEC.2013.27
- Zhang et al.  F. Zhang, Q. Zheng, Y. Zou, and A. E. Hassan, “Cross-project defect prediction using a connectivity-based unsupervised classifier,” in Proc Int Conf on Soft Eng (ICSE2016). New York, NY, USA: ACM, 2016, pp. 309–320. [Online]. Available: http://dx.doi.org/10.1145/2884781.2884839
- Platt  J. C. Platt, “Fast training of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods. MIT Press, 1999, pp. 185–208.
Appendix A Code Metrics
Average Method Complexity (AMC): This metric measures the average method size for each class. Size of a method is equal to the number of Java binary codes in the method.
Cohesion Among Class Methods (CAM) : This metric computes the relatedness among methods of a class based upon the parameter list of the methods. The metric is computed using the summation of number of different types of method parameters in every method divided by a multiplication of number of different method parameter types in whole class and number of methods.
Afferent couplings (Ca): The Ca metric represents the number of classes that depend upon the measured class.
Efferent couplings (Ce): The Ce metric represents the number of classes that the measured class is depended upon.
Coupling Between Methods (CBM): The metric measures the total number of new/redefined methods to which all the inherited methods are coupled. There is a coupling when at least one of the conditions given in the IC metric is held.
Coupling between object classes (CBO): The CBO metric represents the number of classes coupled to a given class (efferent couplings and afferent couplings).
Cyclomatic Complexity (CC): CC is equal to number of different paths in a method (function) plus one. The McCabe cyclomatic complexity is defined as: CC=E-N+P; where E is the number of edges of the graph, N is the number of nodes of the graph, and P is the number of connected components. CC is the only method size metric. The constructed models make the class size predictions. Therefore, the metric had to be converted to a class size metric. Two metrics has been derived:
MAX_CC - the greatest value of CC among methods of the investigated class;
AVG_CC - the arithmetic mean of the CC value in the investigated class.
Data Access Metric (DAM): This metric is the ratio of the number of private (protected) attributes to the total number of attributes declared in the class.
Depth of Inheritance Tree (DIT): The DIT metric provides for each class a measure of the inheritance levels from the object hierarchy top.
Inheritance Coupling (IC): This metric provides the number of parent classes to which a given class is coupled. A class is coupled to its parent class if one of its inherited methods functionally dependent on the new or redefined methods in the class. A class is coupled to its parent class if one of the following conditions is satisfied:
One of its inherited methods uses an attribute that is defined in a new/redefined method;
One of its inherited methods calls a redefined method;
One of its inherited methods is called by a redefined method and uses a parameter that is defined in the redefined method.
Lack of cohesion in methods (LCOM): The LCOM metric counts the sets of methods in a class that are not related through the sharing of some of the class fields.
Lack of cohesion in methods (LCOM3):
m - number of methods in a class;
a - number of attributes in a class;
- number of methods that access the attribute A.
Lines of Code (LOC): The LOC metric calculates the number of lines of code in the Java binary code of the class under investigation.
Measure of Functional Abstraction (MFA): This metric is the ratio of the number of methods inherited by a class to the total number of methods accessible by the member methods of the class.
Measure of Aggregation (MOA): This metric measures the extent of the part-whole relationship, realized by using attributes. The metric is a count of the number of class fields whose types are user defined classes.
Number of Children (NOC): The NOC metric simply measures the number of immediate descendants of the class.
Number of Public Methods (NPM): The NPM metric counts all the methods in a class that are declared as public.
Response for a Class (RFC): The RFC metric measures the number of different methods that can be executed when an object of that class receives a message.
Weighted methods per class (WMC): The value of the WMC is equal to the number of methods in the class (assuming unity weights for all methods).
For a detailed explanation see .
Appendix B Classifiers
Random Forest (RF): RF is based on a collection of decision trees where each tree is grown from a bootstrap sample (randomly sampling the data with replacement). The process of finding the best split for each node is based on a subset of attributes randomly chosen. This characteristic produces a collection of trees with different biases. The final prediction for a new example is given by the majority voting of all trees. This algorithm is robust to redundant and irrelevant attributes although it can produce overfitting models.
Support Vector Machines (SVM): SVM aims to find the optimal hyperplane that maximally separates samples in two different classes. This classifier is also robust to redundant and irrelevant attributes since the number of attributes does not affect the complexity of an SVM model. In this study, we evaluate a SVM variation called Sequential Minimal Optimization (SMO). This technique optimizes the SVM training by dividing the large Quadratic Programming (QP) problem in a series of possible QP problems.
Multilayer Perceptron (MLP): MLP is a neural network model based on the back-propagation algorithm. The MLP consists of three or more layers (an input and an output layer with one or more hidden layers) of nodes in a directed graph. Each layer is fully connected to the next one. The nodes (except for the input nodes) are neurons (or processing elements) with a nonlinear activation function. The weights of each node in the network are iteratively updated in an attempt to minimize a loss function calculated from the output layers.
C4.5 (C45): C45 is a decision tree algorithm which extends the Quinlan’s earlier ID3 algorithm. A decision tree is a collection of decision rules defined in each node. The tree is generated by associating to each node the attribute that most effectively divides the set of training data. The splitting criterion is the normalized information gain . The classification of examples is performed by following a path trough the tree from the root to the leaf nodes where a class value is taken.
Naive Bayes (NB): NB is a simple statistical algorithm based on Bayes’ Theorem. In the defect prediction context, this theorem can be defined as follows:
where is an element of the set of class values (defective or not-defective), is an attributes vector, and
are respectively the prior probabilities ofand occur, and is the probability of given
. These probabilities are combined based on the training set. The theorem calculates the posterior probability ofgiven that is true. The term “naïve” is due to its assumption that the attributes are independent. Although this assumption is not always true, this algorithm has been reported as an efficient classifier for defect prediction .
Appendix C General AUC Performances