Building Better Quality Predictors Using "ε-Dominance"

03/13/2018 ∙ by Wei Fu, et al. ∙ NC State University 0

Despite extensive research, many methods in software quality prediction still exhibit some degree of uncertainty in their results. Rather than treating this as a problem, this paper asks if this uncertainty is a resource that can simplify software quality prediction. For example, Deb's principle of ϵ-dominance states that if there exists some ϵ value below which it is useless or impossible to distinguish results, then it is superfluous to explore anything less than ϵ. We say that for "large ϵ problems", the results space of learning effectively contains just a few regions. If many learners are then applied to such large ϵ problems, they would exhibit a "many roads lead to Rome" property; i.e., many different software quality prediction methods would generate a small set of very similar results. This paper explores DART, an algorithm especially selected to succeed for large ϵ software quality prediction problems. DART is remarkable simple yet, on experimentation, it dramatically out-performs three sets of state-of-the-art defect prediction methods. The success of DART for defect prediction begs the questions: how many other domains in software quality predictors can also be radically simplified? This will be a fruitful direction for future work.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

This paper presents DART, a novel method for simplifying supervised learning for defect prediction. DART produces tiny, easily comprehensible models (5 lines of very simple rules) and, in principle, DART could be applied to many domains in software quality predictors. When tested on software defect prediction, this method dramatically out-performs three recent state-of-the-art studies 

(Ghotra et al., 2015; Fu et al., 2016a; Agrawal and Menzies, 2018).

DART was designed by working backward from known properties of software quality predictors problems. Such predictors exhibit a “many roads lead to Rome” property; i.e., many different data mining algorithms generate a small set of very similar results. For example, Lessmann et al. reported that 17 of 22 studied data mining algorithms for defect prediction had statistically indistinguishable performance  (Lessmann et al., 2008a). Also, Ghotra et al. reported that the performance of 32 data mining algorithms for defect prediction clustered into just four groups (Ghotra et al., 2015).

This paper asks what can be learned from the above examples. In this paper, we note that learners that have a “results space” i.e., values for various performance metrics such as recall and false alarm. Next, we ask what “shape” of result spaces leads to “many roads”? Also, given those “shapes”, do we need complex data miners? Or can we reverse engineer from that space a much simpler kind of software quality predictor?

To answer these questions we apply -dominance (Deb et al., 2005). Deb’s principle of -dominance states that if there exists some value below which it is useless or impossible to distinguish results, then

It is superfluous to explore anything less than .

We say that for “large problems”, the results space of learning effectively contains just a few regions In such simple result spaces, a few DARTs thrown around the output space would sample the results just as well, or better, than more complex methods.

To test if -dominance simplifies software quality prediction, this paper compares DART-ing around the results space against three defect prediction systems:

  1. The algorithms surveyed at a recent ICSE’15 paper (Ghotra et al., 2015);

  2. A hyper-parameter optimization method proposed in 2016 in the IST journal (Fu et al., 2016a);

  3. A search-based data pre-processing method presented at ICSE’18 (Agrawal and Menzies, 2018).

These three were chosen since they reflect the state-of-the-art in software quality defect prediction. Also, the second and third items in this list are CPU-intensive systems that require days of computing time to execute data algorithms many times to find good configurations. Comparing something as simple as DART to these complex systems let us critically assess the value of elaborate cloud computing environments for software quality prediction.

What we will see is that a small number of DARTs dramatically out-performs these three systems. This suggests that, at least for our data, much of the complexity associated with hyper-parameter optimization is not required. We conjecture that a few DARTs succeed so well since the results space for defect prediction exhibits the large property. We also conjecture that prior state-of-the-art algorithms fail against DART since all those models do not spread out over the results space. On the other hand, DART works so well since it knows how to spread its models across a large results space better.

Metric Name Description
amc average method complexity Number of Java byte codes
avg_cc average McCabe Average McCabe’s cyclomatic complexity seen in class
ca afferent couplings How many other classes use the specific class.
cam cohesion amongst classes #different method parameters types divided by (#different method parameter types in a class)*(#methods).
cbm coupling between methods Total number of new/redefined methods to which all the inherited methods are coupled
cbo coupling between objects Increased when the methods of one class access services of another.
ce efferent couplings How many other classes is used by the specific class.
dam data access Ratio of private (protected) attributes to total attributes
dit depth of inheritance tree It’s defined as the maximum length from the node to the root of the tree
ic inheritance coupling Number of parent classes to which a given class is coupled (includes counts of methods and variables inherited)
lcom lack of cohesion in methods Number of pairs of methods that do not share a reference to an instance variable.
locm3 another lack of cohesion measure count the in a class. is the number of methods accessing an attribute. .
loc lines of code Total lines of code in this file or package.
max_cc Maximum McCabe maximum McCabe’s cyclomatic complexity seen in class
mfa functional abstraction Number of methods inherited by a class plus number of methods accessible by member methods of the class
moa aggregation Count of the number of data declarations (class fields) whose types are user defined classes
noc number of children Number of direct descendants (subclasses) for each class
npm number of public methods npm metric simply counts all the methods in a class that are declared as public.
rfc response for a class Number of methods invoked in response to a message to the object.
wmc weighted methods per class A class with more member functions than its peers is considered to be more complex and therefore more error prone
defect defect Boolean: where defects found in post-release bug-tracking systems.
Table 1. OO Measures used in our defect data sets.

The rest of this paper is structured as follows. §2 introduces the SE case studies explored in this paper (defect prediction) as well as different approaches and evaluation criteria. Secondly, it discusses the many sources of variability inherent in software quality predictors. In summary, between the raw data and the conclusions there are so many choices, some of which are stochastic (e.g., the random number generators that control test suite selection). All these choices introduce , a degree of uncertainty in the conclusions. §3 discusses -domination for software quality prediction and proposes DART, a straightforward ensemble method that can quickly sample a results space that divides into a few -sized regions. §4 describes the experimental details in this study. §5 checks our conjecture. It will be seen that DART dramatically out-performs state-of-the-art defect prediction algorithms, hyper-parameter tuning algorithms, and data pre-processors.

Based on these results, we will argue in the conclusion that it is time to consider a fundamentally different approach to software quality prediction. Perhaps it is time to stop fretting about the numerous options available for selecting data pre-processing methods or machine learning algorithms, then configuring their controlling parameters. The results of this paper suggest that most of those decisions are superfluous since so many methods result in the same output. Accordingly, we recommend doing something like DART; i.e.

first reason about the results space before selecting an appropriate data mining technology.

One caveat on all these results is that paper has explored only one domain; i.e., software defect prediction. As to other domains, in as-yet-unpublished experiments, we have initial results suggesting that this simplification might also work elsewhere (e.g., text mining of programmer comments in Stackoverflow; for predicting Github issue close time; and for detecting programming bad smells). While those results are promising, they are still preliminary.

That said, the success of this simplification method for defect prediction begs the question: how many other domains in software quality prediction can also be radically simplified? This will be a fruitful direction for future work.

Note that all the data and scripts used in this study are freely available online for use by other researchers111URL blinded for review..

2. Background and Related Work

2.1. Why Study Simplification?

In this section, we argue it is important to study methods for simplifying quality predictors.

We study simplicity since it is very useful to replace methods with methods, especially when the results from the many are no better than the few. A bewildering array of new methods for software quality prediction are reported each year (some of which rely on intimidatingly complex mathematical methods) such as deep belief net learning (Wang et al., 2016), spectral-based clustering (Zhang et al., 2016)

, and n-gram language models 

(Ray et al., 2016). Ghotra et al. list dozens of different data mining algorithms that might be used for defect predictors (Ghotra et al., 2015). Fu and Menzies argue that these algorithms might require extensive tuning (Fu et al., 2016a). There are many ways to implement that tuning, some of which are very slow (Tantithamthavorn et al., 2016). And if they were not enough, other computationally expensive methods might also be required to handle issues like (say) class imbalance (Agrawal and Menzies, 2018).

Given recent advances in cloud computing, it is possible to find the best method for a particular data set via a “shoot out” between different methods. For example, Lessmann et al. (Lessmann et al., 2008a) and Ghotra et al. (Ghotra et al., 2015) explored 22 and 32 different learning algorithms (respectively) for software quality defect prediction. Such studies may require days to weeks of CPU time to complete (Fu and Menzies, 2017a). But are such complex and time-consuming studies necessary?

  • [leftmargin=0.4cm]

  • If there exists some way to dramatically simplify software quality predictors; then those cloud-based resources would be better used for other tasks.

  • Also, Lessmann and Ghotra et al. (Lessmann et al., 2008a; Ghotra et al., 2015) report that many defect prediction methods have equivalent performance.

  • Further, as we show here; there are very simple methods that perform even better than the methods studied by Lessmann and Ghotra et al.

1 (best) Random Forest (RF)

Generate conclusions using multiple entropy-based decision trees.

Logistic Regression (SL) Map the output of a regression into ; thus enabling using regression for classification (e.g., defective if ).

K Nearest Neighbors (KNN)

Classify a new instance by finding “k” examples of similar instances. Ghortra et al. suggested .
Naive Bayes (NB)

Classify a new instance by (a) collecting mean and standard deviations of attributes in old instances of different classes; (b) return the class whose attributes are statistically most similar to the new instance.

3 Decision Trees (DT) Recursively divide data by selecting attribute splits that reduce the entropy of the class distribution.
Expectation Maximization (EM) This clustering algorithm uses iterative sampling and repair to derive a parametric expression for each class.
4 (worst) Support Vector Machines (SMO) Map the raw data into a higher-dimensional space where it is easier to distinguish the examples.
Table 2. Classifiers used in this study. Rankings from Ghotra et al. (Ghotra et al., 2015).

Another reason to study simplification is that studies can reveal the underlying nature of seemingly complex problems. In terms of core science, we argue that the better we understand something, the better we can match tools to SE. Tools which are poorly matched to task are usually complex and/or slow to execute. DART seems a better match for the tasks explored in this paper since it is neither complex nor slow. Hence, we argue that DART is interesting in terms of its core scientific contribution to SE quality prediction.

Seeking simpler and/or faster solutions is not just theoretically interesting. It is also an approach currently in vogue in contemporary software engineering. Calero and Pattini (Calero and Piattini, 2015) comments that “redesign for greater simplicity” also motivates much contemporary industrial work. In their survey of modern SE companies, they find that many current organizational redesigns are motivated (at least in part) by arguments based on “sustainability” (i.e., using fewer resources to achieve results). According to Calero and Pattini, sustainability is now a new source of innovation. Managers used sustainability-based redesigns to explore cost-cutting opportunities. In fact, they say, sustainability is now viewed by many companies as a mechanism for gaining a complete advantage over their competitors. Hence, a manager might ask a programmer to assess methods like DART as a technique to generate more interesting products.

For all these reasons, we assert that it is high time to explore how to simplify software quality prediction methods.

2.2. Why Study Defect Prediction?

The particular software quality predictor explored here is software defect prediction. This section argues that this is a useful area of research, worthy of exploration and simplification.

Software developers are smart, but sometimes make mistakes. Hence, it is essential to test software before the deployment  (Orso and Rothermel, 2014; Barr et al., 2015; Yoo and Harman, 2012; Myers et al., 2011). Testing is an expensive process. Software assessment budgets are finite while assessment effectiveness increases exponentially with assessment effort (Fu et al., 2016a). Therefore, the standard practice is to apply the best available methods on code sections that seem most critical and bug-prone.

Many researchers find that the software bugs are not evenly distributed across the project (Hamill and Goseva-Popstojanova, 2009; Koru et al., 2009; Ostrand et al., 2004; Misirli et al., 2011). Ostrand et al. (Ostrand et al., 2004) studied AT&T software projects and found that they can find 70% to 90% bugs in first 20% of the total files in the projects after sorting according to the file size. Hamill et al. (Hamill and Goseva-Popstojanova, 2009) investigated the common trends in software fault and they reported that 80% of the faults happened in 20% files in the GCC project. Based on these findings on software defect distribution, a smart way to perform software testing is to allocate most assessment budgets to the more defect-prone parts in software projects. Software defect predictors are such a strategy, which is to explore the software project and sample the most defect-prone files/modules/commits. Software defect predictors are never 100% correct, they can be used to suggest where to focus more expensive methods.

Software defect predictors have been proven useful in many industrial settings. Misirli et al. (Misirli et al., 2011) built a defect prediction model based on Naive Bayes classier for a telecommunications company. Their results show that defect predictors can predict 87 percent of code defects, decrease inspection efforts by 72 percent, and hence reduce post-release defects by 44 percent; Kim et al. (Kim et al., 2015) applied defect prediction model, REMI, to API development process at Samsung Electronics.They reported that REMI predicted the bug-prone APIs with reasonable accuracy (0.681 F1 score) and reduced the resources required for executing test cases.

Software defect predictors not only save labor compared with traditional manual methods, but they are also competitive with certain automatic methods. A recent study at ICSE’14, Rahman et al.  (Rahman et al., 2014) compared (a) static code analysis tools FindBugs, Jlint, and PMD and (b) static code defect predictors (which they called “statistical defect prediction”) built using logistic regression. They found no significant differences in the cost-effectiveness of these approaches. Given this equivalence, it is significant to note that static code defect prediction can be quickly adapted to new languages by building lightweight parsers that find in- formation like Table  1. The same is not true for static code analyzers - these need extensive modification before they can be used in new languages.

2.3. Different Defect Prediction Approaches

Over the past decade, defect prediction has attracted many attentions from the software research community. There are many different types of defect predictors according to the metrics used for building models:

  • [leftmargin=0.4cm]

  • Module-level based defect predictors, which use the complexity of software project, like McCabe metrics, Halstead’s effort metrics and CK object-oriented code metrics (Chidamber and Kemerer, 1994; Kafura and Reddy, 1987; McCabe, 1976) of Table 1.

  • Just-in-time (JIT) defect prediction on change level, which utilizes the change metrics collected from the software code (Kamei et al., 2013; Kim et al., 2008; Fu and Menzies, 2017b; Mockus and Weiss, 2000; Yang et al., 2016).

  • The first two points represent much of the work in this area. For completeness, we add there are numerous other kinds of defect predictors based on many and varied other methods. For example, Ray et al. propose a defect prediction method using n-gram language models (Ray et al., 2016). Other work argues that process metrics are more important than the product metrics mentioned in the last two points (Rahman and Devanbu, 2013).

Figure 1. Illustration of metric.
Rank Classification algorithm
1 (best) Rsub+J48, SL, Rsub+SL, Bag+SL, LMT, RF+SL, Bag+LMT, Rsub+LMT, RF+LMT, RF+J48
2 RBFs, Bag+J48, Ad+SL, KNN, RF+NB, Ad+LMT, NB, Rsub+NB, Bag+NB
3 Ripper, J48, Ad+NB, Bag+SMO, EM, Ad+SMO, Ad+J48,
4 (worst) RF+SMO, Rsub+SMO, SMO, Ridor
Table 3. 32 defect predictors clustered by their performance rank by Ghotra et al. (using a Scott-Knot statistical test) (Ghotra et al., 2015).

Defect prediction models can be built via a variety of machine learning algorithms such as Decision Tree, Random Forests, SVM, Naive Bayes and Logistic Regression(Khoshgoftaar and Allen, 2001; Khoshgoftaar and Seliya, 2003; Khoshgoftaar et al., 2000; Menzies et al., 2007b; Lessmann et al., 2008b; Hall et al., 2012). Ghotra et al. (Ghotra et al., 2015) compared various classifiers for defect prediction (for notes on a sample of those classifiers, see Table 2). According to their study, the prediction performances of classifiers group into the four clusters of Table 3. One advantage of this result is that, to sample across space of prior defect prediction work (including some state-of-the-art methods) researchers need only select one learner from each group.

To improve the performance of defect predictors, Fu et al.  (Fu et al., 2016a, b) and Tantithamthavorn et al.  (Tantithamthavorn et al., 2016) recommended improving standard typical defect predictors, like Random Forests and CART by performing hyper-parameter tuning. Results from both research groups confirm that hyper-parameter tuning can dramatically supercharge the defect predictors. In other tuning work, Agrawal et al. (Agrawal and Menzies, 2018) argued that better data is better than a better learner, where their results show that defect predictors can be improved a lot by changing the distribution of defective and non-defective examples seen during training.

For this paper, to compare the performance of our proposed method, DART, with the state-of-the-art defect prediction methods, we picked one classification technique at random from each group of Table 3: SL, NB, EM, and SMO. Furthermore, we adopted techniques from Fu et al.  (Fu et al., 2016a, b) and Agrawal et al. (Agrawal and Menzies, 2018) to investigate how DART performs compared to the improved (more sophisticated) defect prediction techniques. Note that Agrawal et al. also selected different classification techniques from Ghotra et al. study (Ghotra et al., 2015; Agrawal and Menzies, 2018). Hence, using these classifiers, we can also compare our results to the experiments of Agrawal et al (Agrawal and Menzies, 2018).

2.4. Evaluation Criteria

Figure 2. Illustration of metric.

In defect prediction literature, once a learner is executed, the results must be scored. Recall measures the percentage of defective modules found by a model generated by the learner. False alarm reports how many non-defective modules the learner reports as defective. measures how much code some secondary quality assurance method would have to perform after the learner has terminated.

is defined as , where is the area between the effort (code-churn-based) cumulative lift charts of the optimal learner and the proposed learner (as shown in Figure 1). To calculate , we divide all the code modules into those predicted to be defective () or not (). Both sets are then sorted in ascending order of lines of code. The two sorted sets are then laid out across the x-axis, with before . This layout means that the x-axis extends from 0 to 100% where lower values of are predicted to be more defective than higher values. On such a chart, the y-axis shows what percent of the defects would be recalled if we traverse the code sorted that x-axis order. According to Kamei et al. and Yang et al.  (Yang et al., 2016; Kamei et al., 2013; Monden et al., 2013), should be normalized as follows:


Figure 3. Results space when recall is greater than false alarms (see blue curve).

where , and represent the area of curve under the optimal learner, proposed learner, and worst learner, respectively. Note that the worst model is built by sorting all the changes according to the actual defect density in ascending order. For any learner, it performs better than random predictor only if the is greater than 0.5.

Note that these measures are closely inter-connected. Recall appears as the dependent variable of . Also, false alarms result in flat regions of the curve. Further, for useful learners, recall is greater than the false alarm. Such learners have the characteristic shape of Figure 3.

In the following, we will assess results using and another measure “distance to heaven” (denoted dis2heaven) that computes the distance of some recall, false alarm pair to the ideal “heaven” point of recall and false alarm as shown in Figure 2. This measure is defined in Equation 2:


The denominator of this equation means that . Note that :

  • [leftmargin=0.4cm]

  • For , the larger values are better;

  • For dis2heaven, the smaller values are better;

We use these measures instead of, say, precision or the F1 measure (harmonic mean of precision and recall) since Menzies et al. 

(Menzies et al., 2007a) warn that precision can be very unstable for SE data (where the class distributions may be highly unbalanced).

2.5. Sources of Uncertainty

This section argues that there is inherent uncertainty in making conclusions via software quality predictors. The rest of this paper exploits that uncertainty to simplify defect prediction.

Given that divergent nature of software projects and software developers, it is to be expected that different researchers find different effects, even from the same data sets (Hosseini et al., 2017). According to Menzies et al. (Menzies and Shepperd, 2012), the conclusion uncertainty of software quality predictors come from different choices in the training data and many other factors.

Sampling Bias:

Any data mining algorithm needs to input multiple examples to make its conclusions. The more diverse the input examples, the greater the variance in the conclusions. And software engineering is a very diverse discipline:

  • [leftmargin=0.4cm]

  • The software is built by engineers with varying skills and experience.

  • That construction process is performed using a wide range of languages and tools.

  • The completed product is delivered on lots of platforms.

  • All those languages, tools and platforms keep changing and evolving.

  • Within one project, the problems faced and the purpose served by each software module may be very different (e.g., GUI, database, network connections, business logic, etc.).

  • Within the career of one developer, the problem domain, goal, and stakeholders of their software can change dramatically from one project to the next.

Pre-processing: Real world data usually requires some clean-up before it can be used effectively by data mining algorithms. There are many ways to transform the data favorably. The numeric data may be discretized into smaller bins. Discretization can greatly affect the results of the learning: since there may be ways to implement discretization (Fayyad and Irani, 1993)

; Feature selection is sometimes useful to prune uncorrelated features to the target variable 

(Chen et al., 2005)

. On the other hand, it can be helpful to prune data points that are very noisy or are outliers 

(Kocaguneli et al., 2010). The effects of pre-processing can be quite dramatic. For example, Agrawal et al. (Agrawal and Menzies, 2018) report that their pre-processing technique (SMOTUNED) increased AUC and recall by and , respectively. Note that the choices made during pre-processing can introduce some variability in the final results.

Stochastic algorithms: Numerous methods in software quality predictors employ stochastic algorithms that use random number generators. For example, the key to scalability is usually (a) build a model on the randomly selected small part of the data then (b) see how well that works over the rest of the data (Sculley, 2010). Also, when evaluating data mining algorithms, it is standard practice to divide the data randomly into several bins as part of a cross-validation experiment (Witten and Frank, 2002). For all these stochastic algorithms, the conclusions are adjusted, to some extent, by the random numbers used in the processing.

Many methods have been proposed to reduce the above uncertainties such as feature selection to remove spurious outliers (Menzies et al., 2007b), application of background knowledge to constrain model generation (Fenton and Neil, 2012), optimizers to tune model parameters to reduce uncertainty (Agrawal et al., 2018). Despite this, some uncertainty usually remains (see an example, next section).

We conjecture that, for all the above reasons, uncertainty is an inherent property of software quality prediction. If so, the question becomes, “what to do with that uncertainty?”. The starting point for this paper was the following speculation: Instead of striving to make , use as a tool for simplifying software quality predictors. The next section describes such a tool.

3. -Domination

From the above, we assert that software quality predictors result collected on the same data will vary by some amount . As mentioned in the introduction, Deb’s principle of -dominance (Deb et al., 2005) states that if there exists some value below which is useless or impossible to distinguish results, then it is superfluous to explore anything less than .

Figure 4. Grids in results space.

Note that effectively clusters the space of possible results. For example, consider the result space defined by recall and false alarms . Both these measures have the range and . If , then the results space of possible recalls and false alarms divides into the 5*5 grid of Figure 4.

Figure 3 showed that the results from useful learners have a characteristic shape where recall is greater than false alarms. That is, in Figure 4, such results avoid the red regions of that grid (where false alarms are higher than recall) and the gray regions (also called the “no-information” region where recall is the same as a false alarm).

This means that when , then (a) recall-vs-false alarm results space is effectively just the ten green cells of Figure 4; and (b) “many roads lead to Rome” (i.e., if the results of 100 learners were places on this grid, then there could never be more than 10 groups of results).

It turns out that real-world results spaces are more complicated than shown in Figure 4. For example, consider the results space of Figure 5. In this figure, 100 times, a defect predictor was built for LUCENE, an open-source Java text search engine. Random Forests was used to build the defect predictor using 90% of the data, then tested on the remaining 10% (Random Forests are a multi-tree classifier, widely used in defect prediction; see Table 2).

To compute in this results space, we divide the x-axis into divisions of 0.1 and report the standard deviation

of recall in each division. For the moment, we use a simple t-test to infer the separation required to distinguish two results within Figure 

5 (later in this section, we will dispense with that assumption). This means that which is the the range required to be 95% confident that two distributions are different (Witten and Frank, 2005).

The main result of Figure 5 is that is often very large. The blue curve shows that 70% of the results occur in the region . At , ; i.e., most of our models have an of 0.2 or higher. Note that learners with divide into the 25 cells, or less, of Figure 4. More specifically, it means that most of the results of 100 learners applied to LUCENE would have statistically indistinguishable results.

From an analytic perspective, there are some limitations with the above analysis. Firstly, the threshold of

is a simplistic measure of statistically significantly different results. It makes many assumptions that may not hold for SE data; e.g., that the data conforms to a parametric Gaussian distribution and that the variance of the two distributions is the same.

Figure 5. can vary across results space. 100 experiments with LUCENE results using 90% of the data, for training, and 10%, for testing. The x-axis sorts the code base, first according to predicted defective or not, then second on lines of code. The blue line shows the distribution of the 100 results across this space. For example, 70% of the results predict defect for up to 20% of the code (see the blue curve). The y-axis of this figures shows mean recall (in red); the standard deviation of the recall (in yellow); and is defined as per standard t-tests that says at the 95% confidence level, two distributions differ when they are apart. (in green).

Secondly, as shown by the green curve of Figure 5, is not uniform across this result space. One reason for the lack of uniformity is that the results generated from 100 samples of the LUCENE data do not fall evenly across space: 70% of the 100 learned models fall far left of Figure 5 (up to 20% of the code– see the blue curve). This high variance means that we cannot reason about the results of space just via, e.g., some trite summary of the entire results space as a mean value.

When analytic methods fail, sampling can use instead. Rather than using analytically, we instead use it to define a sampling method of the results space. That system, called DART is described in Figure 6. The algorithms work by DART-ing around results space, a couple of times. Note that if the results exhibit a large properties, then these few samples would be enough to cover the ten green cells of Figure 4. Also, the results from such DART-ing around should perform as well as anything else, including the three state-of-the-art systems listed in the introduction.

To operationalize DART, we wrote some Python code based on the Fast-and-frugal tree (FFT) R-package from Phillips et al. (Phillips et al., 2017). While this is not the only way to operationalize DART, it worked so well for this paper; we were not motivated to try alternatives. An FFT is a binary tree where, at each level, there is one exit node predicting either for “true” for target class or “false”. Also, at the bottom of the tree, there are two leaves exiting to “true” and “false”. For example, from the Log4j dataset of Table 4, one tree predicting for software defects is shown in Figure. 7. Note that this tree has decided to exit towards the target class at lines 2, 3, 4 and otherwise on lines 1, 5.

  • [leftmargin=0.4cm]

  • A dataset, such as Table 4;

  • A goal predicate ; e.g., or ;

  • = number of models, number of ranges used per model

  • [leftmargin=0.4cm]

  • Score of the best model when applied to data not used for training.

  • [leftmargin=0.4cm]

  • Separate the data into train and test;

  • On the train data, build an ensemble and select the best:

    • [leftmargin=0.4cm]

    • For to do

      1. Divide numeric attributes into ranges;

      2. Find extreme ranges that score highest and lowest on ;

      3. Combine some the extreme ranges into model ;

      4. Score using ;

      5. Keep the best scoring model.

  • On the test data:

    • [leftmargin=0.4cm]

    • Return the score of the best scoring model.

  • [leftmargin=0.4cm]

  • For training step (2), we use extreme ranges in order to maximize the spread of the darts around the results space.

  • To keep this simple, the discretizer used in training step (1) just divides the numeric data on its median value.

Figure 6. DART: an ensemble algorithm to sample results space, number of times.
1. if cob ¡= 4     then false
2. else if rfc ¿ 32      then true
3. else if dam ¿ 0    then true
4. else if amc ¡ 32.25  then true
5. else false
Figure 7. A simple model for software defect prediction

To build one tree, our version of DART discretize numerics by dividing at median values; then scores each range using dis2heaven or according to how well they predict for “true” or “false” (this finds the extreme ranges seen in DART’s training step (2)). Next we built one level of the tree by (a) picking the exit class then (b) adding in the range that best predicts for that class. The other levels are build recursively using the data not selected by that range. Given a tree of depth , there are two choices at each level about whether or not to exit to “true” or “false”. Hence, for trees of depth , there are possible trees. Each such tree is one “dart” into results space.

To throw several darts at results space, DART builds an ensemble of 16 trees, we use depth . This number was selected since Figure 4 had ten green cells. Hence: would generate trees which would not be enough to cover results space; would generate trees which would be excessive for results like Figure 4. Note that, when using this approach, the number of extreme ranges used in the models is the same as the depth of the tree . As per Figure 6, on the training data shown in Table 4, we built 16 trees, then selected the best one to be used for testing.

4. Experimental SETUP

Recalling §2.4, the evaluation criteria used in this study was dis2heaven or . Note that this criteria also echoes the criteria seen in prior work (Fu and Menzies, 2017b; Kamei et al., 2013; Yang et al., 2016). The rest of this section discusses our other experimental details.

4.1. Research Questions

To compare with three established defect prediction methods, we use all machine learning implementations from Scikit-learn package and tools released by Fu et al. (Fu et al., 2016a) and Agrawal et al. (che, 2018). In this study, we set three research questions:

RQ1: Do established learners sample results space better than a few DARTs? This questions compares DART against the sample of defect prediction algorithms surveyed by Ghotra et al. at ICSE’15 (Ghotra et al., 2015).

RQ2: Do goal-savvy learners sample results space better than a few DARTs? This question address a potential problem with the RQ1 analysis. DART uses the goal function when it trains its models. Hence, this might give DART an unfair advantage compared to other learners in Table 5. Therefore, in RQ2, we compare DART to goal-savvy hyper-parameter optimizers (Fu et al., 2016a) that make extensive use of the goal function as they tune learner parameters.

RQ3: Do data-savvy learners sample results space better than a few DARTs? Agrawal et al. (che, 2018) argues that selecting and/or tuning data miners is less useful that repairing problems with the training data. To test that, this research question compares DART agains the data-savvy methods developed by Agrawal et al.

4.2. Datasets

To compare the DART ensemble method against alternate approaches, we used data from SEACRAFT repository (, shown in Table 4 (for details on the contents of those data sets, see Table 1). This data was selected for two reasons:

  • [leftmargin=0.4cm]

  • The data is available for multiple versions of the same software. This means we can ensure that our learners are trained on past data and tested on future data.

  • It is very similar, or identical, to the data used in prior work against which we will compare our new approach  (Ghotra et al., 2015; Fu et al., 2016a; Agrawal and Menzies, 2018).

When applying data mining algorithms to build predictive models, one important principle is not to test on the data used in training. There are many ways to design a experiment that satisfies this principle. Some of those methods have limitations; e.g., leave-one-out is too slow for large data sets and cross-validation mixes up older and newer data  (such that data from the past may be used to test on future data). In this work, for each project data, we set the latest version of project data as the testing data and all the older data as the training data. For example, we use data for training predictors, and the newer data, is left for testing.

Training Data Testing Data
Project Versions % of Defects Versions % of Defects
Poi 1.5, 2.0, 2.5 426/936 = 46% 3.0 281/442 = 64%
Lucene 2.0, 2.2 235/442 = 53% 2.4 203/340 = 60%
Camel 1.0, 1.2, 1.4 374/1819 = 21% 1.6 188/965 = 19%
Log4j 1.0, 1.1 71/244 = 29% 1.2 189/205 = 92%
Xerces 1.2, 1.3 140/893 = 16% 1.4 437/588 = 74%
Velocity 1.4, 1.5 289/410 = 70% 1.6 78/229 = 34%
Xalan 2.4, 2.5, 2.6 908/2411 = 38% 2.7 898/909 = 99%
Ivy 1.1, 1.4 79/352 = 22% 2.0 40/352 = 11%
Synapse 1.0, 1.1 76/379 = 20% 1.2 86/256 = 34%
3.2, 4.0
4.1, 4.2
292/1257 = 23% 4.3 11/492 = 2%
Table 4. Statistics of the studied data sets.
dis2heaven: (less is better) log4j 23 53 51 56 48
jedit 31 40 41 34 47
lucene 33 40 44 44 71
poi 35 36 57 70 45
ivy 35 50 40 71 43
velocity 37 61 40 49 60
synapse 38 51 39 34 62
xalan 39 55 55 70 68
camel 41 60 52 44 71
xerces 42 68 60 50 69
: (more is better) ivy 28 17 9 28 23
jedit 39 10 9 16 17
synapse 43 26 24 22 22
camel 53 15 17 16 50
log4j 56 19 22 16 23
velocity 64 64 64 24 60
poi 73 51 19 33 64
lucene 81 43 27 20 80
xerces 90 4 9 15 48
xalan 99 11 15 100 51
Table 5. DART v.s. state-of-the-art defect predictors from Ghotra et al. (Ghotra et al., 2015) for and . Gray cells mark best performances on each project (so DART is most-often best).

5. Results

Data dis2heaven
(less is better) (more is better)
DART Tuning RF DART Tuning RF
ivy 35 56 28 28
jedit 31 35 39 39
synapse 38 57 43 48
camel 41 70 53 54
log4j 23 51 56 20
velocity 37 53 64 64
poi 34.8 27 73 74
lucene 33 35 81 80
xerces 42 70 90 94
xalan 38.7 36 99 99
Table 6. DART v.s. tuning Random Forests for and . Gray cells mark best performances on each project. Note that even when DART does not perform best, it usually performs very close to the best.

5.1. RQ1: Do established learners sample results space better than a few DARTs?

In order to compare our approach to established norms in software quality predictors, we used the Ghotra et al. study from ICSE’15 (Ghotra et al., 2015). Recall that this study was a comparison of the the 32 learners shown in Table 3. The performance of those learners clustered into four groups, from which we selected four representative learners (see the discussion in §2.3): SL, NB, EM, SMO.

For this comparison, DART and the learners from Ghotra et al. were all trained/tested on the same versions shown in Table 4. The resulting performance scores are shown in Table 5. Note that:

  • [leftmargin=0.4cm]

  • DART performed as well, or better, than the sample of Ghortra et al learners in 18/20 experiments.

  • When DART failed to produce best performance, it came very close to the best (e.g., for Xalan’s results, DART scored 99% while the best was 100%).

  • When DART performed best, it often did so by a very wide margin. For example, for Log4j’s dis2heaven score, DART’s score was 23% and the best value of the other learners was 51%; i.e., worse by more than a factor of two. For another example, for Log4j’s score, DART’s score was more than twice better than the scores of any other learner.

From these results, we assert that DART out-performs the established state-of-the-art defect predictors recommended by Ghortra el al. (Ghotra et al., 2015) on the data sets of Table 4.

Tuning Object Parameters Default
threshold 0.5 [0.01,1] The value to determine defective or not.
max_feature None [0.01,1] The number of features to consider when looking for the best split.
max_leaf_nodes None [1,50] Grow trees with max_leaf_nodes in best-first fashion.
min_sample_split 2 [2,20] The minimum number of samples required to split an internal node.
min_samples_leaf 1 [1,20] The minimum number of samples required to be at a leaf node.
n_estimators 100 [50,150] The number of trees in the forest.
k 5 [1,20] Number of neighbors
m 50, 100, 200, 400 Number of synthetic examples to create. Expressed as a percent of final training data.
r 2 [0.1, 5] Power parameter for the Minkowski distance metric.
Table 7. List of parameters tuned in this paper.
  • [leftmargin=0.4cm]

  • A dataset, such as Table 4;

  • A tuning goal ; e.g., or ;

  • DE parameters: , , ,

  • [leftmargin=0.4cm]

  • Best tunings for learners (e.g., RF) found by DE

  • [leftmargin=0.4cm]

  • Separate the data into and ;

  • Generate tunings as the initial population;

  • Score each tuning in the population with goal ;

  • For to do

    1. Generate a mutant built by extrapolating between three other members of population , ,

      at probability

      . For each decision :

      • [leftmargin=0.4cm]

      • (continuous values).

      • (discrete values).

    2. Build a learner with parameters and train data;

    3. Score on tune data using ;

    4. Replace with if is preferred;

  • Repeat the last step until run out of or could not find better tunings;

  • Return the best tuning of the last population as the final result.

Figure 8. TUNER is an evolutionary optimization algorithm based on Storn’s differential evolution algorithm (Storn and Price, 1997; Fu and Menzies, 2017a).

5.2. RQ2: Do goal-savvy learners sample results space better than a few DARTs?

One counter argument to the conclusions of the RQ1 is that it may not be fair to compare DART against standard data mining algorithms using their off-the-shelf parameter tunings. DART makes extensive use of the goal function (i.e., and dis2heaven) at three points in its algorithm:

  • [leftmargin=0.4cm]

  • Once when assessing individual ranges;

  • Once again when assessing trees built from those ranges;

  • A third time when assessing the best tree on the test set.

All the other learners in Table 5 use only once (when their final model was assessed) but never while they build their models. That is, DART is “goal-savvy” while all the other learners explored in RQ1 were not. Perhaps this gave DART an unfair advantage?

To address this issue, we turned to the state-of-the-art in hyper-parameter optimization for defect prediction. In 2016, Fu et al. presented in the IST journal (Fu et al., 2016a) an extensive study where an optimizer tuned the control parameters of various learners applied to software quality defect prediction. That study used the goal function to guide their selection of control parameters; i.e., unlike the Table 5 results, this learning method is “goal-savvy” in the sense that it was allowed to reflect on the goal during model generation.

For RQ2, we compare the performance of DART with goal-savvy tuning Random Forests. We use RandomForests since they where recommended by Ghotra et al. (Ghotra et al., 2015) and prior work hyper-parameter tuning for defect prediction by Fu et al. (Fu et al., 2016a). In this hyper-parameter tuning experiment, for each project data, we randomly split the original training data (e.g, combine and as the original training data) into and as new training data and tuning data, respectively. As recommended by Fu et al. (Fu et al., 2016a), we use the TUNER algorithm of Figure 8 to select the parameters of Random Forests (for a list of those parameters, see Table 7). TUNER iterates until it runs out of tuning resources (i.e., a given tuning budget) or it canot not find any better hyper-parameters. Finally, we use the current best parameters as the best hyper-parameters to train Random Forests with the new training data.

Since different data split might have an impact on predictor’s performance, we repeat tuning+testing process 30 times, each time with different random seed and return the median values of 30 runs as the result of tuning Random Forests experiment on each project data. Since we have two different goals: minimize and maximize metrics, we run two different experiments (so 60 repeats in all).

For comparison purposes, DART built its ensembles on the original training data (e.g, combine and as the original training data) and selected the best tree in terms of  (i.e., or ). The best tree was then tested on testing data (e.g., ). Table 6 shows the results of DART versus Random Forests, where the latter was tuned for dis2heaven or . Note that in this experiment, DART was not tuned. Rather, it just used its default settings of:

  • [leftmargin=0.4cm]

  • Discretizing using median splits for numeric attributes;

  • Building trees of depth , which means ensembles of size ;

  • At each level of tree, use just one extreme range.

As shown in Table 6, for 13/20 experiments, untuned DART performed better than tuned RandomForests. Also, in all cases where DART performed worse, the performance delta was very small (the largest loss was 4% seen in the xerces’ results).

dis2heaven: (less is better) log4j 23 45 44 50 44 40 47
jedit 31 45 52 41 39 44 40
lucene 33 37 45 44 41 40 40
poi 35 38 52 52 39 46 43
ivy 35 37 46 36 39 37 40
velocity 37 56 64 40 44 61 42
synapse 38 36 47 36 42 37 42
xalan 39 20 35 45 25 71 28
camel 41 45 62 47 35 53 38
xerces 42 45 67 52 52 53 53
ivy 28 26 27 10 27 24 26
: (more is better) jedit 39 3 17 6 10 4 24
synapse 43 39 38 27 36 36 35
camel 52.9 53 53 21 52 53 49
log4j 56 27 50 24 33 44 44
velocity 64 56 64 64 57 65 53
poi 73 67 69 26 72 72 71
lucene 81 45 49 27 49 42 53
xerces 90 73 63 20 50 77 48
xalan 99 99 98 24 93 100 88
Table 8. DART v.s. data-savvy learners in and . Gray cells mark best performances on each project.

From these results, we assert that DART out-performs the established state-of-the-art in parameter tuning for defect prediction. This is an interesting result since TUNER must evaluate dozens to hundreds of different models before it can select the best settings. DART, on the other hand, just had to build one ensemble, then test one tree from that ensemble.

5.3. RQ3: Do data-savvy learners sample results space better than a few DARTs?

There has been much recent research in hyper-parameter tuning of quality predictors. For example:

  • [leftmargin=0.4cm]

  • The Fu et al. study mentioned in RQ2 tuned control parameters of the learning algorithm.

  • At ICSE’18, Agrawal et al. (Agrawal and Menzies, 2018), applied tuning to the data pre-processor that was called before the learners executed.

This section compares DART to the Agrawal et al. methods. Note that the Fu et al. tuners were “goal-savvy”, the Agrawal et al. methods are “data-savvy”.

Agrawal compared the benefits of (a) selecting better learners versus (b) picking any learner but also addressing class-imbalance in the training data. Class-imbalance is a major problem in quality prediction. If the target class is very rare, it can be difficult for a data mining algorithm to generate a model that can locate it. A standard method for addressing class imbalance is the SMOTE pre-processor (Chawla et al., 2002). SMOTE randomly deletes members of the majority class while synthesizing artificial members of the minority class.

SMOTE is controlled by the parameters of Table 7. Agrawal et al. applied the same TUNER algorithm of Fu et al. and found that the default settings of SMOTE could be greatly improved. Agrawal et al. used the term SMOTUNED to denote their combination.

This section compares DART against SMOTUNED. As per the methods of Agrawal et al. (Agrawal and Menzies, 2018), for each project of Table4, we randomly split the original training data into and as new training data and tuning data, respectively. The tuning data was used to validate our parameter settings of SMOTE found by TUNER.

Similar to the RQ2 experiment, we repeated the whole process until we either run out of tuning resources(i.e., a given tuning budget) or TUNER could not find any better hyper-parameters. The best parameters found was tested against the testing set and these results are reported.

Since different data split might have an impact on predictor’s performance, we repeat tuning+testing process 30 times, each time with different random seed and return the median values of 30 runs as the result of tuning SMOTE on each project data. SMOTUNED experiment was run twice to minimize distance2heaven and maximize popt20 metrics. These two experiments were run separately.

Table 8 shows the results. In results that echo all the above, usually DART performs much better than the more elaborate approach of Agrawal et al:

  • [leftmargin=0.4cm]

  • For the results, DART was either the best result or no worse that 1% off the best results;

  • As to dis2heaven, DART’s worst performance was for xalan, which was was 19% worse that best. Apart from that, DART either had the best result or was within 2% of the best result for of the results.

6. Threads to Validity

As with any large scale empirical study, biases can affect the final results. Therefore, any conclusions made from this work must be considered with the following issues in mind:

Threats to internal validity concern the consistency of the results obtained from the result. In our study, to investigate how DART performs compared with the state-of-the-art defect predictors, goal-savvy defect predictors, and data-savvy defect predictors, we has taken care to either clearly define our algorithms or use implementations from the public domain. All the machine learning algorithms are imported from Scikit-Learn, a machine learning package in Python (Pedregosa et al., 2011). For example, In RQ1, DART followed the FFTs algorithm defined in (Martignon et al., 2003). In RQ2 and RQ3, we adopt the original source code of DE-TUNER and SMOTUNED provided by Fu et al (Fu et al., 2016a) and Agrawal et. al. (Agrawal and Menzies, 2018), which reduce the bias introduced by implementing the rigs by ourselves. All the data used in this work is widely used open source Java system data in defect prediction field and it is also available in the SEACRAFT repository (

Threats to external validity represent if the results are of relevance for other cases, or the ability to generalize the observations in a study. In this study, we proposed that using DART as a scout to explore the results space could build better defect predictors in terms of and measures. Nonetheless, we do not claim that our findings can be generalized to all software quality predictors tasks. However, those other software quality predictors tasks often apply machine learning algorithms, like SVM and Random Forests, or other data pre-processing techniques to build predictive models. Most of those models are also exploring the results space and find the best models. Therefore, it is quite possible that FFTs method of this paper would be widely applicable, elsewhere.

7. Conclusions

The thesis of this paper is that we have been treating uncertainty incorrectly. Instead of view uncertainty as a problems to be solved, we instead view it as a resource that simplifies software defect prediction.

For example, Deb’s principle of -dominance states that if there exists some value below which it is useless or impossible to distinguish results, then it is superfluous to explore anything less than . For large problems, the results space effectively contains just a few regions.

As shown here, there are several important benefits if we we design a learner especially for such large problems. Firstly, the resulting learner is very simple to implement. Secondly, this learner can sample the results space very effectively. Thirdly, that very simple learner can out-perform far more elaborate systems such as three state-of-the-art defect prediction systems:

  1. The algorithms surveyed at a recent ICSE’15 paper (Ghotra et al., 2015);

  2. A hyper-parameter optimization method proposed in 2016 in the IST journal (Fu et al., 2016a);

  3. A search-based data pre-processing method presented at ICSE’18 (Agrawal and Menzies, 2018).

Figure 9. Different ways to reason about software quality prediction.

We believe that our results call for a new approach to software quality prediction. The standard approach to this problem, as shown by the top pink section of Figure 9 is to reason forwards from domain data, towards a model. In that approach, analysts must make many decisions about data pre-processing, feature selection, and tuning parameters for a learner. This is a very large number of decisions:

  • Data can be processed by SMOTE (Chawla et al., 2002), SMOTUNED (Agrawal and Menzies, 2018), or any number of other methods including normalization, discretization, outlier removal, etc (Witten and Frank, 2005);

  • Feature selection can explore subsets of features;

  • Hyper-parameter optimization explores the space of control parameters within data mining algorithms. As shown in Table 7, those parameters can be continuous which means the space of parameters is theoretically infinite.

Perhaps a much simpler approach is the backwards reasoning shown in the blue bottom region of Figure 9. In this approach, analysts do some initial data mining, perhaps at random, then reflect on what has been learned from those initial probes of the result space. Based on those results, analysts then design a software quality predictor that better understands the results space.

The DART system discussed in this paper is an example of such backwards reasoning. We hope the success of this system inspires other researchers to explore large scale simplifications of other SE problem domains.


  • (1)
  • che (2018) 2018. MULTI: Multi-objective effort-aware just-in-time software defect prediction. Information and Software Technology 93 (2018), 1–13.
  • Agrawal et al. (2018) Amritanshu Agrawal, Wei Fu, and Tim Menzies. 2018. What is Wrong with Topic Modeling? (and How to Fix it Using Search-based SE). (02 2018).
  • Agrawal and Menzies (2018) Amritanshu Agrawal and Tim Menzies. 2018. Is ”Better Data” Better than ”Better Data Miners”? (Benefits of Tuning SMOTE for Defect Prediction). ICSE (2018).
  • Barr et al. (2015) Earl T Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2015. The oracle problem in software testing: A survey. IEEE transactions on software engineering 41, 5 (2015), 507–525.
  • Calero and Piattini (2015) Coral Calero and Mario Piattini. 2015. Green in Software Engineering. Springer Publishing Company, Incorporated.
  • Chawla et al. (2002) Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Int. Res. 16, 1 (June 2002), 321–357.
  • Chen et al. (2005) Zhihao Chen, Tim Menzies, Dan Port, and Barry Boehm. 2005.

    Feature subset selection can improve software cost estimation accuracy. In

    ACM SIGSOFT Software Engineering Notes, Vol. 30. ACM, 1–6.
  • Chidamber and Kemerer (1994) Shyam R Chidamber and Chris F Kemerer. 1994. A metrics suite for object oriented design. IEEE Transactions on software engineering 20, 6 (1994), 476–493.
  • Deb et al. (2005) Kalyanmoy Deb, Manikanth Mohan, and Shikhar Mishra. 2005. Evaluating the

    -domination based multi-objective evolutionary algorithm for a quick computation of Pareto-optimal solutions.

    Evolutionary computation 13, 4 (2005), 501–525.
  • Fayyad and Irani (1993) Usama Fayyad and Keki Irani. 1993. Multi-interval discretization of continuous-valued attributes for classification learning. (1993).
  • Fenton and Neil (2012) Norman Fenton and Martin Neil. 2012.

    Risk assessment and decision analysis with Bayesian networks

    Crc Press.
  • Fu and Menzies (2017a) Wei Fu and Tim Menzies. 2017a.

    Easy over Hard: A Case Study on Deep Learning. In

    Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017). ACM, 49–60.
  • Fu and Menzies (2017b) Wei Fu and Tim Menzies. 2017b.

    Revisiting Unsupervised Learning for Defect Prediction. In

    Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017). ACM, 72–83.
  • Fu et al. (2016a) Wei Fu, Tim Menzies, and Xipeng Shen. 2016a. Tuning for software analytics: Is it really necessary? Information and Software Technology 76 (2016), 135–146.
  • Fu et al. (2016b) Wei Fu, Vivek Nair, and Tim Menzies. 2016b. Why is differential evolution better than grid search for tuning defect predictors? arXiv preprint arXiv:1609.02613 (2016).
  • Ghotra et al. (2015) Baljinder Ghotra, Shane McIntosh, and Ahmed E Hassan. 2015. Revisiting the impact of classification techniques on the performance of defect prediction models. IEEE Press, 789–800.
  • Hall et al. (2012) Tracy Hall, Sarah Beecham, David Bowes, David Gray, and Steve Counsell. 2012. A systematic literature review on fault prediction performance in software engineering. IEEE Transactions on Software Engineering 38, 6 (2012), 1276–1304.
  • Hamill and Goseva-Popstojanova (2009) Maggie Hamill and Katerina Goseva-Popstojanova. 2009. Common trends in software fault and failure data. IEEE Transactions on Software Engineering 35, 4 (2009), 484–496.
  • Hosseini et al. (2017) Seyedrebvar Hosseini, Burak Turhan, and Dimuthu Gunarathna. 2017. A Systematic Literature Review and Meta-analysis on Cross Project Defect Prediction. IEEE Transactions on Software Engineering (2017).
  • Kafura and Reddy (1987) Dennis Kafura and Geereddy R. Reddy. 1987. The use of software complexity metrics in software maintenance. IEEE Transactions on Software Engineering 3 (1987), 335–343.
  • Kamei et al. (2013) Yasutaka Kamei, Emad Shihab, Bram Adams, Ahmed E Hassan, Audris Mockus, Anand Sinha, and Naoyasu Ubayashi. 2013. A large-scale empirical study of just-in-time quality assurance. IEEE Transactions on Software Engineering 39, 6 (2013), 757–773.
  • Khoshgoftaar and Allen (2001) Taghi M Khoshgoftaar and Edward B Allen. 2001. Modeling software quality with. Recent Advances in Reliability and Quality Engineering 2 (2001), 247.
  • Khoshgoftaar and Seliya (2003) Taghi M Khoshgoftaar and Naeem Seliya. 2003. Software quality classification modeling using the SPRINT decision tree algorithm.

    International Journal on Artificial Intelligence Tools

    12, 03 (2003), 207–225.
  • Khoshgoftaar et al. (2000) Taghi M Khoshgoftaar, Xiaojing Yuan, and Edward B Allen. 2000. Balancing misclassification rates in classification-tree models of software quality. Empirical Software Engineering 5, 4 (2000), 313–330.
  • Kim et al. (2015) Mijung Kim, Jaechang Nam, Jaehyuk Yeon, Soonhwang Choi, and Sunghun Kim. 2015. REMI: defect prediction for efficient API testing. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, 990–993.
  • Kim et al. (2008) Sunghun Kim, E James Whitehead Jr, and Yi Zhang. 2008. Classifying software changes: Clean or buggy? IEEE Transactions on Software Engineering 34, 2 (2008), 181–196.
  • Kocaguneli et al. (2010) Ekrem Kocaguneli, Gregory Gay, Tim Menzies, Ye Yang, and Jacky W Keung. 2010. When to use data from other projects for effort estimation. In Proceedings of the IEEE/ACM international conference on Automated software engineering. ACM, 321–324.
  • Koru et al. (2009) A Güneş Koru, Dongsong Zhang, Khaled El Emam, and Hongfang Liu. 2009. An investigation into the functional form of the size-defect relationship for software modules. IEEE Transactions on Software Engineering 35, 2 (2009), 293–304.
  • Lessmann et al. (2008a) S. Lessmann, B. Baesens, C. Mues, and S. Pietsch. 2008a. Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings. IEEE Transactions on Software Engineering 34, 4 (July 2008), 485–496. DOI: 
  • Lessmann et al. (2008b) Stefan Lessmann, Bart Baesens, Christophe Mues, and Swantje Pietsch. 2008b. Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Transactions on Software Engineering 34, 4 (2008), 485–496.
  • Martignon et al. (2003) Laura Martignon, Oliver Vitouch, Masanori Takezawa, and Malcolm R Forster. 2003. Naive and yet enlightened: From natural frequencies to fast and frugal decision trees. Thinking: Psychological perspectives on reasoning, judgment and decision making (2003), 189–211.
  • McCabe (1976) Thomas J McCabe. 1976. A complexity measure. IEEE Transactions on software Engineering 4 (1976), 308–320.
  • Menzies et al. (2007a) Tim Menzies, Alex Dekhtyar, Justin Distefano, and Jeremy Greenwald. 2007a. Problems with Precision: A Response to ”Comments on ’Data Mining Static Code Attributes to Learn Defect Predictors’”. IEEE Trans. Softw. Eng. 33, 9 (Sept. 2007), 637–640. DOI: 
  • Menzies et al. (2007b) Tim Menzies, Jeremy Greenwald, and Art Frank. 2007b. Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering 33, 1 (2007).
  • Menzies and Shepperd (2012) Tim Menzies and Martin Shepperd. 2012. Special issue on repeatable results in software engineering prediction. (2012).
  • Misirli et al. (2011) Ayse Tosun Misirli, Ayse Bener, and Resat Kale. 2011. Ai-based software defect predictors: Applications and benefits in a case study. AI Magazine 32, 2 (2011), 57–68.
  • Mockus and Weiss (2000) Audris Mockus and David M Weiss. 2000. Predicting risk of software changes. Bell Labs Technical Journal 5, 2 (2000), 169–180.
  • Monden et al. (2013) Akito Monden, Takuma Hayashi, Shoji Shinoda, Kumiko Shirai, Junichi Yoshida, Mike Barker, and Kenichi Matsumoto. 2013. Assessing the cost effectiveness of fault prediction in acceptance testing. IEEE Transactions on Software Engineering 39, 10 (2013), 1345–1357.
  • Myers et al. (2011) Glenford J Myers, Corey Sandler, and Tom Badgett. 2011. The art of software testing. John Wiley & Sons.
  • Orso and Rothermel (2014) Alessandro Orso and Gregg Rothermel. 2014. Software testing: a research travelogue (2000–2014). In Proceedings of the on Future of Software Engineering. ACM, 117–132.
  • Ostrand et al. (2004) Thomas J Ostrand, Elaine J Weyuker, and Robert M Bell. 2004. Where the bugs are. In ACM SIGSOFT Software Engineering Notes, Vol. 29. ACM, 86–96.
  • Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, and others. 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research 12, Oct (2011), 2825–2830.
  • Phillips et al. (2017) Nathaniel D Phillips, Hansjörg Neth, Jan K Woike, and Wolfgang Gaissmaier. 2017. FFTrees: A toolbox to create, visualize, and evaluate fast-and-frugal decision trees. Judgment and Decision Making 12, 4 (2017), 344.
  • Rahman and Devanbu (2013) Foyzur Rahman and Premkumar Devanbu. 2013. How, and why, process metrics are better. In Software Engineering (ICSE), 2013 35th International Conference on. IEEE, 432–441.
  • Rahman et al. (2014) Foyzur Rahman, Sameer Khatri, Earl T Barr, and Premkumar Devanbu. 2014. Comparing static bug finders and statistical prediction. In Proceedings of the 36th International Conference on Software Engineering. ACM, 424–434.
  • Ray et al. (2016) Baishakhi Ray, Vincent Hellendoorn, Saheel Godhane, Zhaopeng Tu, Alberto Bacchelli, and Premkumar Devanbu. 2016. On the naturalness of buggy code. In Proceedings of the 38th International Conference on Software Engineering. ACM, 428–439.
  • Sculley (2010) D. Sculley. 2010. Web-scale K-means Clustering. In Proceedings of the 19th International Conference on World Wide Web (WWW ’10). ACM, New York, NY, USA, 1177–1178. DOI: 
  • Storn and Price (1997) R. Storn and K. Price. 1997.

    Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces.

    Journal of global optimization 11, 4 (1997), 341–359.
  • Tantithamthavorn et al. (2016) Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E Hassan, and Kenichi Matsumoto. 2016. Automated parameter optimization of classification techniques for defect prediction models. In Software Engineering (ICSE), 2016 IEEE/ACM 38th International Conference on. IEEE, 321–332.
  • Wang et al. (2016) Song Wang, Taiyue Liu, and Lin Tan. 2016. Automatically learning semantic features for defect prediction. In Proceedings of the 38th International Conference on Software Engineering. ACM, 297–308.
  • Witten and Frank (2002) Ian H. Witten and Eibe Frank. 2002. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. SIGMOD Rec. 31, 1 (March 2002), 76–77. DOI: 
  • Witten and Frank (2005) Ian H. Witten and Eibe Frank. 2005. Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
  • Yang et al. (2016) Yibiao Yang, Yuming Zhou, Jinping Liu, Yangyang Zhao, Hongmin Lu, Lei Xu, Baowen Xu, and Hareton Leung. 2016. Effort-aware just-in-time defect prediction: simple unsupervised models could be better than supervised models. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 157–168.
  • Yoo and Harman (2012) Shin Yoo and Mark Harman. 2012. Regression testing minimization, selection and prioritization: a survey. Software Testing, Verification and Reliability 22, 2 (2012), 67–120.
  • Zhang et al. (2016) Feng Zhang, Quan Zheng, Ying Zou, and Ahmed E Hassan. 2016. Cross-project defect prediction using a connectivity-based unsupervised classifier. In Proceedings of the 38th International Conference on Software Engineering. ACM, 309–320.