Why is Differential Evolution Better than Grid Search for Tuning Defect Predictors?

09/08/2016 ∙ by Wei Fu, et al. ∙ NC State University 0

Context: One of the black arts of data mining is learning the magic parameters which control the learners. In software analytics, at least for defect prediction, several methods, like grid search and differential evolution (DE), have been proposed to learn these parameters, which has been proved to be able to improve the performance scores of learners. Objective: We want to evaluate which method can find better parameters in terms of performance score and runtime cost. Methods: This paper compares grid search to differential evolution, which is an evolutionary algorithm that makes extensive use of stochastic jumps around the search space. Results: We find that the seemingly complete approach of grid search does no better, and sometimes worse, than the stochastic search. When repeated 20 times to check for conclusion validity, DE was over 210 times faster than grid search to tune Random Forests on 17 testing data sets with F-Measure Conclusions: These results are puzzling: why does a quick partial search be just as effective as a much slower, and much more, extensive search? To answer that question, we turned to the theoretical optimization literature. Bergstra and Bengio conjecture that grid search is not more effective than more randomized searchers if the underlying search space is inherently low dimensional. This is significant since recent results show that defect prediction exhibits very low intrinsic dimensionality-- an observation that explains why a fast method like DE may work as well as a seemingly more thorough grid search. This suggests, as a future research direction, that it might be possible to peek at data sets before doing any optimization in order to match the optimization algorithm to the problem at hand.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given the large amount of data now available to analysts, many researchers in empirical software engineering are now turning to automatic data miners to help them explore all that data Menzies:2013; me07b. Those learners come with many “magic numbers” which, if tuned, can generate better predictors for particular data spaces. For example, Fu et al.fu16 showed that, with tuning, the precision of software defect predictors learned from static code metrics, can grow by 20% to 60%.

Since tuning is so effective, it is tempting to enforce tuning for all data mining studies. However, there is a problem: tuning can be slow. For example, in this paper, one of our tuners used 54 days of CPU time to conduct 20 repeats of tuning Random Forestsbrieman00 for one tuning goal with 17 test data sets. Other researchers also comment that tuning may require weeks, or more, of CPU time Arcur11. One dramatic demonstration of the slowness of tuning (not for defect prediction) comes from Wang et al. Wang:Harman13 who needed 15 years of CPU to explore 9.3 million candidate configurations for software clone detectors.

One way to address the cost of tuning data miners is to use cloud-based CPU farms. The advantage of this approach is that it’s simple to implement (just buy the cloud-based CPU time). For example, such cloud resources were used in this paper. But uses of cloud-based resources have several disadvantages:

  • Cloud computing environments are extensively monetized so the total financial cost of tuning can be prohibitive.

  • That CPU time is wasted if there is a faster and more effective way.

It turns out that the last point is indeed the case– at least for tuning defect predictors learned from static code attributes. The case study of this paper compares two tuning methods: the grid search as used by Tantithamthavorn et al. tantithamthavorn2016automated and differential evolution as used by Fu et al. fu16. B oth paper investigate the impacts of tunings on defect predictors and find that parameter tuning can improve the learner performance by different methods. This study makes the following contributions:

Metric Name Description
amc average method complexity Number of JAVA byte codes
avg_cc average McCabe Average McCabe’s cyclomatic complexity seen in class
ca afferent couplings How many other classes use the specific class.
cam cohesion amongst classes Summation of number of different types of method parameters in every method divided by a multiplication of number of different method parameter types in whole class and number of methods.
cbm coupling between methods Total number of new/redefined methods to which all the inherited methods are coupled
cbo coupling between objects Increased when the methods of one class access services of another.
ce efferent couplings How many other classes is used by the specific class.
dam data access Ratio of private (protected) attributes to total attributes
dit depth of inheritance tree It’s defined as “the maximum length from the node to the root of the tree”
ic inheritance coupling Number of parent classes to which a given class is coupled (includes counts of methods and variables inherited)
lcom lack of cohesion in methods Number of pairs of methods that do not share a reference to an instance variable.
locm3 another lack of cohesion measure If are the number of in a class number and is the number of methods accessing an attribute, then .
loc lines of code Total lines of code in this file or package.
max_cc Maximum McCabe maximum McCabe’s cyclomatic complexity seen in class
mfa functional abstraction Number of methods inherited by a class plus number of methods accessible by member methods of the class
moa aggregation Count of the number of data declarations (class fields) whose types are user defined classes
noc number of children Number of direct descendants (subclasses) for each class
npm number of public methods npm metric simply counts all the methods in a class that are declared as public.
rfc response for a class Number of methods invoked in response to a message to the object.
wmc weighted methods per class A class with more member functions than its peers is considered to be more complex and therefore more error prone

defect
defect Boolean: where defects found in post-release bug-tracking systems.
Table 1: OO measures used in our defect data sets.
  • To the best of our knowledge, this is the first such comparison of these two techniques for the purposes of tuning defect predictors;

  • We show that differential evolution is just as effective as grid search for improving defect predictors, while being one to two orders of magnitude faster;

  • Hence, we offer a strong recommendation to use evolutionary algorithms(like DE) for tuning defect predictors;

  • Lastly, we propose a prediction method that would allow future analysts to match the optimization method to the data at hand.

The rest of this paper is structured as follows. Section 2 describes defect predictors, how they can be generated by data miners, and how tuning can affect that learning process. Section 3 defines and compares differential evolution and grid search for that tuning process. This section’s case study shows that grid search is far slower, yet no more effective, than differential evolution. Section 4 is a discussion on why DE is more effective and faster than grid search by showing how intrinsic dimensionality can be used as a predictor for tuning problems that might be suitable for random search methods such as differential evolution.

While the specific conclusion of this paper relates to defect prediction, this work has broader implications across data science. Many researchers in SE explore hyper-parameter tuning 

Arcur11; Jia15; panichella2013effectively; cora10; minku2013analysis; song2013impact; lessmann2008benchmarking; tantithamthavorn2016automated; fu16; Wang:Harman13 (for details, see Section 2.3). Within that group, grid search is the common method for exploring different tuning parameters. Such studies can be tediously slow, often requiring days to weeks of CPU time. The success of differential evolution to tune defect predictors raises the possibility that tuning research could be greatly accelerated by a better selection of tuning algorithms. This could be a very productive avenue for future research.

1.1 Case Studies, Not Experiments

In terms of SE experimental theory this paper is a case study paper and not an experiment. Runeson and Höst Runeson2009 reserve the term “case study” to refers to some act which includes observing or changing the conditions under which humans perform some software engineering action. In the age of data mining and model-based optimization, this definition should be further extended to include case studies exploring how data collected from different projects can be generalized across multiple projects. As Cruzes et al. comment “Choosing a suitable method for synthesizing evidence across a set of case studies is not straightforward… limited access to raw data may limit the ability to fully understand and synthesize studies” Cruzes11.

In this age of model-based reasoning, that limited information can be synthesized using a data mining. Such tools are automatic to use which means it is issue to use them automatically and incorrectly. For example, as shown in this paper, it is not good enough to use these learners with their default tunings since what is learned from data miner can be greatly changed and improved by tuning. That is, tuning is essential.

The problem with tuning is that, if done using standard methods it can be needlessly computationally expensive. Hence, the goal of this paper is to comment on how quickly tuning can be applied without incurring excessive CPU costs.

2 Background

2.1 Defect Prediction

Human programmers are clever, but flawed. Coding adds functionality, but also defects. Hence, software sometimes crashes(perhaps at the most awkward or dangerous moment) or delivers the wrong functionality.

Since programming inherently introduces defects into programs, it is important to test them before they are used. Testing is expensive. According to Lowry et al. LowryBK98, software assessment budgets are finite while assessment effectiveness increases exponentially with assessment effort. Exponential costs quickly exhaust finite resources so standard practice is to apply the best available methods only on code sections that seem most critical. Any method that focuses on parts of the code can miss defects in other areas so some sampling policy should be used to explore the rest of the system. This sampling policy will always be incomplete, but it is the only option when resources prevent a complete assessment of everything.

Learner Name
Parameters Default
Tuning
Range
Description
CART threshold 0.5 [0,1] The value to determine defective or not.
max_feature None [0.01,1] The number of features to consider when looking for the best split.
min_sample_split 2 [2,20] The minimum number of samples required to split an internal node.
min_samples_leaf 1 [1,20] The minimum number of samples required to be at a leaf node.
max_depth None [1, 50] The maximum depth of the tree.
Random Forests threshold 0.5 [0.01,1] The value to determine defective or not.
max_feature None [0.01,1] The number of features to consider when looking for the best split.
max_leaf_nodes None [1,50] Grow trees with max_leaf_nodes in best-first fashion.
min_sample_split 2 [2,20] The minimum number of samples required to split an internal node.
min_samples_leaf 1 [1,20] The minimum number of samples required to be at a leaf node.
n_estimators [50,150] The number of trees in the forest.
Figure 1: List of parameters tuned by this paper.

One of sampling policies is defect predictors learned from static code attributes. Given software described in the attributes like Figure LABEL:fig:ck

, data miners can infer where probability of software defects is most likely. This is useful since static code attributes can be automatically collected, even for very large systems. Other methods, like manual code reviews, are slower and more labor-intensive 

rakitin01. There are increasing number of work to build defect predictors based on static code attributes krishna2016too; nam2015heterogeneous; tan2015online. and it can localize 70% (or more) of the defects in code me07b. They also perform well compared to certain widely-used automatic methods. Rahman et al. rahman14:icse compared (a) static code analysis tools FindBugs, Jlint, and Pmd and (b) static code defect predictors. They found no significant differences in the cost-effectiveness of these approaches. This is interesting since static code defect prediction can be quickly adapted to new languages by building lightweight parsers that find information like Figure LABEL:fig:ck. The same is not true for static code analyzers which need extensive modification before they can be used on new languages.

2.2 Data Mining

Data miners produce summaries of data. Data miners are efficient since they employ various heuristics to reduce their search space for finding summaries.

For examples of how these heuristics control data miners, consider the CART and Random Forests tree learners. These algorithms divide a data set, then recursively split on on each nodes until some stop criterion is satisfied. In the case of building defect predictors, these learners reflect on the number of issue reports raised for each class in a software system where the issue counts are converted into binary “yes/no” decisions via Equation 1, where T is a threshold number.

(1)

For the specific implementation in Scikit-learnscikit-learn, the splitting process is controlled by numerous tuning parameters listed in Figure 1, where the default parameters for CART and Random Forest are set by the Scikit-learn authors except for n_estimators, as recommended by Witten et al. Witten2011, we used a forest of 100 trees as default instead of 10. If data contains more than min_sample_split, then a split is attempted. On the other hand, if a split contains no more than min_samples_leaf, then the recursion stops.

These learners use different techniques to explore the splits:

  • CART finds the attributes of the dataset whose ranges contain rows(samples of data) with least variance in the number of defects: if an attribute ranges

    is found in rows, each with a defect count variance of , then CART seeks the attributes whose ranges minimizes .

  • Random Forests divides data like CART then builds number of trees(), each time using some random subset of the attributes.

Note that some tuning parameters are learner specific. max_feature is used by CART and Random Forests to select the number of attributes used to build one tree. CART’s default is to use all the attributes while Random Forests usually selects the square root of the number of attributes. Also, max_leaf_nodes is the upper bound on leaf notes generated in a Random Forests. Lastly, max_depth is the upper bound on the depth of the CART tree.

2.3 Parameter Tuning

Parameter tuning for evolutionary algorithms was studied by Arcuri & Fraser and presented at SSBSE’11 Arcur11. Using the grid search method (described below), they found that different parameter settings for evolutionary programs cause very large variance in their performance. Also, while parameter settings perform relatively well, they are far from optimal on particular problem instances.

Tuning is now explored in many parts of the SE research literature. Apart for defect prediction, tuning is used in the hyper-parameter optimization literature exploring better combinatorial search methods for software testing Jia15

or the use of genetic algorithms to explore 9.3 million different configurations for clone detection algorithms 

Wang:Harman13.

Other researchers explore the effects of parameter tuning on topic modeling panichella2013effectively (which is a text mining technique). Like Arcuri & Fraser, that work showed that the performance of the LDA topic modelling algorithm was greatly effected by the choice of four parameters that configure LDA. Furthermore, Agrawal et.al amrit16 demonstrate that more stable topics can be generated by tuning LDA parameters using differential evolution algorithm.

Tuning is also used for software effort estimation; e.g. using tabu search for tuning SVM 

cora10; or genetic algorithms for tuning ensembles minku2013analysis; or as an exploration tool for checking if parameter settings affect the performance of effort estimators (and what learning machines are more sensitive to their parameters) song2013impact. The latter study explored Random Forests, kth-nearest neighbor methods, MLPs, and bagging. This was another grid search paper that explored a range of tunings parameters, divided into 5 bins.

Our focus is on defect prediction and for this task, Lessmann et al. used grid search to tune parameters as part of their extensive analysis of different algorithms for defect prediction lessmann2008benchmarking. Strangely, they only tuned a small set of their learners while for most of them, they used the default settings. This is an observation we cannot explain but our conjecture is that the overall cost of their grid search tuning was so expensive that they restricted it to just the hardest choices. To be fair to Lessmann et al., we note that the general trend in SE data science papers is to explore defect prediction without tuning the parameters; e.g. me07b.

Two exceptions to this rule are recent work by Tantithamthavorn et al. tantithamthavorn2016automated and Fu et al. fu16: the former used grid search while the latter used differential evolution (both these techniques are detailed, below). These two teams worked separately using completely different scripts (written in “R” or in Python). Yet for tuning defect predictors, both groups reported the same results:

  • Across a range of performance measures (AUC, precision, recall, F-measure), tuning rarely makes performance worse;

  • Tuning offers a median improvement of 5% to 15%;

  • For a third of data sets exploration, tuning can result in performance improvements of 30% to 50%.

Also, in a result that echoes one of the conclusions of Arcuri & Fraser, Fu et al. report that different data sets require different tunings.

What was different between Fu et al. and Tantithamthavorn et al. was the computational costs of the two studies. Fu et al. used a single desktop machine and all their runs terminated in 2 hours for one tuning goal. On the other hand, the grid search of Tantithamthavorn et al. used 43 high performance computing machines with 24 hyper-threads times 43 machines = 1,032 hyper-threads. Their total runtime were not reported– but as shown below, such tuning with grid search can take over a day just to learn one defect predictor.

3 Comparing Grid Search and Differential Evolution

Tantithamthavorn et al. tantithamthavorn2016automated and Fu et al. fu16 use different methods to tune defect predictors. Neither offer a comparison of their preferred tuning method to any other. This section offers such a case study result: specifically, a comparison of grid search and different evolution for tuning defect predictors.

3.1 Algorithms

Grid search

is simply picking a set of values for each configuration parameter and evaluating all the combinations of these values, and then return the best one as the final optimal result, which can be simply implemented by nested for-loops. For example, for Naive Bayes, two loops might explore different values from the Laplace and M-estimator while a third loop might explore what happens when numeric values are divided into, say,

bins.

Bergstra and Bengio Bergstra2012 comment on the popularity of grid search:

  • Such a simple search gives researchers some degree of insight into it;

  • There is little technical overhead or barrier to its implementation;

  • As to automating grid search, it is simple to implement and parallelization is trivial;

  • According to Bergstra and Bengio Bergstra2012, grid search (on a compute cluster) can find better tunings than sequential optimization (in the same amount of time).

Since it’s easy to understand and implement and, to some extent, also has good performance, grid search has been available in most popular data mining and machine learning tools, like

caret kuhn2014caret package in the and GridSearchCV module in Scikit-Learn.

Differential evolution is included in many optimization toolkits, like JMetal in Java durillo2011jmetal

. But given its implementation simplicity, it is often written from scratch using the researcher’s preferred scripting language. Differential evolution just randomly picks three different vectors

from a list called (the frontier) for each parent vector A in  storn1997differential. Each pick generates a new vector (which replaces if it scores better). is generated as follows:

(2)

where is a random number, and are constants (following Storn et al. storn1997differential, we use and ). Also, one value (picked at random) is moved to to ensure that has at least one unchanged part of an existing vector.

As a sanity check, we also provide random search as a third optimizer to tune the parameters. Random search is nothing but randomly generate a set of different candidate parameters, and always evaluate them against the current “best” one. If better, then it will replace the “best” one. The process will repeat until the stop condition meets. In this case study, we set maximum iterations for random search the same as median number of evaluations in DE. The parameter will be randomly generated from the same tuning range as in Figure 1.

Dataset antV0 antV1 antV2 camelV0 camelV1 ivy jeditV0 jeditV1 jeditV2
training (release ) 20 / 125 40 / 178 32 / 293 13 / 339 216 / 608 63 / 111 90 / 272 75 / 306 79 / 312
tuning (release ) 40 / 178 32 / 293 92 / 351 216 / 608 145 / 872 16 / 241 75 / 306 79 / 312 48 / 367
testing (release ) 32 / 293 92 / 351 166 / 745 145 / 872 188 / 965 40 / 352 79 / 312 48 / 367 11 / 492
Dataset log4j lucene poiV0 poiV1 synapse velocity xercesV0 xercesV1
training (release ) 34 / 135 91 / 195 141 / 237 37 / 314 16 / 157 147 / 196 77 / 162 71 / 440
tuning (release ) 37 / 109 144 / 247 37 / 314 248 / 385 60 / 222 142 / 214 71 / 440 69 / 453
testing(release ) 189 / 205 203 / 340 248 / 385 281 / 442 86 / 256 78 / 229 69 / 453 437 / 588
Figure 2: Data used in this case study. Fractions denote defects / total. E.g., the top left data set has 20 defective classes out of 125 total. See explanation in section 3.2.3.

Grid search is much slower than DE since DE explores fewer options. Grid search’s execution of loops exploring options takes time , where is the number of parameters being tuned. Hence:

  • Tantithamthavorn et al. required 1000s of hyperthreads to complete their study in less than a day tantithamthavorn2016automated.

  • The grid search of Arcuri & Fraser Arcur11 took weeks to terminate, even using a large computer cluster.

  • In the following study, our grid search times took 54 days of total CPU time.

On the other hand, DE’s runtime are much faster since they are linear on the size of the frontiers, i.e. .

3.2 Case Studies

In order to understand the relative merits of grid search versus differential evolution for defect prediction, we performed the following study.

3.2.1 Data Miners

This study uses Random Forests and CART, for the following reason. Firstly, they were two of the learners studied by Tantithamthavorn et al. Secondly, they are interesting learners in that they represent two ends of a performance spectrum for defect predictors. CART and Random Forests were mentioned in a recent IEEE TSE paper by Lessmann et al. lessmann2008benchmarking that compared 22 learners for defect prediction. That study ranked CART worst and Random Forests as best. In a demonstration of the impact of tuning, Fu et al. showed that they could refute the conclusions of Lessmann et al. in the sense that, after tuning, CART performs just as well as Random Forests.

3.2.2 Tuning Parameters

Our DE and grid search explored the parameter space of Figure 1. Specifically, since Tantithamthavorn et al. divide each tuning range into 5 bins (if applicable), we also use the same policy here. For example, we pick values for . Other parameters grid will generate in the same way. As to why we used the “Tuning Range” shown in Figure 1, and not some other ranges, we note that (1) those ranges included the defaults; (2) the results shown below show that by exploring those ranges, we achieved large gains in the performance of our defect predictors. This is not to say that larger tuning ranges might not result in greater improvements.

3.2.3 Data

Our defect data, shown in Figure 2 comes from PROMISE repository111http://openscience.us/repo. This data pertains to open source Java systems defined in terms of Figure LABEL:fig:ck: ant, camel, ivy, jedit, log4j, lucene, poi, synapse, velocity and xerces. We selected these data sets since they have at least three consecutive releases (where release was built after release ). This will allow us to build defect predictors based on the past data and then predict (test) defects on future version projects, which will be a more practical scenario.

More specifically, when tuning a learner:

  • Release was used for training a learner with tunings generated by grid search or differential evolution.

  • During the search, each candidate has to be evaluated by some model, which we build using CART or Random Forests from release .

  • After grid search or differential evolution terminated, we tested the tunings found by those methods on CART or Random Forests applied to release .

  • For comparison purposes, CART and Random Forests were also trained (with default tunings) on releases and then tested on release .

3.2.4 Optimization Goals:

Our optimizers explore tuning improvements for precision and the F-measure, defined as follows. Let denote the true negatives, false negatives, false positives, and true positives (respectively) found by a binary detector. Certain standard measures can be computed from , as shown below. Note that for , the better scores are smaller while for all other scores, the better scores are larger.

Figure 3: Tuning to improve and . Median results from 20 repeats. F-measures results shown on left. Precision results shown on right. Blue = DEs, Green = no tunings, Yellow = random search. Red= grid search results that are better and statistically different to DE.

We do not explore all goals since some have trivial, but not useful, solutions. For example, when we tune for recall, we can achieve near 100% recall– but the cost of a near 100% false alarms. Similarly, when we tune to minimize false alarms, we can achieve near 0% false alarms– but the cost of a near 0% recall.

The lesson here is that tuning for defect predictors needs some “brake” effect where multiple goals are in contention (so one cannot be pushed to some extreme value without being “braked” by the other). Precision’s definition takes into accounts not only the defective examples but also the none defective ones as well so it has this brake effect. The same is true for the measure (since it uses precision).

3.2.5 20 Random Runs:

All our studies were repeated times to check for the stability of conclusions across different random biases. Initially, we planned for

repeats (appealing to the central limit theorem) but grid search proved to be so slow that, for pragmatic reasons, we used

repeats. To be clear, the random seed is different for each data set in each repeat, but it will be the same across learners built by grid search and DE as well as random search for the same data set.

The reason that we believe this is the right thing to do is the search bias for a particular train/tune/test run is always the same. For the same train/tune/test data set, searching algorithms will start from the same random seed. With this approach, it is important to note that different triplets have different seed values (so this case study does sample across a range of search biases).

3.2.6 Statistical Tests

For each data set, the results of grid search and DE were compared across the 20 repeats. Statistical differences were tested by the Scott-Knott test scott1974cluster that used the Efron & Tibshirani bootstrap procedure efron93 and the Vargha and Delaney A12 effect size test Vargha00 to confirm that sub-cluster of the treatments are statistically different by more than a small effect. We used these statistical tests since they were recently endorsed in the SE literature by Mittas & Angelis (for Scott-Knott) in TSE’13 mittas13 and Acura & Briand (for A12) at ICSE’11 arcuri11.

3.3 Results

In this section, we will present the results from the above designed case studies. To better compare DE with grid search, we’d like to investigate the following questions :

  • Does tuning improve learners performance?

  • Is grid search statistically better than DE in terms of performance improvements (precision & F-measure)?

  • Is DE a more efficient tuner than grid search in terms of cost(runtime)?

3.3.1 Performance Improvements

The changes in precision andF-Measure using parameters chosen by DE and grid search are shown in Figure 3. As a note, Figure 3 is generated by two separate case studies with tuning goals of precision andF-Measure, respectively. In that figure:

  • The blue squares show the DE results;

  • The green diamonds show what happens we run a learner using the default parameter settings;

  • The yellow triangle show the random search results;

  • The red dots indicate where the grid search results were statistically different and better (i.e. greater than) than DE.

Over all, from Figure 3, we can see that random search and DE both improve learners performance in terms of precision and F-measure, respectively. For example, in this case study, a simple random search can improve precision scores for CART and Random Forests in and data sets, respectively. The similar pattern can be found when F is set as our tuning objective. This further supports the conclusion from Fu et.al.fu16 that parameter tuning is helpful and can’t be ignored.

Result 1

Tuning can improve learner performance in most cases; DE is better than random search in a couple of data sets.

The key feature of Figure 3 is that the red dots are in the minority; i.e. there are very few cases where the grid search generate better results than DE:

  • When tuning CART:

    • For the F-measure, grid search was never better than DE;

    • For precision, grid does better than DE in cases (less than half the time).

  • When tuning Random Forests:

    • For the F-measure, there is only one case where grid does better than DE;

    • For precision, grid does better than DE in data sets; i.e. not very often.

Note that Figure 3 also contains results that echoes conclusions from Arcuri & Fraser in SSBSE’11 Arcur11:

  • Default parameter settings perform relatively well. Note that there exists several cases where the tuned results (blue squares or red dots) are not much different to the results from using the default parameters (the green diamonds).

  • But they are far from optimal on particular problem instances. Note all the results where the red dots and blue squares are higher than green, particular in the results where we are tuning for precision.

Result 2

Differential evolution is just as effective as grid search for improving defect predictors.

3.3.2 Runtime

Figure 4: Time(in seconds)required to run the 20 repeats of tunings to generate Figure 3. Left-hand-side plot shows raw runtime; Right-hand-side plot shows runtime releative to just running the learners using their default tunings.

Figure 4 shows the runtime costs of grid search, random search and differential evolution over 17 data sets including both F-measure and Precision as optimzation objectives:

  • The orange plot shows the raw runtime (in seconds) of our entire trial. In that plot,

    • CART is faster than Random Forests: 20 repeats of a grid search of the latter required 54 days of CPU time;

    • DE is faster than grid search: or seconds for DE vs or seconds for grid.

    • DE runs as fast as random search for both CART and RF: there’s no significant difference.

  • The blue plot shows that the same information but in ratio form.

    • Grid search is 1,000s to 10,000s times slower than just running the default learners

    • While DE adds a factor of to the default(untuned) runtime cost.

Result 3

Differential evolution runs much faster than grid search.

In summary:

  • Both DE and random search as well as grid search can improve precision and the F-measure.

  • Compared to DE, grid search runs far too long for too little additional benefit.

  • Both DE and random search require the same amount of runtime. But for some cases, DE has better performance than random search.

  • There are many cases where DE out-performs grid search.

4 Discussion

4.1 Why does DE perform better than grid search?

How to explain the suprising success of DE over grid search? Surely a thorough investigation of all options (via grid search) must do better than a partial exploration of just a few options (via DEs).

It turns out that grid search is not a thorough exploration of all options. Rather, it jumps through different parameter settings between some min and max value of pre-defined tuning range. If the best options lie in between these jumps, then grid search will skip the critical tuning values. That means, the selected grid points will finally determine what kind of tunings we can get and good tunings require a lot of expert knowledge.

Note that DE is less prone to skip since, as shown in Equation 2, tuning values are adjusted by some random amount that is the difference between two randomly selected vectors. Further, if that process results in a better candidate, then this new randomly generated value might be used as the start point of a subsequent random selection of data. Hence DE is more likely than grid search to “fill in the gaps” between an initially selected values.

Another important difference between DE and grid search is the nature of their searches:

  • All the grid points in the pre-defined grids are independently evaluated. This is useful since it makes grid search highly suited for parallelism (just run some of the loops on different processors). That said, this independence has a drawback: any lessons learned midway by grid search cannot affect(improve) the inferences made in the remaining runs.

  • DE’s candidates (equivalent to grid points) do “transfer knowledge” to candidates in the new generation. Since DE is an evolutionary algorithms, the better candidates will be inherited by the following generations. That said, DE’s discoveries of better vectors accumulate in the frontier– which means new solutions(candidates) are being continually built from increasingly better solutions cached in the frontier. That is, lessons learned midway through a DE run can improve the inferences made in the remaining runs.

Bergstra and Benigo Bergstra2012 offer a more formal analysis for why random searches (like DEs) can do better than grid search. They comment that grid search will be expected to fail if the region containing the useful tunings is very small. In such a search space:

  • Grid search can waste much time exploring an irrelevant part of the space.

  • Grid search’s effectiveness is limited by the curse of dimensionality.

Bergstra and Benigo reasons for the second point are as follows. They compared deep belief networks configured by a thoughtful combination of manual search and grid search, and purely random search over the same 32-dimensional configuration space. They found found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of their systems revealed that for most data sets only a few of the tuning really matter, but that different hyper-parameters are important on different data sets. They comment that a grid with sufficient granularity to tune for all data sets must consequently be inefficient for each individual data set because of the curse of dimensionality: the number of wasted grid search trials is exponential in the number of search dimensions that turn out to be irrelevant for a particular data set. Bergstra and Benigo add:

… in contrast, random search thrives on low effective dimensionality. Random search has the same efficiency in the relevant subspace as if it had been used to search only the relevant dimensions.

Our previous results in Section 3.3 also verified Bergstra and Benigo’s conclusion that random search is much better than grid search for exploring defect prediction data space, where random search is as good as DE in most data sets. But grid search rarely outperformed DE and random search in terms of performance scores(F-measure and Precision). However, grid search just waste a lot of time to explore unnecessary space.

4.2 When (not) to use DE?

How might we assess the external validity of the above results? Is it possible to build some predictor when DEs might and might not work well?

To explore these questions we use Bergstra and Benigo’s comments to define the conditions when we would expect DEs to work better than grid search for defect prediction. According to the argument above, DE works well for tuning since:

  • DE tends to favor the small number of dimensions relevant to tuning;

  • The space of tunings for defect predictors is inherently low dimensional.

In defence of the first point, recall that Equation 2 says that DE repeatedly compares an existing tuning against another candidate that is constructed by taking a small step between three other candidates . DE runs over a list of old candidates, times. For , the invariant is that members of that list are not inferior to at least one other example. If a new candidate is created that is orthogonal to the relevant dimensions, it will be no better than the candidates it was created from. Hence, the invariant for any successful replacement of is that it has moved over the relevant dimensions.

As to the second point about the low dimensional nature of tuning defect predictors, we first assume that the dimensionality of the tuning problem is linked to the dimensionality of the data explored by the learners. Our argument for this assumption is (1) learners like CART and Random Forest divide the data into regions with similar properties (specifically, those with and without defects); (2) when we tune those learners, we are constraining how they make those divisions over that data.

Figure 5: Intrinsic dimensionality of our data.

Given that assumption, exploring the space of tunings for defect predictors really means exploring the dimensionality of defect prediction data. Two studies strongly suggest that this data is inherently low-dimensionality. Papakroni papa13 combined instance selection and attributes pruning for defect prediction. Using some information theory, he was able to prune 75% of the attributes of Figure LABEL:fig:ck as uninformative. He then clustered the remaining data, replacing each cluster with one centroid. This two-phase pruning procedure generated small data sets with, e.g., 24 columns (attributes) and 800 rows (instances) to a table of 6 columns and 20 rows. To test the efficacy of that reduced space, Papakroni built defect predictors by extrapolating between the two nearest centroids in the reduced space, for each test case. Papakroni found that those estimates from that small

space worked just as well as those generated by state-of-the-art learners (Random Forests and Naive Bayes) using all the data papa13. That is, according to Papakroni, the signal in these data sets can be found in a handful of attributes and a few dozen instances.

In order to formalize the findings of the Papakroni study, Provence province15 explored the intrinsic dimensionality of defect data sets. Intrinsic dimensionality measures the number of dimensions used by data within an dimensional space. For example, the “B” data shown at right spreads over a dimensional space. but the “A” data does not use all the available dimensions. Hence, the intrinsic dimensionality of the “A” data is .

Like Provence, we use correlation dimension to calculate the intrinsic dimensionality of the datasets. Euclidean distance is used to compute the distance between the independent decisions within each candidate solution; all values are normalized by . Next, we use the distance measure as part of the correlation dimension defined by by Grassberger and Procacciagrassberger1983measuring. This correlation dimension of a data set with items is found by plotting the number of items found at distance within radius r from any other item against (where is actually a distance, as defined in the last paragraph). Then we normalize this by the number of connections between items to find the expected number of neighbors at distance r is .

Given a data set with items and min, max distance of and , we estimate the intrinsic dimensionality as the mean value of the slope of vs by evaluating for in , such that is sufficient for a good estimation of slope, and .

Figure 5 shows the intrinsic dimensions for the data sets used in this study. Note the low intrinsic dimensionality (median value, shown as the dashed line, nearly 1.2). By way of comparison, the intrinsic dimensions reported by other researchers in their data sets (not from SE) is often much larger; e.g. 5 to 10, or more; see levina04.

From all this, Provence concluded that the effects reported by Papakroni were due to the underlying low dimensionality of the data. Extending his result, we conjecture that our conclusion (that DEs do better than grid search for tuning data miners) are externally valid when the data miners are exploring data with low intrinsic dimensionality.

5 Conclusion and Future Work

When the control parameters of data miners are tuned to the local data, the performance of the resulting predictions can be greatly increased. Such tuning are very computationally expensive, especially when done with grid search. For some software engineering tasks, it is possible to avoid those very long runtime. When tuning defect predictors learned from static code attributes, a simple randomized evolutionary strategy (differential evolution) runs one to two orders of magnitude faster that grid search. Further the tunings found in this way work as well, or better, than those found via grid search. We explain this result using the (1) Bergstra and Benigo argument that random search works best in low-dimensional data sets and (2) the empirical results of Papakroni and Provence that defect data sets are very low dimensional in nature.

As to testing the external validity of this paper’s argument, the next steps are clear:

  1. Sort data sets by how well a simple evolutionary algorithm like DE can improve the performance of data miners executing on that data;

  2. Explore the difference in the worst and best end of that sort.

  3. If the intrinsic dimensionalities are very different at both ends of this sort, then that would support with the claims of this paper .

  4. Else, this paper’s claims would not be supported and we would need to seek other difference between the best and worst data sets.

Note that we offer these four items as future work, rather than reported results, since so far all the defect data sets we have tried had responded best to DE. For a time, we did consider trying this with artificially generated data (where we control how many dimensions were contained in the data). However, prior experience with using artificial data sets me99q suggested to us that such arguments are not widely convincing since the issue is not how often an effect holds in artificial data, but how often in holds in real data. Hence, in future work, we will look further afield for radically different (real world) data sets.

Acknowledgments

The work has partially funded by a National Science Foundation CISE CCF award #1506586.

References