Is One Hyperparameter Optimizer Enough?

07/29/2018 ∙ by Huy Tu, et al. ∙ 0

Hyperparameter tuning is the black art of automatically finding a good combination of control parameters for a data miner. While widely applied in empirical Software Engineering, there has not been much discussion on which hyperparameter tuner is best for software analytics. To address this gap in the literature, this paper applied a range of hyperparameter optimizers (grid search, random search, differential evolution, and Bayesian optimization) to defect prediction problem. Surprisingly, no hyperparameter optimizer was observed to be `best' and, for one of the two evaluation measures studied here (F-measure), hyperparameter optimization, in 50% cases, was no better than using default configurations. We conclude that hyperparameter optimization is more nuanced than previously believed. While such optimization can certainly lead to large improvements in the performance of classifiers used in software analytics, it remains to be seen which specific optimizers should be applied to a new dataset.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recent results from software analytics show that the performance of a data miner exploring software data is significantly increased after applying hyperparameter tuning (Agrawal et al., 2018a; Agrawal and Menzies, 2018; Fu et al., 2016; Fu and Menzies, 2017; Fu et al., 2016; Jia et al., 2015; Wang, Harman, Jia, and Krinke, Wang et al.; Corazza et al., 2010; Minku and Yao, 2013; Song et al., 2013). Such tuners automatically search the very large input space of possible parameters settings for a data miner.

Researchers in this area use a very narrow range of optimizers. For example, WEKA comes with a hyperparameter optimizer based on SMAC (a state-of-the-art Bayesian optimization method (Hutter et al., 2011)). While there is much to recommend SMAC, it is only one of a wide range of possible tuners including grid search, random search (Bergstra and Bengio, 2012), evolutionary modern search methods, sampling methods (Jia et al., 2015), or various domain-specific methods that exploit some aspect of the local problem (Chen and Menzies, 2018). Before this community uncritically endorses the use of a single tuner, it seems appropriate and timely to reflect on the relative merits of multiple tuners. Hence, this paper.

Metric Name Description
amc average method complexity Number of JAVA byte codes
avg_cc average McCabe Average McCabe’s cyclomatic complexity seen in class
ca afferent couplings How many other classes use the specific class.
cam cohesion amongst classes Summation of number of different types of method parameters in every method divided by a multiplication of number of different method parameter types in whole class and number of methods.
cbo coupling between objects Increased when the methods of one class access services of another.
ce efferent couplings How many other classes is used by the specific class.
dam data access Ratio of private (protected) attributes to total attributes
dit depth of inheritance tree It’s defined as the maximum length from the node to the root of the tree
ic inheritance coupling Number of parent classes to which a given class is coupled (includes counts of methods and variables inherited)
lcom lack of cohesion in methods Number of pairs of methods that do not share a reference to an instance variable.
locm3 another lack of cohesion measure = number of methods,attributes in a class; = number of methods accessing an attribute.
loc lines of code Total lines of code in this file or package.
max_cc Maximum McCabe maximum McCabe’s cyclomatic complexity seen in class
mfa functional abstraction Number of methods inherited by a class plus number of methods accessible by member methods of the class
moa aggregation Count of the number of data declarations (class fields) whose types are user defined classes
noc number of children Number of direct descendants (subclasses) for each class
npm number of public methods npm metric simply counts all the methods in a class that are declared as public.
rfc response for a class Number of methods invoked in response to a message to the object.
wmc weighted methods per class A class with more member functions than its peers is considered to be more complex and therefore more error prone
defect defect Boolean: where defects found in post-release bug-tracking systems.
Table 1. Object Oriented Measures used in our defect datasets
Figure 1. WEKA’s hyperparemter tuning tool from UBC. Note that central to the tool is a single hyperparameter optimizer called SMAC (Hutter et al., 2011), shown in RED. This paper asks the question “is one hyperparameter optimizer enough?”.

The experiments in this paper document the efficacy of default versus tuned settings using four state-of-the-art hyperparameter optimization techniques (grid search, random search, differential evolution, and Bayesian optimization) across four representative classes of data miners (decision tree, random forest, support vector machine, and k nearest neighbors). This study investigate the practicability and benefits of hyperparameter tuning in defect prediction for three goals of (1) F-Measure, (2) Precision

111These measures were used since they reference multiple target classes and optimizing for goals based on single targets leads to (e.g.) high recalls and false alarms, and (3) extensively on per software release version level 222Most of the previous studies investigated just on project version (Agrawal and Menzies, 2018; Fu et al., 2016).. Note that we make no claim that the tuners we explore cover the space of all possible hyperparamter optimizers. Instead, we explore just some of the more popular ones and show that there is much variability in which of these tuners is best. This result, even on just the tuners explored here, is sufficient to motivate future work that explores how to select tuning algorithms for SE data sets. This work shall explore two research questions.

RQ1: Is hyperparameter tuning useful in defect prediction?

With current success of state-of-the-art hyperparameter tuning work in software analytics (Fu and Menzies, 2017; Fu et al., 2016; Agrawal et al., 2018a), we aim to challenge the versatility of the success with smaller scope of datasets specifically in defect prediction (from project version to software release version). Our experiment shows statistically significant improvements while using hyperparameter tuning in defect prediction (especially in precision). In summary, we show that:

Lesson 1: Hyperparamter tuning is useful in defect prediction. It confirms with the recent success of hyperparameter tuning in empirical SE study.

This confirmation of the usefulness of hyperparameter tuning practice in defect prediction leads us to wonder which hyperparameter optimizer should be considered as the best standard one. Hence, our next question is as follows.

RQ2: Which hyperparamter optimizer is the best for defect prediction?

We find that some optimizers work best only for specific learners and evaluation criteria; e.g. Bayesian Optimization works well in Precision but not in F-Measure while DE optimizes greatly in F-Measure but not in Precision. Moreover, the time cost of Bayesian Optimization is expensive (twice to 100 times longer than other techniques). Hence we say:

Lesson 2: No hyperparamter optimization technique was considered to be best.

That is, our answer to the question “is one hyperparemeter optimizer enough” is “no”. Hence we must deprecate papers that report the results of tuning based on a single optimizer.

In summary, the main contributions of this paper are:

  • [leftmargin=0.4cm]

  • An extensive experimental survey for hyperparameter tuning in defect predictions;

  • A comment on the (lack of) generality of conclusions from such hyperparameter studies (we cannot claim that one hyper parameter optimizer is better than another);

  • A reproduction package containing all the data, algorithms, and experimentation of this paper, see

The rest of this paper is organized as follows. Section 2 describes background, how defect predictors can be generated by data miners, and how tuning can affect the effectiveness of the learners. Section 3 defines the experimental setup of this paper. Section 4 presents the results and discussions from the case study. Lastly, we discuss the validity of our results and a section describing our conclusions.

2. Related Work

2.1. Defect Prediction

Human programmers are clever, but flawed. When there are software functionality developments, there must also be software defects. Defects include software crashes or wrong and lack of appropriate functionality. With the inherent existence of defects, it is important to be aware of the defect and take proactive approaches to minimize and prevent future defects. It is imperative to efficiently summarize the knowledge about defects within the system in order to balance with taking proactive action toward defects while developing new functionality to the system. Testing before software is deployed is one approach to learn about the existence of defects within the system. However, according to Lowry et al. (Lowry et al., 1998), software assessment budgets are finite while assessment effectiveness increases exponentially with assessment effort. According to the 80/20 Pareto principle in software testing, 20% of the application contains 80% of the critical defects. Therefore, in order to preserve the finite resources, the gold standard practice is to apply the best available resources (labor, knowledge, time, etc) on the critical code portion. Any method that focuses arbitrary parts of the code can miss critical defects in other areas, which means some sampling policy should be implemented to smartly explore the rest of the system.

One smart sampling policies class is defect predictors which learned from static code attributes. Such defect predictors are easy to apply, widely used, and useful. Given object oriented software attributes described like Table 1, data miners can infer where software defects (dependent attribute) mostly occur and learn the pattern of how defects will occur. Static code attributes can be automatically collected as independent attributes, even for very large systems (Nagappan and Ball, 2005). Otherwise, manual code reviews method can be applied, which is slower and more labor intensive (Menzies et al., 2002). Researchers and industrial practitioners use static attributes to guide software quality predictions (Nam and Kim, 2015; Tan, Tan, and Dara, Tan et al.; Krishna, Menzies, and Fu, Krishna et al.; Lewis et al., 2013) and such predictors can localize 70% (or more) of the defects in code (Menzies et al., 2007).

2.2. Data Miners

Defect predictors use data miners to apply various heuristics to efficiently reduce the search space for finding the summaries of the defect data. This study uses 4 popular data miners including Classification And Regression Trees (CART), Random Forests (RF), K Nearest Neighbors (KNN), and Support Vector Machine (SVM). They are interesting learners in that they represent all the four statistically distinct classes of a performance spectrum for defect predictors that were categorized by Ghotra et al 

(Ghotra, McIntosh, and Hassan, Ghotra et al.) through the double Scott-Knott test. Specifically:

  • [leftmargin=0.4cm]

  • CART is a tree learner that divide a data set, then recursively split on each node until some stop criterion is satisfied.

  • RF follows the procedure like CART but RF is an ensemble method of building CART (), each time using some random subset of the attributes.

  • KNN is a lazy-learning and instance based technique that studies the most similar training examples to a particular instance to classify that instance.

  • SVM uses hyperplane to separate two classes (defective versus non-defective). SVM applies the kernel tricks to make the data more separable which transforms the data points into multi-dimensional feature space.

This standard of picking four different levels of performance between various data miners was also adapted for other recent empirical SE studies (Chen et al., 2018; Fu et al., 2018; Krishna and Menzies, 2017; Agrawal and Menzies, 2018).

2.3. Hyperparameter Tuning

Data miners have control parameters (e.g., for SVM it would be the kernel function type, the regularization term, the tolerance, etc.). Adjusting those parameters to optimize the performance of the data miners is called hyperparameter tuning (Fu et al., 2016; Fu and Menzies, 2017; Agrawal et al., 2018a; Fu et al., 2016). Tuning is used in the hyperparameter optimization literature exploring better combinatorial search methods for software testing (Jia et al., 2015)

or the use of genetic algorithms to explore 9.3 million different configurations for clone detection algorithms

(Wang, Harman, Jia, and Krinke, Wang et al.)

. Other researchers explore the effects of parameter tuning on topic modeling for SE text mining. Tuning is also used for software effort estimation; e.g. using tabu search for tuning SVM

(Corazza et al., 2010); or genetic algorithms for tuning ensembles (Minku and Yao, 2013); or an exploration tool for quality checking of parameter settings in effort estimators (Song et al., 2013).

Figure 2. Literature review of hyperparameters tuning on 52 top defect prediction papers (Fu et al., 2016)
Learner Parameter Default Tuning Range Description
CART criterion “gini” [“gini”, “entropy”] The function to measure the quality of a split.
max_features None [0.1, 1.0] The number of features to consider when looking for the best split.
min_samples_split 2 [2, 30] The minimum number of samples required to split an internal node.
min_samples_leaf 1 [1, 21] The minimum number of samples required to be at a leaf node.
max_depth None [1, 21] The maximum depth of the tree.
KNN n_neighbors 5 [2, 10] Number of neighbors to use.
weights “uniform” [“uniform”, “distance”] Weight function used in prediction.
SVM C 1.0 [1, 100] Penalty parameter C of the error term.
kernel “rbf” [“rbf”, “sigmoid”] Kernel type to be used in the algorithm.
coef0 0.0 [0.1, 1.0] Independent term in kernel function.
gamma ’auto’ [0.1, 1.0] Kernel coefficient.
RF criterion “entropy” [“gini”, “entropy”] The function to measure the quality of a split.
max_features ’auto’ [0.1, 1.0] The number of features to consider when looking for the best split.
min_samples_split 2 [2, 30] The minimum number of samples required to split an internal node.
min_samples_leaf 1 [1, 21] The minimum number of samples required to be at a leaf node.
n_estimators 10 [10, 100] The number of trees in the forest.
Table 2. Parameters Tuning Space

The case studies used in this paper comes from defect predictor or classification of existing static code attributes. Many SE defect prediction studies on static code attributes have been produced (Krishna, Menzies, and Fu, Krishna et al.; Nam and Kim, 2015; Tan, Tan, and Dara, Tan et al.)

. However, software analytic practitioners have only been solely focusing on finding and employing complex and “off-the-shelf” machine learning models

(Menzies et al., 2007; Moser, Pedrycz, and Succi, Moser et al.; Elish and Elish, 2008). According to literature reviews done by Fu et al (Fu et al., 2016) in defect prediction shown in Figure 2, 80% of highly cited papers did not mention any parameters tuning while employing the default parameters setting of the data miners.

Bergstra and Bengio (Bergstra and Bengio, 2012) noted on the popularity of grid search: (a) simple search to give some degree of insight; (b) has little technical overhead; (c) simple to automate and parallize; (d) (on a computing cluster) can find better tunings than sequential optimization. Grid search is conjectured not more effective than more randomized searchers if the underlying search space dimension is inherently low.

Tantithamthavorn et al. (Tantithamthavorn et al., 2016) and Fu et al. (Fu et al., 2016) are two recent work investigating the effects of parameter tuning on defect prediction, Tantithamthavorn used grid search while Fu applied differential evolution. Neither offer a comparison of their preferred tuning method to any other. At the same time, Fu et al (1) only studied half of the 4 data miners classes Ghotra et al considered; (2) did not include the state-of-the-art hyperparameter tuning Bayesian Optimization method; and (3) applied the optimization on project level instead of the release version level.

Beside the strength or circumstance to pick the right optimizer, tuning for many objectives or inappropriate goals at one dispersed the strength of optimization. From Sayyad et al’s results through tuning multi-objectives in effort estimation, all the methods did similarly in tuning 2 objectives but most of these algorithms do not perform nearly acceptably in tuning 4-5 objectives (Sayyad et al., 2013) when comparing the percentage of fully-correct solutions in the Pareto fronts. However, Recent study by Agrawal et al (Agrawal and Menzies, 2018) determined that better data quality is more important than better data miners quality by tuning the data preprocessors. It is apparent because hyperparameter tuning can be applied for not only the data mining but also the preprocessing data (SMOTE, SMOTUNED (Agrawal et al., 2018a)

, normalization, discretization, outlier removal, etc) and features selection (explore

subsets of features with PCA, RFE, etc). Thus, the appropriate goals of tuning or/and knowing what to tune are important.

The lack of these points in SBSE’s literatures basically stemmed from Lessmann et al’s conclusion (Lessmann et al., 2008) as software analytics practitioners are flexible to pick from a broad set of models when building defect predictors since the importance of the data miner is generally not too significant. Knowing that, is the insignificant difference in performance due to the nature of defect prediction problem itself or the traditional approach of exploring the tuning input space to the problem? For instance, Fu, Chen, and Agrawal (Fu et al., 2018; Chen et al., 2018; Agrawal et al., 2018b) had applied simple method designed by psychological principles, Fast and Frugal Trees (Phillips et al., 2017), that focusing on exploring the output space as binary tree with depth instead of exploring the input space. This backward approach offers better or similar performance (for most cases) but with much less trade-off in term of time, processing power, and result’s human-readability. Consequently, it is important to assess the old belief, which has pushed for this more extensive guideline and survey of hyperparameter tuning in defect prediction.

3. Experimentation

For each tuning goal (precision and F1), each tuning algorithm shall run 20 times across the four machine learner models (CART, KNN, SVM, and RF) to validate the stability of the results across through random biases and noises. For each repeat of the algorithm, different combination of hyperparameter settings would be evaluated. Each evaluation is quantified as 10 parameters sets generated by corresponding tuning algorithm. Each evaluation would be compared against the current “best” one. If better, then it will replace the “best” one. If not, it would be less likely that the next evaluation would achieve higher “best” so the lives of the process is decreased by 1. The stop conditions include the exhaustion of five lives or 1 hour processing time, the search process is repeated till either the stop condition meets. For each release version of project , the results from training the respective scoring goal with data miner with all tuning methods on release version would be recorded and ranked by the Scott-Knott test. The tuning method that statistically got first rank will be incremented by one to measure how often each optimizer would statistically perform best.

3.1. Tuning Algorithms

This study shall explore the representatives of hyperparameter tuning classes including: grid search, random search, DE, and Bayesian optimization. All of these optimization methods explored the parameter space as described in Table 2.

Grid Search is simply picking a set of values for each configuration parameter and evaluating all the combinations of these values, and then return the best one as the final optimal result, which can be simply implemented by nested for-loops.

Random search is nothing but randomly generated set of candidate parameters from the same tuning range as in Table 2.

DE evolves a new generation of candidates by extrapolating randomly between three current population’s members of solutions, , of size  (Storn and Price, 1997). DE combines local search mutation, (where is a parameter controlling crossover), with an archive pruning operator. As the process progresses, new candidates supplant older items in the population, then all subsequent mutations use the newer and more valuable candidates.

Bayesian Optimization comprises a probabilistic model and an acquisition function. There are several popular approaches for probabilistic models: density estimation models such as Tree-structured Parzen Estimator (TPE) (Bergstra et al., 2011), random forest such as Sequential Model-based Algorithm Configuration (SMAC) (Hutter et al., 2011), and Gaussian process (Snoek et al., 2012). Specifically, this study employed random forest based probabilistic method, SMAC (Hutter et al., 2011). It is known as auto-sklearn, a robust AutoML system based on scikit-learn (Pedregosa and Varoquaux, 2011), developed by Feurer et al (Feurer et al., 2015)

which discards poorly performing hyperparameter early. A new input’s posterior mean and variance of is computed then used for computation of the acquisition function. The acquisition function defines the criterion to determine future parameters candidates for evaluation. The next most promising parameters set will be found using the probabilistic model, evaluated by acquisition function, updated within the main model, and reiterated.

Figure 3. Comparison of applying different methods of tuning against default setting of various data miners on 27 release versions of 10 projects. The colors refer to a statistical comparison across the tuning performance of optimizers in each row of learners.

3.2. Defect Data

Our data comes from SEACRAFT repository ( This data pertains to open source JAVA systems defined in terms: ant, camel, ivy, jedit, log4j, lucene, poi, synapse, velocity and xerces.

We applied incremental learning approach. With at least three software releases (where release i+1 was built after release i), this will allow defect predictors being built to predict future (test) defects based on learning from the past (train) data. Specifically:

  • [leftmargin=0.4cm]

  • Each software release was divided into 3 even portions () where a learner will be trained on of and each candidate of parameter setting would be evaluated on the other of .

  • After terminating the tuning method, the best parameters setting shall be picked for building the appropriate data miner model on the release to predict the defects in release and output the result according to the tuning goal.

  • These 4 data miners will also be trained with also default parameters configuration on release then tested on release .

3.3. Tuning Goals

The problem studied in this paper is a binary classification task of bug identification based on the static code attributes of a specific JAVA class. The performance of a binary classifier can be assessed via a confusion matrix as in Table 


Prediction false true
Table 3. Confusion Matrix

Further, “false” means the learner got it wrong and “true” means the learner correctly identified a positive or negative class. Hence, Table 3 has four quadrants containing, e.g., which denotes “true positive”.

Our optimizers explore tuning improvements for Precision and F-Measures values on software release version level. For these two goals, the larger the values, the better the model’s predicting power.


We do not explore all goals since some have trivial, but not insightful, solutions. No evaluation criteria is “best” since different criteria are appropriate in different real-world contexts. For example, when we tune for recall, we can achieve near 100% recall but at the cost of a near 100% false alarms. Precision’s definition takes into accounts not only the defective examples but also the none defective ones as well so it has this effect of where multiple goals are in contention. The same is true for the F-Measure (as it uses precision).

3.4. Statistical Analysis

We compared our results of tuned miners per release version using statistical significance test and an effect size test by Scott-Knott procedure (Mittas and Angelis, 2013; Ghotra, McIntosh, and Hassan, Ghotra et al.). Significance test detects if two populations differ merely by random noises (Ghotra, McIntosh, and Hassan, Ghotra et al.). Effect sizes checks whether two populations differ by more than just a trivial amount, where effect size test was used (Arcuri and Briand, 2011). Our stats test are statistically significant with 95% confidence and not a “small” effect ().

4. Results

Figure 3 offers the cumulative statistical results of Grid, DE, Random, and SMAC against the default hyperparameter configuration for each data miner across the 27 release versions to maximize Precision and F-measure scores for this study. There are 27 release versions which correspond to 27 possible times that one tuning method can get first rank. For example, CART tuned with SMAC got first rank 19 times (70%) while CART tuned with Grid search only got first rank 11 times (56%). The darker the cell, the statistically better the performance of the learner combined with that optimizer.

From Figure 3, we observe that over all (learners, optimizers, evaluation criteria), there is no clear “best” optimizer:

  1. [leftmargin=0.4cm]

  2. Grid search, widely depreciated (Bergstra and Bengio, 2012), performs surprisingly well for KNN and SVM’s F-Measure (but not elsewhere);

  3. DE does well for optimizing F-Measure but not Precision;

  4. Bayesian Optimization, SMAC, gets best results in Precision, but not for F-Measure;

  5. Other optimizers work best only for specific learners and evaluation criteria.

Further to the third point, Table 4 shows the mean CPU time in seconds to run one repeat of one optimizer on one learner. Note that the SMAC runtimes are substantially larger than the other methods. Hence the extra benefits of SMAC optimization must be carefully weighed against the cost of that optimization.

Grid 4 5 334 6
DE 4 5 318 6
Random 4 5 305 6
Default 1 1 2 1
SMAC 613 501 652 505
Table 4. Runtime in seconds.

For example, for Precision and SVM, is a win of 63% (with SMAC) vs 59% (with Defaults) really worth the CPU required to earn such a small gain?

If the reader feels that the CPU times recorded in Table 4 are insignificantly small, then please recall that these are for repeats over learners for optimizer for datasets. When repeated for larger values, the longer runtimes of SMAC become highly significant. For example, even utilizing our university’s cloud compute facilities, it took two graduate students weeks to collect the data behind Figures 2 & 3.

5. Discussion

5.1. Algorithm

Even when there seems to have no conclusive evidence to indicate which optimizer is the best across the three goals. DE did well to optimize F-Measure while SMAC did well to optimize Precision on per the release version level. Naturally, a thorough investigation of all options via grid search should do better than a partial exploration of just a few options, through DEs and SMACs.

In reality, both grid search and random search sample through different parameter settings between some and value of predefined tuning range, which will determine the nature of tuning and good tuning require expert knowledge. If the best options lie in between these jumps, then grid search will skip the critical tuning values. Moreover, for both grid search and random search, all the combination options in the predefined tuning range are independently evaluated. Any lessons learned in the process will not be utilized to improve in the remaining runs.

Note that both DE and SMAC are more prone to not skip and incrementally fill in the gaps between initial selected tuning range. Moreover, both evolutionary nature of DE and Bayesian nature of SMAC, learning knowledge are transferred to the next generation in the same run to improve the inference of future results:

  • [leftmargin=0.4cm]

  • For DE, tuning values are adjusted by some random amount that is the difference between two randomly selected vectors. DE’s discoveries of better vectors accumulate in the frontier, new solutions (candidates) are being continually built from increasingly better solutions cached in the frontier.

  • For SMAC, given a small initial set of function evaluations, proceeds by fitting a surrogate model to those observations, random forest like (SMAC), and then optimizing an acquisition function that balances exploration and exploitation. With randomness and probability distributions, it determines the next most promising point to evaluate.

Our results aligned with Bergstra and Benigo’s formal analysis for how random searches (like DEs and SMAC) can do better than grid search and random search especially if the region containing the useful tunings is very small. In such search space: (a) Grid search and Random search can waste much time exploring an irrelevant part of the space. (b) Grid search’s effectiveness is limited by the curse of dimensionality.

5.2. Approach

The core experiment can be seen as narrow - by looking solely at defect prediction - but is indeed appropriate and necessary to start a discussion on the complexity and potential limitations of parameter tuning methods. Introducing hyperparameter tuning can come with a great trade-off of complexity and cost (processing power and time) if the goals are not achieved within a reasonable cost. More important, hyperparameter tuning can be incorporated beside the data mining step such as during the preprocessing data and features selection. Moreover, Table 2’s tuning space can be expressed continuous which means the space of parameters is theoretically infinite. It is reasonable that the complexity is unwanted unless the area that needed to be tuned is known.

In Calero’s and Pattini’s survey of modern SE companies (Calero and Piattini, 2015), they find that many current organizational redesigns are motivated (at least in part) by arguments based on “sustainability” (i.e., using fewer resources to achieve results). According to them, “redesign for greater simplicity” is a new source of innovation and motivation for much contemporary industrial work to explore cost-cutting opportunities for gaining an advantage over other competitors. Perhaps, it is time to call for a new approach to software analytic beside the traditional forward approach of exploring the input space that we followed for this study.

6. Threats to Validity

Biases are inevitable in any empirical study that can affect the results. Although, this work has attempted to minimize biases, the following issues should be considered when inferring insight from the results.

6.1. Order Bias

With each dataset how data samples are distributed in training and testing set is completely random.Though there could be times when all good samples arebinned into training set. The experiment is designed to run for 20 times in order to mitigate that bias and for stability of the results. For each run, the random seed is different for each data set, but it will be the same across learners configured by hyperparameter optimizers for the same data set. With this approach, it is important to note that different triplets have different seed values (so this case study does sample across a range of search biases).

6.2. Sampling Bias

Sampling Bias threatens any classification experiment, i.e.,what proves to be important here may be insignificant there. For e.g., We applied ten widely used open source JAVA software project data from SEACRAFT as the subject in various case studies by various researchers (Fu et al., 2016; Chen et al., 2018; Fu et al., 2018), i.e., our results are not more biased than many other studies in this arena. However, the datasets were supplied by one individual. Moreover, only specified metrics listed in Table 2 are used as the attributes to build defect predictors, it is not scientific to guarantee that our observation can be directly generalized to other projects that using different set of metrics, like code change metrics.

Also, these defect datasets are lower in dimension in comparison with effort estimation, text mining, and test case prioritization. Consequently, our findings for this specific case study in defect prediction might not be to applicable in those other software analytics.

6.3. Learner and Optimizer Bias

Research reproducability refers to the consistency of the results reproduced and obtained from this particular designed experiment. To assure the research reproducibility and reliability, this paper has taken care to either clearly define our algorithms or use implementations from the public domain (Scikit-Learn). While most of our optimizers were solely developed based on the public domain foundation, there are different algorithm based implementation algorithms with Bayesian Optimization (SMAC) in Auto-Sklearn and DE which may affect results differently if other algorithm implementations were considered, i.e. for probabilistic models of Bayesian Optimization, TPE or Gaussian Process can be incorporated instead. However, in term of runtime cost, the data loading and processing methods implemented in this study are used by all optimizers. Therefore, the relative runtime cost comparison between all still hold.

6.4. Evaluation Bias

The SMAC optimizer’s internal configuration, auto-sklearn, forestalls the incorporation of many non-conventional evaluation metrics

(Feurer et al., 2015). This work was only able to report on two performance measures, Precision and F-Measure (as defined in equation 1 and 2), on per software version level. Other quality measures (AUC, , Distance2Heaven, etc) are also often used in software engineering to quantify the effectiveness of prediction (Chen et al., 2018; Monden et al., 2013; Kamei et al., 2013).

7. Conclusion and Future Work

Three conclusions following from this extensive hyperparameter tuning study in defect prediction. Firstly, like many other researchers before us (Fu et al., 2016, 2016; Jia et al., 2015; Wang, Harman, Jia, and Krinke, Wang et al.; Corazza et al., 2010; Minku and Yao, 2013; Song et al., 2013; Agrawal et al., 2018a; Agrawal and Menzies, 2018; Fu and Menzies, 2017), we conclude that hyperparameter optimization is very useful. Figure 3 is very clear: hyperparamter tuning usually leads to far better performance scores than just using the defaults.

That said, our second conclusion is that it is not clear when one hyperparameter optimizer is better than any other. Hence, for future researches, practitioners would need to apply a range of optimizers rather than rely on just one. It is similar to Lessmann et al’s conclusion (Lessmann et al., 2008) as software analytics practitioners are flexible to pick from a broad set of models when building defect predictors since the importance of the data miner is generally not too significant.

So thirdly, reducing the total runtimes of multi-optimizer studies is an open and pressing problem. We cannot expect a wide community of academic and industrial practitioners to use hyperparameter optimization unless that usage is easy to apply.

As for future work, we propose:

  • [leftmargin=0.4cm]

  • This study can be replicated with other evaluation measures, i.e. AUC, , Distance2Heaven, etc. Bayesian Optimization practice (SMAC) of auto-sklearn can be edited to adapt those evaluation measures while optimizing the data miners.

  • This work should be repeated for different domains in software analytics such as text mining, effort estimation, etc.

  • Other parts of the data mining process can be explored for optimization (preprocessing, features engineering, etc)

  • Research to determine how to check if a specific problem is tunable? And if so, which part of the data mining pipeline along which goals should be tuned?

  • Explore more of the “backward approach” of surveying the result space by some initial and random data mining, then reflecting and redesigning a software quality predictor that better understands the results space.


  • (1)
  • Agrawal et al. (2018a) A. Agrawal, W. Fu, and T. Menzies. 2018a. What is wrong with topic modeling? And how to fix it using search-based software engineering. IST (2018).
  • Agrawal and Menzies (2018) A. Agrawal and T. Menzies. 2018. “Better Data” is Better than “Better Data Miners” (Benefits of Tuning SMOTE for Defect Prediction). ICSE (2018).
  • Agrawal et al. (2018b) A. Agrawal, H. Tu, and T. Menzies. 2018b. Can You Explain That, Better? Comprehensible Text Analytics for SE Applications. (2018).
  • Arcuri and Briand (2011) A. Arcuri and L. Briand. 2011. A practical guide for using statistical tests to assess randomized algorithms in software engineering. In 33rd ICSE. IEEE.
  • Bergstra et al. (2011) J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. 2011. Algorithms for Hyper-parameter Optimization. In 24th International Conference on NIPS. 9.
  • Bergstra and Bengio (2012) J. Bergstra and Y. Bengio. 2012. Random Search for Hyper-parameter Optimization. J. Mach. Learn. Res. 13, 1 (Feb. 2012), 281–305.
  • Calero and Piattini (2015) C. Calero and M. Piattini. 2015. Green in Software Engineering. Springer Publishing Company, Incorporated.
  • Chen et al. (2018) D. Chen, W. Fu, R. Krishna, and T. Menzies. 2018. Applications of Psychological Science for Actionable Analytics. CoRR abs/1803.05067 (2018). arXiv:1803.05067
  • Chen and Menzies (2018) Jianfeng Chen and Tim Menzies. 2018. RIOT: a Novel Stochastic Method for Rapidly Configuring Cloud-Based Workflows. In IEEE Cloud 2018.
  • Corazza et al. (2010) A. Corazza, S. Di Martino, F. Ferrucci, C. Gravino, and F. Sarro. 2010. How Effective is Tabu Search to Configure Support Vector Regression for Effort Estimation?. In 6th PROMISE.
  • Elish and Elish (2008) K. O. Elish and M. O. Elish. 2008. Predicting Defect-prone Software Modules Using Support Vector Machines. J. Syst. Softw. 81, 5 (May 2008), 649–660.
  • Feurer et al. (2015) M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter. 2015. Efficient and Robust Automated Machine Learning. In NIPS 28.
  • Fu and Menzies (2017) W. Fu and T. Menzies. 2017. Easy over Hard: A Case Study on Deep Learning. CoRR (2017). arXiv:1703.00133
  • Fu et al. (2018) W. Fu, T. Menzies, D. Chen, and A. Agrawal. 2018. Building Better Quality Predictors Using ”-Dominance”. CoRR abs/1803.04608 (2018). arXiv:1803.04608
  • Fu et al. (2016) W. Fu, T. Menzies, and X. Shen. 2016. Tuning for software analytics: Is it really necessary? IST 76 (2016), 135–146.
  • Fu et al. (2016) W. Fu, V. Nair, and T. Menzies. 2016. Why is Differential Evolution Better than Grid Search for Tuning Defect Predictors? CoRR abs/1609.02613 (2016). arXiv:1609.02613
  • Ghotra, McIntosh, and Hassan (Ghotra et al.) B. Ghotra, S. McIntosh, and A. E. Hassan. Revisiting the Impact of Classification Techniques on the Performance of Defect Prediction Models. In 2015 IEEE/ACM 37th IEEE ICSE.
  • Hutter et al. (2011) F. Hutter, H. H. Hoos, and K. Leyton-Brown. 2011. Sequential Model-based Optimization for General Algorithm Configuration. In 5th LION.
  • Jia et al. (2015) Y. Jia, M. B. Cohen, M. Harman, and J. Petke. 2015. Learning Combinatorial Interaction Test Generation Strategies Using Hyperheuristic Search. In ICSE.
  • Kamei et al. (2013) Y. Kamei, E. Shihab, B. Adams, A. E. Hassan, A. Mockus, A. Sinha, and N. Ubayashi. 2013. A large-scale empirical study of just-in-time quality assurance. IEEE TSE (June 2013).
  • Krishna and Menzies (2017) R. Krishna and T. Menzies. 2017. Simpler Transfer Learning (Using ”Bellwethers”). CoRR abs/1703.06218 (2017). arXiv:1703.06218
  • Krishna, Menzies, and Fu (Krishna et al.) R. Krishna, T. Menzies, and W. Fu.

    Too Much Automation? The Bellwether Effect and Its Implications for Transfer Learning. In

    IEEE/ACM ICSE (ASE 2016).
  • Lessmann et al. (2008) S. Lessmann, B. Baesens, C. Mues, and S. Pietsch. 2008. Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings. IEEE TSE (2008).
  • Lewis et al. (2013) C. Lewis, Z. Lin, C. Sadowski, X. Zhu, R. Ou, and E. J. Whitehead Jr. 2013. Does Bug Prediction Support Human Developers? Findings from a Google Case Study. In Proceedings of the 2013 ICSE. 372–381.
  • Lowry et al. (1998) M. Lowry, M. Boyd, and D. Kulkami. 1998. Towards a theory for integration of mathematical verification and empirical testing. In 13th IEEE ASE.
  • Menzies et al. (2007) T. Menzies, J. Greenwald, and A. Frank. 2007. Data Mining Static Code Attributes to Learn Defect Predictors. IEEE TSE (Jan 2007).
  • Menzies et al. (2002) T. Menzies, D. Raffo, S. Setamanit, Y. Hu, and S. Tootoonian. 2002. Model-Based Tests of Truisms. In 17th IEEE International Conference on ASE (ASE ’02).
  • Minku and Yao (2013) L. Minku and X. Yao. 2013.

    An Analysis of Multi-objective Evolutionary Algorithms for Training Ensemble Models Based on Different Performance Measures in Software Effort Estimation. In

    Proceedings of the 9th PROMISE.
  • Mittas and Angelis (2013) N. Mittas and L. Angelis. 2013. Ranking and clustering software cost estimation models through a multiple comparisons algorithm. IEEE TSE (2013).
  • Monden et al. (2013) A. Monden, T. Hayashi, S. Shinoda, K. Shirai, J. Yoshida, M. Barker, and K. Matsumoto. 2013. Assessing the Cost Effectiveness of Fault Prediction in Acceptance Testing. IEEE TSE 39, 10 (Oct 2013), 1345–1357.
  • Moser, Pedrycz, and Succi (Moser et al.) R. Moser, W. Pedrycz, and G. Succi. A Comparative Analysis of the Efficiency of Change Metrics and Static Code Attributes for Defect Prediction. In 2008 ICSE.
  • Nagappan and Ball (2005) N. Nagappan and T. Ball. 2005. Static Analysis Tools As Early Indicators of Pre-release Defect Density. In Proceedings of the 27th ICSE (ICSE ’05).
  • Nam and Kim (2015) J. Nam and S. Kim. 2015. Heterogeneous Defect Prediction. In 10th FSE.
  • Pedregosa and Varoquaux (2011) F. Pedregosa and G. Varoquaux. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
  • Phillips et al. (2017) N. D. Phillips, H. Neth, J. K. Woike, and W. Gaissmaier. 2017. FFTrees: A toolbox to create, visualize, and evaluate fast-and-frugal decision trees. Judgment and Decision Making (2017).
  • Sayyad et al. (2013) A. S. Sayyad, T. Menzies, and H. Ammar. 2013. On the Value of User Preferences in Search-based Software Engineering: A Case Study in Software Product Lines. In 35th ICSE.
  • Snoek et al. (2012) J. Snoek, H. Larochelle, and R. Adams. 2012. Practical Bayesian Optimization of Machine Learning Algorithms. In NIPS - Volume 2.
  • Song et al. (2013) L. Song, L. L. Minku, and X. Yao. 2013. The Impact of Parameter Tuning on Software Effort Estimation Using Learning Machines. In 9th PROMISE.
  • Storn and Price (1997) R. Storn and K. Price. 1997. Differential Evolution; A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces. J. of Global Optimization (1997).
  • Tan, Tan, and Dara (Tan et al.) M. Tan, L. Tan, and S. Dara. Online Defect Prediction for Imbalanced Data. In ICSE.
  • Tantithamthavorn et al. (2016) C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Matsumoto. 2016. Automated Parameter Optimization of Classification Techniques for Defect Prediction Models. In 38th ICSE.
  • Wang, Harman, Jia, and Krinke (Wang et al.) T. Wang, M. Harman, Y. Jia, and J. Krinke. Searching for Better Configurations: A Rigorous Approach to Clone Evaluation. In 2013 9th Joint Meeting on FSE.