Hyperparameter Optimization for Effort Estimation

04/28/2018 ∙ by Tianpei Xia, et al. ∙ NC State University IEEE 0

Software analytics has been widely used in software engineering for many tasks such as generating effort estimates for software projects. One of the "black arts" of software analytics is tuning the parameters controlling a data mining algorithm. Such hyperparameter optimization has been widely studied in other software analytics domains (e.g. defect prediction and text mining) but, so far, has not been extensively explored for effort estimation. Accordingly, this paper seeks simple, automatic, effective, and fast methods for finding good tunings for automatic software effort estimation. We introduce a hyperparameter optimization architecture called OIL (Optimized Inductive learning). We test OIL on a wide range of hyperparameter optimizers using data from 945 software projects. After tuning, large improvements in effort estimation accuracy were observed (measured in terms of the magnitude of the relative error and standardized accuracy). From those results, we can recommend using regression trees (CART) tuned by either different evolution or MOEA/D. This particular combination of learner and optimizers often achieves in one or two hours what other optimizers need days to weeks of CPU to accomplish. An important part of this analysis is its reproducibility and refutability. All our scripts and data are on-line. It is hoped that this paper will prompt and enable much more research on better methods to tune software effort estimators.



There are no comments yet.


page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Software analytics has been widely used in software engineering for many tasks Menzies and Zimmermann (2018). This paper explores methods to improve algorithms for software effort estimation (a particular kind of analytics tasks). This is needed since software effort estimates can be wildly inaccurate Kemerer (1987). Effort estimations need to be accurate (if for no other reason) since many government organizations demand that the budgets allocated to large publicly funded projects be double-checked by some estimation model Menzies et al. (2017). Non-algorithm techniques that rely on human judgment Jørgensen (2004) are much harder to audit or dispute (e.g., when the estimate is generated by a senior colleague but disputed by others).

Sarro et al. Sarro et al. (2016) assert that effort estimation is a critical activity for planning and monitoring software project development in order to deliver the product on time and within budget Briand and Wieczorek (2002); Kocaguneli et al. (2011); Trendowicz and Jeffery (2014). The competitiveness of software organizations depends on their ability to accurately predict the effort required for developing software systems; both over- or under- estimates can negatively affect the outcome of software projects Trendowicz and Jeffery (2014); McConnell (2006); Mendes and Mosley (2002); Sommerville (2010).

Hyperparameter optimizers tuning the control parameters of a data mining algorithm. It is well established that classification tasks like software defect prediction or text classification are improved by such tuning Fu et al. (2016a); Tantithamthavorn et al. (2018); Agrawal et al. (2018); Agrawal and Menzies (2018). This paper investigates hyperparameter optimization using data from 945 projects, the study is an extensive exploration of hyperparameter optimization and effort estimation following the earlier work done by Corazza et al Corazza et al. (2013).

We assess our results with respect to recent findings by Arcuri & Fraser Arcuri and Fraser (2013). They caution that to transition hyperparameter optimizers to industry, they need to be fast:

A practitioner, that wants to use such tools, should not be required to run large tuning phases before being able to apply those tools Arcuri and Fraser (2013).

Also, according to Arcuri & Fraser, optimizers must be useful:

At least in the context of test data generation, (tuning) does not seem easy to find good settings that significantly outperform “default” values. … Using “default” values is a reasonable and justified choice, whereas parameter tuning is a long and expensive process that might or might not pay off Arcuri and Fraser (2013).

Hence, to assess such optimization for effort estimation, we ask four questions.

RQ1: To address one concern raised by Arcuri & Fraser, we must first ask is it best to just use “off-the-shelf” defaults? We will find that tuned learners provide better estimates than untuned learners. Hence, for effort estimation: Lesson1: “off-the-shelf” defaults should be deprecated.

RQ2: Can tuning effort be avoided by replacing old defaults with new defaults? This checks if we can run tuning once (and once only) then use those new defaults ever after. We will observe that effort estimation tunings differ extensively from dataset to dataset. Hence, for effort estimation: Lesson2: Overall, there are no “best” default settings.

RQ3: The first two research questions tell us that we must retune our effort estimators whenever new data arrives. Accordingly, we must now address the other concern raised by Arcuri & Fraser about CPU cost. Hence, in this question we ask can we avoid slow hyperparameter optimization? The answer to RQ3 will be “yes” since our results show that for effort estimation: Lesson3: Overall, our slowest optimizers perform no better than faster ones.

RQ4: The final question to answer is what hyperparameter optimizers to use for effort estimation? Here, we report that a certain combination of learners and optimizers usually produce best results. Further, this particular combination often achieves in a few minutes what other optimizers may need hours to days of CPU to achieve. Hence we will recommend the following combination for effort estimation: Lesson4: For new datasets, try a combination of CART with the optimizers differential evolution and FLASH. (Note: The italicized words are explained below.)

In summary, unlike the test case generation domains explored by Arcuri & Fraser, hyperparamter optimization for effort estimation is both useful and fast.

Overall the contributions of this paper are:

  • A demonstration that defaults settings are not the best way to perform effort estimation. Hence, when new data is encountered, some tuning process is required to learn the best settings for generating estimates from that data.

  • A recognition of the inherent difficulty associated with effort estimation. Since there is not one universally best effort estimation method. commissioning a new effort estimator requires extensive testing. As shown below, this can take hours to days of CPU time.

  • The identification of a combination of learner and optimizer that works as well as anything else, and which takes minutes to learn an effort estimator.

  • An extensible open-source architecture called OIL that enables the commissioning of effort estimation methods. OIL makes our results repeatable and refutable.

The rest of this paper is structured as follows. The next section discusses different methods for effort estimation and how to optimize the parameters of effort estimation methods. This is followed by a description of our data, our experimental methods, and our results. After that, a discussion section explores open issues with this work.

From all of the above, we can conclude that (a)  Arcuri & Fraser’s pessimism about hyperparameter optimization applies to their test data generation domain. However (b) for effort estimation, hyperparamter optimization is both useful and fast. Hence, we hope that OIL, and the results of this paper, will prompt and enable more research on methods to tune software effort estimators.

Note that OIL and all the data used in this study is freely available for download from https://github.com/arennax/effort_oil_2019

2 About Effort Estimation

Software effort estimation is a method to offer managers approximate advice on how much human effort (usually expressed in terms of hours, days or months of human work) is required to plan, design and develop a software project.  Such advice can only ever be approximate due to dynamic nature of any software development. Nevertheless, it is important to attempt to allocate resources properly in software projects to avoid waste. In some cases, inadequate or overfull funding can cause a considerable waste of resource and time Cowing (2002); Germano and Hufford (2016); Hazrati (2011); Roman (2016). As shown below, effort estimation can be categorized into (a) human-based and (b) algorithm-based methods Kocaguneli et al. (2012); Shepperd (2007).

For several reasons, this paper does not explore human-based estimation methods. Firstly, it is known that humans rarely update their human-based estimation knowledge based on feedback from new projects Jørgensen and Gruschke (2009). Secondly, algorithm-based methods are preferred when estimate have to be audited or debated (since the method is explicit and available for inspection). Thirdly, algorithm-based methods can be run many times (each time applying small mutations to the input data) to understand the range of possible estimates. Even very strong advocates of human-based methods Jørgensen (2015) acknowledge that algorithm-based methods are useful for learning the uncertainty about particular estimates.

2.1 Algorithm-based Methods

There are many algorithmic estimation methods. Some, such as COCOMO Boehm (1981), make assumptions about the attributes in the model. For example, COCOMO requires that data includes 22 specific attributes such as analyst capability (acap) and software complexity (cplx). This attribute assumptions restricts how much data is available for studies like this paper. For example, here we explore 945 projects expressed using a wide range of attributes. If we used COCOMO, we could only have accessed an order of magnitude fewer projects.

Due to its attribute assumptions, this paper does not study COCOMO data. All the following learners can accept projects described using any attributes, just as long as one of those is some measure of project development effort.

Whigham et al.’s ATLM method Whigham et al. (2015)

is a multiple linear regression model which calculate the effort as

, where are explanatory attributes and are errors to the actual value. The prediction weights are determined using least square error estimation Neter et al. (1996)

. Additionally, transformations are applied on the attributes to further minimize the error in the model. In case of categorical attributes the standard approach of “dummy variables” 

Hardy (1993)

is applied. While, for continuous attributes, transformations such as logarithmic, square root, or no transformation is employed such that the skewness of the attribute is minimum. It should be noted that, ATLM does not consider relatively complex techniques like using model residuals, box transformations or step-wise regression (which are standard) when developing a linear regression model. The authors make this decision since they intend ATLM to be a simple baseline model rather than the “best” model. And since it can be applied automatically, there should be no excuse not to compare any new model against a comparatively naive baseline.

Sarro et al. proposed a method named Linear Programming for Effort Estimation (LP4EE) 

Sarro and Petrozziello (2018), which aims to achieve the best outcome from a mathematical model with a linear objective function subject to linear equality and inequality constraints. The feasible region is given by the intersection of the constraints and the Simplex (linear programming algorithm) is able to find a point in the polyhedron where the function has the smallest error in polynomial time. In effort estimation problem, this model minimizes the Sum of Absolute Residual (SAR), when a new project is presented to the model, LP4EE predicts the effort as , where is the value of given project feature and is the corresponding coefficient evaluated by linear programming. LP4EE is suggested to be used as another baseline model for effort estimation since it provides similar or more accurate estimates than ATLM and is much less sensitive than ATLM to multiple data splits and different cross-validation methods.

Some algorithm-based estimators are regression trees such as CART L.Breiman (1984), CART is a tree learner that divides a dataset, then recurses on each split. If data contains more than min_sample_split, then a split is attempted. On the other hand, if a split contains no more than min_samples_leaf

, then the recursion stops. CART finds the attributes whose ranges contain rows with least variance in the number of defects. If an attribute ranges

is found in rows each with an effort variance of , then CART seeks the attribute with a split that most minimizes . For more details on the CART parameters, see Table 1.

Parameter Default Tuning Range Notes
max_feature None [0.01, 1] The number of feature to consider when looking for the best split.
max_depth None [1, 12] The maximum depth of the tree.
min_sample_split 2 [0, 20] Minimum samples required to split internal nodes.
min_samples_leaf 1 [1, 12] Minimum samples required to be at a leaf node.
Table 1: CART’s parameters.

Random Forest Breiman (2001)

and Support Vector Regression 

Chang and Lin (2011)

are another instances of regression methods. Random Forest (RF) is an ensemble learning method for regression (and classification) tasks that builds a set of trees when training the model. To decide the output, it uses the mode of the classes (classification) or mean prediction (regression) of the individual trees. Support Vector Regression (SVR) uses kernel functions to project the data onto a new hyperspace where complex non-linear patterns can be simply represented. It aims to construct an optimal hyperplane that fits data and predicts with minimal empirical risk and complexity of the modelling function.

Another algorithm-based estimators are the analogy-based Estimation (ABE) methods advocated by Shepperd and Schofield Shepperd and Schofield (1997). ABE is widely-used Peters et al. (2015); Kocaguneli et al. (2015); Hihn and Menzies (2015); Kocaguneli and Menzies (2011); Menzies et al. (2017), in many forms. We say that “ABE0” is the standard form seen in the literature and “ABEN” are the 6,000+ variants of ABE defined below. The general form of ABE (which applies to ABE0 or ABEN) is to first form a table of rows of past projects. The columns of this table are composed of independent variables (the features that define projects) and one dependent feature (project effort). From this table, we learn what similar projects (analogies) to use from the training set when examining a new test instance. For each test instance, ABE then selects analogies out of the training set. Analogies are selected via a similarity measure. Before calculating similarity, ABE normalizes numerics min..max to 0..1 (so all numerics get equal chance to influence the dependent). Then, ABE uses feature weighting to reduce the influence of less informative features. Finally, some adaption strategy is applied return a combination of the dependent effort values seen in the nearest analogies. For details on ABE0 and ABEN, see Figure 1 & Table 2.

Figure 1: OIL’s feature model of the space of machine learning options for ABEN. In this model, , , and are the mandatory features, while the and features are optional. To avoid making the graph too complex, some cross-tree constrains are not presented. For more details on the terminology of this figure, see Table 2.
  • To measure similarity between , ABE uses where corresponds to feature weights applied to independent features. ABE0 uses a uniform weighting where . ABE0’s adaptation strategy is to return the effort of the nearest item.

  • Two ways to find training subsets: (a) Remove nothing: Usually, effort estimators use all training projects Chang (1974)

    . Our ABE0 is using this variant; (b) Outlier methods: prune training projects with (say) suspiciously large values 

    Keung et al. (2008).

  • Eight ways to make feature weighting: Li et al. Li et al. (2009) and Hall and Holmes Hall and Holmes (2003) review 8 different feature weighting schemes.

  • Three ways to discretize (summarize numeric ranges into a few bins): Some feature weighting schemes require an initial discretization of continuous columns. There are many discretization policies in the literature, including: (1) equal frequency, (2) equal width, (3) do nothing.

  • Six ways to choose similarity measurements: Mendes et al. Mendes et al. (2003) discuss three similarity measures, including the weighted Euclidean measure described above, an unweighted variant (where = 1), and a “maximum distance” measure that focuses on the single feature that maximizes interproject distance. Frank et al. Frank et al. (2002) use a triangular distribution that sets to the weight to zero after the distance is more than “k” neighbors away from the test instance. A fifth and sixth similarity measure are the Minkowski distance measure used in Angelis and Stamelos (2000) and the mean value of the ranking of each project feature used in Walkerden and Jeffery (1999).

  • Four ways for adaption mechanisms: (1) median effort value, (2) mean dependent value, (3) summarize the adaptations via a second learner (e.g., linear regression) Li et al. (2009); Menzies et al. (2006); Baker (2007); Quinlan (1992), (4) weighted mean Mendes et al. (2003).

  • Six ways to select analogies: Analogy selectors are fixed or dynamic Kocaguneli et al. (2012). Fixed methods use nearest neighbors while dynamic methods use the training set to find which is best for examples.

Table 2: Variations on analogy. Visualized in Figure 1.

2.2 Effort Estimation and Hyperparameter Optimization

Note that we do not claim that the above represents all methods for effort estimation. Rather, we say that (a) all the above are either prominent in the literature or widely used; and (b) anyone with knowledge of the current effort estimation literature would be tempted to try some of the above.

Even though our lost of effort estimation methods is incomplete, it is still very long. Consider, for example, just the ABEN variants documented in Table 2. There are such variants. Some can be ignored; e.g. at , adaptation mechanisms return the same result, so they are not necessary. Also, not all feature weighting techniques use discretization. But even after those discards, there are still thousands of possibilities.

Given the space to exploration is so large, some researchers have offered automatic support for that exploration. Some of that prior work suffered from being applied to limited data Li et al. (2009).

Other researchers assume that the effort model is a specific parametric form (e.g. the COCOMO equation) and propose mutation methods to adjust the parameters of that equation Aljahdali and Sheta (2010); Moeyersoms et al. (2015); Singh and Misra (2012); Chalotra et al. (2015); Rao et al. (2014). As mentioned above, this approach is hard to test since there are very few datasets using the pre-specified COCOMO attributes.

Further, all that prior work needs to be revisited given the existence of recent and very prominent methods; i.e. ATLM from TOSEM’2015 Whigham et al. (2015) or LP4EE from TOSEM’2018 Sarro and Petrozziello (2018).

Accordingly, this paper conducts a more thorough investigation of hyperparameter optimization for effort estimation.

  • We use methods with no data feature assumptions (i.e. no COCOMO data);

  • That vary many parameters (6,000+ combinations);

  • That also tests results on 9 different sources with data on 945 software projects;

  • Which uses optimizers representative of the state-of-the-art (Differential Evolution Storn and Price (1997), FLASH Nair et al. (2018b));

  • And which benchmark results against prominent methods such as ATLM and LP4EE.

2.3 Oil

OIL is our architecture for exploring hyperparameter optimization and effort estimation, initially, our plan was to use standard hyperparameter tuning for this task. Then we learned that standard machine learning toolkits like Scikit-learn Pedregosa et al. (2011) did not include many of the effort estimation techniques; and (b) standard hyperparameter tuners can be slow. Hence, we build OIL:

  • At the base library layer, we use Scikit-learn Pedregosa et al. (2011).

  • Above that, OIL has a utilities layer containing all the algorithms missing in Scikit-Learn (e.g., ABEN required numerous additions at the utilities layer).

  • Higher up, OIL’s modelling layer uses an XML-based domain-specific language to specify a feature map of predicting model options. These feature models are single-parent and-or graphs with (optional) cross-tree constraints showing what options require or exclude other options. A graphical representation of the feature model used in this paper is shown in Figure 1.

  • Finally, at top-most optimizer layer, there is some optimizer that makes decisions across the feature map. An automatic mapper facility then links those decisions down to the lower layers to run the selected algorithms.

2.4 Optimizers

Once OIL’s layers were built, it was simple to “pop the top” and replace the top layer with another optimizer. Nair et al. Nair et al. (2018a) advise that for search-based SE studies, optimizers should be selecting via the a “dumb+two+next” rule. Here:

  • “Dumb” are some baseline methods;

  • “Two” are some well-established optimizers;

  • “Next” are more recent methods which may not have been applied before to this domain.

For our “dumb” optimizer, we used Random Choice (hereafter, RD). To find valid configurations, RD selects leaves at random from Figure 1. All these variants are executed and the best one is selected for application to the test set. To maintain a fair comparison with other systems described below, OIL Chooses N as the same number of evaluations in other methods.

Moving on, our “two” well-established optimizers are ATLM Whigham et al. (2015) and LP4EE Sarro and Petrozziello (2018). For LP4EE, we perform experiments with the open source code provided by orginal authors. For ATLM, since there is no online source code available, we carefully re-implemented the method by ourselves.

As to our “next” optimizers, we used Differential Evolution (hereafter, DE Storn and Price (1997)) and FLASH Nair et al. (2018b). The premise of DE is that the best way to mutate the existing tunings is to extrapolate between current solutions. Three solutions are selected at random. For each tuning parameter

, at some probability

, we replace the old tuning with . For booleans and for numerics, where is a parameter controlling differential weight. The main loop of DE runs over the population of size , replacing old items with new candidates (if new candidate is better). This means that, as the loop progresses, the population is full of increasingly more valuable solutions (which, in turn, helps extrapolation). As to the control parameters of DE, using advice from Storn and Fu et al. Storn and Price (1997); Fu et al. (2016a), we set . Also, the number of generations was set to 10 to test the effects of a very CPU-light effort estimator.

FLASH, proposed by Nair et al. Nair et al. (2018b), is an incremental optimizer. Previously, it has been applied to configuration system parameters for software systems. This paper is the first application of FLASH to effort estimation. Formally, FLASH is a sequential model-Based optimizer Bergstra et al. (2011) (also known in the machine learning literature as an active learner Das et al. (2016) or, in the statistics literature as optimal experimental design Olsson (2009)). Whatever the name, the intuition is the same: reflects on the model built to date in order to find the next best example to evaluate. To tune a data miner, FLASH explores possible tunings as follows:

  1. Set the evaluation budget . In order to make a fair comparison between FLASH and other methods, we used .

  2. Run the data miner using randomly selected tunings.

  3. Build an archive of examples holding pairs of parameter settings and their resulting performance scores (e.g. MRE, SA, etc).

  4. Using that archive, learn a surrogate to predicts performance. Following the methods of Nair et al. Nair et al. (2018b), we used CART L.Breiman (1984) for that surrogate.

  5. Use the surrogate to guess performance scores where and parameter settings. Note that this step is very fast since it all that is required is to run vectors down some very small CART trees.

  6. Using some selection function, select the most “interesting” setting. After Nair et al. Nair et al. (2018b) we returned the setting with the nest prediction (i.e. find the most promising possibility).

  7. Collect performance scores by evaluating “interesting” using the data miners (i.e. check the most troubling possibility). Set .

  8. Add “interesting” to the archive. If , goto step 4. Else, halt.

In summary, given what we already know about the tunings (represented in a CART tree), FLASH finds the potentially best thing (in Step6); then checks that thing (in Step7); then updates the model with the results of that check.

3 Empirical Study

3.1 Data

Projects Features
kemerer 15 6
albrecht 24 7
isbsg10 37 11
finnish 38 7
miyazaki 48 7
maxwell 62 25
desharnais 77 6
kitchenham 145 6
china 499 16
total 945
Table 3: Data used in this study. For details on the features, see Table 4.
feature min max mean std
kemerer Duration 5 31 14.3 7.5
KSLOC 39 450 186.6 136.8
AdjFP 100 2307 999.1 589.6
RAWFP 97 2284 993.9 597.4
Effort 23 1107 219.2 263.1
albrecht Input 7 193 40.2 36.9
Output 12 150 47.2 35.2
Inquiry 0 75 16.9 19.3
File 3 60 17.4 15.5
FPAdj 1 1 1.0 0.1
RawFPs 190 1902 638.5 452.7
AdjFP 199 1902 647.6 488.0
Effort 0 105 21.9 28.4
isbsg10 UFP 1 2 1.2 0.4
IS 1 10 3.2 3.0
DP 1 5 2.6 1.1
LT 1 3 1.6 0.8
PPL 1 14 5.1 4.1
CA 1 2 1.1 0.3
FS 44 1371 343.8 304.2
RS 1 4 1.7 0.9
FPS 1 5 3.5 0.7
Effort 87 14453 2959 3518
finnish hw 1 3 1.3 0.6
at 1 5 2.2 1.5
FP 65 1814 763.6 510.8
co 2 10 6.3 2.7
prod 1 29 10.1 7.1
lnsize 4 8 6.4 0.8
lneff 6 10 8.4 1.2
Effort 460 26670 7678 7135
china AFP 9 17518 486.9 1059
Input 0 9404 167.1 486.3
Output 0 2455 113.6 221.3
Enquiry 0 952 61.6 105.4
File 0 2955 91.2 210.3
Interface 0 1572 24.2 85.0
Added 0 13580 360.4 829.8
changed 0 5193 85.1 290.9
Deleted 0 2657 12.4 124.2
PDR_A 0 84 11.8 12.1
PDR_U 0 97 12.1 12.8
NPDR_A 0 101 13.3 14.0
NPDU_U 0 108 13.6 14.8
Resource 1 4 1.5 0.8
Dev.Type 0 0 0.0 0.0
Duration 1 84 8.7 7.3
Effort 26 54620 3921 6481
  feature min max mean std miyazaki KLOC 7 390 63.4 71.9 SCRN 0 150 28.4 30.4 FORM 0 76 20.9 18.1 FILE 2 100 27.7 20.4 ESCRN 0 2113 473.0 514.3 EFORM 0 1566 447.1 389.6 EFILE 57 3800 936.6 709.4 Effort 6 340 55.6 60.1 maxwell App 1 5 2.4 1.0 Har 1 5 2.6 1.0 Dba 0 4 1.0 0.4 Ifc 1 2 1.9 0.2 Source 1 2 1.9 0.3 Telon. 0 1 0.2 0.4 Nlan 1 4 2.5 1.0 T01 1 5 3.0 1.0 T02 1 5 3.0 0.7 T03 2 5 3.0 0.9 T04 2 5 3.2 0.7 T05 1 5 3.0 0.7 T06 1 4 2.9 0.7 T07 1 5 3.2 0.9 T08 2 5 3.8 1.0 T09 2 5 4.1 0.7 T10 2 5 3.6 0.9 T11 2 5 3.4 1.0 T12 2 5 3.8 0.7 T13 1 5 3.1 1.0 T14 1 5 3.3 1.0 T15 1 5 3.3 0.7 Dura. 4 54 17.2 10.7 Size 48 3643 673.3 784.1 Time 1 9 5.6 2.1 Effort 583 63694 8223 10500 desharnais TeamExp 0 4 2.3 1.3 MngExp 0 7 2.6 1.5 Length 1 36 11.3 6.8 Trans.s 9 886 177.5 146.1 Entities 7 387 120.5 86.1 AdjPts 73 1127 298.0 182.3 Effort 546 23940 4834 4188 kitchenham code 1 6 2.1 0.9 type 0 6 2.4 0.9 duration 37 946 206.4 134.1 fun_pts 15 18137 527.7 1522 estimate 121 79870 2856 6789 esti_mtd 1 5 2.5 0.9 Effort 219 113930 3113 9598
Table 4: Descriptive Statistics of the Datasets. Terms in red are removed from this study, for reasons discussed in the text.

To assess OIL, we applied it to the 945 projects seen in nine datasets from the SEACRAFT repository111http://tiny.cc/seacraft; see Table 3 and Table 4. This data was selected since it has been widely used in previous estimation research. Also, it is quite diverse since it differs in number of observations (from 15 to 499 projects); geographical locations (software projects coming from Canada, China, Finland); technical characteristics (software projects developed in different programming languages and for different application domains, ranging from telecommunications to commercial information systems); and number and type of features (from 6 to 25 features, including a variety of features describing the software projects, such as number of developers involved in the project and their experience, technologies used, size in terms of Function Points, etc.);

Note that some features of the original datasets are not used in our experiment because they are (1) naturally irrelevant to their effort values (e.g., ID, Syear), (2) unavailable at the prediction phase (e.g., duration, LOC), (3) highly correlated or overlap to each other (e.g., raw function point and adjusted function points). A data cleaning process is applied to solve this issue. Those removed features are shown as italic in Table 4.

3.2 Cross-Validation

Each datasets was treated in a variety of ways. Each treatment is an M*N-way cross-validation test of some learners or some learners and optimizers. That is, times, shuffle the data randomly (using a different random number seed) then divide the data into bins. For , bin is used to test a model build from the other bins. Following the advice of Nair et al. Nair et al. (2018a), we use bins for our effort datasets.

As a procedural detail, first we divided the data and then we applied the treatments. That is, all treatments saw the same training and test data.

3.3 Scoring Metrics

MRE: MRE is defined in terms of AR, the magnitude of the absolute residual. This is computed from the difference between predicted and actual effort values:
MRE is the magnitude of the relative error calculated by expressing AR as a ratio of actual effort:
MRE has been criticized Foss et al. (2003); Kitchenham et al. (2001); Korte and Port (2008); Port and Korte (2008); Shepperd et al. (2000); Stensrud et al. (2003) as being biased towards error underestimations. Nevertheless, we use it here since there exists known baselines for human performance in effort estimation, expressed in terms of MRE Molokken and Jorgensen (2003a). The same can not be said for SA.
SA: Because of issues with MRE, some researchers prefer the use of other (more standardized) measures, such as Standardized Accuracy (SA) Langdon et al. (2016); Shepperd and MacDonell (2012). SA is defined in terms of
where is the number of projects used for evaluating the performance, and and are the actual and estimated effort, respectively, for the project . SA uses MAE as follows:
where is the MAE of the approach being evaluated and is the MAE of a large number (e.g., 1000 runs) of random guesses. Over many runs, will converge on simply using the sample mean Shepperd and MacDonell (2012). That is, SA represents how much better is than random guessing. Values near zero means that the prediction model is practically useless, performing little better than random guesses Shepperd and MacDonell (2012).
Table 5: Performance scores: MRE and SA
This study ranks methods using the Scott-Knott procedure recommended by Mittas & Angelis in their 2013 IEEE TSE paper Mittas and Angelis (2013). This method sorts a list of treatments with measurements by their median score. It then splits into sub-lists in order to maximize the expected value of differences in the observed performances before and after divisions. For example, we could sort methods based on their median score, then divide them into three sub-lists of of size . Scott-Knott would declare one of these divisions to be “best” as follows. For lists of size where , the “best” division maximizes ; i.e. the difference in the expected mean value before and after the spit:

Scott-Knott then checks if that “best” division is actually useful. To implement that check, Scott-Knott would apply some statistical hypothesis test

to check if are significantly different. If so, Scott-Knott then recurses on each half of the “best” division. For a more specific example, consider the results from treatments:
        rx1 = [0.34, 0.49, 0.51, 0.6]
        rx2 = [0.6,  0.7,  0.8,  0.9]
        rx3 = [0.15, 0.25, 0.4,  0.35]
        rx4=  [0.6,  0.7,  0.8,  0.9]
        rx5=  [0.1,  0.2,  0.3,  0.4]
After sorting and division, Scott-Knott declares:
  • Ranked #1 is rx5 with median= 0.25

  • Ranked #1 is rx3 with median= 0.3

  • Ranked #2 is rx1 with median= 0.5

  • Ranked #3 is rx2 with median= 0.75

  • Ranked #3 is rx4 with median= 0.75

Note that Scott-Knott found little difference between rx5 and rx3. Hence, they have the same rank, even though their medians differ. Scott-Knott is prefered to, say, an all-pairs hypothesis test of all methods; e.g. six treatments can be compared ways. A 95% confidence test run for each comparison has a very low total confidence: %. To avoid an all-pairs comparison, Scott-Knott only calls on hypothesis tests after it has found splits that maximize the performance differences. For this study, our hypothesis test was a conjunction of the A12 effect size test of and non-parametric bootstrap sampling; i.e. our Scott-Knott divided the data if both bootstrapping and an effect size test agreed that the division was statistically significant (95% confidence) and not a “small” effect (). For a justification of the use of non-parametric bootstrapping, see Efron & Tibshirani (Efron and Tibshirani, 1993, p220-223). For a justification of the use of effect size tests see Shepperd & MacDonell Shepperd and MacDonell (2012); Kampenes Kampenes et al. (2007); and Kocaguneli et al. Keung et al. (2013). These researchers warn that even if an hypothesis test declares two populations to be “significantly” different, then that result is misleading if the “effect size” is very small. Hence, to assess the performance differences we first must rule out small effects. Vargha and Delaney’s non-parametric A12 effect size test explores two lists and of size and :
This expression computes the probability that numbers in one sample are bigger than in another. This test was recently endorsed by Arcuri and Briand at ICSE’11 Arcuri and Briand (2011).
Table 6: Explanation of Scott-Knott test.

The results from each test set are evaluated in terms of two scoring metrics: magnitude of the relative error (MRE) Conte et al. (1986) and Standardized Accuracy (SA). These scoring metrics are defined in Table 5. We use these since there are advocates for each in the literature. For example, Shepperd and MacDonell argue convincingly for the use of SA Shepperd and MacDonell (2012) (as well as for the use of effect size tests in effort estimation). Also in 2016, Sarro et al.222http://tiny.cc/sarro16gecco used MRE to argue their estimators were competitive with human estimates (which Molokken et al. Molokken and Jorgensen (2003b) says lies within 30% and 40% of the true value).

Note that for these evaluation measures:

  • MRE values: smaller are better

  • SA values: larger are better

From the cross-vals, we report the median (termed med) which is the 50th percentile of the test scores seen in the M*N results. Also reported are the

inter-quartile range

(termed IQR) which is the (75-25)th percentile. The IQR is a non-parametric description of the variability about the median value.

For each datasets, the results from a M*N-way are sorted by their median value, then ranked using the Scott-Knott test recommended for ranking effort estimation experiments by Mittas et al. in TSE’13 Mittas and Angelis (2013). For full details on Scott-Knott test, see Table 6. In summary, Scott-Knott is a top-down bi-clustering method that recursively divides sorted treatments. Division stops when there is only one treatment left or when a division of numerous treatments generates splits that are statistically indistinguishable. To judge when two sets of treatments are indistinguishable, we use a conjunction of both a 95% bootstrap significance test Efron and Tibshirani (1993) and a A12 test for a non-small effect size difference in the distributions Menzies et al. (2017)

. These tests were used since their non-parametric nature avoids issues with non-Gaussian distributions.

Table 7 shows an example of the report generated by our Scott-Knott procedure. Note that when multiple treatments tie for Rank=1, then we use the treatment’s runtimes to break the tie. Specifically, for all treatments in Rank=1, we mark the faster ones as Rank=1.

Standardized Accuracy
Rank Method Med IQR
1 CART_FLASH 65 18
1 CART_DE 59 19
2 ABEN_DE 52 23
2 ABE0 51 20
2 RF 49 29
2 LP4EE 47 25
3 CART 41 31
3 ABEN_RD 37 33
3 ATLM 34 13
3 CART_RD 32 27
3 SVR 30 18
Table 7: Example of Scott-Knott results. SA scores seen in the albrecht dataset. sorted by their median value. Here, larger values are better. Med is the 50th percentile and IQR is the inter-quartile range; i.e., 75th-25th percentile. Lines with a dot in the middle shows median values with the IQR. For the Ranks, smaller values are better. Ranks are computed via the Scott-Knot procedure from TSE’2013 Mittas and Angelis (2013). Rows with the same ranks are statistically indistinguishable. Rows shown in color denotes rows of fastest best-ranked treatments.

3.4 Terminology for Optimizers

Some treatments are named “X_Y” which denote learner “X” tuned by optimizer “Y”. In the following:

Note that we do not tune ATLM and LP4EE since they were designed to be used “off-the-shelf”. Whigham et al. Whigham et al. (2015) declare that one of ATLM’s most important features is that if does not need tuning. We also do not tune SVR and RF since we treat them as baseline algorithm-based methods in our benchmarks (i.e. use default settings in scikit-learn for these algorithms).

4 Results

faster slower












kemerer 4 5 13
albrecht 4 6 15
finnish 5 6 18
miyazaki 6 8 21
desharnais 9 11 24
isbsg10 7 10 23
maxwell 12 16 34
kitchenham 16 17 37
china 23 26 54
total 3 4 4 4 5 7 8 9 8 86 105
Table 8: Average runtime (in minutes), for one-way out of an N*M cross-validation experiment. cross-validation (minutes). Executing on a 2GHz processor, with 8GB RAM, running Windows 10. Note that LP4EE and ATLM have no tuning results since the authors of these methods stress that it is advantageous to use their baseline methods, without any tuning. Last column reports totals for each dataset.
Rank Method Med. IQR
1 CART_FLASH 32 12
1 CART_DE 33 14
2 ABEN_DE 42 15
2 LP4EE 44 13
2 ABE0 45 16
2 RF 46 24
2 ABEN_RD 48 21
3 CART 53 22
3 CART_RD 54 20
3 SVR 56 19
4 ATLM 140 91 out-of-range
1 LP4EE 45 5
1 ATLM 48 6
2 ABEN_DE 62 6
3 ABE0 64 5
3 CART_DE 64 6
4 ABEN_RD 68 7
4 RF 69 8
5 SVR 71 7
5 CART 71 5
5 CART_RD 72 6
1 CART_DE 35 11
1 CART_FLASH 35 11
1 LP4EE 38 13
2 RF 46 12
2 ABEN_DE 47 14
2 SVR 48 12
2 CART_RD 48 13
2 ABEN_RD 49 16
2 CART 49 14
2 ABE0 50 15
2 ATLM 54 17
1 CART_FLASH 42 17
2 CART_DE 48 16
3 CART 57 21
3 RF 57 26
3 CART_RD 58 22
4 ABEN_DE 62 37
4 LP4EE 63 33
4 ABE0 64 48
4 ABEN_RD 64 42
4 SVR 74 13
5 ATLM 87 72
1 CART_DE 59 20
1 CART_FLASH 62 19
2 SVR 72 17
2 CART_RD 73 24
2 ABE0 73 60
2 CART 74 21
2 ABEN_DE 74 25
2 LP4EE 75 23
2 ABEN_RD 76 35
2 RF 78 58
3 ATLM 127 124 out-of-range
1 CART_DE 32 24
1 CART_FLASH 37 27
2 RF 50 39
2 ABEN_DE 54 22
2 LP4EE 54 23
2 CART_RD 55 25
2 CART 55 27
3 ABE0 56 33
3 SVR 59 14
3 ABEN_RD 60 17
4 ATLM 76 56
1 ABEN_DE 35 8
2 LP4EE 38 6
2 CART_DE 38 7
2 ABE0 39 9
3 ABEN_RD 42 10
3 RF 43 8
4 CART 49 11
5 CART_RD 57 12
5 SVR 60 8
6 ATLM 106 108 out-of-range
1 CART_FLASH 36 10
1 CART_DE 38 8
2 RF 51 15
2 LP4EE 51 16
2 CART 52 10
2 CART_RD 53 11
2 ABEN_DE 53 16
3 SVR 56 13
3 ABEN_RD 56 13
3 ABE0 56 14
4 ATLM 282 221 out-of-range
1 CART_DE 32 11
1 CART_FLASH 32 11
1 LP4EE 33 10
2 SVR 37 10
2 ATLM 37 32
2 ABEN_DE 39 16
3 RF 46 24
3 ABE0 47 12
3 CART 47 16
3 ABEN_RD 47 15
3 CART_RD 48 14
Table 9: % MRE results from our cross-validation studies. Smaller values are better. Same format as Table 7. The gray rows show the Rank=1 results recommended for each data set. The phrase “ out-of-range ” denotes results that are so bad that they fall outside of the 0%..100% range shown here.
Rank Method Med. IQR
1 CART_FLASH 65 18
1 CART_DE 59 19
2 ABEN_DE 52 23
2 ABE0 51 20
2 RF 49 29
2 LP4EE 47 25
3 CART 41 31
3 ABEN_RD 37 33
3 ATLM 34 13
3 CART_RD 32 27
3 SVR 30 18
1 LP4EE 32 9
1 ABEN_DE 29 13
1 ABE0 28 7
1 CART_DE 27 6
2 SVR 21 3
2 RF 21 11
2 ABEN_RD 20 9
3 ATLM 12 5
3 CART 12 13
3 CART_RD 10 16
1 CART_FLASH 53 11
1 CART_DE 53 11
2 LP4EE 48 12
2 RF 46 11
2 ABEN_DE 46 14
2 ABE0 44 12
2 SVR 43 7
3 CART 39 12
3 ATLM 37 8
3 ABEN_RD 37 13
4 CART_RD 31 10
1 CART_FLASH 54 13
2 CART_DE 49 12
3 RF 44 16
3 ABEN_DE 43 25
3 CART 42 17
4 ATLM 41 48
4 CART_RD 40 22
4 ABE0 40 25
4 LP4EE 39 22
4 ABEN_RD 38 27
5 SVR 24 9
1 CART_DE 33 19
1 CART_FLASH 30 18
1 ATLM 30 20
2 ABEN_DE 28 24
2 ABE0 28 23
2 SVR 25 11
3 LP4EE 22 23
3 CART_RD 22 28
3 RF 22 35
3 ABEN_RD 21 27
3 CART 20 34
1 CART_DE 55 30
2 CART_FLASH 43 27
2 CART 42 25
2 RF 41 29
2 ABEN_DE 40 27
2 LP4EE 40 23
2 ABE0 38 25
2 CART_RD 36 27
3 ABEN_RD 32 28
3 ATLM 30 28
3 SVR 28 24
1 LP4EE 52 24
1 ABEN_DE 51 25
1 ABE0 47 23
1 CART_FLASH 44 24
2 RF 41 19
2 ABEN_RD 40 19
2 CART_DE 40 20
3 CART 34 14
4 CART_RD 32 18
4 SVR 32 17
5 ATLM -3 37
1 LP4EE 52 13
1 CART_DE 51 16
2 RF 44 11
2 ABEN_DE 43 12
3 ABE0 39 10
3 ABEN_RD 39 13
4 CART 37 21
4 CART_RD 36 23
4 SVR 30 11
5 ATLM -107 99 out-of-range
1 CART_FLASH 53 13
1 CART_DE 53 14
1 LP4EE 52 11
1 ATLM 50 9
2 ABEN_DE 46 18
2 RF 46 19
2 ABE0 45 15
3 CART_RD 42 20
3 ABEN_RD 42 24
3 SVR 41 12
3 CART 41 21
Table 10: % SA results from our cross-validation studies. Larger values are better. Same format as Table 7. The gray rows show the Rank=1 results recommended for each data set. The phrase “ out-of-range ” denotes results that are so bad that they fall outside of the 0%..100% range shown here.

4.1 Observations

Table 8 shows the runtimes (in minutes) for one of our N*M experiments for each dataset. From the last column of that table, we see that the median to maximum runtimes per dataset range are:

  • 24 to 54 minutes, for one-way;

  • Hence 8 to 18 hours, for the 20 repeats of our N*M experiments.

Performance scores for all datasets are shown in Table 9 and Table 10. We observe that ATLM and LP4EE performed as expected. Whigham et al. Whigham et al. (2015) and Sarro et al. Sarro and Petrozziello (2018) designed these methods to serve as baselines against which other treatments can be compared. Hence, it might be expected that in some cases these methods will perform comparatively below other methods. This was certainly the case here– as seen in Table 9 and Table 10, these baseline methods are top-ranked in 8/18 datasets.

Another thing to observe in Table 9 and Table 10 is that random search (RD) also performed as expected; i.e. it was never top-ranked. This is a gratifying result since if random otherwise, then that tend to negate the value of hyperparameter optimization.

We also see in Table 9 empirical evidence many of our methods achieve human-competitive results. Molokken and Jorgensen Molokken and Jorgensen (2003a)’s survey of current industry practices reports that human-expert predictions of project effort lie within 30% and 40% of the true value; i.e. MRE %. Applying that range to Table 9 we see that in 6/9 datasets, the best estimator has MRE %; i.e. they lie comfortably within the stated human-based industrial thresholds. Also, in a further 2/9 datasets, the best estimator has MRE %; i.e. they are close to the performance of humans.

The exception to the results in the last paragraph is isbg10 where the best estimator has an MRE %; i.e. our best performance is nowhere close to that of human estimators. In future work, we recommend researchers use isbg10 as a “stress test” on new methods.

4.2 Answers to Research Questions

Turning now to the research questions listed in the introduction:

RQ1: Is it best just to use the “off-the-shelf” defaults?

As mentioned in the introduction, Arcuri & Fraser note that for test case generation, using the default settings can work just as well as anything else. We can see some evidence of this effect in Table 9 and Table 10. Observe, for example, the kitchenham results where the untuned ABE0 treatment achieves Rank=1.

However, overall, Table 9 and Table 10 is negative on the use of default settings. For example, in datasets “albrecht”, “desharnais”, “finnish”, not even one treatments that use the default found in Rank=1. Overall, if we always used just one of the methods using defaults (LP4EE, ATLM, ABE0) then that would achieve best ranks in 8/18 datasets.

Another aspect to note in the Table 9 and Table 10 results are the large differences in performance scores between the best and worst treatments (exceptions: miyazaki’s MRE and SA scores do not vary much; and neither does isbg10’s SA scores). That is, there is much to be gained by using the Rank=1 treatments and deprecating the rest.

In summary, using the defaults is recommended only in a part of datasets. Also, in terms of better test scores, there is much to be gained from tuning. Hence:

Lesson1: “Off-the-shelf” defaults should be deprecated.

RQ2: Can we replace the old defaults with new defaults?

If the hyperparameter tunings found by this paper were nearly always the same, then this study could conclude by recommending better values for default settings. This would be a most convenient result since, in future when new data arrives, the complexities of this study would not be needed.

%max_features max_depth min_sample_split min_samples_leaf
(selected at random; (of trees) (continuation (termination
100% means “use all”) criteria) criteria)
 25% 50% 75% 100%  03  06  09  12  5  10  15  20  03  06  09  12

















































































































































KEY: 102030405060708090100%

Table 11: Tunings discovered by hyperparameter selections (CART+DE). Table rows sorted by number of rows in data sets (smallest on top). Cells in this table show the percent of times a particular choice was made. White text on black denotes choices made in more than 50% of tunings.

Unfortunately, this turns out not to be the case. Table 11 shows the percent frequencies with which some tuning decision appears in our M*N-way cross validations (this table uses results from DE tuning CART since, as shown below, this usually leads to best results). Note that in those results it it not true that across most datasets there is a setting that is usually selected (thought min_samples_leaf less than 3 is often a popular setting). Accordingly, we say that Table 11 shows that there is much variations of the best tunings. Hence, for effort estimation:

Lesson2: Overall, there are no “best” default settings.

Before going on, one curious aspect of the Table 11

results are the %max_features results; it was rarely most useful to use all features. Except for finnish and china), best results were often obtained after discarding (at random) a quarter to three-quarters of the features. This is a clear indication that, in future work, it might be advantageous to explore more feature selection for CART models.

RQ3: Can we avoid slow hyperparameter optimization?

Some methods in our experiments (ABEN_RD and ABEN_DE) are slower than others, even with the same number of evaluations, as shown in Table 8. Is it possible to avoid such slow runtimes?

Long and slow optimization times are recommended when their exploration leads to better solutions. Such better solutions from slower optimizations are rarely found in Table 9 and Table 10 (only in 2/18 cases: see the ABEN_DE results for kitchenham, and china). Further, the size of the improvements seen with the slower optimizers over the best Rank=2 treatments is small. Those improvements come at runtime cost (in Table 8), the slower optimizers are one orders of magnitude slower than other methods). Hence we say that for effort estimation:

Lesson3: Overall, our slowest optimizers perform no better than faster ones.

RQ4: What hyperparatmeter optimizers to use for effort estimation?

When we discuss this work with our industrial colleagues, they want to know “the bottom line”; i.e. what they should use or, at the very least, what they should not use. This section offers that advice. We stress that this section is based on the above results so, clearly these recommendations are something that would need to be revised whenever new results come to hand.

Based on the above we can assert that using all the estimators mentioned above is not recommended (to say the least):

  • For one thing, many of them never appear in our top-ranked results.

  • For another thing, testing all of them on new datasets would be needlessly expensive. Recall our rig: 20 repeats over the data where each of those repeats include slower estimators shown in Table 8. As seen in that figure, the median to maximum runtimes for such an analysis for a single dataset would take 8 to 18 hours (i.e. hours to days).

Rank Method Win Times
1 CART_FLASH 16/18
2 CART_DE 14/18
3 LP4EE 7/18
4 ATLM 3/18
5 ABEN_DE 2/18
5 ABE0 2/18
6 CART 0/18
6 ABEN_RD 0/18
6 RF 0/18
6 CART_RD 0/18
6 SVR 0/18
Table 12: Methods ranking of total winning times (Rank=1), in all 18 experiment cases (9 datasets for both MRE and SA).

Table 12 lists the best that can be expected if an engineer chooses one of the estimators in our experiment, and applied it to all our datasets. The fractions shown at right come from counting optimizer frequencies in the top-ranks of Table 9 and Table 10. Note that the champion in our experiment is “CART_FLASH”, which ranked as ‘1’ in 16 out of all 18 cases. One close runner-up is “CART_DE”, which has 2 cases less in number of winning times. Those two estimators usually have good performance among most cases in the experiment.

Beside the two top methods, none of the rest estimators could reach even half of all cases. Including those untuned baseline methods (, ). Hence, we cannot endorse their use for generating estimates to be shown to business managers. That said, we do still endorse their use as a baseline methods, for methodological reasons in effort estimation research (they are useful for generating a quick result against which we can compare other, better, methods).

Hence, based on the results of Table 12, for similar effort estimation tasks, we recommend:

Lesson4: For new datasets, try a combination of CART with the optimizers differential evolution and FLASH.

5 Threats to Validity

Internal Bias: Many of our methods contain stochastic random operators. To reduce the bias from random operators, we repeated our experiment in 20 times and applied statistical tests to remove spurious distinctions.

Parameter Bias: For other studies, this is a significant question since (as shown above) the settings to the control parameters of the learners can have a positive effect on the efficacy of the estimation. That said, recall that much of the technology of this paper concerned methods to explore the space of possible parameters. Hence we assert that this study suffers much less parameter bias than other studies.

Sampling Bias: While we tested OIL on the nine datasets, it would be inappropriate to conclude that OIL tuning always perform better than others methods for all datasets. As researchers, what we can do to mitigate this problem is to carefully document out method, release out code, and encourage the community to try this method on more datasets, as the occasion arises.

6 Related Work

In software engineering, hyperparameter optimization techniques have been applied to some sub-domains, but yet to be adopted in many others. One way to characterize this paper is an attempt to adapt recent work in hyperparameter optimization in software defect prediction to effort estimation. Note that, like in defect prediction, this article has also concluded that Differential Evolution is an useful method.

Several SE defect prediction techniques rely on static code attributes Krishna et al. (2016); Nam and Kim (2015); Tan et al. (2015). Much of that work has focused of finding and employing complex and “off-the-shelf” machine learning models Menzies et al. (2007); Moser et al. (2008); Elish and Elish (2008), without any hyperparameter optimization. According to a literature review done by Fu et al. Fu et al. (2016b), as shown in Figure 2, nearly 80% of highly cited papers in defect prediction do not mention parameters tuning (so they rely on the default parameters setting of the predicting models).

Figure 2: Literature review of hyperparameters tuning on 52 top defect prediction papers Fu et al. (2016b)

Gao et al. Gao et al. (2011) acknowledged the impacts of the parameter tuning for software quality prediction. For example, in their study, “distanceWeighting” parameter was set to “Weight by 1/distance”, the KNN parameter “k” was set to “30”, and the “crossValidate” parameter was set to “true”. However, they did not provide any further explanation about their tuning strategies.

As to methods of tuning, Bergstra and Bengio Bergstra and Bengio (2012) comment that grid search333For tunable option, run nested for-loops to explore their ranges. is very popular since (a) such a simple search to gives researchers some degree of insight; (b) grid search has very little technical overhead for its implementation; (c) it is simple to automate and parallelize; (d) on a computing cluster, it can find better tunings than sequential optimization (in the same amount of time). That said, Bergstra and Bengio deprecate grid search since that style of search is not more effective than more randomized searchers if the underlying search space is inherently low dimensional. This remark is particularly relevant to effort estimation since datasets in this domain are often low dimension Kocaguneli et al. (2013).

Lessmann et al. Lessmann et al. (2008) used grid search to tune parameters as part of their extensive analysis of different algorithms for defect prediction. However, they only tuned a small set of their learners while they used the default settings for the rest. Our conjecture is that the overall cost of their tuning was too expensive so they chose only to tune the most critical part.

Two recent studies about investigating the effects of parameter tuning on defect prediction were conducted by Tantithamthavorn et al. Tantithamthavorn et al. (2016, 2018) and Fu et al. Fu et al. (2016a). Tantithamthavorn et al. also used grid search while Fu et al. used differential evolution. Both of the papers concluded that tuning rarely makes performance worse across a range of performance measures (precision, recall, etc.). Fu et al. Fu et al. (2016a) also report that different datasets require different hyperparameters to maximize performance.

One major difference between the studies of Fu et al. Fu et al. (2016a) and Tantithamthavorn et al. Tantithamthavorn et al. (2016) was the computational costs of their experiments. Since Fu et al.’s differential evolution based method had a strict stopping criterion, it was significantly faster.

Note that there are several other methods for hyperparameter optimization and we aim to explore several other method as a part of future work. But as shown here, it requires much work to create and extract conclusions from a hyperparameter optimizer. One goal of this work, which we think we have achieved, to identify a simple baseline method against which subsequent work can be benchmarked.

7 Conclusions and Future Work

Hyperparameter optimization is known to improve the performance of many software analytics tasks such as software defect prediction or text classification Agrawal and Menzies (2018); Agrawal et al. (2018); Fu et al. (2016a); Tantithamthavorn et al. (2018). Most prior work in this effort estimation optimization only explored very small datasets Li et al. (2009) or used estimators that are not representative of the state-of-the-art Whigham et al. (2015); Sarro and Petrozziello (2018). Other researchers assume that the effort model is a specific parametric form (e.g. the COCOMO equation), which greatly limits the amount of data that can be studied. Further, all that prior work needs to be revisited given the existence of recent and very prominent methods; i.e. ATLM from TOSEM’15 Whigham et al. (2015) and LP4EE from TOSEM’18 Sarro and Petrozziello (2018).

Accordingly, this paper conducts a thorough investigation of hyperparameter optimization for effort estimation using methods (a) with no data feature assumptions (i.e. no COCOMO data); (b) that vary many parameters (6,000+ combinations); that tests its results on 9 different sources with data on 945 software projects; (c) which uses optimizers representative of the state-of-the-art (DE Storn and Price (1997), FLASH Nair et al. (2018b)); and which (d) benchmark results against prominent methods such as ATLM and LP4EE.

These results were assessed with respect to the Arcuri and Fraser’s concerns mentioned in the introduction; i.e. sometimes hyperparamter optimization can be both too slow and not effective. Such pessimism may indeed apply to the test data generation domain. However, the results of this paper show that there exists other domains like effort estimation where hyperparameter optimization is both useful and fast. After applying hyperparameter optimization, large improvements in effort estimation accuracy were observed (measured in terms of the standardized accuracy). From those results, we can recommend using a combination of regression trees (CART) tuned by different evolution and FLASH. This particular combination of learner and optimizers can achieve in a few minutes what other optimizers need longer runtime of CPU to accomplish.

This study is a very extensive explorations of hyperparameter optimization and effort estimation yet undertaken. There are still very many options not explored here. Our current plans for future work include the following:

  • Try other learners: e.g. neural nets, bayesian learners or AdaBoost;

  • Try other data pre-processors. We mentioned above how it was curious that max features was often less than 100%. This is a clear indication that, we might be able to further improve our estimations results by adding more intelligent feature selection to, say, CART.

  • Other optimizers. For example, combining DE and FLASH might be a fruitful way to proceed.

  • Yet another possible future direction could be hyper-hyperparamter optimization. In the above, we used optimizers like differential evolution to tune learners. But these optimizers have their own control parameters. Perhaps there are better settings for the optimizers? Which could be found via hyper-hyperparameter optimization?

Hyper-hyperparameter optimization could be a very slow process. Hence, results like this paper could be most useful since here we have identified optimizers that are very fast and very slow (and the latter would not be suitable for hyper-hyperparamter optimization).

In any case, we hope that OIL and the results of this paper will prompt and enable more research on better methods to tune software effort estimators. To that end, we have placed our scripts and data online at https://github.com/arennax/effort_oil_2019


This work was partially funded by a National Science Foundation Award 1703487.


  • Agrawal and Menzies (2018) Agrawal A, Menzies T (2018) ” better data” is better than” better data miners”(benefits of tuning smote for defect prediction). In: ICSE’18
  • Agrawal et al. (2018) Agrawal A, Fu W, Menzies T (2018) What is wrong with topic modeling? and how to fix it using search-based software engineering. IST Journal
  • Aljahdali and Sheta (2010) Aljahdali S, Sheta AF (2010) Software effort estimation by tuning coocmo model parameters using differential evolution. In: Computer Systems and Applications (AICCSA), 2010 IEEE/ACS International Conference on, IEEE, pp 1–6
  • Angelis and Stamelos (2000) Angelis L, Stamelos I (2000) A simulation tool for efficient analogy based cost estimation. EMSE 5(1):35–68
  • Arcuri and Briand (2011) Arcuri A, Briand L (2011) A practical guide for using statistical tests to assess randomized algorithms in software engineering. In: Software Engineering (ICSE), 2011 33rd International Conference on, IEEE, pp 1–10
  • Arcuri and Fraser (2013) Arcuri A, Fraser G (2013) Parameter tuning or default values? an empirical investigation in search-based software engineering. ESE 18(3):594–623
  • Atkinson-Abutridy et al. (2003)

    Atkinson-Abutridy J, Mellish C, Aitken S (2003) A semantically guided and domain-independent evolutionary model for knowledge discovery from texts. IEEE Transactions on Evolutionary Computation 7(6):546–560

  • Baker (2007) Baker DR (2007) A hybrid approach to expert and model based effort estimation. West Virginia University
  • Bergstra and Bengio (2012) Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(1):281–305
  • Bergstra et al. (2011) Bergstra JS, Bardenet R, Bengio Y, Kégl B (2011) Algorithms for hyper-parameter optimization. In: Advances in neural information processing systems, pp 2546–2554
  • Boehm (1981) Boehm BW (1981) Software engineering economics. Prentice-Hall
  • Breiman (2001) Breiman L (2001) Random forests. Machine learning 45(1):5–32
  • Briand and Wieczorek (2002) Briand LC, Wieczorek I (2002) Resource estimation in software engineering. Encyclopedia of software engineering
  • Chalotra et al. (2015) Chalotra S, Sehra SK, Brar YS, Kaur N (2015) Tuning of cocomo model parameters by using bee colony optimization. Indian Journal of Science and Technology 8(14)
  • Chang and Lin (2011)

    Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST) 2(3):27

  • Chang (1974)

    Chang CL (1974) Finding prototypes for nearest neighbor classifiers. TC 100(11)

  • Conte et al. (1986) Conte SD, Dunsmore HE, Shen VY (1986) Software Engineering Metrics and Models. Benjamin-Cummings Publishing Co., Inc., Redwood City, CA, USA
  • Corazza et al. (2013) Corazza A, Di Martino S, Ferrucci F, Gravino C, Sarro F, Mendes E (2013) Using tabu search to configure support vector regression for effort estimation. Empirical Software Engineering 18(3):506–546
  • Cowing (2002) Cowing K (2002) Nasa to shut down checkout & launch control system. http://www.spaceref.com/news/viewnews.html?id=475
  • Das et al. (2016) Das S, Wong W, Dietterich T, Fern A, Emmott A (2016) Incorporating expert feedback into active anomaly discovery. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp 853–858, DOI 10.1109/ICDM.2016.0102
  • Efron and Tibshirani (1993) Efron B, Tibshirani J (1993) Introduction to bootstrap. Chapman & Hall
  • Elish and Elish (2008) Elish KO, Elish MO (2008) Predicting defect-prone software modules using support vector machines. J Syst Softw 81(5):649–660, DOI 10.1016/j.jss.2007.07.040
  • Foss et al. (2003) Foss T, Stensrud E, Kitchenham B, Myrtveit I (2003) A simulation study of the model evaluation criterion mmre. TSE 29(11):985–995
  • Frank et al. (2002)

    Frank E, Hall M, Pfahringer B (2002) Locally weighted naive bayes. In: 19th conference on Uncertainty in Artificial Intelligence, pp 249–256

  • Fu et al. (2016a) Fu W, Menzies T, Shen X (2016a) Tuning for software analytics: Is it really necessary? IST Journal 76:135–146
  • Fu et al. (2016b) Fu W, Nair V, Menzies T (2016b) Why is differential evolution better than grid search for tuning defect predictors? arXiv preprint arXiv:160902613
  • Gao et al. (2011) Gao K, Khoshgoftaar TM, Wang H, Seliya N (2011) Choosing software metrics for defect prediction: an investigation on feature selection techniques. Software: Practice and Experience 41(5):579–606
  • Germano and Hufford (2016) Germano S, Hufford A (2016) Finish line to close 25% of stores and replace ceo glenn lyon. https://www.wsj.com/articles/finish-line-to-close-25-of-stores-swaps-ceo-1452171033
  • Hall and Holmes (2003) Hall MA, Holmes G (2003) Benchmarking attribute selection techniques. TKDE 15(6):1437–1447
  • Hardy (1993) Hardy MA (1993) Regression with dummy variables, vol 93. Sage
  • Hazrati (2011) Hazrati V (2011) It projects: 400% over-budget and only 25% of benefits realized. https://www.infoq.com/news/2011/10/risky-it-projects
  • Hihn and Menzies (2015) Hihn J, Menzies T (2015) Data mining methods and cost estimation models: Why is it so hard to infuse new ideas? In: ASEW, pp 5–9, DOI 10.1109/ASEW.2015.27
  • Jørgensen (2004) Jørgensen M (2004) A review of studies on expert estimation of software development effort. JSS 70(1-2):37–60
  • Jørgensen (2015) Jørgensen M (2015) The world is skewed: Ignorance, use, misuse, misunderstandings, and how to improve uncertainty analyses in software development projects
  • Jørgensen and Gruschke (2009) Jørgensen M, Gruschke TM (2009) The impact of lessons-learned sessions on effort estimation and uncertainty assessments. TSE 35(3):368–383
  • Kampenes et al. (2007) Kampenes VB, Dybå T, Hannay JE, Sjøberg DI (2007) A systematic review of effect size in software engineering experiments. Information and Software Technology 49(11-12):1073–1086
  • Kemerer (1987) Kemerer CF (1987) An empirical validation of software cost estimation models. CACM 30(5):416–429
  • Keung et al. (2013) Keung J, Kocaguneli E, Menzies T (2013) Finding conclusion stability for selecting the best effort predictor in software effort estimation. ASE 20(4):543–567, DOI 10.1007/s10515-012-0108-5
  • Keung et al. (2008) Keung JW, Kitchenham BA, Jeffery DR (2008) Analogy-x: Providing statistical inference to analogy-based software cost estimation. TSE 34(4):471–484
  • Kitchenham et al. (2001) Kitchenham BA, Pickard LM, MacDonell SG, Shepperd MJ (2001) What accuracy statistics really measure. IEEE Software 148(3):81–85
  • Kocaguneli and Menzies (2011) Kocaguneli E, Menzies T (2011) How to find relevant data for effort estimation? In: ESEM, pp 255–264, DOI 10.1109/ESEM.2011.34
  • Kocaguneli et al. (2011) Kocaguneli E, Misirli AT, Caglayan B, Bener A (2011) Experiences on developer participation and effort estimation. In: SEAA’11, IEEE, pp 419–422
  • Kocaguneli et al. (2012) Kocaguneli E, Menzies T, Bener A, Keung JW (2012) Exploiting the essential assumptions of analogy-based effort estimation. TSE 38(2):425–438
  • Kocaguneli et al. (2013)

    Kocaguneli E, Menzies T, Keung J, Cok D, Madachy R (2013) Active learning and effort estimation: Finding the essential content of software effort estimation data. IEEE Transactions on Software Engineering 39(8):1040–1053

  • Kocaguneli et al. (2015)

    Kocaguneli E, Menzies T, Mendes E (2015) Transfer learning in effort estimation. ESE 20(3):813–843, DOI 

  • Korte and Port (2008) Korte M, Port D (2008) Confidence in software cost estimation results based on mmre and pred. In: PROMISE’08, pp 63–70
  • Krishna et al. (2016) Krishna R, Menzies T, Fu W (2016) Too much automation? the bellwether effect and its implications for transfer learning. In: IEEE/ACM ICSE, ASE 2016, DOI 10.1145/2970276.2970339
  • Langdon et al. (2016) Langdon WB, Dolado J, Sarro F, Harman M (2016) Exact mean absolute error of baseline predictor, marp0. IST 73:16–18
  • L.Breiman (1984) LBreiman ROCS J Friedman (1984) Classification and Regression Trees. Wadsworth
  • Lessmann et al. (2008) Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Transactions on Software Engineering 34(4):485–496, DOI 10.1109/TSE.2008.35
  • Li et al. (2009) Li Y, Xie M, Goh TN (2009) A study of project selection and feature weighting for analogy based software cost estimation. JSS 82(2):241–252
  • McConnell (2006) McConnell S (2006) Software estimation: demystifying the black art. Microsoft press
  • Mendes and Mosley (2002) Mendes E, Mosley N (2002) Further investigation into the use of cbr and stepwise regression to predict development effort for web hypermedia applications. In: ESEM’02, IEEE, pp 79–90
  • Mendes et al. (2003) Mendes E, Watson I, Triggs C, Mosley S N Counsell (2003) A comparative study of cost estimation models for web hypermedia applications. ESE 8(2):163–196
  • Menzies and Zimmermann (2018) Menzies T, Zimmermann T (2018) Software analytics: What’s next? IEEE Software 35(5):64–70
  • Menzies et al. (2006) Menzies T, Chen Z, Hihn J, Lum K (2006) Selecting best practices for effort estimation. TSE 32(11):883–895
  • Menzies et al. (2007) Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering 33(1):2–13, DOI 10.1109/TSE.2007.256941
  • Menzies et al. (2017) Menzies T, Yang Y, Mathew G, Boehm B, Hihn J (2017) Negative results for software effort estimation. ESE 22(5):2658–2683, DOI 10.1007/s10664-016-9472-2
  • Menzies et al. (2018)

    Menzies T, Majumder S, Balaji N, Brey K, Fu W (2018) 500+ times faster than deep learning:(a case study exploring faster methods for text mining stackoverflow). In: 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR), IEEE, pp 554–563

  • Mittas and Angelis (2013) Mittas N, Angelis L (2013) Ranking and clustering software cost estimation models through a multiple comparisons algorithm. IEEE Trans SE 39(4):537–551, DOI 10.1109/TSE.2012.45
  • Moeyersoms et al. (2015) Moeyersoms J, Junqué de Fortuny E, Dejaeger K, Baesens B, Martens D (2015) Comprehensible software fault and effort prediction. J Syst Softw 100(C):80–90, DOI 10.1016/j.jss.2014.10.032
  • Molokken and Jorgensen (2003a) Molokken K, Jorgensen M (2003a) A review of software surveys on software effort estimation. In: 2003 International Symposium on Empirical Software Engineering, 2003. ISESE 2003. Proceedings., pp 223–230, DOI 10.1109/ISESE.2003.1237981
  • Molokken and Jorgensen (2003b) Molokken K, Jorgensen M (2003b) A review of software surveys on software effort estimation. In: Empirical Software Engineering, 2003. ISESE 2003. Proceedings. 2003 International Symposium on, IEEE, pp 223–230
  • Moser et al. (2008) Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: 30th ICSE, DOI 10.1145/1368088.1368114
  • Nair et al. (2018a) Nair V, Agrawal A, Chen J, Fu W, Mathew G, Menzies T, Minku LL, Wagner M, Yu Z (2018a) Data-driven search-based software engineering. In: MSR
  • Nair et al. (2018b) Nair V, Yu Z, Menzies T, Siegmund N, Apel S (2018b) Finding faster configurations using flash. IEEE Transactions on Software Engineering pp 1–1, DOI 10.1109/TSE.2018.2870895
  • Nam and Kim (2015) Nam J, Kim S (2015) Heterogeneous defect prediction. In: 10th FSE, ESEC/FSE 2015, DOI 10.1145/2786805.2786814
  • Neter et al. (1996) Neter J, Kutner MH, Nachtsheim CJ, Wasserman W (1996) Applied linear statistical models, vol 4. Irwin Chicago
  • Olsson (2009)

    Olsson F (2009) A literature survey of active machine learning in the context of natural language processing

  • Pedregosa et al. (2011) Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J (2011) Scikit-learn: Machine learning in python. JMLR 12(Oct):2825–2830
  • Peters et al. (2015) Peters T, Menzies T, Layman L (2015) Lace2: Better privacy-preserving data sharing for cross project defect prediction. In: ICSE, vol 1, pp 801–811, DOI 10.1109/ICSE.2015.92
  • Port and Korte (2008) Port D, Korte M (2008) Comparative studies of the model evaluation criterion mmre and pred in software cost estimation research. In: ESEM’08, pp 51–60
  • Quinlan (1992) Quinlan JR (1992) Learning with continuous classes. In: 5th Australian joint conference on artificial intelligence, Singapore, vol 92, pp 343–348
  • Rao et al. (2014)

    Rao GS, Krishna CVP, Rao KR (2014) Multi objective particle swarm optimization for software cost estimation. In: ICT and Critical Infrastructure: Proceedings of the 48th Annual Convention of Computer Society of India-Vol I, Springer, pp 125–132

  • Roman (2016) Roman K (2016) Federal government’s canada.ca project ‘off the rails’ https://www.cbc.ca/news/politics/canadaca-federal-website-delays-1.3893254
  • Sarro and Petrozziello (2018) Sarro F, Petrozziello A (2018) Linear programming as a baseline for software effort estimation. ACM Transactions on Software Engineering and Methodology (TOSEM) p to appear
  • Sarro et al. (2016) Sarro F, Petrozziello A, Harman M (2016) Multi-objective software effort estimation. In: ICSE, ACM, pp 619–630
  • Shepperd (2007) Shepperd M (2007) Software project economics: a roadmap. In: 2007 Future of Software Engineering, IEEE Computer Society, pp 304–315
  • Shepperd and MacDonell (2012) Shepperd M, MacDonell S (2012) Evaluating prediction systems in software project estimation. IST 54(8):820–827
  • Shepperd and Schofield (1997) Shepperd M, Schofield C (1997) Estimating software project effort using analogies. TSE 23(11):736–743
  • Shepperd et al. (2000) Shepperd M, Cartwright M, Kadoda G (2000) On building prediction systems for software engineers. EMSE 5(3):175–182
  • Singh and Misra (2012)

    Singh BK, Misra A (2012) Software effort estimation by genetic algorithm tuned parameters of modified constructive cost model for nasa software projects. International Journal of Computer Applications 59(9)

  • Sommerville (2010) Sommerville I (2010) Software engineering. Addison-Wesley
  • Stensrud et al. (2003) Stensrud E, Foss T, Kitchenham B, Myrtveit I (2003) A further empirical investigation of the relationship of mre and project size. ESE 8(2):139–161
  • Storn and Price (1997)

    Storn R, Price K (1997) Differential evolution–a simple and efficient heuristic for global optimization over cont. spaces. JoGO 11(4):341–359

  • Tan et al. (2015) Tan M, Tan L, Dara S (2015) Online defect prediction for imbalanced data. In: ICSE
  • Tantithamthavorn et al. (2016) Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2016) Automated parameter optimization of classification techniques for defect prediction models. In: 38th ICSE, DOI 10.1145/2884781.2884857
  • Tantithamthavorn et al. (2018) Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2018) The impact of automated parameter optimization on defect prediction models. IEEE Transactions on Software Engineering pp 1–1, DOI 10.1109/TSE.2018.2794977
  • Trendowicz and Jeffery (2014) Trendowicz A, Jeffery R (2014) Software project effort estimation. Foundations and Best Practice Guidelines for Success, Constructive Cost Model–COCOMO pags pp 277–293
  • Walkerden and Jeffery (1999) Walkerden F, Jeffery R (1999) An empirical study of analogy-based software effort estimation. ESE 4(2):135–158
  • Whigham et al. (2015) Whigham PA, Owen CA, Macdonell SG (2015) A baseline model for software effort estimation. TOSEM 24(3):20:1–20:11, DOI 10.1145/2738037