Why Software Effort Estimation Needs SBSE

04/02/2018 ∙ by Tianpei Xia, et al. ∙ NC State University 0

Industrial practitioners now face a bewildering array of possible configurations for effort estimation. How to select the best one for a particular dataset? This paper introduces OIL (short for optimized learning), a novel configuration tool for effort estimation based on differential evolution. When tested on 945 software projects, OIL significantly improved effort estimations, after exploring just a few configurations (just a few dozen). Further OIL's results are far better than two methods in widespread use: estimation-via-analogy and a recent state-of-the-art baseline published at TOSEM'15 by Whigham et al. Given that the computational cost of this approach is so low, and the observed improvements are so large, we conclude that SBSE should be a standard component of software effort estimation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper reports an extraordinarily successful experiment in applying SBSE to a very common software engineering problem; i.e., effort estimation. There are many effort estimation methods discussed in the literature; e.g.,

  • Jörgensen & Shepperd report over 250 papers proposing new methods for project size or effort estimation methods [23].

  • We list below 6,000+ methods for analogy-based effort estimation.

With so many available methods, it is now a matter of some debate about which one is best for a new data set. To simplify that task, Whigham et al. recently proposed at TOSEM’15 a “baseline model for software effort estimation” called ATLM [57]

. They recommend ATLM since, they claim, “it performs well over a range of different project types and requires no parameter tuning”. Note that “no parameter tuning” is an attractive property since tuning can be very slow– particularly when using evolutionary genetic algorithms (GAs). For example, the default recommendations for GAs suggest

to evaluations [18]. This can take some time to terminate. Sarro et al. [48] reports that their evolutionary system for effort estimation mutated 100 individuals for 250 generations. While they do not report their runtimes, we estimate that their methods would require 34 to 345 hours of CPU to terminate111Assuming 100*250 evals, 0.5 to 5 seconds to evaluate one mutation, 10-way cross-val..

In practice, commissioning an effort estimator on new data takes even more time than stated above. Wolpert’s no-free lunch theorems warn that for machine learning 

[58], no single method works best on all data sets. Hence, when building effort estimators for a new data set, some commissioning process is required that tries a range of different algorithms. This is not a mere theoretical concern: researchers report that the “best” effort estimator for different data sets varies enormously [38, 40, 29].

Given such long runtimes, we have found it challenging to make SBSE attractive to the broader community of standard developers and business users. To address that challenge, it would be useful to have an example where SBSE can commission a specific effort estimator for a specific data set, in just a few minutes on a standard laptop.

This paper offers such an example. We present a surprising and fortunate result that a very “CPU-lite” SBSE method can commission an effort estimator that significantly out-performs standard effort estimation methods. Here, by “out-perform” we mean that:

  • Our estimates have statistically much smaller errors than standard methods;

  • The comissioning time for that estimator is very fast: median runtime for our ten-way cross-vals is just six minutes on a standard 8GB, 3GHz desktop machine.

Note that our approach is very different to much of the prior research on effort estimation and evolutionary algorithms 

[5, 10, 12, 13, 33, 47, 7, 49, 48, 39]. Firstly, that work assumed a “CPU-heavy” approach whereas we seek a “CPU-lite” method. Secondly, we do not defend one particular estimator; instead, our commissioning process selects a different estimator for each data set after exploring thousands of possibilities.

The rest of this paper is structured as follows. The next section describes effect estimation. We then introduce OIL (short for optimized learning), a CPU-lite search-based SE method based on differential evolution [55]. This is followed by an empirical study where estimates for 945 software projects are generated using a variety of methods including OIL. The results from that study let us comment on three research questions:

  • RQ1: Can effort estimation ignore SBSE? That is, is tuning avoidable since just a few options are typically “best”? We will find that the “best” effort estimation method is highly variable. That is, tools like OIL are important for ensuring that the right estimators are being applied to the current data set.

  • RQ2: Pragmatically speaking, is SBSE too hard to apply to effort estimation? As shown below, a few dozen evaluations of OIL are enough to explore configuration options for effort estimation. That is, it is hardly arduous to apply SBSE to effort estimation. Even on a standard single core machine, the median time to explore all those options is just a few minutes.

  • RQ3: Does SBSE estimate better than widely-used effort estimation methods? As shown below, the estimations from OIL perform much better than standard effort estimation methods, including ATLM.

2 Background

2.1 Why Explore Software Effort Estimation?

Software effort estimation is the process of predicting the most realistic amount of human effort (usually expressed in terms of hours, days or months of human work) required to plan, design and develop a software project based on the information collected in previous related software projects. With one or more wrong factors, the effort estimate results could be inaccurate which affect the allocated funds for the projects[24]. Inadequate or overfull funds for a project could cause a considerable waste of resource and time. For example, NASA canceled its incomplete Check-out Launch Control System project after the initial $200M estimate was exceeded by another $200M [9]. It is critical to generate effort estimations with good accuracy if for no other reason that many government organizations demand that the budgets allocated to large publicly funded projects be double-checked by some estimation model [37].

Effort estimation can be divided into human-based techniques and model-based techniques [28, 50]. Human-based techniques [20] are that can be hard to audit or dispute ( e.g., when the estimate is generated by a senior colleague but disputed by others). Also, empirically, it is known that humans rarely update their estimation knowledge based on feedback from new projects [22].

Model-based methods are preferred when estimate have to be audited or debated (since the method is explicit and available for inspection). Even advocates of human-based methods [21] acknowledge that model-based methods are useful for learning the uncertainty about particular estimates; e.g., by running those models many times, each time applying small mutations to the input data.

Note that this paper focuses on estimation-via-analogy and there are many other ways to perform effort estimation. We choose not to explore parametric estimation [37]

since that approach demands the data be expressed in identically the same terms as the parametric models (e.g. COCOMO). This can be a major limitation to parametric models; for example, none of the data sets used in this paper are expressed in terms of the vocabulary used by standard parametric models. As to CPU-heavy methods (e.g.,

ensembles [29] or standard genetic algrithms for effort estimation [5, 10, 12, 13, 33, 47, 7, 49, 48, 39]), the message of this paper is that CPU-lite methods (e.g., just 40 evaluations within DE) can be surprisingly effective. Hence, we do not explore CPU-heavy methods, at least for now. It would be interesting in future work to check if (e.g.,) CPU-heavy ensembles or genetic algorithms are out-performed by the CPU-lite methods of this paper.

2.2 Analogy-based Estimation (ABE)

Analogy-based Estimation (ABE) was explored by Shepperd and Schofield in 1997 [53]. It is widely-used [44, 30, 19, 27, 37], in many forms. We say that “ABE0” is the standard form seen in the literature and “ABEN” are the 6,000+ variants of ABE defined below. The general form of ABE (which applies to ABE0 or ABEN) is:

  • Form a table of rows of past projects. The columns of this table are composed of independent variables (the features that define projects) and one dependent variable (project effort).

  • Find training subsets. Decide on what similar projects (analogies) to use from the training set when examining a new test instance.

  • For each test instance, select analogies out of the training set.

    • While selecting analogies, use a similarity measure.

    • Before calculating similarity, normalize numerics min..max to 0..1 (so all numerics get equal chance to influence the dependent).

    • Use feature weighting to reduce the influence of less informative features.

  • Use an adaption strategy to return some combination of the dependent effort values seen in the nearest analogies.

To measure similarity between examples , ABE uses where ranges over all the independent variables. In this equation, corresponds to feature weights applied to independent features. For ABE0, we use a uniform weighting, therefore . Also, the adaptation strategy for ABE0 is to return the effort values of the nearest analogies. The rest of this section describes 6,000+ variants of ABE that we call ABEN. Note that we do not claim that the following represents all possible ways to perform analogy-based estimation. Rather, we merely say that (a) all the following are common variations of ABE0, seen in recent research publications [28]; and (b) anyone with knowledge of the current effort estimation literature would be tempted to try some of the following.

Two ways to find training subsets: (1) Remove nothing: Usually, effort estimators use all training projects [6]

. Our ABE0 is using this variant; (2) Outlier methods: prune training projects with (say) suspiciously large values 

[25]. Typically, this removes a small percentage of the training data.

Eight ways to make feature weighting: Li et al. [34] and Hall and Holmes [17] review eight different feature weighting schemes. Li et al. use a genetic algorithm to learn useful feature weights. Hall and Holmes review a variety of methods ranging from WRAPPER to various filters methods, including their preferred correlation-based method. Note that their methods assume symbolic, not numeric, dependent variables. Hence, to apply these methods we add a discretized classes column, using (max-min)/10. Technical aside: when we compute the errors measures (see below), we use the raw numeric dependent values.

Three ways to discretize (summarize numeric ranges into a few bins): Some feature weighting schemes require an initial discretization of continuous columns. There are many discretization policies in the literature, including: (1) equal frequency, (2) equal width, (3) do nothing.

Six ways to choose similarity measurements: Mendes et al. [35] discuss three similarity measures, including the weighted Euclidean measure described above, an unweighted variant (where = 1), and a “maximum distance” measure that focuses on the single feature that maximizes interproject distance. Frank et al. [15] offer a fourth similarity measure that uses a triangular distribution that sets to the weight to zero after the distance is more than “k” neighbors away from the test instance. A fifth and sixth similarity measure are the Minkowski distance measure used in [3] and the mean value of the ranking of each project feature used in [56].

Four ways for adaption mechanisms:

(1) median effort value, (2) mean dependent value, (3) summarize the adaptations via a second learner (e.g., linear regression

[34, 36, 4, 46], (4) weighted mean [35].

Six ways to select analogies: Kocaguneli et al. [28] says analogies selectors are fixed or dynamic. Fixed methods use nearest neighbors while dynamic methods use the training set to find which is best for examples.

Figure 1: OIL’s feature model of the space of machine learning options for ABEN. In this model, , , and are the mandatory features, while the and features are optimal. To avoid making the graph too complex, some cross-tree constrains are not presented.

2.3 Oil

As shown above, ABEN has variants. Some can be ignored; e.g. at , adaptation mechanisms return the same result, so they can be ignored. Also, not all feature weighting techniques use discretization. But even after those discards, there are still thousands of possibilities to explore.

OIL is our controller for exploring these possibilities. Initially, our plan was to use standard hyper-parameter tuning for this task. Then we learned that (a) standard data mining toolkits like scikit-learn lack some of ABEN variants; and (b) standard hyper-parameter tuners can be slow (sklearn recommends a default runtime of 24 hours [1]). Hence, we build OIL, implemented as a layered architecture:

  • At the lowest library layer, OIL uses Python Scikit-Learn [43].

  • Above that, there is a utilities layer containing all the algorithms missing in Scikit-Learn (e.g., ABEN required numerous additions at the utilities layer).

  • Higher up, OIL’s modelling layer uses an XML-based domain-specific language to specify a feature map of data mining options. These feature models are single-parent and-or graphs with (optionally) cross-tree constraints showing what options require or exclude other options. A graphical representation of the feature model used in this paper is shown in Figure 1.

  • Finally, at top-most optimizer layer, there is some evolutionary optimizer that makes decisions across the feature map. An automatic mapper facility then links those decisions down to the lower layers to run the selected algorithms.

For this study, we optimize using the differential evolution method (DE [55]), shown in Figure 2. DE was selected since certain recent software analytics papers have reported that DE can be effective for text mining [2] and defect prediction [16]. While we initially planned a more extensive evaluation with other optimizers, but encountered problems accessing reference implementations222E.g. there is no reproduction package available for the Sarro et al. system [48] at their home page http://www0.cs.ucl.ac.uk/staff/F.Sarro/projects/CoGEE/.. In any case, the results with DE were so promising that we deferred the application of other optimizers to future work.

  • A dataset, as described in Table 2;

  • A tuning goal ; e.g., or ;

  • DE parameters: , or , , (selected using advice from [55]).

OUTPUT: Best tunings for learners (e.g., ABEN) found by DE
  • Separate the data into and ;

  • Generate tunings as the initial population;

  • Score each tuning in the population with goal ;

  • For to do

    1. Generate amutant by extrapolating 3 members of population , ,

      at probability

      . For decision :

      • (continuous values).

      • (discrete values).

    2. Build a learner with parameters and train data;

    3. Score on tune data using ;

    4. Replace with if is preferred to .;

  • Repeat the last step until reach the number of ;

  • Return the last population as the final result.

Figure 2: OIL: uses Storn’s differential evolution method [55].

DE evolves a new generation of candidates from a current population of size . Each candidate solution for effort estimation is pair of (Tunings, Scores) where Tunings are selected from the above options for ABEN; and Scores come from training a learner using those parameters and applying it to test data.

The premise of DE is that the best way to mutate the existing tunings is to extrapolate between current solutions. Three solutions are selected at random. For each tuning parameter , at some probability , we replace the old tuning with . For booleans and for numerics, where is a parameter controlling cross-over. The main loop of DE runs over the population, replacing old items with new candidates (if new candidate is better). This means that, as the loop progresses, the population is full of increasingly more valuable solutions (which, in turn, helps extrapolation).

As to the control parameters of DE, using advice from Storn [55], we set . The number of generations was set as follows. A small number (2) was used to test the effects of a very CPU-lite SBSE effort estimator. A larger number (8) was used to check if anything was lost by restricting the inference to just two generations.

3 Empirical Study

To assess OIL, we applied it to the 945 projects seen in nine datasets from the SEACRAFT repository (http://tiny.cc/seacraft); see Table 1 and Table 2. This data was used since it has been widely used in previous estimation research. Also, it is quite diverse since it differs for: observation number (from 15 to 499 projects); number and type of features (from 6 to 25 features, including a variety of features describing the software projects, such as number of developers involved in the project and their experience, technologies used, size in terms of Function Points, etc.); technical characteristics (software projects developed in different programming languages and for different application domains, ranging from telecommunications to commercial information systems); geographical locations (software projects coming from China, Canada, Finland).

Projects Features
kemerer 15 6
albrecht 24 7
isbsg10 37 11
finnish 38 7
miyazaki 48 7
maxwell 62 25
desharnais 77 6
kitchenham 145 6
china 499 18
total 945
Table 1: Data in this study. For details on the features, see Table 2.

OIL collects information on two performance metrics: magnitude of the relative error (MRE) [8] and Standardized Accuracy (SA). We make no comment on which measure is better– these were selected since they are widely used in the literature.

feature min max mean std
kemerer Langu. 1 3 1.2 0.6
Hdware 1 6 2.3 1.7
Duration 5 31 14.3 7.5
KSLOC 39 450 186.6 136.8
AdjFP 100 2307 999.1 589.6
RAWFP 97 2284 993.9 597.4
Effort 23 1107 219.2 263.1
albrecht Input 7 193 40.2 36.9
Output 12 150 47.2 35.2
Inquiry 0 75 16.9 19.3
File 3 60 17.4 15.5
FPAdj 1 1 1.0 0.1
RawFPs 190 1902 638.5 452.7
AdjFP 199 1902 647.6 488.0
Effort 0 105 21.9 28.4
isbsg10 UFP 1 2 1.2 0.4
IS 1 10 3.2 3.0
DP 1 5 2.6 1.1
LT 1 3 1.6 0.8
PPL 1 14 5.1 4.1
CA 1 2 1.1 0.3
FS 44 1371 343.8 304.2
RS 1 4 1.7 0.9
FPS 1 5 3.5 0.7
Effort 87 14453 2959 3518
finnish hw 1 3 1.3 0.6
at 1 5 2.2 1.5
FP 65 1814 763.6 510.8
co 2 10 6.3 2.7
prod 1 29 10.1 7.1
lnsize 4 8 6.4 0.8
lneff 6 10 8.4 1.2
Effort 460 26670 7678 7135
feature min max mean std
miyazaki KLOC 7 390 63.4 71.9
SCRN 0 150 28.4 30.4
FORM 0 76 20.9 18.1
FILE 2 100 27.7 20.4
ESCRN 0 2113 473.0 514.3
EFORM 0 1566 447.1 389.6
EFILE 57 3800 936.6 709.4
Effort 6 340 55.6 60.1
maxwell App 1 5 2.4 1.0
Har 1 5 2.6 1.0
Dba 0 4 1.0 0.4
Ifc 1 2 1.9 0.2
Source 1 2 1.9 0.3
Telon. 0 1 0.2 0.4
Nlan 1 4 2.5 1.0
T01 1 5 3.0 1.0
T02 1 5 3.0 0.7
T03 2 5 3.0 0.9
T04 2 5 3.2 0.7
T05 1 5 3.0 0.7
T06 1 4 2.9 0.7
T07 1 5 3.2 0.9
T08 2 5 3.8 1.0
T09 2 5 4.1 0.7
T10 2 5 3.6 0.9
T11 2 5 3.4 1.0
T12 2 5 3.8 0.7
T13 1 5 3.1 1.0
T14 1 5 3.3 1.0
Dura. 4 54 17.2 10.7
Size 48 3643 673.3 784.1
Time 1 9 5.6 2.1
Effort 583 63694 8223 10500
  feature min max mean std desharnais TeamExp 0 4 2.3 1.3 MngExp 0 7 2.6 1.5 Length 1 36 11.3 6.8 Trans.s 9 886 177.5 146.1 Entities 7 387 120.5 86.1 AdjPts 73 1127 298.0 182.3 Effort 546 23940 4834 4188 kitchenham code 1 6 2.1 0.9 type 0 6 2.4 0.9 duration 37 946 206.4 134.1 fun_pts 15 18137 527.7 1522 estimate 121 79870 2856 6789 esti_mtd 1 5 2.5 0.9 Effort 219 113930 3113 9598 china ID 1 499 250.0 144.2 AFP 9 17518 486.9 1059 Input 0 9404 167.1 486.3 Output 0 2455 113.6 221.3 Enquiry 0 952 61.6 105.4 File 0 2955 91.2 210.3 Interface 0 1572 24.2 85.0 Added 0 13580 360.4 829.8 Changed 0 5193 85.1 290.9 Deleted 0 2657 12.4 124.2 PDR_A 0 84 11.8 12.1 PDR_U 0 97 12.1 12.8 NPDR_A 0 101 13.3 14.0 NPDU_U 0 108 13.6 14.8 Resource 1 4 1.5 0.8 Dev.Type 0 0 0.0 0.0 Duration 1 84 8.7 7.3 N_effort 31 54620 4278 7071 Effort 26 54620 3921 6481
Table 2: Descriptive Statistics of the Datasets

MRE is defined in terms of AR, the magnitude of the absolute residual. This is computed from the difference between predicted and actual effort values: . MRE is the magnitude of the relative error calculated by expressing AR as a ratio of the actual effort value; i.e., .

MRE has been criticized [14, 26, 31, 45, 51, 54] as being biased towards error underestimations. Some researchers prefer the use of other (more standardized) measures, such as Standardized Accuracy (SA) [32, 52]. SA is defined in terms of where is the number of projects used for evaluating the performance, and and are the actual and estimated effort, respectively, for the project . SA uses MAE as follows: where is the MAE of the approach being evaluated and is the MAE of a large number (e.g., 1000 runs) of random guesses. The important thing about SA is that, over many runs, will converge on simply using the sample mean [52]. SA represents how much better is than random guessing. Values near zero means that the prediction model is practically useless, performing little better than random guesses [52].

Note that for these evaluation measures:

  • smaller MRE values are better;

  • while larger SA values are better.

It is good practice to benchmark new methods against a variety of different approaches. Accordingly, OIL uses the following algorithms:

  • ABE0 was described above. It is widely used  [44, 30, 19, 27, 37].

  • Automatically Transformed Linear Model (ATLM) is an effort estimation method recently proposed at TOSEM’15 by Whigham et al. [57]. ATLM is a multiple linear regression model which calculate the effort as , where is the response for project , and are explanatory variables. The prediction weights are determined using a least square error estimation [42]. Recall for the introduction that Whigham et al. recommend ATLM since, they say, it performs well on a range of different project types and needs no parameter tuning.

  • Differential Evolution (DE) was described above. Recall we have two versions of DE. DE2, DE8 runs for two, eight generations and terminate after evaluating 40, 160 configurations (respectively).

  • Random Choice (RD). It is good practice to baseline stochastic optimizers like DE against some random search [41]. Accordingly, until it finds valid configurations, RD selects leaves at random from Figure 1. All these variants are executed and the best one is selected for application to the test set. To maintain parity with DE2 and DE8, OIL uses (which we denote RD40 and RD160).

Subset Weighting Discret. Similarity Adaption Analogies

Rm nothing Outlier

 Remain same  Genetic  Gain rank  Relief  PCA  CFS  CNS  WRP

No discrete Equal freq. Equal width

  Euclidean   Weight Euclid.   Max measure   Local likelihood   Minkowski   Feature mean

  Median   Mean   Second learner   Weighted Mean

K=1 K=2 K=2 K=4 K=5 Dynamic






































































































































































































































































KEY: 102030405060708090100%

Table 3: In twenty runs of DE2, how often was each configuration selected? Cells with white text denote an option selected half the time, or more. Such cells are rare.

OIL performs a -fold cross validation for each of (ABE0, ATLM, DE2, DE8, RD40, RD160), for each of our nine data sets. To apply this, datasets are partitioned into sets (the observations were sampled uniformly at random, without replacement), and then for each set OIL considered it as a testing set and the remaining observations as training set. For datasets kemerer, albrecht, isbsg10 and finnish, we uses three-fold cross validation since their instances are less than 40. For the other larger datasets miyazaki, maxwell, desharnais, kitchenham and china, we use ten-fold.

Since our folds are selected in a stochastic manner, we repeat the cross-vals 20 times, each time with different random seeds.

4 Results

These results are divided into answers for the research questions introduced above.

RQ1: Can effort estimation ignore SBSE? That is, is tuning avoidable since just a few options are typically “best”?

Table 3 shows why SBSE is an essential component for effort estimation. This table shows how often different options were selected by the best optimizer seen in this study. Note that, very rarely, is one option selected most of the time (exception: clearly our outlier operator is not very good– this should be explored further in future work). From this table, it is clear that the best configuration is not only data set specific, but all specific to the training set used within a data set. This means that RQ1=no and tools like OIL are very important for configuring effort estimation methods.

RQ2: Pragmatically speaking, is SBSE too hard to apply to effort estimation?

As mentioned in the introduction, some SBSE methods can be very slow. While such long runtimes are certainly required in other domains, for configuring effort estimation methods, SBSE can terminate much faster than that. Figure 3 shows the time required to generate our results (on a standard 8GB, 3GHz desktop machine).

 ABE0  ATLM  RD40  RD160  DE2  DE8
kemerer 1 1 3 13 4 10
albrecht 1 1 3 11 4 11
isbsg10 1 1 3 15 4 14
finnish 1 1 4 14 5 14
miyazaki 1 1 5 16 6 16
maxwell 1 1 12 52 18 53
desharnais 1 1 13 54 17 55
kitchenham 1 1 21 80 28 94
china 1 1 57 232 52 243
Figure 3: Mean runtime, cross-validation (minutes), as seen in 20 repeated cross-val experiments.

Note that standard effort estimation methods (i.e., ABE0 and ATLM) run very fast indeed compare to anything else. Hence, pragmatically, it seems tempting to recommend these faster systems. Nevertheless, this paper will recommend somewhat slower methods since, as shown below, these faster methods (i.e., ABE0 and ATLM) result in very poor estimates. The good news from Figure 3 is that cross-validation for the methods we will recommend (DE2) takes just a few minutes to terminate. Hence we say that RQ2=no since SBSE can quite quickly commission an effort estimator, tuned specifically to a data set.

a. % MRE (smaller values are better). Rank Using Med. IQR

1 DE8 21 32 1* DE2 22 27 1 RANDOM160 24 17 1* RANDOM40 26 27 2 ABE0 60 53 3 ATLM 154 341 out-of-range albrecht 1 DE8 19 6 1* DE2 21 6 1 RANDOM160 24 12 2 RANDOM40 28 16 3 ABE0 48 34 4 ATLM 97 76 out-of-range isbsg10 1 DE8 37 43 1* DE2 43 22 2 RANDOM160 48 21 2 RANDOM40 56 24 2 ABE0 72 22 3 ATLM 138 120 out-of-range finnish 1 DE2 15 18 1* ATLM 18 9 1 RANDOM160 21 18 2 DE8 22 30 2 RANDOM40 24 18 3 ABE0 37 19 miyazaki 1 DE8 21 33 1* DE2 21 31 1 RANDOM160 23 25 1* RANDOM40 31 22 2 ABE0 56 16 3 ATLM 147 98 out-of-range maxwell 1 DE8 28 32 1* DE2 28 20 2 RANDOM160 34 26 3 RANDOM40 40 19 3 ABE0 55 26 4 ATLM 357 322 out-of-range desharnais 1 DE8 24 28 1* DE2 24 20 1 RANDOM160 28 15 2 RANDOM40 32 19 3 ATLM 47 23 3 ABE0 52 27 kitchenham 1 DE8 18 19 1* DE2 18 12 1 RANDOM160 22 11 1* RANDOM40 24 12 2 ABE0 43 16 3 ATLM 133 59 out-of-range china 1 DE8 16 11 1* DE2 16 6 2 RANDOM160 24 14 2 RANDOM40 27 17 3 ABE0 44 6 4 ATLM 57 14
b. % SA (larger values are better) Rank Using Med. IQR
1 RANDOM160 61 33 1* DE2 54 24 1* RANDOM40 53 36 1 DE8 49 28 2 ABE0 37 51 3 ATLM -46 217 out-of-range albrecht 1* DE8 77 20 2 DE2 69 19 2 RANDOM160 68 20 3 RANDOM40 55 21 3 ABE0 54 38 4 ATLM 30 50 isbsg10 1 RANDOM160 40 30 1* ABE0 33 25 1 RANDOM40 31 18 1 DE8 28 24 1 DE2 26 20 2 ATLM 10 126 out-of-range finnish 1* ATLM 81 6 1 DE2 81 13 1 RANDOM160 77 14 2 DE8 74 43 2 RANDOM40 73 14 3 ABE0 54 25 miyazaki 1 RANDOM160 60 33 1 DE8 57 32 1* DE2 57 29 1* RANDOM40 55 32 2 ABE0 36 24 3 ATLM -41 85 out-of-range maxwell 1 DE8 60 26 1* DE2 55 34 1 RANDOM160 52 26 1* RANDOM40 50 26 2 ABE0 41 28 3 ATLM -204 247 out-of-range desharnais 1* DE2 57 24 1 DE8 57 21 2 RANDOM160 54 20 2 RANDOM40 52 26 2 ATLM 52 16 3 ABE0 36 17 kitchenham 1* RANDOM40 67 20 1* DE2 66 17 1 RANDOM160 66 21 1 DE8 65 21 2 ABE0 45 18 3 ATLM -39 72 out-of-range china 1 DE8 82 11 1* DE2 78 12 2 RANDOM160 69 19 2 RANDOM40 67 27 3 ABE0 60 4 4 ATLM 41 12

Figure 4: % MRE and % SA seen in 20 repeats. Med is the 50th percentile and IQR is the

inter-quartile range

; i.e., 75th-25th percentile. Lines with a dot in the middle (e.g., ) show median values with the IQR. MRE and SA results are sorted in different directions since better MRE and SA values are smaller and larger (respectively). The left-hand side columns Rank results (and the smaller, the better). Ranks separate statistically different results, as computed by a bootstrap test (95% confidence) and the A12 test [57]). out-of-range denote results that are so bad, that they fall outside of this figure s range of [0,100] %. 1* denotes rows of faster best-ranked methods.

RQ3: Does SBSE estimate better than widely-used effort estimation methods?

RQ2 showed SBSE for effort estimation is not arduously slow. Another issue is whether or not those SBSE methods lead to better estimates. Figure 4 explores that issue. Black dots show median values from 20 repeats. Horizontal lines show the 25th to 75th percentile of the values.

The most important part of the Figure 4 results are the Rank columns shown left-hand-side. These ranks cluster together results that are statistically indistinguishable as judged by a conjunction of both a 95% bootstrap significance test [11] and a A12 test for a non-small effect size difference in the distributions [37]

. These tests were used since their non-parametric nature avoids issues with non-Gaussian distributions.

In Figure 4, Rank=1 denotes the better results. When multiple treatments receive top rank, we use the runtimes of Figure 3 to break ties. For example, in the kemerer MRE results, four methods have Rank=1. However, two of these methods (DE2 and RD40) are much faster than the others. Rows denoted Rank=1* show these fastest top-ranked treatments.

(Technical aside: there is no statistically significant difference between the runtimes of RD40 and DE2 in Figure 3, as determined by a 95% bootstrap test. Hence, when assigning the Rank=1*, we say that RD40 runs as fast as DE2.)

From the Rank=1* entries in Figure 4, we make the following comments.

  • In marked contrast to the claims of Whigham et al., ATLM does not have a very good performance. While it does appear as a Rank=1* method in finnish, in all other data sets it performs badly. Indeed, often, its performance falls outside the [0,100]% range shown in Figure 4.

  • Another widely-used method in effort estimation is the ABE0 analogy-based effort estimator. In 15/18 of the Figure 4 results, ABE0 is ranked better than ATLM. That is, if the reader wants to avoid the added complexity of SBSE, they could ignore our advocacy for OIL and instead just use ABE0. That said, ABE0 is only top-ranked in 1/18 of our results. Clearly, there are better methods than ABE0.

  • Random configuration selection performs not too badly. In 6/18 of the Figure 4 results, one of our random methods earns Rank=1*. That said, the random methods are clearly out-performed by just a few dozen evaluations of DE. In 14/18 of these results, DE2 (40 evaluations of DE) earns Rank=1*.

Overall, based on the above points, we would recommend DE2 for comissioning effort estimation to new data sets. In 17/18 of our results, it gets scored Rank=1. To be sure, in 3 of those results, another method ran faster. However, for the sake of implementation simplicity, some researchers might choose to ignore that minority case.

In summary RQ3=yes since SBSE produces much better effort estimates than widely-used effort estimation methods.

5 Discussion

The natural question that arises from all this is why does SBSE work so well? We see three possibilities: (1) DE is really clever, (2) effort estimation is really simple, or (3) there exists a previously undocumented floor effect in effort estimation.

Regarding DE is clever: DE combines local search (the extrapolation described in Figure 2) with an archive pruning operator (when new candidates supplant older items in the population, then all subsequent mutations use the new and improved candidates). Hence it is wrong to characterize 40 DE evaluations as “just 40 guesses”. Also, there is evidence from other SE domains that DE is indeed a clever way to study SE problems. For example, Fu et al. found that hyper-parameter optimization via a few dozen DE evaluations was enough to produce significantly large improvements in defect prediction [16]. Also, in other work, Agrawal et al. [2] found that a few dozen evaluations of DE were enough to significantly improve the control parameters for the Latent Dirichlet Allocation text mining algorithm.

Regarding effort estimation is simple: Perhaps the effective search space of different effort estimators might be very small. If effort estimation exhibits a “Many roads lead to Rome” property then when multiple estimators are applied to the same data sets, many of them will have equivalent performance. For such problems, configuration is not a difficult problem since a few random probes (plus a little guidance with DE) can effectively survey all the important features.

Regarding floor effects: Floor effects exist when a domain contains some inherent performance boundary, which cannot be exceeded. Floor effects have many causes such as the signal content of a data set is very limited, For such data sets, then once learners reach ‘the floor”, then there is no better place to go after that. This paper offers two pieces of evidence for floor effects in effort estimation:

  • Recall from the above that our data sets are very small (see Figure 1)– which suggests that effort estimation data has limited signal.

  • Also, one indicator for floor effects is that informed methods perform no better than random search and, to some extent, that indicator was seen in the above results. Recall from the above that while a full random search was out-performed by DE2, sometimes those random searchers performed very well indeed.

Whatever the explanation, the main effect documented by this paper is that a widely used SE technique (effort estimation) which can be dramatically improved with SBSE.

6 Threats to Validity

Internal Bias: All our methods contain stochastic random operators. To reduce the bias from random operators, we repeated our experiment in 20 times and applied statistical tests to remove spurious distinctions.

Parameter Bias: DE plays an important role in OIL, in this paper, we did not discuss the influence of different DE parameters, such as , , . In this paper, we followed Storn et al.’s configurations [55]. Clearly, tuning such parameters is a direction for future work.

Sampling Bias: While we tested OIL on the nine datasets, it would be inappropriate to conclude that OIL tuning always perform better than others methods for all data sets. As researchers, what we can do to mitigate this problem is to carefully document out method, release out code, and encourage the community to try this method on more datasets, as the occasion arises.

7 Conclusion and Future Work

This paper has explored methods for commissioning effort estimation methods. As stated in the introduction, our approach is very different to much of the prior “CPU-heavy” SBSE research on effort estimation and evolutionary algorithms [5, 10, 12, 13, 33, 47, 7, 49, 48, 39]. Firstly, we take a “CPU-lite” approach. Secondly, we do not defend one particular estimator; instead, our commissioning process selects different estimators for different data set after exploring thousands of options.

Our results show that SBSE is both necessary and simple to apply for effort estimation. Table 3 showed that the “best” estimator varies greatly across effort estimation data. Using “CPU-lite” SBSE methods (specifically, DE) it is possible to very quickly find these best estimators. Further, the effort estimators generated by SBSE out-perform standard methods in widespread use (ABE0 and ATLM). This SBSE process is not an overly burdensome task since, as shown above it is enough to perform 40 evaluations of different candidates (guided by DE). To be sure, some additional architecture is required for SBSE and effort estimation, but we have packaged that into the OIL system (which after double blind, we will distribute as a Python pip package).

As to future work, as discussed in several places around this document:

  • This work should be repeated for more datasets.

  • The space of operators we explored within ABEN could be expanded. Clearly, from Table 3, our outliers method is ineffective and should be replaced. There are also other estimation methods that could be explored (not just for ABE, but otherwise).

  • Other DE settings , and could be explored.

  • It could also be useful to try optimizers other than DE. Specifically, future work could check if (e.g.,) CPU-heavy methods such as ensembles methods [29] or Sarro’s genetic algorithms [48] are out-performed by the CPU-lite methods of this paper. That said, it should be noted that this study found no benefit in increasing the number of evaluations from 40 to 160. Hence, possibly, CPU-heavy methods may not result in better estimators.

  • It could be very insightful to explore the floor effects discussed in §5. If these are very common, then that would suggest the whole field of software effort estimation has been needlessly over-complicated.


  • [1] Sklean, manual, 2018.
  • [2] A. Agrawal, W. Fu, and T. Menzies. What is wrong with topic modeling? and how to fix it using search-based software engineering. IST, 2018.
  • [3] L. Angelis and I. Stamelos. A simulation tool for efficient analogy based cost estimation. EMSE, 5(1):35–68, 2000.
  • [4] D. R. Baker. A hybrid approach to expert and model based effort estimation. West Virginia University, 2007.
  • [5] C. J. Burgess and M. Lefley.

    Can genetic programming improve software effort estimation? a comparative evaluation.

    IST, 43(14):863–873, 2001.
  • [6] C. L. Chang.

    Finding prototypes for nearest neighbor classifiers.

    TC, 100(11), 1974.
  • [7] M. Choetkiertikul, H. K. Dam, T. Tran, T. T. M. Pham, A. Ghose, and T. Menzies.

    A deep learning model for estimating story points.

    TSE, PP(99):1–1, 2018.
  • [8] S. D. Conte, H. E. Dunsmore, and V. Y. Shen. Software Engineering Metrics and Models. Benjamin-Cummings Publishing Co., Inc., Redwood City, CA, USA, 1986.
  • [9] K. Cowing. Nasa to shut down checkout & launch control system, 2002.
  • [10] J. J. Dolado. A validation of the component-based method for software size estimation. TSE, 26(10):1006–1021, Oct 2000.
  • [11] B. Efron and J. Tibshirani. An introduction to the bootstrap. Chapman & Hall, 1993.
  • [12] F. Ferrucci, C. Gravino, R. Oliveto, and F. Sarro. Genetic programming for effort estimation: An analysis of the impact of different fitness functions. In SSBSE’10, pages 89–98, 2010.
  • [13] F. Ferrucci, C. Gravino, R. Oliveto, F. Sarro, and E. Mendes. Investigating tabu search for web effort estimation. In SEAA, pages 350–357, Sept 2010.
  • [14] T. Foss, E. Stensrud, B. Kitchenham, and I. Myrtveit. A simulation study of the model evaluation criterion mmre. TSE, 29(11):985–995, 2003.
  • [15] E. Frank, M. Hall, and B. Pfahringer.

    Locally weighted naive bayes.


    19th conference on Uncertainty in Artificial Intelligence

    , pages 249–256, 2002.
  • [16] W. Fu, T. Menzies, and X. Shen. Tuning for software analytics: Is it really necessary? IST, 76:135–146, 2016.
  • [17] M. A. Hall and G. Holmes. Benchmarking attribute selection techniques for discrete class data mining. TKDE, 15(6):1437–1447, 2003.
  • [18] R. L. Haupt. Optimum population size and mutation rate for a simple real genetic algorithm that optimizes array factors. In APSIS’00, pages 1034–1037, 2000.
  • [19] J. Hihn and T. Menzies. Data mining methods and cost estimation models: Why is it so hard to infuse new ideas? In ASEW, pages 5–9, Nov 2015.
  • [20] M. Jørgensen. A review of studies on expert estimation of software development effort. JSS, 70(1-2):37–60, 2004.
  • [21] M. Jørgensen.

    The world is skewed: Ignorance, use, misuse, misunderstandings, and how to improve uncertainty analyses in software development projects, 2015.

  • [22] M. Jørgensen and T. M. Gruschke. The impact of lessons-learned sessions on effort estimation and uncertainty assessments. TSE, 35(3):368–383, 2009.
  • [23] M. Jørgensen and M. Shepperd. A systematic review of software development cost estimation studies. TSE, 33(1), 2007.
  • [24] C. F. Kemerer. An empirical validation of software cost estimation models. CACM, 30(5):416–429, 1987.
  • [25] J. W. Keung, B. A. Kitchenham, and D. R. Jeffery. Analogy-x: Providing statistical inference to analogy-based software cost estimation. TSE, 34(4):471–484, 2008.
  • [26] B. A. Kitchenham, L. M. Pickard, S. G. MacDonell, and M. J. Shepperd. What accuracy statistics really measure. IEEE Software, 148(3):81–85, 2001.
  • [27] E. Kocaguneli and T. Menzies. How to find relevant data for effort estimation? In ESEM, pages 255–264, Sept 2011.
  • [28] E. Kocaguneli, T. Menzies, A. Bener, and J. W. Keung. Exploiting the essential assumptions of analogy-based effort estimation. TSE, 38(2):425–438, 2012.
  • [29] E. Kocaguneli, T. Menzies, and J. Keung. On the value of ensemble effort estimation. TSE, 38(6):1403–1416, November 2012.
  • [30] E. Kocaguneli, T. Menzies, and E. Mendes. Transfer learning in effort estimation. ESE, 20(3):813–843, Jun 2015.
  • [31] M. Korte and D. Port. Confidence in software cost estimation results based on mmre and pred. In PROMISE’08, pages 63–70, 2008.
  • [32] W. B. Langdon, J. Dolado, F. Sarro, and M. Harman. Exact mean absolute error of baseline predictor, marp0. IST, 73:16–18, 2016.
  • [33] M. Lefley and M. J. Shepperd. Using genetic programming to improve software effort estimation based on general data sets. In GECCO’03, pages 2477–2487, 2003.
  • [34] Y. Li, M. Xie, and T. N. Goh. A study of project selection and feature weighting for analogy based software cost estimation. JSS, 82(2):241–252, 2009.
  • [35] E. Mendes, I. Watson, C. Triggs, and S. Mosley, N. Counsell. A comparative study of cost estimation models for web hypermedia applications. ESE, 8(2):163–196, 2003.
  • [36] T. Menzies, Z. Chen, J. Hihn, and K. Lum. Selecting best practices for effort estimation. TSE, 32(11):883–895, 2006.
  • [37] T. Menzies, Y. Yang, G. Mathew, B.W. Boehm, and J. Hihn. Negative results for software effort estimation. ESE, 22(5):2658–2683, 2017.
  • [38] L. L. Minku and X. Yao. A principled evaluation of ensembles of learning machines for software effort estimation. In PROMISE’11, pages 9:1–9:10. ACM, 2011.
  • [39] L. L. Minku and X. Yao. An analysis of multi-objective evolutionary algorithms for training ensemble models based on different performance measures in software effort estimation. In PROMISE’13, pages 8:1–8:10. ACM, 2013.
  • [40] L. L. Minku and X. Yao. Ensembles and locality: Insight on improving software effort estimation. IST, 55(8):1512 – 1528, 2013.
  • [41] V. Nair, A. Agrawal, J. Chen, W. Fu, G. Mathew, T. Menzies, L. L. Minku, M. Wagner, and Z. Yu. Data-driven search-based software engineering. In MSR, 2018.
  • [42] J. Neter, M. H. Kutner, C. J. Nachtsheim, and W. Wasserman. Applied linear statistical models, volume 4. Irwin Chicago, 1996.
  • [43] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, and J. Vanderplas. Scikit-learn: Machine learning in python. JMLR, 12(Oct):2825–2830, 2011.
  • [44] T. Peters, T. Menzies, and L. Layman. Lace2: Better privacy-preserving data sharing for cross project defect prediction. In ICSE, volume 1, pages 801–811, May 2015.
  • [45] D. Port and M. Korte. Comparative studies of the model evaluation criterion mmre and pred in software cost estimation research. In ESEM’08, pages 51–60, 2008.
  • [46] J. R. Quinlan. Learning with continuous classes. In 5th Australian joint conference on artificial intelligence, volume 92, pages 343–348. Singapore, 1992.
  • [47] F. Sarro, F. Ferrucci, M. Harman, A. Manna, and J. Ren. Adaptive multi-objective evolutionary algorithms for overtime planning in software projects. TSE, 43(10):898–917, 2017.
  • [48] F. Sarro, A. Petrozziello, and M. Harman. Multi-objective software effort estimation. In ICSE, pages 619–630. ACM, 2016.
  • [49] Y. Shan, R. I. McKay, C. J. Lokan, and D. L. Essam. Software project effort estimation using genetic programming. In ICCCAS & WESINO EXPO’02, volume 2, pages 1108–1112, 2002.
  • [50] M. Shepperd. Software project economics: a roadmap. In 2007 Future of Software Engineering, pages 304–315. IEEE Computer Society, 2007.
  • [51] M. Shepperd, M. Cartwright, and G. Kadoda. On building prediction systems for software engineers. EMSE, 5(3):175–182, 2000.
  • [52] M. Shepperd and S. MacDonell. Evaluating prediction systems in software project estimation. IST, 54(8):820–827, 2012.
  • [53] M. Shepperd and C. Schofield. Estimating software project effort using analogies. TSE, 23(11):736–743, 1997.
  • [54] E. Stensrud, T. Foss, B. Kitchenham, and I. Myrtveit. A further empirical investigation of the relationship of mre and project size. ESE, 8(2):139–161, 2003.
  • [55] R. Storn and K. Price.

    Differential evolution–a simple and efficient heuristic for global optimization over cont. spaces.

    JoGO, 11(4):341–359, 1997.
  • [56] F. Walkerden and R. Jeffery. An empirical study of analogy-based software effort estimation. ESE, 4(2):135–158, 1999.
  • [57] P. A. Whigham, C. A. Owen, and S. G. Macdonell. A baseline model for software effort estimation. TOSEM, 24(3):20:1–20:11, May 2015.
  • [58] D. H. Wolpert. The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7):1341–1390, 1996.