1 Introduction
This paper reports an extraordinarily successful experiment in applying SBSE to a very common software engineering problem; i.e., effort estimation. There are many effort estimation methods discussed in the literature; e.g.,

Jörgensen & Shepperd report over 250 papers proposing new methods for project size or effort estimation methods [23].

We list below 6,000+ methods for analogybased effort estimation.
With so many available methods, it is now a matter of some debate about which one is best for a new data set. To simplify that task, Whigham et al. recently proposed at TOSEM’15 a “baseline model for software effort estimation” called ATLM [57]
. They recommend ATLM since, they claim, “it performs well over a range of different project types and requires no parameter tuning”. Note that “no parameter tuning” is an attractive property since tuning can be very slow– particularly when using evolutionary genetic algorithms (GAs). For example, the default recommendations for GAs suggest
to evaluations [18]. This can take some time to terminate. Sarro et al. [48] reports that their evolutionary system for effort estimation mutated 100 individuals for 250 generations. While they do not report their runtimes, we estimate that their methods would require 34 to 345 hours of CPU to terminate^{1}^{1}1Assuming 100*250 evals, 0.5 to 5 seconds to evaluate one mutation, 10way crossval..In practice, commissioning an effort estimator on new data takes even more time than stated above. Wolpert’s nofree lunch theorems warn that for machine learning
[58], no single method works best on all data sets. Hence, when building effort estimators for a new data set, some commissioning process is required that tries a range of different algorithms. This is not a mere theoretical concern: researchers report that the “best” effort estimator for different data sets varies enormously [38, 40, 29].Given such long runtimes, we have found it challenging to make SBSE attractive to the broader community of standard developers and business users. To address that challenge, it would be useful to have an example where SBSE can commission a specific effort estimator for a specific data set, in just a few minutes on a standard laptop.
This paper offers such an example. We present a surprising and fortunate result that a very “CPUlite” SBSE method can commission an effort estimator that significantly outperforms standard effort estimation methods. Here, by “outperform” we mean that:

Our estimates have statistically much smaller errors than standard methods;

The comissioning time for that estimator is very fast: median runtime for our tenway crossvals is just six minutes on a standard 8GB, 3GHz desktop machine.
Note that our approach is very different to much of the prior research on effort estimation and evolutionary algorithms
[5, 10, 12, 13, 33, 47, 7, 49, 48, 39]. Firstly, that work assumed a “CPUheavy” approach whereas we seek a “CPUlite” method. Secondly, we do not defend one particular estimator; instead, our commissioning process selects a different estimator for each data set after exploring thousands of possibilities.The rest of this paper is structured as follows. The next section describes effect estimation. We then introduce OIL (short for optimized learning), a CPUlite searchbased SE method based on differential evolution [55]. This is followed by an empirical study where estimates for 945 software projects are generated using a variety of methods including OIL. The results from that study let us comment on three research questions:

RQ1: Can effort estimation ignore SBSE? That is, is tuning avoidable since just a few options are typically “best”? We will find that the “best” effort estimation method is highly variable. That is, tools like OIL are important for ensuring that the right estimators are being applied to the current data set.

RQ2: Pragmatically speaking, is SBSE too hard to apply to effort estimation? As shown below, a few dozen evaluations of OIL are enough to explore configuration options for effort estimation. That is, it is hardly arduous to apply SBSE to effort estimation. Even on a standard single core machine, the median time to explore all those options is just a few minutes.

RQ3: Does SBSE estimate better than widelyused effort estimation methods? As shown below, the estimations from OIL perform much better than standard effort estimation methods, including ATLM.
2 Background
2.1 Why Explore Software Effort Estimation?
Software effort estimation is the process of predicting the most realistic amount of human effort (usually expressed in terms of hours, days or months of human work) required to plan, design and develop a software project based on the information collected in previous related software projects. With one or more wrong factors, the effort estimate results could be inaccurate which affect the allocated funds for the projects[24]. Inadequate or overfull funds for a project could cause a considerable waste of resource and time. For example, NASA canceled its incomplete Checkout Launch Control System project after the initial $200M estimate was exceeded by another $200M [9]. It is critical to generate effort estimations with good accuracy if for no other reason that many government organizations demand that the budgets allocated to large publicly funded projects be doublechecked by some estimation model [37].
Effort estimation can be divided into humanbased techniques and modelbased techniques [28, 50]. Humanbased techniques [20] are that can be hard to audit or dispute ( e.g., when the estimate is generated by a senior colleague but disputed by others). Also, empirically, it is known that humans rarely update their estimation knowledge based on feedback from new projects [22].
Modelbased methods are preferred when estimate have to be audited or debated (since the method is explicit and available for inspection). Even advocates of humanbased methods [21] acknowledge that modelbased methods are useful for learning the uncertainty about particular estimates; e.g., by running those models many times, each time applying small mutations to the input data.
Note that this paper focuses on estimationviaanalogy and there are many other ways to perform effort estimation. We choose not to explore parametric estimation [37]
since that approach demands the data be expressed in identically the same terms as the parametric models (e.g. COCOMO). This can be a major limitation to parametric models; for example, none of the data sets used in this paper are expressed in terms of the vocabulary used by standard parametric models. As to CPUheavy methods (e.g.,
ensembles [29] or standard genetic algrithms for effort estimation [5, 10, 12, 13, 33, 47, 7, 49, 48, 39]), the message of this paper is that CPUlite methods (e.g., just 40 evaluations within DE) can be surprisingly effective. Hence, we do not explore CPUheavy methods, at least for now. It would be interesting in future work to check if (e.g.,) CPUheavy ensembles or genetic algorithms are outperformed by the CPUlite methods of this paper.2.2 Analogybased Estimation (ABE)
Analogybased Estimation (ABE) was explored by Shepperd and Schofield in 1997 [53]. It is widelyused [44, 30, 19, 27, 37], in many forms. We say that “ABE0” is the standard form seen in the literature and “ABEN” are the 6,000+ variants of ABE defined below. The general form of ABE (which applies to ABE0 or ABEN) is:

Form a table of rows of past projects. The columns of this table are composed of independent variables (the features that define projects) and one dependent variable (project effort).

Find training subsets. Decide on what similar projects (analogies) to use from the training set when examining a new test instance.

For each test instance, select analogies out of the training set.

While selecting analogies, use a similarity measure.

Before calculating similarity, normalize numerics min..max to 0..1 (so all numerics get equal chance to influence the dependent).

Use feature weighting to reduce the influence of less informative features.


Use an adaption strategy to return some combination of the dependent effort values seen in the nearest analogies.
To measure similarity between examples , ABE uses where ranges over all the independent variables. In this equation, corresponds to feature weights applied to independent features. For ABE0, we use a uniform weighting, therefore . Also, the adaptation strategy for ABE0 is to return the effort values of the nearest analogies. The rest of this section describes 6,000+ variants of ABE that we call ABEN. Note that we do not claim that the following represents all possible ways to perform analogybased estimation. Rather, we merely say that (a) all the following are common variations of ABE0, seen in recent research publications [28]; and (b) anyone with knowledge of the current effort estimation literature would be tempted to try some of the following.
Two ways to find training subsets: (1) Remove nothing: Usually, effort estimators use all training projects [6]
. Our ABE0 is using this variant; (2) Outlier methods: prune training projects with (say) suspiciously large values
[25]. Typically, this removes a small percentage of the training data.Eight ways to make feature weighting: Li et al. [34] and Hall and Holmes [17] review eight different feature weighting schemes. Li et al. use a genetic algorithm to learn useful feature weights. Hall and Holmes review a variety of methods ranging from WRAPPER to various filters methods, including their preferred correlationbased method. Note that their methods assume symbolic, not numeric, dependent variables. Hence, to apply these methods we add a discretized classes column, using (maxmin)/10. Technical aside: when we compute the errors measures (see below), we use the raw numeric dependent values.
Three ways to discretize (summarize numeric ranges into a few bins): Some feature weighting schemes require an initial discretization of continuous columns. There are many discretization policies in the literature, including: (1) equal frequency, (2) equal width, (3) do nothing.
Six ways to choose similarity measurements: Mendes et al. [35] discuss three similarity measures, including the weighted Euclidean measure described above, an unweighted variant (where = 1), and a “maximum distance” measure that focuses on the single feature that maximizes interproject distance. Frank et al. [15] offer a fourth similarity measure that uses a triangular distribution that sets to the weight to zero after the distance is more than “k” neighbors away from the test instance. A fifth and sixth similarity measure are the Minkowski distance measure used in [3] and the mean value of the ranking of each project feature used in [56].
Four ways for adaption mechanisms:
(1) median effort value, (2) mean dependent value, (3) summarize the adaptations via a second learner (e.g., linear regression)
[34, 36, 4, 46], (4) weighted mean [35].Six ways to select analogies:
Kocaguneli et al. [28] says analogies selectors are fixed or dynamic. Fixed methods use
nearest neighbors
while dynamic methods use the training set to find which is best for examples.
2.3 Oil
As shown above, ABEN has variants. Some can be ignored; e.g. at , adaptation mechanisms return the same result, so they can be ignored. Also, not all feature weighting techniques use discretization. But even after those discards, there are still thousands of possibilities to explore.
OIL is our controller for exploring these possibilities. Initially, our plan was to use standard hyperparameter tuning for this task. Then we learned that (a) standard data mining toolkits like scikitlearn lack some of ABEN variants; and (b) standard hyperparameter tuners can be slow (sklearn recommends a default runtime of 24 hours [1]). Hence, we build OIL, implemented as a layered architecture:

At the lowest library layer, OIL uses Python ScikitLearn [43].

Above that, there is a utilities layer containing all the algorithms missing in ScikitLearn (e.g., ABEN required numerous additions at the utilities layer).

Higher up, OIL’s modelling layer uses an XMLbased domainspecific language to specify a feature map of data mining options. These feature models are singleparent andor graphs with (optionally) crosstree constraints showing what options require or exclude other options. A graphical representation of the feature model used in this paper is shown in Figure 1.

Finally, at topmost optimizer layer, there is some evolutionary optimizer that makes decisions across the feature map. An automatic mapper facility then links those decisions down to the lower layers to run the selected algorithms.
For this study, we optimize using the differential evolution method (DE [55]), shown in Figure 2. DE was selected since certain recent software analytics papers have reported that DE can be effective for text mining [2] and defect prediction [16]. While we initially planned a more extensive evaluation with other optimizers, but encountered problems accessing reference implementations^{2}^{2}2E.g. there is no reproduction package available for the Sarro et al. system [48] at their home page http://www0.cs.ucl.ac.uk/staff/F.Sarro/projects/CoGEE/.. In any case, the results with DE were so promising that we deferred the application of other optimizers to future work.
INPUT:

PROCEDURE:

DE evolves a new generation of candidates from a current population of size . Each candidate solution for effort estimation is pair of (Tunings, Scores) where Tunings are selected from the above options for ABEN; and Scores come from training a learner using those parameters and applying it to test data.
The premise of DE is that the best way to mutate the existing tunings is to extrapolate between current solutions. Three solutions are selected at random. For each tuning parameter , at some probability , we replace the old tuning with . For booleans and for numerics, where is a parameter controlling crossover. The main loop of DE runs over the population, replacing old items with new candidates (if new candidate is better). This means that, as the loop progresses, the population is full of increasingly more valuable solutions (which, in turn, helps extrapolation).
As to the control parameters of DE, using advice from Storn [55], we set . The number of generations was set as follows. A small number (2) was used to test the effects of a very CPUlite SBSE effort estimator. A larger number (8) was used to check if anything was lost by restricting the inference to just two generations.
3 Empirical Study
To assess OIL, we applied it to the 945 projects seen in nine datasets from the SEACRAFT repository (http://tiny.cc/seacraft); see Table 1 and Table 2. This data was used since it has been widely used in previous estimation research. Also, it is quite diverse since it differs for: observation number (from 15 to 499 projects); number and type of features (from 6 to 25 features, including a variety of features describing the software projects, such as number of developers involved in the project and their experience, technologies used, size in terms of Function Points, etc.); technical characteristics (software projects developed in different programming languages and for different application domains, ranging from telecommunications to commercial information systems); geographical locations (software projects coming from China, Canada, Finland).
Projects  Features  
kemerer  15  6 
albrecht  24  7 
isbsg10  37  11 
finnish  38  7 
miyazaki  48  7 
maxwell  62  25 
desharnais  77  6 
kitchenham  145  6 
china  499  18 
total  945 
OIL collects information on two performance metrics: magnitude of the relative error (MRE) [8] and Standardized Accuracy (SA). We make no comment on which measure is better– these were selected since they are widely used in the literature.


MRE is defined in terms of AR, the magnitude of the absolute residual. This is computed from the difference between predicted and actual effort values: . MRE is the magnitude of the relative error calculated by expressing AR as a ratio of the actual effort value; i.e., .
MRE has been criticized [14, 26, 31, 45, 51, 54] as being biased towards error underestimations. Some researchers prefer the use of other (more standardized) measures, such as Standardized Accuracy (SA) [32, 52]. SA is defined in terms of where is the number of projects used for evaluating the performance, and and are the actual and estimated effort, respectively, for the project . SA uses MAE as follows: where is the MAE of the approach being evaluated and is the MAE of a large number (e.g., 1000 runs) of random guesses. The important thing about SA is that, over many runs, will converge on simply using the sample mean [52]. SA represents how much better is than random guessing. Values near zero means that the prediction model is practically useless, performing little better than random guesses [52].
Note that for these evaluation measures:

smaller MRE values are better;

while larger SA values are better.
It is good practice to benchmark new methods against a variety of different approaches. Accordingly, OIL uses the following algorithms:

Automatically Transformed Linear Model (ATLM) is an effort estimation method recently proposed at TOSEM’15 by Whigham et al. [57]. ATLM is a multiple linear regression model which calculate the effort as , where is the response for project , and are explanatory variables. The prediction weights are determined using a least square error estimation [42]. Recall for the introduction that Whigham et al. recommend ATLM since, they say, it performs well on a range of different project types and needs no parameter tuning.

Differential Evolution (DE) was described above. Recall we have two versions of DE. DE2, DE8 runs for two, eight generations and terminate after evaluating 40, 160 configurations (respectively).

Random Choice (RD). It is good practice to baseline stochastic optimizers like DE against some random search [41]. Accordingly, until it finds valid configurations, RD selects leaves at random from Figure 1. All these variants are executed and the best one is selected for application to the test set. To maintain parity with DE2 and DE8, OIL uses (which we denote RD40 and RD160).

Subset  Weighting  Discret.  Similarity  Adaption  Analogies 

Rm nothing Outlier 
Remain same Genetic Gain rank Relief PCA CFS CNS WRP 
No discrete Equal freq. Equal width 
Euclidean Weight Euclid. Max measure Local likelihood Minkowski Feature mean 
Median Mean Second learner Weighted Mean 
K=1 K=2 K=2 K=4 K=5 Dynamic 
kemerer 






albrecht 






isbsg10 






finnish 






miyazaki 






maxwell 






desharnais 






kitchenham 






china 







KEY: 102030405060708090100%
OIL performs a fold cross validation for each of (ABE0, ATLM, DE2, DE8, RD40, RD160), for each of our nine data sets. To apply this, datasets are partitioned into sets (the observations were sampled uniformly at random, without replacement), and then for each set OIL considered it as a testing set and the remaining observations as training set. For datasets kemerer, albrecht, isbsg10 and finnish, we uses threefold cross validation since their instances are less than 40. For the other larger datasets miyazaki, maxwell, desharnais, kitchenham and china, we use tenfold.
Since our folds are selected in a stochastic manner, we repeat the crossvals 20 times, each time with different random seeds.
4 Results
These results are divided into answers for the research questions introduced above.
RQ1: Can effort estimation ignore SBSE? That is, is tuning avoidable since just a few options are typically “best”?
Table 3 shows why SBSE is an essential component for effort estimation. This table shows how often different options were selected by the best optimizer seen in this study. Note that, very rarely, is one option selected most of the time (exception: clearly our outlier operator is not very good– this should be explored further in future work). From this table, it is clear that the best configuration is not only data set specific, but all specific to the training set used within a data set. This means that RQ1=no and tools like OIL are very important for configuring effort estimation methods.
RQ2: Pragmatically speaking, is SBSE too hard to apply to effort estimation?
As mentioned in the introduction, some SBSE methods can be very slow. While such long runtimes are certainly required in other domains, for configuring effort estimation methods, SBSE can terminate much faster than that. Figure 3 shows the time required to generate our results (on a standard 8GB, 3GHz desktop machine).
ABE0  ATLM  RD40  RD160  DE2  DE8  
kemerer  1  1  3  13  4  10 
albrecht  1  1  3  11  4  11 
isbsg10  1  1  3  15  4  14 
finnish  1  1  4  14  5  14 
miyazaki  1  1  5  16  6  16 
maxwell  1  1  12  52  18  53 
desharnais  1  1  13  54  17  55 
kitchenham  1  1  21  80  28  94 
china  1  1  57  232  52  243 
Note that standard effort estimation methods (i.e., ABE0 and ATLM) run very fast indeed compare to anything else. Hence, pragmatically, it seems tempting to recommend these faster systems. Nevertheless, this paper will recommend somewhat slower methods since, as shown below, these faster methods (i.e., ABE0 and ATLM) result in very poor estimates. The good news from Figure 3 is that crossvalidation for the methods we will recommend (DE2) takes just a few minutes to terminate. Hence we say that RQ2=no since SBSE can quite quickly commission an effort estimator, tuned specifically to a data set.
RQ3: Does SBSE estimate better than widelyused effort estimation methods?
RQ2 showed SBSE for effort estimation is not arduously slow. Another issue is whether or not those SBSE methods lead to better estimates. Figure 4 explores that issue. Black dots show median values from 20 repeats. Horizontal lines show the 25th to 75th percentile of the values.
The most important part of the Figure 4 results are the Rank columns shown lefthandside. These ranks cluster together results that are statistically indistinguishable as judged by a conjunction of both a 95% bootstrap significance test [11] and a A12 test for a nonsmall effect size difference in the distributions [37]
. These tests were used since their nonparametric nature avoids issues with nonGaussian distributions.
In Figure 4, Rank=1 denotes the better results. When multiple treatments receive top rank, we use the runtimes of Figure 3 to break ties. For example, in the kemerer MRE results, four methods have Rank=1. However, two of these methods (DE2 and RD40) are much faster than the others. Rows denoted Rank=1* show these fastest topranked treatments.
(Technical aside: there is no statistically significant difference between the runtimes of RD40 and DE2 in Figure 3, as determined by a 95% bootstrap test. Hence, when assigning the Rank=1*, we say that RD40 runs as fast as DE2.)
From the Rank=1* entries in Figure 4, we make the following comments.

In marked contrast to the claims of Whigham et al., ATLM does not have a very good performance. While it does appear as a Rank=1* method in finnish, in all other data sets it performs badly. Indeed, often, its performance falls outside the [0,100]% range shown in Figure 4.

Another widelyused method in effort estimation is the ABE0 analogybased effort estimator. In 15/18 of the Figure 4 results, ABE0 is ranked better than ATLM. That is, if the reader wants to avoid the added complexity of SBSE, they could ignore our advocacy for OIL and instead just use ABE0. That said, ABE0 is only topranked in 1/18 of our results. Clearly, there are better methods than ABE0.

Random configuration selection performs not too badly. In 6/18 of the Figure 4 results, one of our random methods earns Rank=1*. That said, the random methods are clearly outperformed by just a few dozen evaluations of DE. In 14/18 of these results, DE2 (40 evaluations of DE) earns Rank=1*.
Overall, based on the above points, we would recommend DE2 for comissioning effort estimation to new data sets. In 17/18 of our results, it gets scored Rank=1. To be sure, in 3 of those results, another method ran faster. However, for the sake of implementation simplicity, some researchers might choose to ignore that minority case.
In summary RQ3=yes since SBSE produces much better effort estimates than widelyused effort estimation methods.
5 Discussion
The natural question that arises from all this is why does SBSE work so well? We see three possibilities: (1) DE is really clever, (2) effort estimation is really simple, or (3) there exists a previously undocumented floor effect in effort estimation.
Regarding DE is clever: DE combines local search (the extrapolation described in Figure 2) with an archive pruning operator (when new candidates supplant older items in the population, then all subsequent mutations use the new and improved candidates). Hence it is wrong to characterize 40 DE evaluations as “just 40 guesses”. Also, there is evidence from other SE domains that DE is indeed a clever way to study SE problems. For example, Fu et al. found that hyperparameter optimization via a few dozen DE evaluations was enough to produce significantly large improvements in defect prediction [16]. Also, in other work, Agrawal et al. [2] found that a few dozen evaluations of DE were enough to significantly improve the control parameters for the Latent Dirichlet Allocation text mining algorithm.
Regarding effort estimation is simple: Perhaps the effective search space of different effort estimators might be very small. If effort estimation exhibits a “Many roads lead to Rome” property then when multiple estimators are applied to the same data sets, many of them will have equivalent performance. For such problems, configuration is not a difficult problem since a few random probes (plus a little guidance with DE) can effectively survey all the important features.
Regarding floor effects: Floor effects exist when a domain contains some inherent performance boundary, which cannot be exceeded. Floor effects have many causes such as the signal content of a data set is very limited, For such data sets, then once learners reach ‘the floor”, then there is no better place to go after that. This paper offers two pieces of evidence for floor effects in effort estimation:

Recall from the above that our data sets are very small (see Figure 1)– which suggests that effort estimation data has limited signal.

Also, one indicator for floor effects is that informed methods perform no better than random search and, to some extent, that indicator was seen in the above results. Recall from the above that while a full random search was outperformed by DE2, sometimes those random searchers performed very well indeed.
Whatever the explanation, the main effect documented by this paper is that a widely used SE technique (effort estimation) which can be dramatically improved with SBSE.
6 Threats to Validity
Internal Bias: All our methods contain stochastic random operators. To reduce the bias from random operators, we repeated our experiment in 20 times and applied statistical tests to remove spurious distinctions.
Parameter Bias: DE plays an important role in OIL, in this paper, we did not discuss the influence of different DE parameters, such as , , . In this paper, we followed Storn et al.’s configurations [55]. Clearly, tuning such parameters is a direction for future work.
Sampling Bias: While we tested OIL on the nine datasets, it would be inappropriate to conclude that OIL tuning always perform better than others methods for all data sets. As researchers, what we can do to mitigate this problem is to carefully document out method, release out code, and encourage the community to try this method on more datasets, as the occasion arises.
7 Conclusion and Future Work
This paper has explored methods for commissioning effort estimation methods. As stated in the introduction, our approach is very different to much of the prior “CPUheavy” SBSE research on effort estimation and evolutionary algorithms [5, 10, 12, 13, 33, 47, 7, 49, 48, 39]. Firstly, we take a “CPUlite” approach. Secondly, we do not defend one particular estimator; instead, our commissioning process selects different estimators for different data set after exploring thousands of options.
Our results show that SBSE is both necessary and simple to apply for effort estimation. Table 3 showed that the “best” estimator varies greatly across effort estimation data. Using “CPUlite” SBSE methods (specifically, DE) it is possible to very quickly find these best estimators. Further, the effort estimators generated by SBSE outperform standard methods in widespread use (ABE0 and ATLM). This SBSE process is not an overly burdensome task since, as shown above it is enough to perform 40 evaluations of different candidates (guided by DE). To be sure, some additional architecture is required for SBSE and effort estimation, but we have packaged that into the OIL system (which after double blind, we will distribute as a Python pip package).
As to future work, as discussed in several places around this document:

This work should be repeated for more datasets.

The space of operators we explored within ABEN could be expanded. Clearly, from Table 3, our outliers method is ineffective and should be replaced. There are also other estimation methods that could be explored (not just for ABE, but otherwise).

Other DE settings , and could be explored.

It could also be useful to try optimizers other than DE. Specifically, future work could check if (e.g.,) CPUheavy methods such as ensembles methods [29] or Sarro’s genetic algorithms [48] are outperformed by the CPUlite methods of this paper. That said, it should be noted that this study found no benefit in increasing the number of evaluations from 40 to 160. Hence, possibly, CPUheavy methods may not result in better estimators.

It could be very insightful to explore the floor effects discussed in §5. If these are very common, then that would suggest the whole field of software effort estimation has been needlessly overcomplicated.
References
 [1] Sklean, manual, 2018.
 [2] A. Agrawal, W. Fu, and T. Menzies. What is wrong with topic modeling? and how to fix it using searchbased software engineering. IST, 2018.
 [3] L. Angelis and I. Stamelos. A simulation tool for efficient analogy based cost estimation. EMSE, 5(1):35–68, 2000.
 [4] D. R. Baker. A hybrid approach to expert and model based effort estimation. West Virginia University, 2007.

[5]
C. J. Burgess and M. Lefley.
Can genetic programming improve software effort estimation? a comparative evaluation.
IST, 43(14):863–873, 2001. 
[6]
C. L. Chang.
Finding prototypes for nearest neighbor classifiers.
TC, 100(11), 1974. 
[7]
M. Choetkiertikul, H. K. Dam, T. Tran, T. T. M. Pham, A. Ghose, and T. Menzies.
A deep learning model for estimating story points.
TSE, PP(99):1–1, 2018.  [8] S. D. Conte, H. E. Dunsmore, and V. Y. Shen. Software Engineering Metrics and Models. BenjaminCummings Publishing Co., Inc., Redwood City, CA, USA, 1986.
 [9] K. Cowing. Nasa to shut down checkout & launch control system, 2002.
 [10] J. J. Dolado. A validation of the componentbased method for software size estimation. TSE, 26(10):1006–1021, Oct 2000.
 [11] B. Efron and J. Tibshirani. An introduction to the bootstrap. Chapman & Hall, 1993.
 [12] F. Ferrucci, C. Gravino, R. Oliveto, and F. Sarro. Genetic programming for effort estimation: An analysis of the impact of different fitness functions. In SSBSE’10, pages 89–98, 2010.
 [13] F. Ferrucci, C. Gravino, R. Oliveto, F. Sarro, and E. Mendes. Investigating tabu search for web effort estimation. In SEAA, pages 350–357, Sept 2010.
 [14] T. Foss, E. Stensrud, B. Kitchenham, and I. Myrtveit. A simulation study of the model evaluation criterion mmre. TSE, 29(11):985–995, 2003.

[15]
E. Frank, M. Hall, and B. Pfahringer.
Locally weighted naive bayes.
In19th conference on Uncertainty in Artificial Intelligence
, pages 249–256, 2002.  [16] W. Fu, T. Menzies, and X. Shen. Tuning for software analytics: Is it really necessary? IST, 76:135–146, 2016.
 [17] M. A. Hall and G. Holmes. Benchmarking attribute selection techniques for discrete class data mining. TKDE, 15(6):1437–1447, 2003.
 [18] R. L. Haupt. Optimum population size and mutation rate for a simple real genetic algorithm that optimizes array factors. In APSIS’00, pages 1034–1037, 2000.
 [19] J. Hihn and T. Menzies. Data mining methods and cost estimation models: Why is it so hard to infuse new ideas? In ASEW, pages 5–9, Nov 2015.
 [20] M. Jørgensen. A review of studies on expert estimation of software development effort. JSS, 70(12):37–60, 2004.

[21]
M. Jørgensen.
The world is skewed: Ignorance, use, misuse, misunderstandings, and how to improve uncertainty analyses in software development projects, 2015.
 [22] M. Jørgensen and T. M. Gruschke. The impact of lessonslearned sessions on effort estimation and uncertainty assessments. TSE, 35(3):368–383, 2009.
 [23] M. Jørgensen and M. Shepperd. A systematic review of software development cost estimation studies. TSE, 33(1), 2007.
 [24] C. F. Kemerer. An empirical validation of software cost estimation models. CACM, 30(5):416–429, 1987.
 [25] J. W. Keung, B. A. Kitchenham, and D. R. Jeffery. Analogyx: Providing statistical inference to analogybased software cost estimation. TSE, 34(4):471–484, 2008.
 [26] B. A. Kitchenham, L. M. Pickard, S. G. MacDonell, and M. J. Shepperd. What accuracy statistics really measure. IEEE Software, 148(3):81–85, 2001.
 [27] E. Kocaguneli and T. Menzies. How to find relevant data for effort estimation? In ESEM, pages 255–264, Sept 2011.
 [28] E. Kocaguneli, T. Menzies, A. Bener, and J. W. Keung. Exploiting the essential assumptions of analogybased effort estimation. TSE, 38(2):425–438, 2012.
 [29] E. Kocaguneli, T. Menzies, and J. Keung. On the value of ensemble effort estimation. TSE, 38(6):1403–1416, November 2012.
 [30] E. Kocaguneli, T. Menzies, and E. Mendes. Transfer learning in effort estimation. ESE, 20(3):813–843, Jun 2015.
 [31] M. Korte and D. Port. Confidence in software cost estimation results based on mmre and pred. In PROMISE’08, pages 63–70, 2008.
 [32] W. B. Langdon, J. Dolado, F. Sarro, and M. Harman. Exact mean absolute error of baseline predictor, marp0. IST, 73:16–18, 2016.
 [33] M. Lefley and M. J. Shepperd. Using genetic programming to improve software effort estimation based on general data sets. In GECCO’03, pages 2477–2487, 2003.
 [34] Y. Li, M. Xie, and T. N. Goh. A study of project selection and feature weighting for analogy based software cost estimation. JSS, 82(2):241–252, 2009.
 [35] E. Mendes, I. Watson, C. Triggs, and S. Mosley, N. Counsell. A comparative study of cost estimation models for web hypermedia applications. ESE, 8(2):163–196, 2003.
 [36] T. Menzies, Z. Chen, J. Hihn, and K. Lum. Selecting best practices for effort estimation. TSE, 32(11):883–895, 2006.
 [37] T. Menzies, Y. Yang, G. Mathew, B.W. Boehm, and J. Hihn. Negative results for software effort estimation. ESE, 22(5):2658–2683, 2017.
 [38] L. L. Minku and X. Yao. A principled evaluation of ensembles of learning machines for software effort estimation. In PROMISE’11, pages 9:1–9:10. ACM, 2011.
 [39] L. L. Minku and X. Yao. An analysis of multiobjective evolutionary algorithms for training ensemble models based on different performance measures in software effort estimation. In PROMISE’13, pages 8:1–8:10. ACM, 2013.
 [40] L. L. Minku and X. Yao. Ensembles and locality: Insight on improving software effort estimation. IST, 55(8):1512 – 1528, 2013.
 [41] V. Nair, A. Agrawal, J. Chen, W. Fu, G. Mathew, T. Menzies, L. L. Minku, M. Wagner, and Z. Yu. Datadriven searchbased software engineering. In MSR, 2018.
 [42] J. Neter, M. H. Kutner, C. J. Nachtsheim, and W. Wasserman. Applied linear statistical models, volume 4. Irwin Chicago, 1996.
 [43] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, and J. Vanderplas. Scikitlearn: Machine learning in python. JMLR, 12(Oct):2825–2830, 2011.
 [44] T. Peters, T. Menzies, and L. Layman. Lace2: Better privacypreserving data sharing for cross project defect prediction. In ICSE, volume 1, pages 801–811, May 2015.
 [45] D. Port and M. Korte. Comparative studies of the model evaluation criterion mmre and pred in software cost estimation research. In ESEM’08, pages 51–60, 2008.
 [46] J. R. Quinlan. Learning with continuous classes. In 5th Australian joint conference on artificial intelligence, volume 92, pages 343–348. Singapore, 1992.
 [47] F. Sarro, F. Ferrucci, M. Harman, A. Manna, and J. Ren. Adaptive multiobjective evolutionary algorithms for overtime planning in software projects. TSE, 43(10):898–917, 2017.
 [48] F. Sarro, A. Petrozziello, and M. Harman. Multiobjective software effort estimation. In ICSE, pages 619–630. ACM, 2016.
 [49] Y. Shan, R. I. McKay, C. J. Lokan, and D. L. Essam. Software project effort estimation using genetic programming. In ICCCAS & WESINO EXPO’02, volume 2, pages 1108–1112, 2002.
 [50] M. Shepperd. Software project economics: a roadmap. In 2007 Future of Software Engineering, pages 304–315. IEEE Computer Society, 2007.
 [51] M. Shepperd, M. Cartwright, and G. Kadoda. On building prediction systems for software engineers. EMSE, 5(3):175–182, 2000.
 [52] M. Shepperd and S. MacDonell. Evaluating prediction systems in software project estimation. IST, 54(8):820–827, 2012.
 [53] M. Shepperd and C. Schofield. Estimating software project effort using analogies. TSE, 23(11):736–743, 1997.
 [54] E. Stensrud, T. Foss, B. Kitchenham, and I. Myrtveit. A further empirical investigation of the relationship of mre and project size. ESE, 8(2):139–161, 2003.

[55]
R. Storn and K. Price.
Differential evolution–a simple and efficient heuristic for global optimization over cont. spaces.
JoGO, 11(4):341–359, 1997.  [56] F. Walkerden and R. Jeffery. An empirical study of analogybased software effort estimation. ESE, 4(2):135–158, 1999.
 [57] P. A. Whigham, C. A. Owen, and S. G. Macdonell. A baseline model for software effort estimation. TOSEM, 24(3):20:1–20:11, May 2015.
 [58] D. H. Wolpert. The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7):1341–1390, 1996.
Comments
There are no comments yet.