Software analytics has been widely used in software engineering for many tasks Menzies and Zimmermann (2018). This paper explores methods to improve algorithms for software effort estimation (a particular kind of analytics tasks). This is needed since software effort estimates can be wildly inaccurate Kemerer (1987). Effort estimations need to be accurate (if for no other reason) since many government organizations demand that the budgets allocated to large publicly funded projects be double-checked by some estimation model Menzies et al. (2017). Non-algorithm techniques that rely on human judgment Jørgensen (2004) are much harder to audit or dispute (e.g., when the estimate is generated by a senior colleague but disputed by others).
Sarro et al. Sarro et al. (2016) assert that effort estimation is a critical activity for planning and monitoring software project development in order to deliver the product on time and within budget Briand and Wieczorek (2002); Kocaguneli et al. (2011); Trendowicz and Jeffery (2014). The competitiveness of software organizations depends on their ability to accurately predict the effort required for developing software systems; both over- or under- estimates can negatively affect the outcome of software projects Trendowicz and Jeffery (2014); McConnell (2006); Mendes and Mosley (2002); Sommerville (2010).
Hyperparameter optimizers tuning the control parameters of a data mining algorithm. It is well established that classification tasks like software defect prediction or text classification are improved by such tuning Fu et al. (2016a); Tantithamthavorn et al. (2018); Agrawal et al. (2018); Agrawal and Menzies (2018). This paper investigates hyperparameter optimization using data from 945 projects, the study is an extensive exploration of hyperparameter optimization and effort estimation following the earlier work done by Corazza et al Corazza et al. (2013).
We assess our results with respect to recent findings by Arcuri & Fraser Arcuri and Fraser (2013). They caution that to transition hyperparameter optimizers to industry, they need to be fast:
A practitioner, that wants to use such tools, should not be required to run large tuning phases before being able to apply those tools Arcuri and Fraser (2013).
Also, according to Arcuri & Fraser, optimizers must be useful:
At least in the context of test data generation, (tuning) does not seem easy to find good settings that significantly outperform “default” values. … Using “default” values is a reasonable and justified choice, whereas parameter tuning is a long and expensive process that might or might not pay off Arcuri and Fraser (2013).
Hence, to assess such optimization for effort estimation, we ask four questions.
RQ1: To address one concern raised by Arcuri & Fraser, we must first ask is it best to just use “off-the-shelf” defaults? We will find that tuned learners provide better estimates than untuned learners. Hence, for effort estimation: Lesson1: “off-the-shelf” defaults should be deprecated.
RQ2: Can tuning effort be avoided by replacing old defaults with new defaults? This checks if we can run tuning once (and once only) then use those new defaults ever after. We will observe that effort estimation tunings differ extensively from dataset to dataset. Hence, for effort estimation: Lesson2: Overall, there are no “best” default settings.
RQ3: The first two research questions tell us that we must retune our effort estimators whenever new data arrives. Accordingly, we must now address the other concern raised by Arcuri & Fraser about CPU cost. Hence, in this question we ask can we avoid slow hyperparameter optimization? The answer to RQ3 will be “yes” since our results show that for effort estimation: Lesson3: Overall, our slowest optimizers perform no better than faster ones.
RQ4: The final question to answer is what hyperparameter optimizers to use for effort estimation? Here, we report that a certain combination of learners and optimizers usually produce best results. Further, this particular combination often achieves in a few minutes what other optimizers may need hours to days of CPU to achieve. Hence we will recommend the following combination for effort estimation: Lesson4: For new datasets, try a combination of CART with the optimizers differential evolution and FLASH. (Note: The italicized words are explained below.)
In summary, unlike the test case generation domains explored by Arcuri & Fraser, hyperparamter optimization for effort estimation is both useful and fast.
Overall the contributions of this paper are:
A demonstration that defaults settings are not the best way to perform effort estimation. Hence, when new data is encountered, some tuning process is required to learn the best settings for generating estimates from that data.
A recognition of the inherent difficulty associated with effort estimation. Since there is not one universally best effort estimation method. commissioning a new effort estimator requires extensive testing. As shown below, this can take hours to days of CPU time.
The identification of a combination of learner and optimizer that works as well as anything else, and which takes minutes to learn an effort estimator.
An extensible open-source architecture called OIL that enables the commissioning of effort estimation methods. OIL makes our results repeatable and refutable.
The rest of this paper is structured as follows. The next section discusses different methods for effort estimation and how to optimize the parameters of effort estimation methods. This is followed by a description of our data, our experimental methods, and our results. After that, a discussion section explores open issues with this work.
From all of the above, we can conclude that (a) Arcuri & Fraser’s pessimism about hyperparameter optimization applies to their test data generation domain. However (b) for effort estimation, hyperparamter optimization is both useful and fast. Hence, we hope that OIL, and the results of this paper, will prompt and enable more research on methods to tune software effort estimators.
Note that OIL and all the data used in this study is freely available for download from https://github.com/arennax/effort_oil_2019
2 About Effort Estimation
Software effort estimation is a method to offer managers approximate advice on how much human effort (usually expressed in terms of hours, days or months of human work) is required to plan, design and develop a software project. Such advice can only ever be approximate due to dynamic nature of any software development. Nevertheless, it is important to attempt to allocate resources properly in software projects to avoid waste. In some cases, inadequate or overfull funding can cause a considerable waste of resource and time Cowing (2002); Germano and Hufford (2016); Hazrati (2011); Roman (2016). As shown below, effort estimation can be categorized into (a) human-based and (b) algorithm-based methods Kocaguneli et al. (2012); Shepperd (2007).
For several reasons, this paper does not explore human-based estimation methods. Firstly, it is known that humans rarely update their human-based estimation knowledge based on feedback from new projects Jørgensen and Gruschke (2009). Secondly, algorithm-based methods are preferred when estimate have to be audited or debated (since the method is explicit and available for inspection). Thirdly, algorithm-based methods can be run many times (each time applying small mutations to the input data) to understand the range of possible estimates. Even very strong advocates of human-based methods Jørgensen (2015) acknowledge that algorithm-based methods are useful for learning the uncertainty about particular estimates.
2.1 Algorithm-based Methods
There are many algorithmic estimation methods. Some, such as COCOMO Boehm (1981), make assumptions about the attributes in the model. For example, COCOMO requires that data includes 22 specific attributes such as analyst capability (acap) and software complexity (cplx). This attribute assumptions restricts how much data is available for studies like this paper. For example, here we explore 945 projects expressed using a wide range of attributes. If we used COCOMO, we could only have accessed an order of magnitude fewer projects.
Due to its attribute assumptions, this paper does not study COCOMO data. All the following learners can accept projects described using any attributes, just as long as one of those is some measure of project development effort.
Whigham et al.’s ATLM method Whigham et al. (2015) is a multiple linear regression model which calculate the effort as . Additionally, transformations are applied on the attributes to further minimize the error in the model. In case of categorical attributes the standard approach of “dummy variables” is applied. While, for continuous attributes, transformations such as logarithmic, square root, or no transformation is employed such that the skewness of the attribute is minimum. It should be noted that, ATLM does not consider relatively complex techniques like using model residuals, box transformations or step-wise regression (which are standard) when developing a linear regression model. The authors make this decision since they intend ATLM to be a simple baseline model rather than the “best” model. And since it can be applied automatically, there should be no excuse not to compare any new model against a comparatively naive baseline.
is a multiple linear regression model which calculate the effort as, where are explanatory attributes and are errors to the actual value. The prediction weights are determined using least square error estimation Neter et al. (1996)
. Additionally, transformations are applied on the attributes to further minimize the error in the model. In case of categorical attributes the standard approach of “dummy variables”Hardy (1993)
is applied. While, for continuous attributes, transformations such as logarithmic, square root, or no transformation is employed such that the skewness of the attribute is minimum. It should be noted that, ATLM does not consider relatively complex techniques like using model residuals, box transformations or step-wise regression (which are standard) when developing a linear regression model. The authors make this decision since they intend ATLM to be a simple baseline model rather than the “best” model. And since it can be applied automatically, there should be no excuse not to compare any new model against a comparatively naive baseline.
Sarro et al. proposed a method named Linear Programming for Effort Estimation (LP4EE)
Sarro et al. proposed a method named Linear Programming for Effort Estimation (LP4EE)Sarro and Petrozziello (2018), which aims to achieve the best outcome from a mathematical model with a linear objective function subject to linear equality and inequality constraints. The feasible region is given by the intersection of the constraints and the Simplex (linear programming algorithm) is able to find a point in the polyhedron where the function has the smallest error in polynomial time. In effort estimation problem, this model minimizes the Sum of Absolute Residual (SAR), when a new project is presented to the model, LP4EE predicts the effort as , where is the value of given project feature and is the corresponding coefficient evaluated by linear programming. LP4EE is suggested to be used as another baseline model for effort estimation since it provides similar or more accurate estimates than ATLM and is much less sensitive than ATLM to multiple data splits and different cross-validation methods.
Some algorithm-based estimators are regression trees such as CART L.Breiman (1984),
CART is a tree learner that divides a dataset, then recurses
on each split.
If data contains more than min_sample_split, then a split is attempted.
On the other hand, if a split contains no more than min_samples_leaf , then the recursion stops.
CART finds the attributes whose ranges contain rows with least variance in the number
of defects. If an attribute ranges
, then the recursion stops. CART finds the attributes whose ranges contain rows with least variance in the number of defects. If an attribute rangesis found in rows each with an effort variance of , then CART seeks the attribute with a split that most minimizes . For more details on the CART parameters, see Table 1.
|max_feature||None||[0.01, 1]||The number of feature to consider when looking for the best split.|
|max_depth||None||[1, 12]||The maximum depth of the tree.|
|min_sample_split||2||[0, 20]||Minimum samples required to split internal nodes.|
|min_samples_leaf||1||[1, 12]||Minimum samples required to be at a leaf node.|
Random Forest Breiman (2001) and Support Vector Regression are another instances of regression methods. Random Forest (RF) is an ensemble learning method for regression (and classification) tasks that builds a set of trees when training the model. To decide the output, it uses
the mode of the classes (classification) or mean prediction (regression) of the individual trees.
Support Vector Regression (SVR) uses kernel functions to project
the data onto a new hyperspace where complex non-linear patterns
can be simply represented. It aims to construct an optimal hyperplane that fits data and predicts with
minimal empirical risk and complexity of the modelling function.
and Support Vector RegressionChang and Lin (2011)
are another instances of regression methods. Random Forest (RF) is an ensemble learning method for regression (and classification) tasks that builds a set of trees when training the model. To decide the output, it uses the mode of the classes (classification) or mean prediction (regression) of the individual trees. Support Vector Regression (SVR) uses kernel functions to project the data onto a new hyperspace where complex non-linear patterns can be simply represented. It aims to construct an optimal hyperplane that fits data and predicts with minimal empirical risk and complexity of the modelling function.
Another algorithm-based estimators are the analogy-based Estimation (ABE) methods advocated by Shepperd and Schofield Shepperd and Schofield (1997). ABE is widely-used Peters et al. (2015); Kocaguneli et al. (2015); Hihn and Menzies (2015); Kocaguneli and Menzies (2011); Menzies et al. (2017), in many forms. We say that “ABE0” is the standard form seen in the literature and “ABEN” are the 6,000+ variants of ABE defined below. The general form of ABE (which applies to ABE0 or ABEN) is to first form a table of rows of past projects. The columns of this table are composed of independent variables (the features that define projects) and one dependent feature (project effort). From this table, we learn what similar projects (analogies) to use from the training set when examining a new test instance. For each test instance, ABE then selects analogies out of the training set. Analogies are selected via a similarity measure. Before calculating similarity, ABE normalizes numerics min..max to 0..1 (so all numerics get equal chance to influence the dependent). Then, ABE uses feature weighting to reduce the influence of less informative features. Finally, some adaption strategy is applied return a combination of the dependent effort values seen in the nearest analogies. For details on ABE0 and ABEN, see Figure 1 & Table 2.
2.2 Effort Estimation and Hyperparameter Optimization
Note that we do not claim that the above represents all methods for effort estimation. Rather, we say that (a) all the above are either prominent in the literature or widely used; and (b) anyone with knowledge of the current effort estimation literature would be tempted to try some of the above.
Even though our lost of effort estimation methods is incomplete, it is still very long. Consider, for example, just the ABEN variants documented in Table 2. There are such variants. Some can be ignored; e.g. at , adaptation mechanisms return the same result, so they are not necessary. Also, not all feature weighting techniques use discretization. But even after those discards, there are still thousands of possibilities.
Given the space to exploration is so large, some researchers have offered automatic support for that exploration. Some of that prior work suffered from being applied to limited data Li et al. (2009).
Other researchers assume that the effort model is a specific parametric form (e.g. the COCOMO equation) and propose mutation methods to adjust the parameters of that equation Aljahdali and Sheta (2010); Moeyersoms et al. (2015); Singh and Misra (2012); Chalotra et al. (2015); Rao et al. (2014). As mentioned above, this approach is hard to test since there are very few datasets using the pre-specified COCOMO attributes.
Further, all that prior work needs to be revisited given the existence of recent and very prominent methods; i.e. ATLM from TOSEM’2015 Whigham et al. (2015) or LP4EE from TOSEM’2018 Sarro and Petrozziello (2018).
Accordingly, this paper conducts a more thorough investigation of hyperparameter optimization for effort estimation.
We use methods with no data feature assumptions (i.e. no COCOMO data);
That vary many parameters (6,000+ combinations);
That also tests results on 9 different sources with data on 945 software projects;
And which benchmark results against prominent methods such as ATLM and LP4EE.
OIL is our architecture for exploring hyperparameter optimization and effort estimation, initially, our plan was to use standard hyperparameter tuning for this task. Then we learned that standard machine learning toolkits like Scikit-learn Pedregosa et al. (2011) did not include many of the effort estimation techniques; and (b) standard hyperparameter tuners can be slow. Hence, we build OIL:
At the base library layer, we use Scikit-learn Pedregosa et al. (2011).
Above that, OIL has a utilities layer containing all the algorithms missing in Scikit-Learn (e.g., ABEN required numerous additions at the utilities layer).
Higher up, OIL’s modelling layer uses an XML-based domain-specific language to specify a feature map of predicting model options. These feature models are single-parent and-or graphs with (optional) cross-tree constraints showing what options require or exclude other options. A graphical representation of the feature model used in this paper is shown in Figure 1.
Finally, at top-most optimizer layer, there is some optimizer that makes decisions across the feature map. An automatic mapper facility then links those decisions down to the lower layers to run the selected algorithms.
Once OIL’s layers were built, it was simple to “pop the top” and replace the top layer with another optimizer. Nair et al. Nair et al. (2018a) advise that for search-based SE studies, optimizers should be selecting via the a “dumb+two+next” rule. Here:
“Dumb” are some baseline methods;
“Two” are some well-established optimizers;
“Next” are more recent methods which may not have been applied before to this domain.
For our “dumb” optimizer, we used Random Choice (hereafter, RD). To find valid configurations, RD selects leaves at random from Figure 1. All these variants are executed and the best one is selected for application to the test set. To maintain a fair comparison with other systems described below, OIL Chooses N as the same number of evaluations in other methods.
Moving on, our “two” well-established optimizers are ATLM Whigham et al. (2015) and LP4EE Sarro and Petrozziello (2018). For LP4EE, we perform experiments with the open source code provided by orginal authors. For ATLM, since there is no online source code available, we carefully re-implemented the method by ourselves.
As to our “next” optimizers, we used Differential Evolution (hereafter, DE Storn and Price (1997)) and FLASH Nair et al. (2018b).
The premise of DE is that the best way to mutate the existing tunings is to extrapolate between current solutions. Three solutions are selected at random. For each tuning parameter , at some probability
, at some probability, we replace the old tuning with . For booleans and for numerics, where is a parameter controlling differential weight. The main loop of DE runs over the population of size , replacing old items with new candidates (if new candidate is better). This means that, as the loop progresses, the population is full of increasingly more valuable solutions (which, in turn, helps extrapolation). As to the control parameters of DE, using advice from Storn and Fu et al. Storn and Price (1997); Fu et al. (2016a), we set . Also, the number of generations was set to 10 to test the effects of a very CPU-light effort estimator.
FLASH, proposed by Nair et al. Nair et al. (2018b), is an incremental optimizer. Previously, it has been applied to configuration system parameters for software systems. This paper is the first application of FLASH to effort estimation. Formally, FLASH is a sequential model-Based optimizer Bergstra et al. (2011) (also known in the machine learning literature as an active learner Das et al. (2016) or, in the statistics literature as optimal experimental design Olsson (2009)). Whatever the name, the intuition is the same: reflects on the model built to date in order to find the next best example to evaluate. To tune a data miner, FLASH explores possible tunings as follows:
Set the evaluation budget . In order to make a fair comparison between FLASH and other methods, we used .
Run the data miner using randomly selected tunings.
Build an archive of examples holding pairs of parameter settings and their resulting performance scores (e.g. MRE, SA, etc).
Use the surrogate to guess performance scores where and parameter settings. Note that this step is very fast since it all that is required is to run vectors down some very small CART trees.
Using some selection function, select the most “interesting” setting. After Nair et al. Nair et al. (2018b) we returned the setting with the nest prediction (i.e. find the most promising possibility).
Collect performance scores by evaluating “interesting” using the data miners (i.e. check the most troubling possibility). Set .
Add “interesting” to the archive. If , goto step 4. Else, halt.
In summary, given what we already know about the tunings (represented in a CART tree), FLASH finds the potentially best thing (in Step6); then checks that thing (in Step7); then updates the model with the results of that check.
3 Empirical Study
To assess OIL, we applied it to the 945 projects seen in nine datasets from the SEACRAFT repository111http://tiny.cc/seacraft; see Table 3 and Table 4. This data was selected since it has been widely used in previous estimation research. Also, it is quite diverse since it differs in number of observations (from 15 to 499 projects); geographical locations (software projects coming from Canada, China, Finland); technical characteristics (software projects developed in different programming languages and for different application domains, ranging from telecommunications to commercial information systems); and number and type of features (from 6 to 25 features, including a variety of features describing the software projects, such as number of developers involved in the project and their experience, technologies used, size in terms of Function Points, etc.);
Note that some features of the original datasets are not used in our experiment because they are (1) naturally irrelevant to their effort values (e.g., ID, Syear), (2) unavailable at the prediction phase (e.g., duration, LOC), (3) highly correlated or overlap to each other (e.g., raw function point and adjusted function points). A data cleaning process is applied to solve this issue. Those removed features are shown as italic in Table 4.
Each datasets was treated in a variety of ways. Each treatment is an M*N-way cross-validation test of some learners or some learners and optimizers. That is, times, shuffle the data randomly (using a different random number seed) then divide the data into bins. For , bin is used to test a model build from the other bins. Following the advice of Nair et al. Nair et al. (2018a), we use bins for our effort datasets.
As a procedural detail, first we divided the data and then we applied the treatments. That is, all treatments saw the same training and test data.
3.3 Scoring Metrics
MRE is defined in terms of
AR, the magnitude of the absolute residual. This is computed from the difference between predicted and actual effort values:
MRE is the magnitude of the relative error calculated by expressing AR as a ratio of actual effort:
MRE has been criticized Foss et al. (2003); Kitchenham et al. (2001); Korte and Port (2008); Port and Korte (2008); Shepperd et al. (2000); Stensrud et al. (2003) as being biased towards error underestimations. Nevertheless, we use it here since there exists known baselines for human performance in effort estimation, expressed in terms of MRE Molokken and Jorgensen (2003a). The same can not be said for SA.
Because of issues with MRE, some researchers prefer the
use of other (more standardized) measures, such as Standardized Accuracy (SA) Langdon et al. (2016); Shepperd and MacDonell (2012).
SA is defined in terms of
where is the number of projects used for evaluating the performance, and and are the actual and estimated effort, respectively, for the project . SA uses MAE as follows:
where is the MAE of the approach being evaluated and is the MAE of a large number (e.g., 1000 runs) of random guesses. Over many runs, will converge on simply using the sample mean Shepperd and MacDonell (2012). That is, SA represents how much better is than random guessing. Values near zero means that the prediction model is practically useless, performing little better than random guesses Shepperd and MacDonell (2012).
This study ranks methods using the Scott-Knott
procedure recommended by Mittas & Angelis in their 2013
IEEE TSE paper Mittas and Angelis (2013). This method
sorts a list of treatments with measurements by their median
score. It then
splits into sub-lists in order to maximize the expected value of
differences in the observed performances
before and after divisions. For example, we could sort
methods based on their median score,
then divide them into three sub-lists of of size .
Scott-Knott would declare one of these divisions
to be “best” as follows.
For lists of size where , the “best” division maximizes ; i.e.
the difference in the expected mean value
before and after the spit:
Scott-Knott then checks if that “best” division is actually useful. To implement that check, Scott-Knott would apply some statistical hypothesis testto check if are significantly different. If so, Scott-Knott then recurses on each half of the “best” division. For a more specific example, consider the results from treatments:
rx1 = [0.34, 0.49, 0.51, 0.6] rx2 = [0.6, 0.7, 0.8, 0.9] rx3 = [0.15, 0.25, 0.4, 0.35] rx4= [0.6, 0.7, 0.8, 0.9] rx5= [0.1, 0.2, 0.3, 0.4]After sorting and division, Scott-Knott declares:
This expression computes the probability that numbers in one sample are bigger than in another. This test was recently endorsed by Arcuri and Briand at ICSE’11 Arcuri and Briand (2011).
The results from each test set are evaluated in terms of two scoring metrics: magnitude of the relative error (MRE) Conte et al. (1986) and Standardized Accuracy (SA). These scoring metrics are defined in Table 5. We use these since there are advocates for each in the literature. For example, Shepperd and MacDonell argue convincingly for the use of SA Shepperd and MacDonell (2012) (as well as for the use of effect size tests in effort estimation). Also in 2016, Sarro et al.222http://tiny.cc/sarro16gecco used MRE to argue their estimators were competitive with human estimates (which Molokken et al. Molokken and Jorgensen (2003b) says lies within 30% and 40% of the true value).
Note that for these evaluation measures:
MRE values: smaller are better
SA values: larger are better
From the cross-vals,
we report the median (termed med)
which is the 50th percentile of the test scores seen in the M*N results.
Also reported are the inter-quartile range
inter-quartile range(termed IQR) which is the (75-25)th percentile. The IQR is a non-parametric description of the variability about the median value.
For each datasets, the results from a M*N-way are sorted by their median value, then ranked using the Scott-Knott test
recommended for ranking effort estimation experiments by Mittas et al. in TSE’13 Mittas and Angelis (2013).
For full details on Scott-Knott test, see Table 6. In summary, Scott-Knott is a top-down bi-clustering
method that recursively divides sorted treatments. Division stops when there is only one treatment left or when a division of numerous treatments generates
splits that are statistically indistinguishable.
To judge when two sets of treatments are indistinguishable, we use a conjunction of both a 95% bootstrap significance test Efron and Tibshirani (1993) and
a A12 test for a non-small effect size difference in the distributions Menzies et al. (2017) . These tests were used since their non-parametric nature avoids issues with non-Gaussian
. These tests were used since their non-parametric nature avoids issues with non-Gaussian distributions.
Table 7 shows an example of the report generated by our Scott-Knott procedure. Note that when multiple treatments tie for Rank=1, then we use the treatment’s runtimes to break the tie. Specifically, for all treatments in Rank=1, we mark the faster ones as Rank=1.
3.4 Terminology for Optimizers
Some treatments are named “X_Y” which denote learner “X” tuned by optimizer “Y”. In the following:
Note that we do not tune ATLM and LP4EE since they were designed to be used “off-the-shelf”. Whigham et al. Whigham et al. (2015) declare that one of ATLM’s most important features is that if does not need tuning. We also do not tune SVR and RF since we treat them as baseline algorithm-based methods in our benchmarks (i.e. use default settings in scikit-learn for these algorithms).
Table 8 shows the runtimes (in minutes) for one of our N*M experiments for each dataset. From the last column of that table, we see that the median to maximum runtimes per dataset range are:
24 to 54 minutes, for one-way;
Hence 8 to 18 hours, for the 20 repeats of our N*M experiments.
Performance scores for all datasets are shown in Table 9 and Table 10. We observe that ATLM and LP4EE performed as expected. Whigham et al. Whigham et al. (2015) and Sarro et al. Sarro and Petrozziello (2018) designed these methods to serve as baselines against which other treatments can be compared. Hence, it might be expected that in some cases these methods will perform comparatively below other methods. This was certainly the case here– as seen in Table 9 and Table 10, these baseline methods are top-ranked in 8/18 datasets.
Another thing to observe in Table 9 and Table 10 is that random search (RD) also performed as expected; i.e. it was never top-ranked. This is a gratifying result since if random otherwise, then that tend to negate the value of hyperparameter optimization.
We also see in Table 9 empirical evidence many of our methods achieve human-competitive results. Molokken and Jorgensen Molokken and Jorgensen (2003a)’s survey of current industry practices reports that human-expert predictions of project effort lie within 30% and 40% of the true value; i.e. MRE %. Applying that range to Table 9 we see that in 6/9 datasets, the best estimator has MRE %; i.e. they lie comfortably within the stated human-based industrial thresholds. Also, in a further 2/9 datasets, the best estimator has MRE %; i.e. they are close to the performance of humans.
The exception to the results in the last paragraph is isbg10 where the best estimator has an MRE %; i.e. our best performance is nowhere close to that of human estimators. In future work, we recommend researchers use isbg10 as a “stress test” on new methods.
4.2 Answers to Research Questions
Turning now to the research questions listed in the introduction:
RQ1: Is it best just to use the “off-the-shelf” defaults?
As mentioned in the introduction, Arcuri & Fraser note that for test case generation, using the default settings can work just as well as anything else. We can see some evidence of this effect in Table 9 and Table 10. Observe, for example, the kitchenham results where the untuned ABE0 treatment achieves Rank=1.
However, overall, Table 9 and Table 10 is negative on the use of default settings. For example, in datasets “albrecht”, “desharnais”, “finnish”, not even one treatments that use the default found in Rank=1. Overall, if we always used just one of the methods using defaults (LP4EE, ATLM, ABE0) then that would achieve best ranks in 8/18 datasets.
Another aspect to note in the Table 9 and Table 10 results are the large differences in performance scores between the best and worst treatments (exceptions: miyazaki’s MRE and SA scores do not vary much; and neither does isbg10’s SA scores). That is, there is much to be gained by using the Rank=1 treatments and deprecating the rest.
In summary, using the defaults is recommended only in a part of datasets. Also, in terms of better test scores, there is much to be gained from tuning. Hence:
Lesson1: “Off-the-shelf” defaults should be deprecated.
RQ2: Can we replace the old defaults with new defaults?
If the hyperparameter tunings found by this paper were nearly always the same, then this study could conclude by recommending better values for default settings. This would be a most convenient result since, in future when new data arrives, the complexities of this study would not be needed.
|(selected at random;||(of trees)||(continuation||(termination|
|100% means “use all”)||criteria)||criteria)|
|25% 50% 75% 100%||03 06 09 12||5 10 15 20||03 06 09 12|
Unfortunately, this turns out not to be the case. Table 11 shows the percent frequencies with which some tuning decision appears in our M*N-way cross validations (this table uses results from DE tuning CART since, as shown below, this usually leads to best results). Note that in those results it it not true that across most datasets there is a setting that is usually selected (thought min_samples_leaf less than 3 is often a popular setting). Accordingly, we say that Table 11 shows that there is much variations of the best tunings. Hence, for effort estimation:
Lesson2: Overall, there are no “best” default settings.
Before going on, one curious aspect of the Table 11 results are the
%max_features results; it was rarely most useful to use all features. Except for finnish and china), best results were often obtained after discarding (at random) a quarter to three-quarters of the features. This is a clear indication that, in future work, it might be advantageous to explore more feature selection for CART models.
results are the %max_features results; it was rarely most useful to use all features. Except for finnish and china), best results were often obtained after discarding (at random) a quarter to three-quarters of the features. This is a clear indication that, in future work, it might be advantageous to explore more feature selection for CART models.
RQ3: Can we avoid slow hyperparameter optimization?
Some methods in our experiments (ABEN_RD and ABEN_DE) are slower than others, even with the same number of evaluations, as shown in Table 8. Is it possible to avoid such slow runtimes?
Long and slow optimization times are recommended when their exploration leads to better solutions. Such better solutions from slower optimizations are rarely found in Table 9 and Table 10 (only in 2/18 cases: see the ABEN_DE results for kitchenham, and china). Further, the size of the improvements seen with the slower optimizers over the best Rank=2 treatments is small. Those improvements come at runtime cost (in Table 8), the slower optimizers are one orders of magnitude slower than other methods). Hence we say that for effort estimation:
Lesson3: Overall, our slowest optimizers perform no better than faster ones.
RQ4: What hyperparatmeter optimizers to use for effort estimation?
When we discuss this work with our industrial colleagues, they want to know “the bottom line”; i.e. what they should use or, at the very least, what they should not use. This section offers that advice. We stress that this section is based on the above results so, clearly these recommendations are something that would need to be revised whenever new results come to hand.
Based on the above we can assert that using all the estimators mentioned above is not recommended (to say the least):
For one thing, many of them never appear in our top-ranked results.
For another thing, testing all of them on new datasets would be needlessly expensive. Recall our rig: 20 repeats over the data where each of those repeats include slower estimators shown in Table 8. As seen in that figure, the median to maximum runtimes for such an analysis for a single dataset would take 8 to 18 hours (i.e. hours to days).
Table 12 lists the best that can be expected if an engineer chooses one of the estimators in our experiment, and applied it to all our datasets. The fractions shown at right come from counting optimizer frequencies in the top-ranks of Table 9 and Table 10. Note that the champion in our experiment is “CART_FLASH”, which ranked as ‘1’ in 16 out of all 18 cases. One close runner-up is “CART_DE”, which has 2 cases less in number of winning times. Those two estimators usually have good performance among most cases in the experiment.
Beside the two top methods, none of the rest estimators could reach even half of all cases. Including those untuned baseline methods (, ). Hence, we cannot endorse their use for generating estimates to be shown to business managers. That said, we do still endorse their use as a baseline methods, for methodological reasons in effort estimation research (they are useful for generating a quick result against which we can compare other, better, methods).
Hence, based on the results of Table 12, for similar effort estimation tasks, we recommend:
Lesson4: For new datasets, try a combination of CART with the optimizers differential evolution and FLASH.
5 Threats to Validity
Internal Bias: Many of our methods contain stochastic random operators. To reduce the bias from random operators, we repeated our experiment in 20 times and applied statistical tests to remove spurious distinctions.
Parameter Bias: For other studies, this is a significant question since (as shown above) the settings to the control parameters of the learners can have a positive effect on the efficacy of the estimation. That said, recall that much of the technology of this paper concerned methods to explore the space of possible parameters. Hence we assert that this study suffers much less parameter bias than other studies.
Sampling Bias: While we tested OIL on the nine datasets, it would be inappropriate to conclude that OIL tuning always perform better than others methods for all datasets. As researchers, what we can do to mitigate this problem is to carefully document out method, release out code, and encourage the community to try this method on more datasets, as the occasion arises.
6 Related Work
In software engineering, hyperparameter optimization techniques have been applied to some sub-domains, but yet to be adopted in many others. One way to characterize this paper is an attempt to adapt recent work in hyperparameter optimization in software defect prediction to effort estimation. Note that, like in defect prediction, this article has also concluded that Differential Evolution is an useful method.
Several SE defect prediction techniques rely on static code attributes Krishna et al. (2016); Nam and Kim (2015); Tan et al. (2015). Much of that work has focused of finding and employing complex and “off-the-shelf” machine learning models Menzies et al. (2007); Moser et al. (2008); Elish and Elish (2008), without any hyperparameter optimization. According to a literature review done by Fu et al. Fu et al. (2016b), as shown in Figure 2, nearly 80% of highly cited papers in defect prediction do not mention parameters tuning (so they rely on the default parameters setting of the predicting models).
Gao et al. Gao et al. (2011) acknowledged the impacts of the parameter tuning for software quality prediction. For example, in their study, “distanceWeighting” parameter was set to “Weight by 1/distance”, the KNN parameter “k” was set to “30”, and the “crossValidate” parameter was set to “true”. However, they did not provide any further explanation about their tuning strategies.
As to methods of tuning, Bergstra and Bengio Bergstra and Bengio (2012) comment that grid search333For tunable option, run nested for-loops to explore their ranges. is very popular since (a) such a simple search to gives researchers some degree of insight; (b) grid search has very little technical overhead for its implementation; (c) it is simple to automate and parallelize; (d) on a computing cluster, it can find better tunings than sequential optimization (in the same amount of time). That said, Bergstra and Bengio deprecate grid search since that style of search is not more effective than more randomized searchers if the underlying search space is inherently low dimensional. This remark is particularly relevant to effort estimation since datasets in this domain are often low dimension Kocaguneli et al. (2013).
Lessmann et al. Lessmann et al. (2008) used grid search to tune parameters as part of their extensive analysis of different algorithms for defect prediction. However, they only tuned a small set of their learners while they used the default settings for the rest. Our conjecture is that the overall cost of their tuning was too expensive so they chose only to tune the most critical part.
Two recent studies about investigating the effects of parameter tuning on defect prediction were conducted by Tantithamthavorn et al. Tantithamthavorn et al. (2016, 2018) and Fu et al. Fu et al. (2016a). Tantithamthavorn et al. also used grid search while Fu et al. used differential evolution. Both of the papers concluded that tuning rarely makes performance worse across a range of performance measures (precision, recall, etc.). Fu et al. Fu et al. (2016a) also report that different datasets require different hyperparameters to maximize performance.
One major difference between the studies of Fu et al. Fu et al. (2016a) and Tantithamthavorn et al. Tantithamthavorn et al. (2016) was the computational costs of their experiments. Since Fu et al.’s differential evolution based method had a strict stopping criterion, it was significantly faster.
Note that there are several other methods for hyperparameter optimization and we aim to explore several other method as a part of future work. But as shown here, it requires much work to create and extract conclusions from a hyperparameter optimizer. One goal of this work, which we think we have achieved, to identify a simple baseline method against which subsequent work can be benchmarked.
7 Conclusions and Future Work
Hyperparameter optimization is known to improve the performance of many software analytics tasks such as software defect prediction or text classification Agrawal and Menzies (2018); Agrawal et al. (2018); Fu et al. (2016a); Tantithamthavorn et al. (2018). Most prior work in this effort estimation optimization only explored very small datasets Li et al. (2009) or used estimators that are not representative of the state-of-the-art Whigham et al. (2015); Sarro and Petrozziello (2018). Other researchers assume that the effort model is a specific parametric form (e.g. the COCOMO equation), which greatly limits the amount of data that can be studied. Further, all that prior work needs to be revisited given the existence of recent and very prominent methods; i.e. ATLM from TOSEM’15 Whigham et al. (2015) and LP4EE from TOSEM’18 Sarro and Petrozziello (2018).
Accordingly, this paper conducts a thorough investigation of hyperparameter optimization for effort estimation using methods (a) with no data feature assumptions (i.e. no COCOMO data); (b) that vary many parameters (6,000+ combinations); that tests its results on 9 different sources with data on 945 software projects; (c) which uses optimizers representative of the state-of-the-art (DE Storn and Price (1997), FLASH Nair et al. (2018b)); and which (d) benchmark results against prominent methods such as ATLM and LP4EE.
These results were assessed with respect to the Arcuri and Fraser’s concerns mentioned in the introduction; i.e. sometimes hyperparamter optimization can be both too slow and not effective. Such pessimism may indeed apply to the test data generation domain. However, the results of this paper show that there exists other domains like effort estimation where hyperparameter optimization is both useful and fast. After applying hyperparameter optimization, large improvements in effort estimation accuracy were observed (measured in terms of the standardized accuracy). From those results, we can recommend using a combination of regression trees (CART) tuned by different evolution and FLASH. This particular combination of learner and optimizers can achieve in a few minutes what other optimizers need longer runtime of CPU to accomplish.
This study is a very extensive explorations of hyperparameter optimization and effort estimation yet undertaken. There are still very many options not explored here. Our current plans for future work include the following:
Try other learners: e.g. neural nets, bayesian learners or AdaBoost;
Try other data pre-processors. We mentioned above how it was curious that max features was often less than 100%. This is a clear indication that, we might be able to further improve our estimations results by adding more intelligent feature selection to, say, CART.
Other optimizers. For example, combining DE and FLASH might be a fruitful way to proceed.
Yet another possible future direction could be hyper-hyperparamter optimization. In the above, we used optimizers like differential evolution to tune learners. But these optimizers have their own control parameters. Perhaps there are better settings for the optimizers? Which could be found via hyper-hyperparameter optimization?
Hyper-hyperparameter optimization could be a very slow process. Hence, results like this paper could be most useful since here we have identified optimizers that are very fast and very slow (and the latter would not be suitable for hyper-hyperparamter optimization).
In any case, we hope that OIL and the results of this paper will prompt and enable more research on better methods to tune software effort estimators. To that end, we have placed our scripts and data online at https://github.com/arennax/effort_oil_2019
This work was partially funded by a National Science Foundation Award 1703487.
- Agrawal and Menzies (2018) Agrawal A, Menzies T (2018) ” better data” is better than” better data miners”(benefits of tuning smote for defect prediction). In: ICSE’18
- Agrawal et al. (2018) Agrawal A, Fu W, Menzies T (2018) What is wrong with topic modeling? and how to fix it using search-based software engineering. IST Journal
- Aljahdali and Sheta (2010) Aljahdali S, Sheta AF (2010) Software effort estimation by tuning coocmo model parameters using differential evolution. In: Computer Systems and Applications (AICCSA), 2010 IEEE/ACS International Conference on, IEEE, pp 1–6
- Angelis and Stamelos (2000) Angelis L, Stamelos I (2000) A simulation tool for efficient analogy based cost estimation. EMSE 5(1):35–68
- Arcuri and Briand (2011) Arcuri A, Briand L (2011) A practical guide for using statistical tests to assess randomized algorithms in software engineering. In: Software Engineering (ICSE), 2011 33rd International Conference on, IEEE, pp 1–10
- Arcuri and Fraser (2013) Arcuri A, Fraser G (2013) Parameter tuning or default values? an empirical investigation in search-based software engineering. ESE 18(3):594–623
Atkinson-Abutridy et al. (2003)
Atkinson-Abutridy J, Mellish C, Aitken S (2003) A semantically guided and domain-independent evolutionary model for knowledge discovery from texts. IEEE Transactions on Evolutionary Computation 7(6):546–560
- Baker (2007) Baker DR (2007) A hybrid approach to expert and model based effort estimation. West Virginia University
- Bergstra and Bengio (2012) Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(1):281–305
- Bergstra et al. (2011) Bergstra JS, Bardenet R, Bengio Y, Kégl B (2011) Algorithms for hyper-parameter optimization. In: Advances in neural information processing systems, pp 2546–2554
- Boehm (1981) Boehm BW (1981) Software engineering economics. Prentice-Hall
- Breiman (2001) Breiman L (2001) Random forests. Machine learning 45(1):5–32
- Briand and Wieczorek (2002) Briand LC, Wieczorek I (2002) Resource estimation in software engineering. Encyclopedia of software engineering
- Chalotra et al. (2015) Chalotra S, Sehra SK, Brar YS, Kaur N (2015) Tuning of cocomo model parameters by using bee colony optimization. Indian Journal of Science and Technology 8(14)
Chang and Lin (2011)
Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST) 2(3):27
Chang CL (1974) Finding prototypes for nearest neighbor classifiers. TC 100(11)
- Conte et al. (1986) Conte SD, Dunsmore HE, Shen VY (1986) Software Engineering Metrics and Models. Benjamin-Cummings Publishing Co., Inc., Redwood City, CA, USA
- Corazza et al. (2013) Corazza A, Di Martino S, Ferrucci F, Gravino C, Sarro F, Mendes E (2013) Using tabu search to configure support vector regression for effort estimation. Empirical Software Engineering 18(3):506–546
- Cowing (2002) Cowing K (2002) Nasa to shut down checkout & launch control system. http://www.spaceref.com/news/viewnews.html?id=475
- Das et al. (2016) Das S, Wong W, Dietterich T, Fern A, Emmott A (2016) Incorporating expert feedback into active anomaly discovery. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp 853–858, DOI 10.1109/ICDM.2016.0102
- Efron and Tibshirani (1993) Efron B, Tibshirani J (1993) Introduction to bootstrap. Chapman & Hall
- Elish and Elish (2008) Elish KO, Elish MO (2008) Predicting defect-prone software modules using support vector machines. J Syst Softw 81(5):649–660, DOI 10.1016/j.jss.2007.07.040
- Foss et al. (2003) Foss T, Stensrud E, Kitchenham B, Myrtveit I (2003) A simulation study of the model evaluation criterion mmre. TSE 29(11):985–995
- Frank et al. (2002)
- Fu et al. (2016a) Fu W, Menzies T, Shen X (2016a) Tuning for software analytics: Is it really necessary? IST Journal 76:135–146
- Fu et al. (2016b) Fu W, Nair V, Menzies T (2016b) Why is differential evolution better than grid search for tuning defect predictors? arXiv preprint arXiv:160902613
- Gao et al. (2011) Gao K, Khoshgoftaar TM, Wang H, Seliya N (2011) Choosing software metrics for defect prediction: an investigation on feature selection techniques. Software: Practice and Experience 41(5):579–606
- Germano and Hufford (2016) Germano S, Hufford A (2016) Finish line to close 25% of stores and replace ceo glenn lyon. https://www.wsj.com/articles/finish-line-to-close-25-of-stores-swaps-ceo-1452171033
- Hall and Holmes (2003) Hall MA, Holmes G (2003) Benchmarking attribute selection techniques. TKDE 15(6):1437–1447
- Hardy (1993) Hardy MA (1993) Regression with dummy variables, vol 93. Sage
- Hazrati (2011) Hazrati V (2011) It projects: 400% over-budget and only 25% of benefits realized. https://www.infoq.com/news/2011/10/risky-it-projects
- Hihn and Menzies (2015) Hihn J, Menzies T (2015) Data mining methods and cost estimation models: Why is it so hard to infuse new ideas? In: ASEW, pp 5–9, DOI 10.1109/ASEW.2015.27
- Jørgensen (2004) Jørgensen M (2004) A review of studies on expert estimation of software development effort. JSS 70(1-2):37–60
- Jørgensen (2015) Jørgensen M (2015) The world is skewed: Ignorance, use, misuse, misunderstandings, and how to improve uncertainty analyses in software development projects
- Jørgensen and Gruschke (2009) Jørgensen M, Gruschke TM (2009) The impact of lessons-learned sessions on effort estimation and uncertainty assessments. TSE 35(3):368–383
- Kampenes et al. (2007) Kampenes VB, Dybå T, Hannay JE, Sjøberg DI (2007) A systematic review of effect size in software engineering experiments. Information and Software Technology 49(11-12):1073–1086
- Kemerer (1987) Kemerer CF (1987) An empirical validation of software cost estimation models. CACM 30(5):416–429
- Keung et al. (2013) Keung J, Kocaguneli E, Menzies T (2013) Finding conclusion stability for selecting the best effort predictor in software effort estimation. ASE 20(4):543–567, DOI 10.1007/s10515-012-0108-5
- Keung et al. (2008) Keung JW, Kitchenham BA, Jeffery DR (2008) Analogy-x: Providing statistical inference to analogy-based software cost estimation. TSE 34(4):471–484
- Kitchenham et al. (2001) Kitchenham BA, Pickard LM, MacDonell SG, Shepperd MJ (2001) What accuracy statistics really measure. IEEE Software 148(3):81–85
- Kocaguneli and Menzies (2011) Kocaguneli E, Menzies T (2011) How to find relevant data for effort estimation? In: ESEM, pp 255–264, DOI 10.1109/ESEM.2011.34
- Kocaguneli et al. (2011) Kocaguneli E, Misirli AT, Caglayan B, Bener A (2011) Experiences on developer participation and effort estimation. In: SEAA’11, IEEE, pp 419–422
- Kocaguneli et al. (2012) Kocaguneli E, Menzies T, Bener A, Keung JW (2012) Exploiting the essential assumptions of analogy-based effort estimation. TSE 38(2):425–438
Kocaguneli et al. (2013)
Kocaguneli E, Menzies T, Keung J, Cok D, Madachy R (2013) Active learning and effort estimation: Finding the essential content of software effort estimation data. IEEE Transactions on Software Engineering 39(8):1040–1053
Kocaguneli et al. (2015)
Kocaguneli E, Menzies T, Mendes E (2015) Transfer learning in effort estimation. ESE 20(3):813–843, DOI10.1007/s10664-014-9300-5
- Korte and Port (2008) Korte M, Port D (2008) Confidence in software cost estimation results based on mmre and pred. In: PROMISE’08, pp 63–70
- Krishna et al. (2016) Krishna R, Menzies T, Fu W (2016) Too much automation? the bellwether effect and its implications for transfer learning. In: IEEE/ACM ICSE, ASE 2016, DOI 10.1145/2970276.2970339
- Langdon et al. (2016) Langdon WB, Dolado J, Sarro F, Harman M (2016) Exact mean absolute error of baseline predictor, marp0. IST 73:16–18
- L.Breiman (1984) LBreiman ROCS J Friedman (1984) Classification and Regression Trees. Wadsworth
- Lessmann et al. (2008) Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Transactions on Software Engineering 34(4):485–496, DOI 10.1109/TSE.2008.35
- Li et al. (2009) Li Y, Xie M, Goh TN (2009) A study of project selection and feature weighting for analogy based software cost estimation. JSS 82(2):241–252
- McConnell (2006) McConnell S (2006) Software estimation: demystifying the black art. Microsoft press
- Mendes and Mosley (2002) Mendes E, Mosley N (2002) Further investigation into the use of cbr and stepwise regression to predict development effort for web hypermedia applications. In: ESEM’02, IEEE, pp 79–90
- Mendes et al. (2003) Mendes E, Watson I, Triggs C, Mosley S N Counsell (2003) A comparative study of cost estimation models for web hypermedia applications. ESE 8(2):163–196
- Menzies and Zimmermann (2018) Menzies T, Zimmermann T (2018) Software analytics: What’s next? IEEE Software 35(5):64–70
- Menzies et al. (2006) Menzies T, Chen Z, Hihn J, Lum K (2006) Selecting best practices for effort estimation. TSE 32(11):883–895
- Menzies et al. (2007) Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering 33(1):2–13, DOI 10.1109/TSE.2007.256941
- Menzies et al. (2017) Menzies T, Yang Y, Mathew G, Boehm B, Hihn J (2017) Negative results for software effort estimation. ESE 22(5):2658–2683, DOI 10.1007/s10664-016-9472-2
Menzies et al. (2018)
Menzies T, Majumder S, Balaji N, Brey K, Fu W (2018) 500+ times faster than deep learning:(a case study exploring faster methods for text mining stackoverflow). In: 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR), IEEE, pp 554–563
- Mittas and Angelis (2013) Mittas N, Angelis L (2013) Ranking and clustering software cost estimation models through a multiple comparisons algorithm. IEEE Trans SE 39(4):537–551, DOI 10.1109/TSE.2012.45
- Moeyersoms et al. (2015) Moeyersoms J, Junqué de Fortuny E, Dejaeger K, Baesens B, Martens D (2015) Comprehensible software fault and effort prediction. J Syst Softw 100(C):80–90, DOI 10.1016/j.jss.2014.10.032
- Molokken and Jorgensen (2003a) Molokken K, Jorgensen M (2003a) A review of software surveys on software effort estimation. In: 2003 International Symposium on Empirical Software Engineering, 2003. ISESE 2003. Proceedings., pp 223–230, DOI 10.1109/ISESE.2003.1237981
- Molokken and Jorgensen (2003b) Molokken K, Jorgensen M (2003b) A review of software surveys on software effort estimation. In: Empirical Software Engineering, 2003. ISESE 2003. Proceedings. 2003 International Symposium on, IEEE, pp 223–230
- Moser et al. (2008) Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: 30th ICSE, DOI 10.1145/1368088.1368114
- Nair et al. (2018a) Nair V, Agrawal A, Chen J, Fu W, Mathew G, Menzies T, Minku LL, Wagner M, Yu Z (2018a) Data-driven search-based software engineering. In: MSR
- Nair et al. (2018b) Nair V, Yu Z, Menzies T, Siegmund N, Apel S (2018b) Finding faster configurations using flash. IEEE Transactions on Software Engineering pp 1–1, DOI 10.1109/TSE.2018.2870895
- Nam and Kim (2015) Nam J, Kim S (2015) Heterogeneous defect prediction. In: 10th FSE, ESEC/FSE 2015, DOI 10.1145/2786805.2786814
- Neter et al. (1996) Neter J, Kutner MH, Nachtsheim CJ, Wasserman W (1996) Applied linear statistical models, vol 4. Irwin Chicago
Olsson F (2009) A literature survey of active machine learning in the context of natural language processing
- Pedregosa et al. (2011) Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J (2011) Scikit-learn: Machine learning in python. JMLR 12(Oct):2825–2830
- Peters et al. (2015) Peters T, Menzies T, Layman L (2015) Lace2: Better privacy-preserving data sharing for cross project defect prediction. In: ICSE, vol 1, pp 801–811, DOI 10.1109/ICSE.2015.92
- Port and Korte (2008) Port D, Korte M (2008) Comparative studies of the model evaluation criterion mmre and pred in software cost estimation research. In: ESEM’08, pp 51–60
- Quinlan (1992) Quinlan JR (1992) Learning with continuous classes. In: 5th Australian joint conference on artificial intelligence, Singapore, vol 92, pp 343–348
Rao et al. (2014)
Rao GS, Krishna CVP, Rao KR (2014) Multi objective particle swarm optimization for software cost estimation. In: ICT and Critical Infrastructure: Proceedings of the 48th Annual Convention of Computer Society of India-Vol I, Springer, pp 125–132
- Roman (2016) Roman K (2016) Federal government’s canada.ca project ‘off the rails’ https://www.cbc.ca/news/politics/canadaca-federal-website-delays-1.3893254
- Sarro and Petrozziello (2018) Sarro F, Petrozziello A (2018) Linear programming as a baseline for software effort estimation. ACM Transactions on Software Engineering and Methodology (TOSEM) p to appear
- Sarro et al. (2016) Sarro F, Petrozziello A, Harman M (2016) Multi-objective software effort estimation. In: ICSE, ACM, pp 619–630
- Shepperd (2007) Shepperd M (2007) Software project economics: a roadmap. In: 2007 Future of Software Engineering, IEEE Computer Society, pp 304–315
- Shepperd and MacDonell (2012) Shepperd M, MacDonell S (2012) Evaluating prediction systems in software project estimation. IST 54(8):820–827
- Shepperd and Schofield (1997) Shepperd M, Schofield C (1997) Estimating software project effort using analogies. TSE 23(11):736–743
- Shepperd et al. (2000) Shepperd M, Cartwright M, Kadoda G (2000) On building prediction systems for software engineers. EMSE 5(3):175–182
Singh and Misra (2012)
Singh BK, Misra A (2012) Software effort estimation by genetic algorithm tuned parameters of modified constructive cost model for nasa software projects. International Journal of Computer Applications 59(9)
- Sommerville (2010) Sommerville I (2010) Software engineering. Addison-Wesley
- Stensrud et al. (2003) Stensrud E, Foss T, Kitchenham B, Myrtveit I (2003) A further empirical investigation of the relationship of mre and project size. ESE 8(2):139–161
Storn and Price (1997)
Storn R, Price K (1997) Differential evolution–a simple and efficient heuristic for global optimization over cont. spaces. JoGO 11(4):341–359
- Tan et al. (2015) Tan M, Tan L, Dara S (2015) Online defect prediction for imbalanced data. In: ICSE
- Tantithamthavorn et al. (2016) Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2016) Automated parameter optimization of classification techniques for defect prediction models. In: 38th ICSE, DOI 10.1145/2884781.2884857
- Tantithamthavorn et al. (2018) Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2018) The impact of automated parameter optimization on defect prediction models. IEEE Transactions on Software Engineering pp 1–1, DOI 10.1109/TSE.2018.2794977
- Trendowicz and Jeffery (2014) Trendowicz A, Jeffery R (2014) Software project effort estimation. Foundations and Best Practice Guidelines for Success, Constructive Cost Model–COCOMO pags pp 277–293
- Walkerden and Jeffery (1999) Walkerden F, Jeffery R (1999) An empirical study of analogy-based software effort estimation. ESE 4(2):135–158
- Whigham et al. (2015) Whigham PA, Owen CA, Macdonell SG (2015) A baseline model for software effort estimation. TOSEM 24(3):20:1–20:11, DOI 10.1145/2738037