Software cost/effort/project estimation research is not uncommon in empirical software engineering research and has its roots already from before the term ‘software engineering’ was established. During the 70s one could see an increase in the interest in estimating software development projects, see, e.g., , which only increased after Boehm  released the book Software Engineering Economics in 1981.
Most would argue that in order to conduct estimations one should not rely exclusively on expert opinion, but also on data of a more quantitative nature using unbiased data collection approaches. To this end, researchers have published studies making use of, among others, the International Software Benchmarking Standards Group’s data repository (ISBSG).111http://isbsg.org (For an overview and introduction to the data sets please see .)
We believe that, even though there is a risk of introducing mono-operation bias and increasing mono-method bias by using the same data set, there is still a need for the community to have data sets that are accessible for all to try new and unproven ideas. In the end, however, generalizability is not necessarily the ultimate goal of these unproven ideas since some researchers see early validation of estimation approaches simply as the first step before implementing said approaches in an industry context, very often limiting generalizability out of necessity [4, 5].
Nevertheless, one could make an argument that the data set, and similar data sets, have several things in common with data collected in industry. Missing data, disparate quality in data collection procedures, and variety of data types collected are issues we see also in empirical software engineering research in general.
As we will see in Sect. II, the prevailing strategy to handle missing data in empirical software engineering research is to merely remove cases of missing data (listwise deletion). We believe that this strategy is suboptimal and, generally speaking, not good for our research discipline. Even in cases where data can be classified according to the quality of the data collection procedure, as is the case with the ISBSG data sets, one sees that our community often chooses only to use a subset of data, classified to be of the highest quality (see, e.g., [6, 7] for recommendations, and  for an example where the authors use the recommendations). In short, we believe that data of low quality should be seen as better than no data at all, and the general rule of thumb should be never to throw away data.
Recent advances in Bayesian data analysis (BDA) and imputation has led to software support that tackles the obstacles of missing data well, as long as one is willing to embrace uncertainty. We hope that this paper will convince researchers that dealing with missing data imputation using Bayesian data analysis, is preferable to the current practice, i.e., to throw data away.
Next, we will present related work (Sect. II), while the majority of the paper will present a case where we apply techniques for data imputation in addition to conducting Bayesian data analysis on effort estimation data using an ISBSG data set (Sect. III). We round off the paper with Threats to Validity (Sect. V), Discussion (Sect. VI) and Conclusions (Sect. VII).222A reproducibility package, making use of brms  (with Stan ) written in R , will be available if the paper is accepted.
Ii Related Work
Related work is mainly divided into three parts. First, we present related work on the use of (Bayesian) multilevel models. Then, we present related work concerning missing data in empirical software engineering. Finally, we present related work in connection to the ISBSG data set. All of this in the context of empirical software engineering when possible.
There are few publications in software engineering where we see evidence of using multilevel models (MLMs).333Multilevel models can also be called hierarchical linear models, nested data models, mixed models, random coefficient, random-effects models, random parameter models, or split-plot designs. Ehrlich and Cataldo  used multilevel models for assessing communication in global software development, while Hassan et al.  applied MLMs for studying reviews in an app store. However, both studies used a frequentist approach (maximum likelihood), i.e., not a Bayesian approach, which we will use.
As far as we can tell, there have only been two studies in software engineering that have applied BDA with MLMs. Furia  presents several cases of how BDA could be used in computer science and software engineering research. In particular, the aspects of including prior belief/knowledge in MLMs is emphasized. Ernst , on the other hand, presents a conceptual replication of an existing study where he shows that MLMs support cross-project comparisons while preserving local context, mainly through the concept of partial pooling, as seen in Bayesian MLMs.444 Partial pooling takes into account variance between units.
Partial pooling takes into account variance between units.
In our case, we will undoubtedly use the concepts of priors and partial pooling—they are after all key to Bayesian data analysis and MLMs—however, our end-goal, as already mentioned, will be to analyze what happens when we follow the principle ‘never throw data away’.
Missing data can be handled in two main ways. Either we delete data using one of three main approaches (listwise or pairwise deletion, and column deletion) or we impute new data. Concerning missing data we conclude the matter is not new to the empirical software engineering community. Liebchen and Shepperd  has pointed out that the community needs more research into ways of identifying and repairing noisy cases, and Mockus  claims that the standard approach to handle missing data, i.e., remove cases of missing values (e.g., listwise deletion), is prevalent in empirical software engineering.
Two additional studies on missing data are, however, worthwhile pointing out. Myrtveit et al.  investigated listwise deletion, mean imputation, similar response pattern imputation, and full information maximum likelihood (FIML), and conclude that FIML is the only technique appropriate when data are not missing completely at random. Finally, Cartwright et al.  conclude that -nearest neighbor and sample mean imputation are significantly better in improving model fit when dealing with imputation. However, much has happened concerning research in imputation techniques lately.
In this study we will instead focus on multivariate imputation by chained equations (MICE), sometimes called fully conditional specification or sequential regression multiple imputation, a technique that has emerged as a principled method of dealing with missing data during the last decade . MICE specifies a multivariate imputation model on a variable-by-variable basis by a set of conditional densities (one for each incomplete variable) and draws imputations by reiterating the conditional densities. The original idea behind MICE is old, see, e.g., stochastic relaxation , but the recent refinements and implementations have made the technique easily accessible.
Finally, related work concerning the ISBSG data set is worthwhile pointing out. Fernández-Diego and de Guevara  present pros and cons of using the ISBSG data set. A systematic mapping review was used as the research method, which was applied to over 120 papers. The dependent variable of interest was usually Effort (more than 70% of the studies), and the most frequently used methods were regression (
60%) and machine learning (35%), the latter a term where many techniques can hide. Worth noting is also that Release 10 was used most frequently, which provides us with a reason to also use that data set. Additionally, we will also use Effort as the dependent variable of interest, since a majority of the studies seem to find that variable interesting to study. (The importance of the ISBSG data set, when considering replication studies in empirical software engineering, has already been pointed out by Shepperd et al. .) By and large, our study takes another approach entirely, we impute missing data in a Bayesian context, and we see our work more as complementary to the work mentioned above.
Iii An Illustrative Case
We will use the Release 10 data set and set the dependent variable to Effort. According to, e.g., [6, 7] and the International Software Benchmarking Standards Group, the following pre-processing steps are appropriate:
Only projects classified with data quality rating ‘A’ are kept, and ‘B–D’ are excluded.
Only projects using IFPUG (unadjusted functional size measurement) should be kept. However, the data description clearly states that versions should not be compared to . Hence, we only use versions .
According to Keung  some additional variables should be kept for compatibility with previous studies.
Missing values in any of the selected attributes should be excluded.
The above leads to Table I according to Mittas et al. . If we use the variables in Table I, and follow the advice above, we will later see that we remove 3,895 projects out of the total 4,106 (close to 95% of the projects). This seems wasteful.
|AFP||Adjusted function points|
|Input||Number of inputs|
|Output||Number of outputs|
|Enquiry||Number of enquiries|
|File||Number of files|
|Interface||Number of interfaces|
|Added||Number of added features|
|Changed||Number of changed features|
|Deleted||Number of deleted features|
|Effort||Actual effort (person-hours)|
|DQR||Data quality rating|
Imagine instead that we aim to keep as much data as possible, i.e., a data-greed approach. Well, first of all, we should consider including all projects no matter the quality rating. After all, it is easy for us to classify them differently in a statistical model and even investigate the difference between projects depending on the data quality rating. Hence, we decide to include all projects and mark them according to their data quality rating, i.e., DQR.
The next step would be to consider a casual analysis of the variables in Table I.
Iii-a Causal Analysis
All predictors except DQR, e.g., Input, Output, etc., seem to be unadjusted function point measurements, which later are used to calculate the adjusted function point (AFP). Drawing our scientific model (Fig. 1) as a directed acyclic graph (DAG) shows something generally considered to be a Pipe confound (one of four types of relations in causal DAGs). In short, AFP mediates association between the other predictors and our outcome Effort, i.e., , or to put it differently, Effort is independent of Input, when conditioning on AFP.
We often worry about not having a predictor that we need for making good predictions (omitted variable bias). However, we do not often consider mistaken inferences because we rely on variables that are derived from other variables, i.e., post-treatment bias . (Rosenbaum  calls this the concomitant variable bias.) In experimental studies, one would declare AFP to be a post-treatment variable, but the same nomenclature can be used in observational studies. To summarize, the above indicates that AFP should not be a predictor, so we will leave it out for now.
Iii-B Identifying Non-Identifiability
Strong correlations between variables are generally speaking a challenge when building statistical models. The model will make good predictions, but understandability will decrease since the impression will be that variables that have a strong correlation, do not seem to have predictive power, while in fact, they have strong associations with the outcome .
Traditionally, there are two ways one can do this: Examining a pairs plot where all combinations of parameters and their correlations are visualized, or check if the matrix is a full rank matrix, and thus identify non-identifiability that way.
The latter approach consists of declaring a model , use the data as a matrix , and decompose it into a product
of an orthogonal matrix, and an upper triangular matrix . By analyzing the diagonal of the matrix a threshold value of along the diagonal is then often used to declare a variable as unsuited for inclusion in a model. Often something like —quasi-zero for a computer—could be used, but generally speaking anything below 1.0 has traditionally been seen suitable.
Identifying non-identifiability for our data set from Table I clearly indicates that Deleted should be a candidate for removal (). To this end, we exclude Deleted from our list of predictors. It might feel strange to remove predictors when our ultimate goal is to use as much data as possible. However, in order to build a statistical model that is sane, has good out of sample prediction, and is understandable, a trade-off is needed. Here we argue that the bulk of work should be done before we design our statistical model, in order to make use of the missing data techniques available to us.
Iii-C Missing Data Analysis
Rubin  has shown that very often 3–5 imputations is enough and that the efficiency of an estimate based on imputations is approximately:
where is the fraction of missing information. As an example, consider that we have 20% missing information in a variable (), given , we have reached an efficiency of approximately 96%. Setting we reach 98%. By doubling the computational effort, we have a slight gain in efficiency. However, more recently we have seen that other recommendations have been introduced.
Bodner  and White et al.  showed through simulations and by analytically deriving Monte Carlo errors, respectively, that the general rule of thumb should be , i.e., if a variable has 40% missing data () we should set (using Rubin’s efficiency estimate this would mean 92.6%99.0%).
As will be evident, we will take the more conservative approach, when we present the model implementation in Section III-F.
Iii-D Sensitivity Analysis of Priors
Through sheer deduction we conclude, so far, that we will most likely use seven predictors and one group variable, to predict one outcome variable. We can already now assume that we will most likely use a likelihood (i.e., our assumptions regarding the data generative process) that is based on counts (Effort is after all a count, the number of hours, going from zero to infinity).
When working with these types of generalized linear models, we use link functions that translate between the linear predictor and the mean of the distribution function. In the case of count distributions, such as Poisson, it is customary to use a link function, i.e., a parameter’s value is the exponentiation of the linear model. However, when setting priors for parameters, unexpected things can happen, and the priors might not have the effect one would expect. To this end, we should always do prior predictive simulations, i.e., a sensitivity analysis of the priors.
The description of the data set indicates that approximately 20% of the projects have more than 20 people in the team. If we assume, roughly 1,700 h/year for an individual, having 60 people in a team sums up to approximately 100,000 hours. Let us now assume that this is the maximum value for our outcome variable Effort. Random sampling from provides us with (we use the exponential since we assume a link function) indicating that this could be an acceptable prior for the intercept .555As a side note, if we assume to be (not uncommon in, e.g., Poisson regressions), hours of work effort. This would correspond to close to half a billion people working on the project for one year. Clearly inappropriate.
Next, we assume broad priors like for our seven parameters, but we also compare with . Assuming a link function, the additive effects of seven priors for our parameters would, on a normalized scale, correspond to and , respectively. As is evident from Fig. 2 (a), we have a massive emphasis on extremely high -values (we would require the world’s total population to work in a project for this to happen). Now compare (a) with (b). We still allow extremely large -values (up to ), but we are more modest about it.
To summarize, prior predictive simulations indicate that setting and on and
, respectively, allows us to delimit the multidimensional Gaussian space of possible parameter values, while still not remove the probability of extreme values completely. If we would still be uncertain one could have conducted even more prior predictive simulations.
Iii-E Model Design
For our model one could imagine using a Poisson likelihood, that is, we have a count () which we can model binomial events for when the trials are very large and small. However, that would be a mistake.
Let us now look at some descriptive statistics of our data (taking into account the pre-processing steps previously introduced).666For all data sets we use single quotes to emphasize the names, e.g., ‘A_clean’, while we print out variable names in verbatim. Some issues catch the eye in Table II. First, Effort
has a max value of 645,694 (three times larger than our assumption). Second, the medians are consistently lower than the means (in one case the median is zero) indicating positive skewness. Third, not visible in the table,Effort, compared to the predictors, contain no zeros (indicating that we do not need to consider zero-inflated or hurdle models ). Finally, the mean and the variance for our outcome variable are very different (the variance is approximately 70,000 times larger than the mean).
Concerning the latter issue, a Poisson likelihood assumes the mean and the variance be approximately equal. This allows us to instead use the negative binomial, known as the Gamma–Poisson (mixture) distribution, as our likelihood, i.e., a continuous mixture model where we assume each Poisson count observation has its own rate. To summarize our findings so far:
We model each observation from a negative-binomial distribution, with a failure rateand shape . We then use a link for our linear model where we include an intercept and estimators () for all predictors. We also add varying intercepts in the form of our DQR variable.
Then, we set the aforementioned priors on our parameters (but we use for our unique intercepts, to separately estimate for each level of DQR). Finally we set and for and , respectively. The latter are regularizing priors common for these types of parameters.777Please see here for prior choice recommendations: https://goo.gl/fx2F7V.
Iii-F Model Implementation
In the previous sections, we presented our statistical model with assumptions. In this section, we will make use of it in two ways: sampling with complete data and imputed data. However, before we begin, Table III describes the data sets we will use.
The data sets are divided into two categories. First, we have data sets that only take into account projects classified as having the highest quality rating (‘A’) and data sets where we use all four quality ratings (‘AD’); taking into account the pre-processing steps in Sect. III.
First, we have subsets with NAs (‘A’ and ‘AD’) and, second, subsets where all original NAs are removed (‘*_clean’). The logic to use these data sets is that we want to use as much data as possible (but we pay the price of missing data), and removing all NAs is, as we have discussed, not uncommon.
One could also imagine having subsets where all zeros are assumed to be NAs, but that would be a bit too conservative assumption in our opinion, and we leave it to the reader to try out such a scenario.
If one would like to compare our data sets with what is commonly seen in literature then, taking into account that we have a more restrictive view on which IFPUG versions are included, ‘A_clean’ would be the most similar data set (e.g., Mittas et al.  report using 501 projects, while we end up with 214, using our more restrictive subset). However, we are more interested in the cases where we have larger data sets, together with missing data, and comparing these with, e.g., ‘A_clean’.
|Name||# projects||# NAs||% NAs||# zeros|
Missing Data Imputation
Visualizing missing data (Fig. 3) shows that the missingness is multivariate (missing data in more than one variable), connected (all variables have blue cells that are connected from the upper left corner), and non-monotone (there are gaps of blue patterns from the lower right corner). Generally speaking, this indicates that data imputation is possible (in particular the connectivity is an important part of missing data imputation).
van Buuren  recommends that one calculates each variable’s influx and outflux and plots them. Influx () is defined as the number of variable pairs with missing and observed, divided by the total number of observed data cells while, in the case of outflux (), we instead divide by the total number of incomplete data cells. In short, a completely observed variable gives , while the opposite holds for . If one has two variables with the same degree of missing data, then the variable with the highest is potentially more useful for imputation purposes. Examining Fig. 4, Effort and Change have the highest and . To summarize, Effort will be the most influential variable for imputation, while Change will be the easiest variable to impute. This is worth keeping in mind later when we analyze the results.
To conclude, we will create multiple imputations (replacement values) for multivariate missing data, based on Fully Conditional Specification 
. Each incomplete variable is imputed by a separate model using predictive mean matching (numeric data), or proportional odds model/orderedmodel (factor data with >2 ordered levels) [30, 19]. Concerning predictive mean matching, the assumption is that the missingness follows approximately the same distribution as the data, but the variability between the imputations over repeated draws reflects the uncertainty of the actual value.
One of several analyses we can do is to plot the imputed data against the empirical data set. In Fig. 5, we see the empirical data set (‘0’ on the -axes) with . As we can see, the imputations follow fairly closely the original distribution.
In this section, we will first present some diagnostics from the Markov chain Monte Carlo sampling we conducted, while we later investigate the estimates from our analysis and probabilistic statements connected to the estimates.
First, for all sampling conducted was consistently low, i.e., (the ratio of the average variance of draws within each chain to the variance of the pooled draws across chains). Second, the effective sampling was consistently high, i.e., . Third, the visual inspection revealed that the chains mixed well (hairy caterpillar ocular test). Finally, the Bayesian fraction of missing information, another diagnostic, shows a significant overlap between the energy transition density, , and the marginal energy distribution, (Fig. 6). When the two distributions are well-matched, the random walk will explore the marginal energy distribution efficiently .
We sampled four chains, each with 2,000 iterations, and the first half of the iterations were discarded as warmup iterations. For our imputed data sets this means that we have posterior samples. Figure 7 provides a comparison of our empirical outcome (using the data set ‘A_clean’), with draws from our posterior distribution; we see evidence of a fairly good match; a perfect match is not what we want since then we could just use our data as-is, i.e., the variability of each vis-à-vis is what interests us.
If we turn our attention to the estimated intercepts for our group-level variable DQR (i.e., a project’s rating according to the quality of data collected), we see something interesting in Fig. 8. There is a clear pattern, in both data sets, where quality rating ‘A’ and ‘D’ are to the left, while ‘B’ and ‘C’ are to the right. It is an indication that these two groups are perceived as similar to each other, which is a bit ironic since ‘A’ and ‘D’ are conceptually the opposite of each other in terms of data quality rating. This indicates that one should question the data quality ratings in the data set and, given enough data, DQR seems to become less critical.
Examining the estimated parameters (Fig. 9) we see that parameters perceived as ‘significant’ differ between the data sets.888Our notion of ‘significant’ is here that the 95% highest density posterior interval does not cross zero. However, other notions do exist . There are three comparisons we should make here. First, comparing imputed with cleaned data sets (within each column). Second, comparing ‘A’ and ‘AD’ models (between columns). Third, compare the upper right and lower left plots (data-greed approach vs. state of practice).
First, if we look at the left column, we see that nothing significant has changed. On the other hand, examining the right column we see that is no longer significant in the imputed data set. This might make you sad. However, using all data can lead to weaker inferences and, honestly, should we not make use of all data available no matter our wishful thinking concerning inferences?
Second, if we compare our simple models with our multi-level models (between column comparisons), examining the two lower plots we see that has become ‘significant’ in the model where we make use of all quality ratings (right plot). In this particular case, one would lean towards the multi-level model (right plot) since it, after all, makes use of more data and employs partial pooling to avoid over-fitting.
Finally, we should compare the upper right and lower left plots (our data-greed approach with state of practice); they are the reason for conducting this study. Two things are worth noticing here: is shifted noticeably more to the right in the imputed data set and is significant, as is . is clearly not significant in the imputed data set, while in the cleaned data set it is nearly so. Once again, making use of more data can lead to weaker inferences, which is a good thing.
Let us now examine what the posterior predictive distribution provides us with concerning point estimates regarding our outcome Effort. The posterior distribution allows us to set predictors at different values and use the posterior predictive distribution to generate predicted outcomes with uncertainty. In our case, we would like to examine the difference between posterior predictive distributions of ‘A_clean’ and ‘AD’, since ‘A_clean’ is based on the assumptions commonly used in literature and ‘AD’ makes use of as much data as possible, i.e., our data-greed approach.
Plotting our two posterior distributions indicates the differences (see Fig. 10). As is clear, the median is higher when taking into account more data (‘AD’), but the uncertainty is also slightly larger. Ultimately, comparing these types of point estimates should always go hand in hand with the purpose of the analysis, i.e., the utility function.
The posterior distribution allows us to make probabilistic statements that provide us with a deeper understanding of the phenomena we study. We could make statements for each model separately, i.e., investigating ‘AD’ we can say that the probability that the effect is greater than 10% (a decline of >10%) for the D category is 2.6%, while for the A category it is 7.2%. Alternatively, in the case of ‘A_clean’, that the probability that the effect of Enquiry is greater than 5% (an increase of >5% in this case) is 58.6%. Of course, one could also look at the probability that the estimates of parameter is larger than 0 using the ‘AD_clean’ and ‘AD’ data sets, i.e., 79.8% and 60.2%, respectively.
These types of probability statements are a positive aspect of Bayesian analysis and the posterior probability distributions that accompanies it.
V Threats to Validity
In this section, we will cover threats to validity from a quantitative, and statistical, perspective. The below threats are not the type of threats we regularly discuss in empirical software engineering , but in our case, we believe these threats to be more relevant for our study. At the end of this section we will contrast our study against recent guidelines concerning design and reporting of software analytics studies.
Directed Acyclic Graphs (DAGs)
Making ones scientific model explicit is dangerous since it becomes open to attack. We believe, however, that it should be compulsory in any scientific publication. We employed DAGs to this end, a concept refined by Pearl and others . Using DAGs make things explicit. If things are explicit, they can be criticized. One threat to validity is of course that our scientific model is wrong and AFP is not a mediator (Sect. III-A). This is for the reader to comment on. Of course, instead of using the graphical approach of DAGs, and applying -calculus to determine -separation, one could walk down the road of numerical approaches according to Peters et al. .
Through our non-identifiability analysis we concluded that the variable Deleted should be removed. Removing this variable is a trade-off. The non-identifiability analysis showed that it should be removed, thus allowing for better understandability and better out of sample prediction. However, we could have taken a more prudent approach and investigated Deleted’s role in predictions, but in this particular case, we believe the initial analysis provided us with a convincing argument to remove it.
The sensitivity analysis of priors provided us with confidence regarding the choice of priors. We conducted prior predictive analysis and, together with recommendations regarding default priors, concluded that our selection of priors was balanced. However, our conclusions could be wrong, and further studies could indicate that our priors are too broad. The latter is, however, what one can expect when doing science.
Bayesian Data Analysis
In a Bayesian context, model comparison is often divided into three categories -open, -complete, and -closed [36, 37]. In the -open world the relationship between the data generating process and our list of models is unknown. However, what we do know is that , our ‘true’ model, is not in , and we cannot specify the explicit form due to computational or conceptual reasons. In the -complete world, is also not in , but we use any model in
that is closest in Kullback-Leibler divergence[36, 38]. Finally, in the case of the -closed world, .
The bulk of statistical methodology is concerned with the latter category (-closed) and “this class of problems is comparatively simple and well studied” . Many problems we face are in the -complete and not the -closed world (this study is such an example). Selecting the ‘best’ model is often done through relative comparisons of using the Watanabe-Akaike information criterion or leave one out cross validation with Pareto-smoothed importance sampling . However, to use WAIC or PSIS-LOO for out of sample prediction, one should use the same data set for each model (e.g., you can change the likelihood and priors, but the data set is fixed).
In our particular study, we have not done model comparison (e.g., using PSIS-LOO ), but the reason is apparent—we use different data sets—which is the purpose of this study. To this end we defend our choice of likelihood, i.e., the Gamma-Poisson (a.k.a. negative binomial), epistemologically: If we have counts from zero to infinity, where the variance is significantly different from the mean then, from a maximum entropy point of view, Gamma-Poisson is a sane choice. By conducting posterior predictive checks, we ultimately received yet another validation to strengthen us in our opinion that the model has predictive capabilities (Sect. IV).
One threat to validity is also the residuals of the model (fitting deviation). If the residuals are too large, it is an indication that the model does not fit data well. This is a trade-off since a perfect fit could imply overfitting. By conducting posterior predictive checks, we concluded that the models, as such, had a convincing fit (see, e.g., Fig. 7). However, investigating the residuals for each estimated parameter provides us with a better view.
In Fig. 11 we see estimates of the parameters with the most unobservable statistical error. The residuals are linear which indicates predictability in our imputation. But, remember, Changed was judged to be the easiest to impute, but still Enquiry and Interface provide less uncertainty considering residuals (Sect. III-C)—that is due to the imputation most likely. Nevertheless, in the end, we have employed multi-level models when possible and, thus, taken a conservative approach by using partial pooling. The latter should encourage us to put some trust in the model’s ability to avoid overfitting, i.e., learn too much from our data.
Finally, we believe that using the traditional threats to validity nomenclature seen in empirical software engineering research most likely does not fit the type of studies we present here. In , Krishna et al. presents 12 ‘bad smells’ in software analytics papers, which we believe are more useful to describe weaknesses in the type of studies that we present here. Below we will now cover each ‘smell’ and contrast it with our study.
Not interesting. (Research that has negligible software engineering impact.) We argue that the problem we have analyzed in this paper is not only relevant, common, and interesting. To this we mainly point to related work.
Not using related work. (Unawareness of related work concerning RQs and SOA.) We point to related work as a basis for our study, i.e., studies which throw away data, and show how other state of art analytical approaches might be more suitable.
Using deprecated and suspect data. (Using data out of convenience.) The data is definitely suspect as we have shown by our analysis of data quality ratings, but it would be hard to argue that the data is deprecated. However, we used a particular version of the data set, but that was due to our intention to align with previous work.
Inadequate reporting. (Partial reporting, e.g., only means.) In this paper we have presented not only point estimates, but also contrasted different distributions and derived probabilistic statements. We would have liked to also provide model comparisons but, alas, the design of this study did not allow this. To this end we rely on posterior predictive checks.
Under-powered experiments. (Small effect sizes and little theory.) Using more data provides us with more statistical power, and we base our priors on state of art recommendations and logical conclusions, e.g., estimating that the world’s population is part of a project is not appropriate.
and all that.
(Abuse of null hypothesis testing.) We mention-values only when making a point not to use them.
Assumptions of normality and equal variances.
(impact of outliers and heteroscedasticity.) We use a Bayesian generalized linear model, which we model using a Gamma-Poisson likelihood. Additionally, we employ multi-level models when possible, and hence make use of partial pooling (which takes into account the presence of outliers).
Not exploring stability. We conducted sensitivity analysis of priors and we report on the differences between imputed and cleaned data sets.
Not tuning. We avoid bias in comparisons mainly by clearly stating our assumptions, conducting a sensitivity analysis, making use of multi-level models, and, generally speaking, following guidelines on how to conduct Bayesian data analysis.
Not exploring simplicity. Using state of art missing data analysis is needed and wanted in order to decrease our bias. Additionally, using a complex mixture model was unavoidable because of epistemological reasons, as presented earlier in this section. We used simulated data to independently assess the appropriateness of our likelihood and priors.
Not justifying choice of learner.
This concerns, ultimately, the risk of over-estimation (or over-fitting). We would argue that any usage of frequentist statistics would potentially introduce this ‘smell’, i.e., using uniform priors, as is the case in a traditional frequentist setting, ensures maximum over-fitting.
We argue that one should have solid reasons to throw away data since we now have techniques available that can provide us with the opportunity to use as much data as possible. The example we provided showed that by using missing data techniques, in combination with Bayesian multi-level models, we could better make use of the data and, thus, gain higher confidence concerning our findings. The inferences can become weaker, but ask yourself if that is not how you think your fellow researchers should conduct their analysis. We could show two things in our analysis: Parameters’ ‘significance’ changed depending on if we used imputation or not, and there was really not much of a difference between the various data quality ratings (once again indicating that we should use as much data as possible).
However, we pay a price for this more complex analysis. It is a more involved analysis compared to a frequentist analysis where only the likelihood of the outcome is required to be specified and maximum likelihood estimates are not conditioned on the observed outcome; the uncertainty is instead connected to the sampling distribution of the estimator. The same applies to confidence intervals in the frequentist world, i.e., one can set up a distribution of predictions, but it entails repeating the process of random sampling on which we apply the estimator every time to then generate point predictions. Contrast this with conditioning on the posterior predictive distribution which is based on the observed outcome.
Additionally, making probabilistic statements is very much more natural when having a posterior at hand, while the -values we have made use of traditionally rely on observing a -statistic that is so large (in magnitude) if the null hypothesis is true, i.e., not if the scientific hypothesis is true. To make the point, the term ‘-value’ was used in this paper for the first time here in this section and in our case, where we used different data sets, one could have expected us to lean towards traditional hypothesis testing, since it was not possible to compare models explicitly, regarding out of sample predictions.
We will not further contrast our approach with how analyses are done in empirical software engineering today. Suffice to say, issues such as the arbitrary cut-off, the usage of null hypothesis significance testing and the reliance on confidence intervals have been criticized [43, 44, 45, 46], and when analyzing the arguments, we have concluded that many of the issues plaguing other scientific fields are equally relevant to come to terms with in empirical software engineering.
We believe that evidence-based interpretation is more straightforward with Bayesian data analysis, and empirical software engineering should embrace it as soon as possible. In our view it is an easy choice to make in this particular case; to base one’s inferences on more data is wise, and doing so in a Bayesian context is natural.
We recommend that one always should be conservative with throwing away data. Many state of art techniques exists today, which provides the researcher with ample of possibilities to conduct rigorous, systematic, and transparent missing data analysis.
In combination with Bayesian data analysis, we believe that researchers will be able to get a more nuanced view of the challenges they are investigating. In short, we do not need -values for this.
In our study, we showed that inferences could become weaker, which is not a bad thing, and that the qualitative assessment of quality ratings were most likely biased. This further strengthens the argument never to throw data away.
- Putnam  L. H. Putnam, “A general empirical solution to the macro software sizing and estimating problem,” IEEE Transactions on Software Engineering, vol. SE-4, no. 4, pp. 345–361, July 1978.
- Boehm  B. W. Boehm, Software engineering economics. Prentice-hall Englewood Cliffs (NJ), 1981.
- Hill et al.  P. R. Hill, M. Stringer, C. Lokan, and T. Wright, “Organizational benchmarking using the ISBSG data repository,” IEEE Software, vol. 18, pp. 26–32, 09 2001.
- Briand et al.  L. C. Briand, D. Bianculli, S. Nejati, F. Pastore, and M. Sabetzadeh, “The case for context-driven software engineering research: Generalizability is overrated,” IEEE Software, vol. 34, no. 5, pp. 72–75, 2017.
- Petersen and Wohlin  K. Petersen and C. Wohlin, “Context in industrial software engineering research,” in Proceedings of the 3rd International Symposium on Empirical Software Engineering and Measurement, ser. ESEM ’09. Washington, DC, USA: IEEE Computer Society, 2009, pp. 401–404.
- Keung  J. Keung, “Empirical evaluation of Analogy-X for software cost estimation,” in Proceedings of the Second ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, ser. ESEM ’08. New York, NY, USA: ACM, 2008, pp. 294–296.
- Liebchen and Shepperd  G. A. Liebchen and M. Shepperd, “Data sets and data quality in software engineering,” in Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, ser. PROMISE ’08. New York, NY, USA: ACM, 2008, pp. 39–44.
Mittas et al. 
N. Mittas, E. Papatheocharous, L. Angelis, and A. S. Andreou, “Integrating non-parametric models with linear components for producing software cost estimations,”Journal of Systems and Software, vol. 99, pp. 120–134, 2015.
- Bürkner  P.-C. Bürkner, “brms: An R package for Bayesian multilevel models using Stan,” Journal of Statistical Software, vol. 80, no. 1, pp. 1–28, 2017.
- Carpenter et al.  B. Carpenter, A. Gelman, M. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell, “Stan: A probabilistic programming language,” Journal of Statistical Software, vol. 76, no. 1, pp. 1–32, 2017.
- R Core Team  R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2018. [Online]. Available: https://www.R-project.org/
- Ehrlich and Cataldo  K. Ehrlich and M. Cataldo, “All-for-one and one-for-all? A multi-level analysis of communication patterns and individual performance in geographically distributed software development,” in Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work, ser. CSCW ’12. New York, NY, USA: ACM, 2012, pp. 945–954.
- Hassan et al.  S. Hassan, C. Tantithamthavorn, C.-P. Bezemer, and A. E. Hassan, “Studying the dialogue between users and developers of free apps in the Google Play Store,” Empirical Software Engineering, Sep 2017.
C. A. Furia, “Bayesian statistics in software engineering: Practical guide and case studies,”ArXiv e-prints, Aug. 2016.
- Ernst  N. A. Ernst, “Bayesian hierarchical modelling for tailoring metric thresholds,” in Proceedings of the 15th International Conference on Mining Software Repositories, ser. MSR ’18. Piscataway, NJ, USA: IEEE Press, 2018.
- Mockus  A. Mockus, “Missing data in software engineering,” in Guide to Advanced Empirical Software Engineering, F. Shull, J. Singer, and D. I. K. Sjøberg, Eds. London: Springer London, 2008, pp. 185–200.
- Myrtveit et al.  I. Myrtveit, E. Stensrud, and U. H. Olsson, “Analyzing data sets with missing data: An empirical evaluation of imputation methods and likelihood-based methods,” IEEE Transactions on Software Engineering, vol. 27, no. 11, pp. 999–1013, Nov 2001.
- Cartwright et al.  M. H. Cartwright, M. J. Shepperd, and Q. Song, “Dealing with missing software project data,” in Proceedings. 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry (IEEE Cat. No.03EX717), Sep. 2003, pp. 154–165.
- van Buuren  S. van Buuren, “Multiple imputation of discrete and continuous data by fully conditional specification,” Statistical Methods in Medical Research, vol. 16, no. 3, pp. 219–242, 2007.
- Geman and Geman  S. Geman and D. Geman, “Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-6, no. 6, pp. 721–741, Nov 1984.
- Fernández-Diego and de Guevara  M. Fernández-Diego and F. G.-L. de Guevara, “Potential and limitations of the ISBSG dataset in enhancing software engineering research: A mapping review,” Information and Software Technology, vol. 56, no. 6, pp. 527–544, 2014.
- Shepperd et al.  M. Shepperd, N. Ajienka, and S. Counsell, “The role and value of replication in empirical software engineering results,” Information and Software Technology, vol. 99, pp. 120–132, 2018.
- McElreath  R. McElreath, Statistical rethinking: A Bayesian course with examples in R and Stan. CRC Press, 2015.
- Rosenbaum  P. R. Rosenbaum, “The consequences of adjustment for a concomitant variable that has been affected by the treatment,” Journal of the Royal Statistical Society. Series A (General), vol. 147, no. 5, pp. 656–666, 1984.
- Rubin  D. B. Rubin, Multiple imputation for nonresponse in surveys. Wiley, 1987.
- Bodner  T. E. Bodner, “What improves with increased missing data imputations?” Structural Equation Modeling: A Multidisciplinary Journal, vol. 15, no. 4, pp. 651–675, 2008.
- White et al.  I. R. White, P. Royston, and A. M. Wood, “Multiple imputation using chained equations: Issues and guidance for practice,” Statistics in Medicine, vol. 30, no. 4, pp. 377–399, 2011.
- Simpson et al.  D. P. Simpson, H. Rue, T. G. Martins, A. Riebler, and S. H. Sørbye, “Penalising model component complexity: A principled, practical approach to constructing priors,” ArXiv e-prints, Mar. 2014.
- Hu et al.  M.-C. Hu, M. Pavlicova, and E. V. Nunes, “Zero-inflated and hurdle models of count data with extra zeros: Examples from an HIV-risk reduction intervention trial,” The American Journal of Drug and Alcohol Abuse, vol. 37, no. 5, pp. 367–375, 2011.
- Rubin  D. B. Rubin, “Statistical matching using file concatenation with adjusted weights and multiple imputations,” Journal of Business & Economic Statistics, vol. 4, pp. 87–94, 1986.
- Betancourt  M. Betancourt, “A conceptual introduction to Hamiltonian Monte Carlo,” arXiv e-prints, Jan 2017.
- Kruschke  J. K. Kruschke, “Rejecting or accepting parameter values in Bayesian estimation,” Advances in Methods and Practices in Psychological Science, 2018.
- Wohlin et al.  C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, and A. Wesslén, Experimentation in Software Engineering. Springer Publishing Company, 2012.
- Pearl  J. Pearl, Causality: Models, reasoning and inference, 2nd ed. New York, NY, USA: Cambridge University Press, 2009.
- Peters et al.  J. Peters, D. Janzing, and B. Schölkopf, Elements of causal inference: Foundations and learning algorithms, ser. Adaptive Computation and Machine Learning. MIT Press, 2017.
- Yao et al.  Y. Yao, A. Vehtari, D. Simpson, and A. Gelman, “Using stacking to average Bayesian predictive distributions (with discussion),” Bayesian Analysis, vol. 13, no. 3, pp. 917–1007, 09 2018.
- Navarro  D. J. Navarro, “Between the devil and the deep blue sea: Tensions between scientific judgement and statistical model selection,” Computational Brain & Behavior, vol. 2, no. 1, pp. 28–34, Mar 2019.
- Betancourt  M. Betancourt, “A unified treatment of predictive model comparison,” arXiv e-prints, Jun 2015.
- Clarke et al.  J. L. Clarke, B. Clarke, and C.-W. Yu, “Prediction in -complete problems with limited sample size,” Bayesian Analysis, vol. 8, no. 3, pp. 647–690, 09 2013.
- Vehtari et al.  A. Vehtari, A. Gelman, and J. Gabry, “Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC,” Statistics and Computing, vol. 27, pp. 1413–1432, 2017.
- Krishna et al.  R. Krishna, S. Majumder, T. Menzies, and M. Shepperd, ““Bad smells” in software analytics papers,” arXiv e-prints, Mar 2018.
- Gabry et al.  J. Gabry, D. Simpson, A. Vehtari, M. Betancourt, and A. Gelman, “Visualization in Bayesian workflow,” ArXiv e-prints, Sep. 2017.
- Ioannidis  J. P. A. Ioannidis, “Why most published research findings are false,” PLOS Medicine, vol. 2, no. 8, 08 2005.
- Morey et al.  R. D. Morey, R. Hoekstra, J. N. Rouder, M. D. Lee, and E.-J. Wagenmakers, “The fallacy of placing confidence in confidence intervals,” Psychonomic Bulletin & Review, vol. 23, no. 1, pp. 103–123, 2016.
- Nuzzo  R. Nuzzo, “Scientific method: Statistical errors,” Nature, vol. 506, no. 7487, pp. 150–152, Feb. 2014.
- Trafimow and Marks  D. Trafimow and M. Marks, “Editorial,” Basic and Applied Social Psychology, vol. 37, no. 1, pp. 1–2, 2015.