Compared to other fields such as the social sciences, empirical software engineering has been using standard statistical practices for a relatively short period of time [Bayes1]. In light of the replication crisis mounting in several of the more established research fields [replication-crisis], empirical software engineering’s relative statistical immaturity can actually be an opportunity to embrace more powerful and flexible statistical methods—which can elevate the impact of the whole field and the solidity of its research findings.
In this paper, we focus on the difference between so-called frequentist and Bayesian statistics. For historical and technical reasons [belief-icse16, fienberg2006, rao1992], frequentist statistics are the most widely used, and the customary choice in empirical software engineering (see Sect. 1.1) as in other experimental research areas. In recent years, many statisticians have repeatedly pointed out [pvalue-cohen, all-false]
that frequentist statistics suffer from a number of shortcomings, which limit their scope of applicability and usefulness in practice, and may even lead to drawing flat-out unsound conclusions in certain contexts. In particular, the widely popular techniques for null hypothesis statistical testing—based on computing the infamous-values—have been de facto deprecated [misusingPval, insignificance-pval], but are still routinely used by researchers who simply lack practical alternatives: techniques that are rigorous yet do not require a wide statistical know-how, and are fully supported by easy-to-use flexible analysis tools.
Bayesian statistics has the potential to replace frequentist statistics, addressing most of the latter’s intrinsic shortcomings, and supporting detailed and rigorous statistical analysis of a wide variety of empirical data. To illustrate the effectiveness of Bayesian statistics in practice, we present two reanalyses of previous work in empirical software engineering. In each reanalysis, we first describe the original data; then, we replicate the findings using the same frequentist techniques used in the original paper; after pointing out the blind spots of the frequentist analysis, we finally perform an internal replication [Natalia-replication] that analyzes the same data using Bayesian techniques—leading to outcomes that are clearer and more robust. The bottom line of our work is not that frequentist statistics are intrinsically unreliable—in fact, if properly applied in some conditions, they might lead to results that are close to those brought by Bayesian statistics [multiple-perspectives]—but that they are generally less intuitive, depend on subtle assumptions, and are thus easier to misuse (despite looking simple superficially).
Our first reanalysis, presented in LABEL:sec:testing, targets data about debugging using manually-written vs. automatically-generated tests. The frequentist analysis by Ceccato et al. [autotest-bib]
is based on (frequentist) linear regression, and is generally sound and insightful. However, since it relies on frequentist statistics to assess statistical significance, it is hard to consistently map some of its findings to practical significance and, more generally, to interpret the statistics in the quantitative terms of the application domain. By simply switching to Bayesian linear regression, we show many of these issues can be satisfactorily addressed: Bayesian statistics estimate actual probability distributions of the measured data, based on which one can readily assess practical significance and even build aprediction model to be used for estimating the outcome of similar experiments, which can be fine-tuned with any new experimental data.
The second reanalysis, presented in LABEL:sec:rosetta, targets data about the performance of different programming languages implementing algorithms in the Rosetta Code repository [rosettacode]. The frequentist analysis by Nanz and Furia [rosetta] is based on a combination of null hypothesis testing and effect sizes, which establishes several interesting results—but leads to inconclusive or counterintuitive outcomes in some of the language comparisons: applying frequentist statistics sometimes seems to reflect idiosyncrasies of the experimental data rather than intrinsic differences between languages. By performing full-fledged Bayesian modeling, we infer a more coherent picture of the performance differences between programming languages: besides the advantage of having actual estimates of average speedup of one language or another, we can quantitatively analyze the robustness of the analysis results, and point out what can be improved by collecting additional data in new experiments.
Contributions. This paper makes the following contributions:
a discussion of the shortcomings of the most common frequentist statistic techniques often used in empirical software engineering;
a high-level introduction to Bayesian analysis;
detailed analyses of two case studies [autotest-bib, rosetta] from empirical software engineering, discussing how the limitations of frequentist statistics weaken clarity and generality of the results, while Bayesian statistics make for robust and rich analyses.
This paper only requires a basic familiarity with the fundamental notions of probability theory. Then,Sect. 2
introduces the basic principles of Bayesian statistics (including a concise presentation of Bayes’ theorem), and illustrates how frequentist techniques such as statistical hypothesis testing work and the shortcoming they possess. Our aim is that our presentation be accessible to a majority of empirical software engineering researchers.
Scope. This paper’s main goal is demonstrating the shortcomings of commonly used frequentist techniques, and to show how Bayesian statistics could be a better choice for data analysis. This paper is not meant to be:
A tutorial on applying Bayesian statistic tools; we refer readers to textbooks [BDA, puppies, rethinking] for such step-by-step introductions.
A philosophical comparison of frequentist vs. Bayesian interpretation of statistics.
A criticism of the two papers (Ceccato et al. [autotest-bib] and Nanz and Furia [rosetta]) whose frequentist analyses we peruse in LABEL:sec:testing and LABEL:sec:rosetta. We chose those papers because they carefully apply generally accepted best practices, in order to demonstrate that, even when applied properly, frequentist statistics have limitations and bring results that can be practically hard to interpret.
Availability. All machine-readable data and analysis scripts used in this paper’s analyses are available online at
1.1 Related Work
Empirical research in software engineering. Statistical analysis of empirical data has become commonplace in software engineering research [experiments-book, hitchhiker-icse11, Bayes1], and it is even making its way into software development practices [data-scientist].
As we discuss below, the overwhelming majority of statistical techniques that are being used in software engineering empirical research are, however, of the frequentist kind, with Bayesian statistics hardly even mentioned.
Of course, Bayesian statistics is a fundamental component of many machine learning techniques[ml-book, ml-springer]; as such, it is used in software engineering research indirectly whenever machine learning is used. In this paper, however, we are concerned with the direct usage of statistics to analyze empirical data from the scientific perspective—a pursuit that seems mainly confined to frequentist techniques in software engineering [Bayes1]. As we argue in the rest of the paper, this is a lost opportunity because Bayesian techniques do not suffer from several limitations of frequentist ones, and can support rich, robust analyses in several situations.
Bayesian analysis in software engineering? To validate the impression that Bayesian statistics are not normally used in empirical software engineering, we carried out a small literature review of ICSE papers.111[views-icse15] has a much more extensive literature survey of empirical publications in software engineering. We selected all papers from the main research track of the latest six editions of the International Conference on Software Engineering (ICSE 2013 to ICSE 2018) that mention “empirical” in their title or in their section’s name in the proceedings. This gave 25 papers, from which we discarded one [stochastic-icse14] that turned out not to be an empirical study. The experimental data in the remaining 24 papers come from various sources: the output of analyzers and other tools [mutations-icse17, compilers-icse16, browser-icse13, configuration-icse14, equivalence-icse15], the mining of repositories of software and other artifacts [evolving-icse14, javareflection-icse17, verification-icse15, js-icse16, fixing-icse13, fixes-icse15], the outcome of controlled experiments involving human subjects [summaries-icse16, lambdas-icse16, smells-icse13], interviews and surveys [codereviews-icse13, coupling-icse13, belief-icse16, network-icse17, data-icse16, green-icse16, uml-icse13, views-icse15], and a literature review [grounded-icse16].
As one would expect from a top-tier venue like ICSE, the 24 papers follow recommended practices in reporting and analyzing data, using significance testing (6 papers), effect sizes (5 papers), correlation coefficients (5 papers), frequentist regression (2 papers), and visualization in charts or tables (23 papers). None of the papers, however, uses Bayesian statistics. In fact, no paper but two [belief-icse16, fixing-icse13] even mentions the terms “Bayes” or “Bayesian”. One of the exceptions [fixing-icse13] only cites Bayesian machine-learning techniques used in related work to which it compares. The other exception [belief-icse16] includes a presentation of the two views of frequentist and Bayesian statistics—with a critique of -values similar to the one we make in Sect. 2.2—but does not show how the latter can be used in practice. The aim of [belief-icse16] is investigating the relationship between empirical findings in software engineering and the actual beliefs of programmers about the same topics. To this end, it is based on a survey of programmers whose responses are analyzed using frequentist statistics; Bayesian statistics is mentioned to frame the discussion about the relationship between evidence and beliefs, but does not feature past the introductory second section. Our paper has a more direct aim: to concretely show how Bayesian analysis can be applied in practice in empirical software engineering research, as an alternative to frequentist statistics; thus, its scope is complementary to [belief-icse16]’s.
As additional validation based on a more specialized venue for empirical software engineering, we also inspected all 105 papers published in Springer’s Empirical Software Engineering (EMSE) journal during the year 2018. Only 22 papers mention the word “Bayes”: 17 of them refer to Bayesian machine learning classifiers (such as naive Bayes or Bayesian networks); 2 of them discuss Bayesian optimization algorithms for machine learning (such as latent Dirichlet allocation word models); 3 of them mention “Bayes” only in the title of some bibliography entries. None of them use Bayesian statistics as a replacement of classic frequentist data analysis techniques.
More generally, we are not aware of any direct application of Bayesian data analysis to empirical software engineering data with the exception of [bayes-extended, F-ICSE17-poster] and [Ernst-bayes]. The technical report [bayes-extended] and its short summary [F-ICSE17-poster] are our preliminary investigations along the lines of the present paper. Ernst [Ernst-bayes] presents a conceptual replication of an existing study to argue the analytical effectiveness of multilevel Bayesian models.
Criticism of the -value. Statistical hypothesis testing—and its summary outcome, the -value—has been customary in experimental science for many decades, both for the influence [rao1992] of its proponents Fisher, Neyman, and Pearson, and because it offers straightforward, ready-made procedures that are computationally simple222With modern computational techniques it is much less important whether statistical methods are computationally simple; we have CPU power to spare—especially if we can trade it for stronger scientific results.. More recently, criticism of frequentist hypothesis testing has been voiced in many experimental sciences, such as psychology [pvalue-cohen, pvalue-psychology] and medicine [pvalue-medicine], that used to rely on it heavily, as well as in statistics research itself [ASA-statement, pvalue-statisticians, gelman-pvalues]. The criticism, which we articulate in Sect. 2.2, concludes that -value-based hypothesis testing should be abandoned [abandon-significance, riseup].333Even statisticians who still accept null-hypothesis testing recognize the need to change the way it is normally used [down-to-005]. There has been no similar explicit criticism of -values in software engineering research, and in fact statistical hypothesis testing is still regularly used [Bayes1].
Guidelines for using statistics. Best practices of using statistics in empirical software engineering are described in books [datascience-perspectives, experiments-book] and articles [hitchhiker-icse11, JedlitschkaJR14]. Given their focus on frequentist statistics,444[hitchhiker-icse11, JedlitschkaJR14, experiments-book] do not mention Bayesian techniques; [hitchhiker-journal] mentions their existence only to declare they are out of scope; one chapter [bayes-in-book] of [datascience-perspectives] outlines Bayesian networks as a machine learning technique. they all are complementary to the present paper, whose main goal is showing how Bayesian techniques can add to, or replace, frequentist ones, and how they can be applied in practice.
2 An Overview of Bayesian Statistics
Statistics provides models of events, such as the output of a randomized algorithm; the probability function assigns probabilities—values in the real unit interval , or equivalently percentages in —to events. Often, events are the values taken by random variables that follow a certain probability distribution. For example, if
is a random variable modeling the throwing of a six-face dice, it means thatfor , and for —where is a shorthand for , and is the set of integers between and .
The probability of variables over discrete domains is described by probability mass functions
(p.m.f. for short); their counterparts over continuous domains are probability density functions (p.d.f.), whose integrals give probabilities. The following presentation mostly refers to continuous domains and p.d.f., although notions apply to discrete-domain variables as well with a few technical differences. For convenience, we may denote a distribution and its p.d.f. with the same symbol; for example, random variablehas a p.d.f. also denoted , such that .
Conditional probability. The conditional probability is the probability of given that has occurred. For example, may represent the empirical data that has been observed, and is a hypothesis that is being tested.
Consider a static analyzer that outputs (resp. ) to indicate that the input program never overflows (resp. may overflow); is the probability that, when the algorithm outputs , the input program is indeed free from overflows—the data is the output “” and the hypothesis is “the input does not overflow”.
2.1 Bayes’ Theorem
Bayes’ theorem connects the conditional probabilities (the probability that the hypothesis is correct given the experimental data) and (the probability that we would see this experimental data given that the hypothesis actually was true). The famous theorem states that
Here is an example of applying Bayes’ theorem.
Suppose that the static analyzer gives true positives and true negatives with high probability (), and that many programs are affected by some overflow errors (). Whenever the analyzer outputs , what is the chance that the input is indeed free from overflows? Using Bayes’ theorem, , we conclude that we can have a mere 50% confidence in the analyzer’s output.
Priors, likelihoods, and posteriors. In Bayesian analysis [thinkBayes], each factor of (1) has a special name:
is the prior—the probability of the hypothesis before having considered the data—written ;
is the likelihood of the data under hypothesis —written ;
is the normalizing constant;
and is the posterior—the probability of the hypothesis after taking data into account—written .
With this terminology, we can state Bayes’ theorem (1) as “the posterior is proportional to the likelihood times the prior”, that is,
and hence we update the prior to get the posterior.
The only role of the normalizing constant is ensuring that the posterior defines a correct probability distribution when evaluated over all hypotheses. In most cases we deal with hypotheses that are mutually exclusive and exhaustive; then, the normalizing constant is simply , which can be computed from the rest of the information. Thus, it normally suffices to define likelihoods that are proportional to a probability, and rely on the update rule to normalize them and get a proper probability distribution as posterior.
2.2 Frequentist vs. Bayesian Statistics
Despite being a simple result about an elementary fact in probability, Bayes’ theorem has significant implications for statistical reasoning. We do not discuss the philosophical differences between how frequentist and Bayesian statistics interpret their results [philosophy]. Instead, we focus on describing how some features of Bayesian statistics support new ways of analyzing data. We start by criticizing statistical hypothesis testing since it is a customary technique in frequentist statistics that is widely applied in empirical software engineering research, and suggest how Bayesian techniques could provide more reliable analyses. For a systematic presentation of Bayesian statistics see Kruschke [puppies], McElreath [rethinking], and Gelman et al. [BDA].
2.2.1 Hypothesis Testing vs. Model Comparison
A primary goal of experimental science is validating models of behavior based on empirical data. This often takes the form of choosing between alternative hypotheses, such as, in the software engineering context, deciding whether automatically generated tests support debugging better than manually written tests (LABEL:sec:testing), or whether a programming language is faster than another (LABEL:sec:rosetta).
Hypothesis testing is the customary framework offered by frequentist statistics to choose between hypotheses. In the classical setting, a null hypothesis corresponds to “no significant difference” between two treatments and (such as two static analysis algorithms whose effectiveness we want to compare); an alternative hypothesis is the null hypothesis’s negation, which corresponds to a significant difference between applying and applying . A null hypothesis significance test [hitchhiker-journal], such as the -test or the -test, is a procedure that takes as input a combination of two datasets and , respectively recording the outcome of applying and , and outputs a probability called the -value.
The -value is the probability, under the null hypothesis, of observing data at least as extreme as the ones that were actually observed. Namely, it is the probability 555Depending on the nature of the data, the -value may also be defined as a left-tail event or as a 2-tailed event . of drawing data that is equal to the observed data or more “extreme”, conditional on the null hypothesis holding. As shown visually in Fig. 1, data is expected to follow a certain distribution under the null hypothesis (the actual distribution depends on the statistical test that is used); the -value measures the tail probability of drawing data equal to the observed or more unlikely than it—in other words, how far the observed data is from the most likely observations under the null hypothesis. If the -value is sufficiently small—typically or —it means that the null hypothesis is an unlikely explanation of the observed data. Thus, one rejects the null hypothesis, which corresponds to leaning towards preferring the alternative hypothesis over : in this case, we have increased our confidence that and differ.
Unfortunately, this widely used approach to testing hypotheses suffers from serious shortcomings [pvalue-cohen], which have prompted calls to seriously reconsider how it’s used as a statistical practice [ASA-statement, down-to-005], or even to abandon it altogether [abandon-significance, riseup]. The most glaring problem is that, in order to decide whether is a plausible explanation of the data, we would need the conditional probability of the hypothesis given the data, not the -value . The two conditional probabilities are related by Bayes’ theorem (1), but knowing only is not enough to determine ;666Assuming that they are equal is the “confusion of the inverse” [fallacy-inverse]. in fact, Sect. 2.1’s example of the static analyzer (Ex. 2.1) showed a case where one conditional probability is 99% while the other is only 50%.
Rejecting the null hypothesis in response to a small -value looks like a sound inference, but in reality it is not, because it follows an unsound probabilistic extension of a reasoning rule that is perfectly sound in Boolean logic. The sound rule is modus tollens: