Progression of Science and of the scientific method go hand in hand. Development of new theories requires—and at the same time facilitates—development of new methods for their validation.
Pioneers of machine learning were playing with ideas: new approaches, such as induction of classification trees, were worthy of publication for the sake of their interestingness. As the field progressed and found more practical uses, variations of similar ideas began emerging, and with that the interest in determining which of them work better in practice. A typical example are the different measures for assessing the quality of attributes; deciding which work better than others required tests on actual, real-world data. Papers thus kept introducing new methods and measured, for instance, classification accuracies to prove their advantages over the existing methods. To ensure the validity of such claims, we adopted—starting with the work of Dietterich98 and Salzberg97, and later followed by demvsar2006statistical—the common statistical methodology used in all scientific areas relying on empirical observations: the null hypothesis significance testing (NHST).
This spread the understanding that the observed results require statistical validation. On the other hand, NHST soon proved inadequate for many reasons (demvsar2008appropriateness). Noteworthy, the American Statistical Association has recently made a statement against p-values (wasserstein2016asa). NHST nowadays it is also falling out of favour in other fields of science (banned). We believe that the field of machine learning is ripe for a change as well.
We will spend a whole section demonstrating the many problems of NHST. In a nutshell: it does not answer the question we ask. In a typical scenario, a researcher proposes a new method and desires to prove that it is more accurate than another method on a single data set or on a collection of data sets. She thus runs the competing methods and records their results (classification accuracy or another appropriate score) on one or more data sets, which is followed by NHST. The difference between what the researcher has in mind and what the NHST provides for is evident from the following quote from a recently published paper: “Therefore, at the 90% confidence level, we can conclude that (…) method is able to significantly outperform the other approaches.” This is wrong. The stated 90% confidence level
is not the probability of one classifier outperforming another. The NHST computes the probability of getting the observed (or a larger) difference between classifiers if the null hypothesis of equivalence was true, which is not the probability of one classifier being more accurate than another, given the observed empirical results. Another common problem is that the claimed statistical significance might have no practical impact. Indeed, the common usage of NHST relies on the wrong assumptions that the-value is a reasonable proxy for the probability of the null hypothesis and that statistical significance implies practical significance.
As we wrote at the beginning, development of Science not only requires but also facilitates the improvement of scientific methods. Advancement of computational techniques and power reinvigorated the interest for Bayesian statistics. Bayesian modelling is now widely adopted for designing principled algorithms for learning from data(bishop2007pattern; murphy2012machine). It is time to also switch to Bayesian statistics when it comes to analysis of our own results.
The questions we are actually interested in—e.g., is method A better than B? Based on the experiments, how probably is A better? How high is the probability that A is better by more than 1%?—are questions about posterior probabilities. These are naturally provided by the Bayesian methods(edwards1963bayesian; dickey1973scientific; berger1987testing). The core of this paper is thus a section that establishes the Bayesian alternatives to frequentist NHST and discusses their inference and results. We eventually describe also the software libraries with the necessary algorithms and give short instructions for their use.
2 Frequentist analysis of experimental results
Why do we need to go beyond the frequentist analysis of experimental results? To answer this question, we will focus on a practical case: the comparison of the accuracy of classifiers on different datasets. We initially consider two classifiers: naive Bayes (nbc) and averaged one-dependence estimator (aode). A description of these algorithms with exhaustive references is given in the book bywitten2011data. Assume that our aim is to compare nbc versus aode. These are the steps we must follow:
choose a comparison metric;
select a group of datasets to evaluate the algorithms;
perform runs of -fold cross-validation for each classifier on each dataset.
We have performed these steps in WEKA, choosing accuracy as metric, on a collection of data sets downloaded from the the WEKA website111See http://www.cs.waikato.ac.nz/ml/weka/. and with runs of -fold cross-validation. Table 1 reports the accuracies obtained on each dataset by each classifier.
First of all, we aim at knowing which is the best classifier for each dataset. The answer to this question is probabilistic, since on each data set we have only estimates of the performance of each classifier.
|Datasets||10 runs of 10-fold cross-validation|
Since during cross-validation we have provided both classifiers with the same training and test sets, we can compare the classifiers by considering the difference in accuracies on each test set. This yields the vector ofdifferences of accuracies , where ( runs of -fold cross-validation). We can compute the mean of the vector of differences , i.e., the mean difference of accuracy between the two classifiers, and statistically evaluate whether the mean difference is significantly different from zero. In frequentist analysis, we would perform a NHST using the -test. The problem is that the -test assumes the observations to be independent. However, the differences of accuracies are not independent of each other because of the overlapping training sets used in cross-validation. Thus the usual -test is not
calibrated when applied to the analysis of cross-validation results: when sampling the data under the null hypothesis its rate of Type I errors is much larger than(Dietterich98). Moreover, the correlation cannot be estimated from data; nadeau2003inference
have proven that there is no unbiased estimator of the correlation of the results obtained on the different folds. Introducing some approximations, they have proposed a heuristic to choose the correlation parameter:, where , and respectively denote the size of the training set, of the test set and of the whole available data set.222nadeau2003inference considered the case in which random training and test sets are drawn from the original data set. This is slightly different from k-fold cross-validation, in which the folds are designed not to overlap. However the correlation heuristic by nadeau2003inference has since become commonly used to analyse the cross-validation results (bouckaert2003choosing).
are the sample mean and sample standard deviation of the data, is the correlation between the observations and is the value of the mean we aim at testing. The statistic follows a Student distribution with degrees of freedom:
Table 2 reports the two-sided -values for each comparison on each dataset obtained via the correlated t-test. The common practice in NHST is to declare all the comparisons such that as significant, i.e., the accuracy of the two classifiers is significantly different on that dataset. Conversely, all the comparisons with are declared not significant. Note that these significance tests can, under the NHST paradigm, only be considered in isolation, while combined they require either an omnibus test like ANOVA or corrections for multiple comparisons.
2.1 NHST: the pitfalls of black and white thinking
Despite being criticized from its inception, NHST is still considered necessary for publication, as is trusted as an objective proof of the method’s quality. One of the key problems of decisions based on is that it leads to “black and white thinking”, which ignores the fact that (i) a statistically significant difference is completely different from a practically significant difference (berger1987testing); (ii) two methods that are not statistically significantly different are not necessarily equivalent. The NHST and this -value-related “black and white thinking” do not allow for making informed decisions. Hereafter, we list the limits of NHST in order of severity using, as a working example, the assessment of the performance of classifiers.
NHST does not estimate probabilities of hypotheses.
What is the probability that the performance of two classifiers is different (or equal)? This is the question we are asking when we compare two classifiers; and NHST cannot answer it.
In fact, the -value represents the probability of getting the observed (or larger) differences assuming that the performance of the classifiers is equal (). Formally, , where is the statistic computed from the data , and is the critical value corresponding to the test and the selected . This is not the probability of the hypothesis, , in which we are interested.
Yet, researchers want to know the probability of the null and the alternative hypotheses on the basis of the observed data, rather than the probability of the data assuming the null hypothesis to be true. Sentences like “at the 95% confidence level, we can conclude that (…)”, are formally correct, but they seem to imply that is the probability of the alternative hypothesis, while in fact , which is not the same as . This is summed up in Table 3.
|what we compute||what we would like to know|
Point-wise null hypotheses are practically always false.
The difference between two classifiers can be very small; however there are no two classifiers whose accuracies are perfectly equivalent.
By using a NHST, the null hypothesis is that the classifiers are equal. However, the null hypothesis is practically always false! By rejecting the null hypothesis NHST indicates that the null hypothesis is unlikely; but this is known even before running the experiment. This problem of the NHST has been pointed out in many different scientific domains (lecoutre2014significance, Sec 184.108.40.206). A consequence is that, since the null hypothesis is always false, by adding enough data points it is possible to claim significance even when the effect size is trivial. This is because the p-value is affected both by the sample size and the effect size, as discussed in the next section. Quoting kruschke2015bayesian: “null hypotheses are straw men that can virtually always be rejected with enough data.”
The -value does not separate between the effect size and the sample size.
The usual explanation for this phenomenon is that if the effect size is small, more data is needed to demonstrate the difference. Enough data can confirm arbitrarily small effects. Since the sample size is manipulated by the researcher and the null hypothesis is always wrong, the researcher can reject it by testing the classifiers on enough data. On the contrary, conceivable differences may fail to yield small -values if there are not enough suitable data for testing the method (e.g., not enough datasets). Even if we pay attention not to confuse the -value with the probability of the null hypothesis, the -value is intuitively understood as the indicator of the effect size. In practice, it is the function of effect size and sample size: same -values do not imply same effect sizes.
Figure 1 reports the density plots of the differences of accuracy between nbc and aode on the dataset hepatitis in two cases: (1) considering only of the accuracies in Table 1 (left); (2) considering all the accuracies (right). The two orange vertical lines define the region in which the differences of accuracy is less than —the meaning of these lines will be clarified in the next sections.
The -value is in the first case and so the null hypothesis cannot be rejected. The -value becomes in the second case and so the null hypothesis can be rejected. This demonstrates how adding data leads to rejection of the null hypothesis although the difference between the two classifiers is very small in this dataset (all the mass is inside the two orange vertical lines). Practical significance can be equated with the effect size, which is what the researcher is interested in. Statistical significance—the -value—is not a measure of practical significance, as shown in the example.
NHST ignores magnitude and uncertainty.
A very important problem with NHST is that the result of the test does not provide information about the magnitude of the effect or the uncertainty of its estimate, which are the key information we should aim at knowing.
A consequence of this limit is that: (i) a null hypothesis can be rejected despite a very small effect; (ii) a null hypothesis can be rejected even though there is a large uncertainty in the effect’s magnitude and the region of uncertainty includes (or is extremely close to) zero, that is, no effect. Figure 1 (right) shows a case for which and the result is therefore declared to be statistically significant. However, from the density plot, it is clear that the magnitude of the effect is very small (all inside the orange vertical lines bounding the less than difference of accuracy region). Thus rejecting a null hypothesis does not provide any information about the magnitude of the effect and whether or not the effect is trivial. Figure 2 shows two cases for which and the result is therefore declared to be statistically significant. Such -values are similarly low, but the two cases are extremely different. For the dataset ecoli (left), the differences of accuracy are spread from to (the magnitude of the uncertainty is very large, about ), while in the second case the data are spread from to , a much smaller uncertainty. Thus, rejecting a null hypothesis does not provide us with any information about the uncertainty of the estimate.
NHST yields no information about the null hypothesis.
What can we say when NHST does not reject the null hypothesis?
The scientific literature contains examples of non-significant tests interpreted as evidence of no difference between the two methods/groups being compared. This is wrong since NHST cannot provide evidence in favour of the null hypothesis (see also lecoutre2014significance for further examples on this point). When NHST does not reject the null hypothesis, no conclusion can be made.
Researchers may be interested in the probability that two classifiers are equivalent. For instance, a recent paper contains the following conclusion: “there is no significant difference between (…) under the significance level of . This is quite a remarkable conclusion.” This is not correct, we cannot conclude anything in this case! Consider for example Figure 3 (left), which shows the density plot of the differences of accuracy between nbc and aode for the dataset audiology. The correlated t-test gives a -value and so it fails to reject the null hypothesis. However, NHST does not allow us to reach any conclusion about the possibility that the two classifiers may actually be equivalent in this dataset, although the majority of data (density plot) seems to support this hypothesis. We tend to interpret these cases as acceptance of the null hypothesis or to even (mis)understand the value as a “62.2% probability that the performance of the classifiers is the same”. This is also evident from Figure 3 (right), where a very similar -value, , corresponds to a different density plot—with more uncertainty and so less evidence in favour of the null.
There is no principled way to decide the level.
In the above examples, we rejected the null hypothesis at . We understand this difference as significant, but not as significant as if we could reject it at . What is the actual meaning of this? Can we reject it at ? The level represents the crucial threshold to declare the experiment successful. With its importance, it needs to be set with care. However, since it is used to compare the meaningless -values, is equally meaningless. By using the NHST, the researcher is forced to select an important threshold with no practical meaning. Using the customary thresholds of and merely allows her/him to shift the responsibility to the unsubstantiated traditional habit.
We pretend that is the proportion of cases in which would be falsely rejected if the same experiment was repeated. This would be true if the -value represented the probability of the null hypothesis. In reality, is the proportion of cases in which experiments yield the data that is more extreme than expected under ; the actual probability of falsely rejecting is also related to the probability of and the data.
For to be meaningful, it would need to set the required effect size or at least the probability of the hypothesis, not the likelihood of the data.
The inference depends on the sampling intention.
Consider analysing a data set of observations with a NHST test. The sampling distribution used to determine the critical value of the test assumes that our intention was to collect exactly observations. If the intention was different—for instance in machine learning you typically compare two algorithms on all the datasets that are available—, the sampling distribution changes to reflect the actual sampling intentions (kruschke2010bayesian). This is never done, given the difficulty of formalizing one’s intention and of devising an appropriate sampling distribution. This problem is thus important but generally ignored. Thus for the data set the hypothesis test (and thus the p-value) should be computed differently, depending on the intention of the person who collected the data (kruschke2010bayesian).
3 Bayesian analysis of experimental results
There are two main Bayesian approaches for the analysis of experimental results. The first Bayesian approach, as NHST, is also based on a null value. The analyst has to set up two competing models of what values are possible. One model assumes that only the null value is possible. The alternative model assumes a broad range of other values is also possible. Bayesian inference is used to compute which model is more credible, given the data. This method is calledBayesian model comparison
and uses so-called “Bayes factors”(berger1985; aitkin1991posterior; kass1995bayes; berger1996intrinsic).
The second Bayesian approach does not set any null value. The analyst simply has to set up a range of candidate values (prior model), including the zero effect, and use Bayesian inference to compute the relative credibilities of all the candidate values (the posterior distribution). This method is called Bayesian estimation (gelman2014bayesian; kruschke2015doing).
The choice of the method depends on the specific question that the analyst aims at answering, but in machine learning the estimation approach is usually preferable because it provides richer information to the analyst. For this reason, we will focus on the Bayesian estimation approach that hereafter we will simply call Bayesian analysis.
The first step in Bayesian analysis is establishing a descriptive mathematical model of the data. In a parametric model, this mathematical model is the thelikelihood function that provides the probability of the observed data for each candidate value of the parameter(s) . The second step is to establish the credibility for each value of the parameter(s) before observing data, the prior distribution . The third step is to use Bayes’ rule to combine likelihood and prior to obtain the posterior distribution of the parameter(s) given the data . The questions we pose in statistical analysis can be answered by querying this posterior distribution in different ways.
As a concrete example of Bayesian analysis we will compare the accuracies of two competing classifiers via cross-validation on multiple data sets (Table 1). For this purpose, we will adopt the correlated Bayesian -test proposed by coraniML2015.
is the variance and, therefore, the covariance matrix takes into account the correlation due to cross-validation. Hence, the likelihood model of data is
which is a Normal-Gamma distribution(bernardo2009bayesian, Chap. 5) with parameters . The Normal-Gamma prior is conjugate to the likelihood (5). If we choose the prior parameters , , , (matching prior), the resulting posterior distribution of is the following Student distribution:
In (6), we have reported the posterior distribution obtained under the matching prior—for which the probability of the Bayesian correlated -test and the -value of the frequentist correlated -test are numerically equivalent. Our aim is to show that although they are numerically equivalent the inferences drawn by the two approaches are very different. In particular we will show that a different interpretation of the same numerical value can completely change our prospective and allow us to make informative decisions. In other words, in this case the cassock does make the priest!
3.1 Comparing nbc and aode through Bayesian analysis: a colour thinking
Consider the dataset squash-unsorted, with the posterior computed by the Bayesian correlated t-test for the difference between nbc and aode, as shown in Figure 4. The vertical orange lines mark again the region corresponding to a difference of accuracy of less than (we will clarify the meaning of this region later in the section).
In Bayesian analysis, the experiment is summarized by the posterior distribution (in this case a Student distribution). The posterior describes the distribution of the mean difference of accuracies between the two classifiers.
By querying the posterior distribution, we can evaluate the probability of the hypothesis. We can for instance infer the probability that nbc is better/worse than aode. Formally is the integral of the posterior distribution from zero to infinity or, equivalently, the posterior probability that the mean of the differences of accuracy between nbc and aode is greater than zero. is the integral of the posterior between minus infinity and zero or, equivalently, the posterior probability that the mean of the differences of accuracy is less than zero.
Can we say anything about the probability that nbc is practically equivalent to aode? Bayesian analysis can answer this question. First, we need to define the meaning of “practically equivalent”. In classification, it is sensible to define that two classifiers whose mean difference of accuracies is less that are practically equivalent. The interval thus defines a region of practical equivalence (rope) (kruschke2015bayesian) for classifiers.333In classification seems to be a reasonable choice. However, in other domains a different value could be more suitable. Once we have defined a rope, from the posterior we can compute the probabilities:
: the posterior probability of the mean difference of accuracies being practically negative, namely the integral of the posterior on the interval .
: the posterior probability of the two classifiers being practically equivalent, namely the integral of the posterior over the rope interval.
: the posterior probability of the mean difference of accuracies being practically positive, namely the integral of the posterior on the interval ().
is the integral of the posterior distribution between the vertical lines (the rope) shown in Figure 4 and it represents the probability that the two classifiers are practically equivalent. Similarly, we can compute the probabilities that the two classifiers are practically different, which are and .
The posterior also shows the uncertainty in the estimate, because the distribution shows the relative credibility of values across the continuum. One way to summarize the uncertainty is by marking the span of values that are the most credible and cover of the distribution (e.g., ). These are called the High Density Intervals (HDIs) and they are shown in Figure 5 (center) for .
Thus the posterior distribution equipped with rope:
estimates the posterior probability of a sensible null hypothesis (the area within the rope);
claims significant differences that also have a practical meaning (the area outside the rope);
represents magnitude (effect size) and uncertainty (HDIs);
does not depend on the sampling intentions.
To see that, we apply Bayesian analysis to the critical cases for NHST presented in the previous section. Figure 6 shows the posterior for hepatitis (top), ecoli (left) and iris (right). For hepatitis, all the probability mass is inside the rope and so we can conclude that nbc and aode are practically equivalent: . For ecoli and iris, the probability mass is all in and so . However, the posterior gives us more information: there is much more uncertainty in the iris dataset. The posteriors thus provide us the same information as the density plots shown in Figures 1–3.
Consider now Figure 7; those are two examples for which we cannot clearly say whether nbc and aode are practically equivalent or practically different. We cannot decide in an obvious way and this is evident from the posterior. Compare these figures with Figure 4, which is similar.
Let us repeat the previous analysis for all datasets: Figure 8 reports the posteriors for nbc versus aode in all those cases. Looking at the posteriors, we see that there are cases where aode is practically better than nbc, since all the posterior is outside (to the left of) the rope (in the datasets ecoli, iris, monks1, monks, mushroom, nursey, optdigits, pasture, segment, spambase, credit, owel). There are datasets where nbc and aode are practically equivalent (hayes-roth, hungarian14, hepatitis, labor, wine, yeast), since the entire posterior is inside the rope. We see also that there are no cases where nbc is practically better than aode (posterior to the right of the rope). The posteriors give us information about the magnitude of effect size, practical difference and equivalence, as well as the related uncertainty.
3.2 Sensible automatic decisions
In machine learning, we often need to perform many analyses. So it may be convenient to devise a tool for automatic decision from the posterior. This means that we have to summarize the posterior in some way, with a few numbers. However, we must be aware that every time we do that we go back to that sort of black and white analysis whose limitations have been discussed before. In fact, by summarizing the posterior, we lose information, but we can do so in a conscious way. We have already explained the advantages of introducing a rope, so we can make automatic decisions based on the three probabilities , and . In this way, we lose information but we introduce shades in the black and white thinking. , and are probabilities and their interpretation is clear. is the area of the posterior within the rope and represents the probability that nbc and aode are practically equivalent. is the area of the posterior to the left of the rope and corresponds to the probability that nbc is practically better than aode. Finally, is the area to the right of the rope and represents the probability that aode is practically better than nbc. Since these are the actual probabilities of the decisions we are interested in, in classification, we need not think in terms of Type-I errors to make decisions. We can simply make decisions using these probabilities, which have a direct interpretation—contrarily to -values. For instance we can decide
We can also decide with a probability of, for instance, or (if this is appropriate in the given context). Table 4 compares the Bayesian decisions based on the above rule with the NHST decisions. The NHST fails to reject the null in datasets; Bayesian analysis declares the two classifiers equivalent in of such datasets. Conversely, when NHST rejects the null (in datasets), the Bayesian analysis declares nbcaode or nbcaode in cases, nbcaode in case and no decision in cases. Overall, Bayesian analysis allows us to make a decision in datasets, while NHST makes a decision only in cases.
|When NHST does not reject the null|
|pair||Data sets||Bayesian decision|
|(out of 54)||nbcaode||No decision|
|When NHST rejects the null|
||nbcaode||nbcaode or nbcaode||No decision|
Sometimes also in Bayesian analysis it may be convenient to think in terms of errors. We can easily do that by defining a loss function. A loss function defines the loss we incur in making the wrong decision. The decisions we can take are nbcaode (denoted as ), nbcaode (), nbcaode () or none of them (). Consider for instance the following loss matrix.
The first row gives us the loss we incur in deciding when is true (zero loss), is true (loss is ) and is true (loss is ). Similarly for the second and third rows. The last row is the loss we incur in making no decision. The expected loss can be simply obtained by multiplying the above matrix for the vector of posterior probabilities of nbcaode, nbcaode and nbcaode (), i.e., . The row of corresponding to the lowest loss determines the final decision. Since , this leads to the same decision rule discussed previously ().
3.3 Comparing NHST and Bayesian analysis for other classifiers
In this section we extend the previous analysis to other classifiers besides nbc and aode: hidden naive Bayes (hnb), j48 decision tree (j48) and j48 grafted (j48-gr).
The aim of this section is to show that the pattern described above is general and it also holds for other classifiers. The results are presented in two tables. First we report on the cases in which NHST does not reject the null (Tab. 5). Then we report on the comparisons in which the NHST rejects the null (Tab. 6).
The NHST test does not reject the null hypothesis (Tab. 5) in 341/540 comparisons. In these cases the NHST does not make any conclusion: it cannot tell whether the null hypothesis is true444A point null is however always false, as already discussed. or whether is false but the evidence is too weak to reject it.
By applying the Bayesian correlated t-test with rope and taking decisions as discussed in the previous section, we can draw more informative conclusions. In 74/341=22% of the rejections failed by NHST, the posterior probability of the rope is larger than , allowing to declare that the two analyzed classifiers are practically equivalent. In the remaining cases, no conclusion can be drawn with probability .
The rope thus provides a sensible null hypothesis which can be accepted on the basis of the data. When this happens we conclude that the two classifiers are practically equivalent. This is impossible with the NHST.
|When NHST does not reject the null|
|pair||Data sets||Bayesian decision|
|(out of 54)||P(rope) >.95||No decision|
|When NHST rejects the null|
Let us consider now the comparisons in which NHST claims significance (Tab. 6). There are 199 such cases. The Bayesian estimation procedure confirms the significance of 142/199 (71%) of them: in these cases it estimates either or .555In the comparison nbc-aode, left means and right . In these cases the accuracy of the two compared classifiers are practically different. In 51/199 cases (26%) the Bayesian test does not make any conclusion. This means that a sizeable amount of probability lies within the rope, despite the statistical significance claimed by the frequentist test. In the remaining cases (6/199=3%) the Bayesian test concludes that the two classifiers are practically equivalent () despite the significance claimed by the NHST. In this case it draws the opposite conclusion from the NHST.
Summing up, the Bayesian test with rope is more conservative, reducing the claimed significances by as compared with the NHST. The Bayesian test is thus more conservative due to the rope, which constitutes a sensible null hypothesis, while the null hypothesis of the NHST is surely wrong. However counting the detection of practically equivalent classifiers as decisions, the Bayesian test with rope takes more decisions than the NHST (222 vs. 199).
4 Comparing two classifiers on multiple data sets
So far we have discussed how to compare two classifiers on the same data set. In machine learning, another important problem is how to compare two classifiers on a collection of different data sets, after having performed cross-validation on each data set.
4.1 The frequentist approach
There is no direct NHST able to perform such statistical comparison, i.e., one that takes as inputs the runs of the -fold cross-validation differences of accuracy for each dataset and returns as output a statistical decision about which classifier is better in all the datasets. The usual NHST procedure that is employed for performing such an analysis has two steps:
compute the mean difference of accuracy for each dataset (averaging the differences of accuracies obtained in the runs of the -fold cross-validation);
perform a NHST to establish if the two classifiers have different performance or not based on these mean differences of accuracy.
For our case study, nbc vs. aode, the mean differences of accuracy in each dataset computed from Table 1 are shown in Table 7. We denote these measures generically with (in our case ). The recommended NHST for this task is the signed-rank test (demvsar2006statistical).
|Dataset||Mean Dif.||Dataset||Mean Dif.||Dataset||Mean Dif.|
The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test used when comparing two paired samples. The signed-rank test assumes the observationsto be i.i.d. and generated from a symmetric distribution. The test is miscalibrated if the distribution is not symmetric. A strict usage of the test should thus include first a test for symmetry. One should run the signed-rank only if “the symmetry test fails to reject the null” (note that, as we discussed before, this does not actually prove that the distribution is symmetric!). However this would make the whole testing procedure cumbersome, requiring also corrections for the test of multiple hypotheses. In the common practice thus the test for symmetry is not performed, and we follow this practice in this paper. The test is typically used as follows. The null hypothesis is that the median of the distribution from which the ’s are sampled is 0; when the test rejects the null hypothesis, it claims that it is significantly different from 0. The test ranks the
’s according to their absolute value and then compares the ranks of the positive differences and negative differences. The test statistic is:
), the statistic under the null hypothesis is approximately normally distributed and in this case the two-sided test is performed as follows:
, which is the cumulative distribution function of the standard Normal distribution;tie is an adjustment for ties in the data , i.e., for some , required by the nonparametric test (sidak1999; hollander2013nonparametric), while it is zero in case of no ties.
Being non-parametric, the signed-rank is robust to outliers. It assumes commensurability of differences, but only qualitatively: greater differences count more as they top the rank; yet their absolute magnitudes are ignored(demvsar2006statistical).
4.1.1 Experimental results
If we apply this method to compare nbc vs. aode, we obtain p-value= (the rank with no ties and is ). Since the p-value is less than , the NHST concludes that the null hypothesis can be rejected and that nbc and aode are significantly different. Table 8 reports the -values of all comparisons of the five classifiers. The pairs nbc-aode, nbc-hnb, j48-j48gr are statistically significantly different. Again by applying this black and white mechanism, we encounter the same problems as before, i.e., we do not have any idea of the magnitude of the effect size, the uncertainty, the probability of the null hypothesis, et cetera. The density plot of the data (the mean differences of accuracy in each dataset) for nbc versus aode shows for instance that there are many datasets where the mean difference is small (close to zero), see Figure 9. Instead for j48 versus j48gr, it is clear that the difference of accuracy is very small.
|Classif. 1||Classif. 2||-value|
4.2 The Bayesian analysis approach
We will present two ways of approaching the comparison between two classifiers in multiple datasets. The first will be based on a nonparametric approach that directly extends the Wilcoxon signed-rank test. The second is a hierarchical model.
4.2.1 Nonparametric test
benavoli2014a have proposed a Bayesian counterpart of the frequentist sign and signed-rank test, which is based on the Dirichlet process.
Bayesian sign and signed-tank test
Let denote the scalar variable of interest and denotes a vector of i.i.d. observations of
. To derive the Bayesian sign and signed-rank test, we assume a Dirichlet Process (DP) prior on the probability distribution of. A DP is a distribution over probability distributions such that marginals on finite partitions are Dirichlet distributed. Like the Dirichlet distribution, the DP is therefore completely defined by two parameters: the prior strength and the prior mean that, for the DP, is a probability measure on . If we choose , i.e., a Dirac’s delta centred on the pseudo-observation
, the posterior probability density function ofhas this simple expression:
i.e., it is a mixture of Dirac’s deltas centred on the observations for and on the prior pseudo-observation , whose weights are Dirichlet distributed with parameters . We can think about (9) as a hierarchical model: depends on the weights that are Dirichlet distributed. The model (9) is therefore a posterior distribution of the probability distribution of and encloses all the information we need for the experimental analysis. We can summarize it in different way. If we compute:
where the indicator if and zero otherwise, then we obtain a Bayesian version of the sign test that also accounts for the rope . In fact, are respectively the probabilities that the mean difference of accuracy is in the interval , , or . Since , it can easily be shown that
where is the number of observations that fall in , is the number of observations that fall in and is the number of observations that fall in , obviously . If we neglect , then (10) says that the posterior probability of is Dirichlet distributed with parameters . The terms are due to the prior. Therefore, to fully specify the Bayesian sign test, we must choose the value of the prior strength and where to place the pseudo-observation in or or . We will return to this choice in Section 4.3. Instead, if we compute
then we derive a Bayesian version of the signed rank test (benavoli2014a) that also accounts for the rope . This time the distribution of has not a simple closed form but we can easily compute it by Monte Carlo sampling the weights . Also in this case we must choose , see Section 4.3.
It should be observed that the Bayesian signed-rank test does not require the symmetry assumption about the distribution of the observations . This test works also in case the distribution is asymmetric thanks to the Bayesian estimation approach (i.e., it estimates the distribution from data). This is another advantage of the Bayesian estimation approach w.r.t. the frequentist null hypothesis tests (benavoli2014a).
Let us start by comparing nbc vs. aode by means of the Bayesian sign-rank tests without rope (). Hereafter we will choose the prior parameter of the Dirichlet as and ; we will return to this choice in Section 4.3. Since without rope , we have only reported the posterior of (denoted as “Pleft”) that represents the probability that aode is better than nbc. The samples of the posteriors are shown in Figure 10: this is simply the histogram of samples of generated according to (11). For all samples, it results in greater than and so . So we can conclude with probability that aode is better than nbc. We can in fact think about the comparison of two classifiers as the inference on the bias () of a coin. In this case, all the sampled coins from the posterior have a bias that is greater than and, therefore, all the coins are always biased towards aode (which is then preferable to nbc).
This conclusion is in agreement with that derived by the frequentist sign-rank test (very small -value, see Table 8).
The introduction of the rope partially changes the previous conclusion. A way to visualize the posterior of in this case, is by plotting the Monte Carlo samples of these probabilities in barycentric coordinates: each trinomial vector of probabilities is a point in the simplex having vertices . The three vertices are respectively denoted as “aode”, “rope” and “nbc” and represent decisions with certainty in favour of “aode”, “rope” and, respectively, “nbc”. Figure 11 reports the simplex as well as the two-dimensional projections of the posterior for the Bayesian sign-rank test. In particular Figure 11 reports the marginal of the posterior distribution of “aode” vs. “rope” (left); the marginal of the posterior distribution “nbc” vs. “rope” (right). From these two figures we can deduce the other marginal since . Finally, Figure 11 (bottom) reports the joint of the three variables in barycentric coordinates (we are again exploiting the fact that ). In particular, Figure 11 (bottom) reports the samples from the posteriors (cloud of points), the simplex (the large orange triangle) and three regions (in orange) that are limited by the level curves: with (hypothesis is more probable than both hypotheses together). For instance, the region at the bottom-right of the triangle is relative to the case where “aode” is more probable than “rope” and “nbc” together; the region at the top of the triangle represents the case where “rope” is more probable than “aode” and “nbc” together; the region at the left of the triangle corresponds to the case where “nbc” is more probable than “aode” and “rope” together. Hence, if all the points fall inside one of these three regions, we conclude that such hypothesis is true with probability . Looking at Figure 11 (bottom), it is evident that the majority of cases support aode more than rope and definitively more than nbc. We can quantify this numerically by counting the number of points that fall in the three regions, see first row in Table 9. aode is better in of cases, while rope is selected in the remaining . We can therefore conclude with probability that aode is practically better than nbc. Table 9 reports also these probabilities for the other comparisons of classifiers computed using Monte Carlo samples. We conclude that hnb is practically better than nbc with probability ; aode and hnb are equivalent with probability ; aode is better than j48 and j48gr with probability ; hnb is better than j48 and j48gr with probability greater than and finally j48 and j48gr are practically equivalent. These conclusions are in agreement with the data.
The computational complexity of the Bayesian sign-rank test is low. The comparison of two classifiers (based on samples) takes less than one second on a standard computer.
|Classif. 1||Classif. 2||left||rope||right|
4.3 Choice of the prior
In the previous section, we have selected the prior parameters of the Dirichlet process as and . In terms of rank, this basically means that the prior strength is equivalent to that of one pseudo-observation that is located inside the rope (benavoli2014a). How are the inferences sensitive to this choice? For instance, we can see how the probabilities on Table 9 would change based on . We have considered two extreme cases and and reported these probabilities in Table 10 and 11 (this is an example of robust Bayesian analysis (r1994orb)). It is evident that the position of has only a minor effect on the probabilities. We could have performed this analysis jointly by considering all the possible Dirichlet process priors obtained by varying . This set of Dirichlet priors is called “Imprecise Dirichlet Process” (IDP). IDP allows us to start the inference with very weak prior assumptions, much in the direction of letting data speak for themselves. More details about the properties of IDP and the choice of the prior can be found in benavoli2014a; benavoli2014b; walley1996.
|Classif. 1||Classif. 2||left||rope||right|
|Classif. 1||Classif. 2||left||rope||right|
4.3.1 Hierarchical models
In Section 3 we have presented the Bayesian correlated t-test that is used for the analysis of cross-validation results on a single dataset. In particular, it makes inference about the mean difference of accuracy between two classifiers in the -th dataset () by exploiting three pieces of information: the sample mean (), the variability of the data (sample standard deviation ) and the correlation due to the overlapping training set (). This test can only be applied to a single dataset. We have already discussed the fact that there is no direct NHST able to extend the above statistical comparison to multiple datasets, i.e., that takes as inputs the runs of the -fold cross-validation results for each dataset and returns as output a statistical decision about which classifier is better in all the datasets. The usual NHST procedure that is employed for performing such analysis has two steps: (1) compute the mean difference of accuracy for each dataset ; (2) perform a NHST to establish if the two classifiers have different performance or not based on these mean differences of accuracy. This discards two pieces of information: the correlation and sample standard deviation in each dataset. The standard deviation is informative about the accuracy of as an estimator of . The standard deviation can largely vary across data sets, as a result of each data set having its own size and complexity. The aim of this section is to present an extension of the Bayesian correlated t-test that is able to make inference on multiple datasets and at the same time to account for all the available information (mean, standard deviation and correlation). In Bayesian estimation, this can be obtained by defining a hierarchical model (corani2016unpub). Hierarchical models are among the most powerful and flexible tools in Bayesian analysis.
to be uniformly distributed within 1 and -1. This choice works for all the measures bounded within
1, such as accuracy, AUC, precision and recall. Other type of indicators might require different bounds.For the standard deviation we adopt the prior , with , where is the standard deviation of the ’s. As for the prior on the degrees of freedom, there are two proposals in the literature. kruschke2013bayesian
proposes an exponentially shaped distribution which balances the prior probability of nearly normal distributions (>30) and heavy tailed distributions ( <30). We re-parameterize this distribution as a Gamma(,) with =1, = 0.0345. juarez2010model proposes instead , assigning larger prior probability to normal distributions. We have no reason for preferring a prior over another, but the hierarchical model shows some sensitivity on the choice of . We model this uncertainty by representing the coefficients and
of the Gamma distribution as two random variables (hierarchical prior). In particular we assume, with and , setting =0.5, =5, =0.05, =0.15. The simulations in (corani2016unpub) show that the inferences of the model are stable with respect to perturbations of , , , and , and that the resulting hierarchical generally fits well the experimental data. These considerations are reflected by the following probabilistic model:
’s, and thus accounting for the different uncertainty which characterizes each data set. This characteristic is unique among the methods discussed so far. Computations in hierarchical models are obtained by Markov-Chain Monte Carlo sampling.
A further merit of the hierarchical model is that it jointly estimates the ’s while the existing methods estimate independently the difference of accuracy on each data set using the ’s. The consequence of the joint estimation performed by the hierarchical model is that shrinkage is applied to the ’s. The hierarchical model thus estimates the ’s more accurately than the ’s adopted by the other tests. This result is valid under general assumptions, such as a severe misspecification between the high-level distributions of the true generative model and of the fitted model (corani2016unpub). By applying the rope on the posterior distribution of the ’s and the in a similar way to what discussed for the Bayesian correlated t-test, the model is able to detect equivalent classifiers and to claim significances that have a practical impact.
In the experiments, we have computed the posterior of for the ten pairwise comparisons between the classifiers nbc, aode, hnb, j48 and j48gr. As inference we have computed the prediction on the next (unseen) dataset, which is formally equivalent to the inference computed by the Bayesian signed-rank test. For instance, for nbc vs. aode, we have computed the probabilities that in the next dataset nbc is better than aode (), nbc is equivalent to aode (), aode is better than nbc (). This is the procedure we have followed:
we have sampled from the posteriors of these parameters;
for each sample of we have defined the posterior of the mean difference of accuracy on the next dataset, i.e., ;
from we have computed the probabilities (integral on , (integral on ) and (integral on .
We have repeated this procedure times, obtaining samples of and the results are shown in Figure 13. The results are quite in agreement with those of the Bayesian signed-rank test. For instance, it is evident that aode is clearly better than nbc. We can quantify this numerically by counting the number of points that fall in the three regions (see the first row in Table 12). aode is better in almost of cases. Table 12 reports also these probabilities for the other comparisons of classifiers computed using Monte Carlo samples. By comparing Tables 12 and 9, we can see that the two tests are substantially in agreement apart from differences in aode vs. j48 and j48gr. The hierarchical test is taking into account of all available infromation: the sample mean (), the variability of the data (sample standard deviation ) and the correlation due to the overlapping training set (), while the Bayesian signed rank only considers . Therefore, when the two tests differ substantially, it means that there is substantial variability of the cross-validation estimate.
We have implemented the hierarchical model in Stan (http://mc-stan.org) (carpenterstan), a language for Bayesian inference. The analysis of the results of 10 runs of 10-fold cross-validation on 54 data sets (that means a total of observations) takes about three minutes on a standard computer.
|Classif. 1||Classif. 2||left||rope||right|
4.4 Choice of the hyper-priors parameters
The choice of of the hyper-priors parameters can be critical in Bayesian hierarchical models. We have conducted a sensitivity analysis by using different constants in the top-level gamma and uniform distributions, to check whether they have any notable influence on the resulting posterior distribution. Whether all gamma/uniform distributions are assumed, the results are essentially identical. More details about this sensitivity analysis are reported in corani2016unpub. This means that for the hierarchical model inferences and decisions are stable w.r.t. the choice of the hyper-parameters.
4.5 Bayesian signed rank or hierarchical model?
So far we have presented two methods for comparing two classifiers on multiple datasets: Bayesian signed-rank and hierarchical model. Which one should we use for comparing classifiers? In our opinion, the hierarchical model is preferable because it takes as inputs the runs of the -fold cross-validation results for each dataset and so it makes inference about the mean difference of accuracy between two classifiers in the -th dataset () by exploiting all available information: the sample mean (), the variability of the data (sample standard deviation ) and the correlation due to the overlapping training set (). Conversely, the Bayesian signed rank only considers
. On the other hand, the hierarchical model is slower than the Bayesian signed rank. In machine learning, we often need to run statistical tests hundreds of times for instance for features selection or algorithms racing and, in this case, it is more convenient to use a light test as the Bayesian signed rank.
5 Comparisons of multiple classifiers
Another important problem with NHST is the issue of multiple hypothesis testing. Considering the results in Table 8, which reports the -values for the comparison of the five classifiers obtained by the Wilcoxon signed-rank test. From the -values, we concluded that “nbc was found significantly better than aode and hnb, and algorithms j48 and j48gr were significantly different, while there were no significant differences between other pairs”. When many tests are made, the probability of making at least one Type 1 error in any of the comparisons increases. One of the most popular fixes to this problem is the Bonferroni correction. The Bonferroni correction adjusts the -value at which a test is evaluated for significance according to the number of tests being performed. More specifically, the adjusted -value is calculated as the original -value divided by the number of tests being performed. Implicitly, Bonferroni’s correction assumes that these test statistics are independent. So in our current example an overall desired significance level of would translate into individual tests each using a -value threshold of (we are performing comparisons). In this case, this would not change our previous sections, since all significant -values were less than . The Bonferroni correction reduces false rejections but it also increases the number of instances in which the null is not rejected when actually it should have been. Thus, the Bonferroni adjustment can reduce the power to detect an important effect. Motivated by this issue of the Bonferroni correction, researchers have proposed alternative procedures. The goal of these methods typically is to reduce the family-wise error rate (that is, the probability of having at least one false positive) without sacrificing power too much. A natural way to achieve this is by considering the dependence across tests (westfall1993adjusting).
We have already discussed in Section 2 the pitfalls of NHST Type I error thinking. Type I error underlies these corrections and, therefore, corrections inherit all its problems. The most critical one is that the correction factor depends on the way the analyst intends to conduct the comparison. For instance, the analyst may want to compare nbc with the other four classifiers (in this case the Bonferroni correction would be —he/she is conducting only four comparisons) or to perform all the ten comparisons () and so on. This creates a problem because two analysts can c draw different conclusions from the same data because of the variety of comparisons that they made. Another important issue with the multiple-comparison procedure based on mean-ranks test is described in benavoli2015c.
How do we manage the problem of multiple hypothesis testing in Bayesian analysis? Paraphrasing gelman2012we: “in Bayesian analysis we usually do not have to worry about multiple comparisons. The reason is that we do not worry about Type I error, because the null hypothesis is hardly believable to be true.” How does Bayesian Analysis mitigate false alarms? gelman2012we suggest using multilevel analysis (in our case a hierarchical Bayesian model on multiple classifiers). Multilevel models perform partial pooling; they shift estimates toward each other. This means that the comparisons of the classifiers are more conservative, in the sense that intervals for comparisons are more likely to include zero. This may be a direction to pursue in future research. In this paper, we instead mitigate false alarms through the rope. The rope mitigates false alarms because it decreases the asymptotic false alarm rate (kruschke2013bayesian).
6 Software and available Bayesian tests
All the tests that we have presented in this paper are available in R and Python code at
Moreover, the code that is necessary to replicate all the analyses we performed in this paper is also available at the above URL in form of Ipython notebooks (implemented in Python and Julia). The software is open source and that can be freely used, changed, and shared (in modified or unmodified form).
Machine learning researchers may be interested in using other Bayesian tests besides the ones we have discussed in this paper. General Bayesian parametric tests can be found in kruschke2015doing (together with R code) and also in gelman2013bayesian. We have specialized some of these tests to the case of correlated data, such as the Bayesian correlated t-test (coraniML2015) discussed in Section 3. We have also implemented several Bayesian nonparametric tests for comparing algorithms: Bayesian rank test (benavoli2014b), Friedman test (benavoli2015b) and tests that account for censored data (mangili2015a). Finally, we have developed an extension of the Bayesian sign test to compare algorithms taking into account multiple measures at the same time (accuracy and computational time for instance) (Benavoli2015). For the analysis of multiple data sets, another approach has been proposed by lacoste2012bayesian that models each data set as an independent Bernoulli trial. The two possible outcomes of the Bernoulli trial are the first classifier being more accurate than the second or vice versa. This approach yields the posterior probability of the first classifier being more accurate than the second classifier on more than half of the data sets. A shortcoming is that its conclusions apply only to the q available data sets without generalizing to the whole population of data sets
We discourage the use of frequentist null hypothesis significance tests (NHST) in machine learning and, in particular, for comparison of the performance of classifiers. In this, we follow the current trends in other scientific areas. For instance, the journal of Basic and Applied Social Psychology, has banned the use of NHSTs and related statistical procedures (banned). We believe that also in machine learning is time to move on from NHST and -values.
In this paper, we have discussed how Bayesian analysis can be employed instead of NHST. In particular, we have presented three Bayesian tests: Bayesian correlated t-test, Bayesian signed rank test and a Bayesian hierarchical model that can be used for comparing the performance of classifiers and that solve the drawbacks of the frequentist tests. All the code of these tests is freely available and so researchers can already use these tests for their analysis. In this paper, we have mainly discussed the use of Bayesian tests for comparing the performance of algorithms. However, in machine learning, NHST statistical tests are also employed inside the algorithms. For instance, nonparametric tests are used in racing algorithms, independence tests are used to learn the structure of Bayesian networks, etcetera. Bayesian tests can be used to replace all NHST tests because of their advantages (for instance, Bayesian tests can assess whether two algorithms are similar through the use of the rope(benavoli2015b)).
Research partially supported by the Swiss NSF grant no. IZKSZ2_162188.