1 Introduction
The statistical comparison of competing algorithms is fundamental in machine learning; it is typically carried through hypothesis testing. As an example in this paper we assume that one is interested in comparing the accuracy of two competing classifiers. However our discussion readily applies to any other measure of performance.
Assume that two classifiers have been assessed via crossvalidation on a single data set. The recommended approach for comparing them is the correlated ttest
(nadeau2003inference). If instead one aims at comparing two classifiers on multiple data sets the recommended test is the signedrank test (demvsar2006statistical). Both tests are based on the frequentist framework of the nullhypothesis significance tests (nhst), which has severe drawbacks.First, the nhst computes the probability of getting the observed (or a larger) difference in the data if the null hypothesis was true. It does not compute the probability of interest, which is the probability of one classifier being more accurate than another given the observed results.
Second, the claimed statistical significances do not necessarily imply practical significance. It has been pointed out (hand2006classifier) that the apparent superiority of a classifier in a simulation may be swamped by other sources of uncertainty when the classifier is adopted in practice. This especially applies when the new classifier provides only a thin advantage over the previous one. Yet the signedrank test easily rejects the null hypothesis even when dealing with two classifiers separated by a thin difference, if the classifiers have been compared on a large collection of data sets. Null hypotheses can virtually always be rejected by increasing the sample size (wasserstein2016asa), since the pvalue depends both on the actual difference between the two classifiers and number of collected observations.
Another issue is that when the nhst does not reject the null, it provides no evidence in favor of the null hypothesis (kruschke2015doing, Chap. 11): it simply states that there is not enough evidence for rejecting the null hypothesis. This prevents nhst tests from recognizing equivalent classifiers.
These issues can be overcome by switching to Bayesian hypothesis testing, as exhaustively discussed by (kruschke2015doing, Sec. 11). Recent applications of Bayesian hypothesis testing in machine learning have been proposed in (lacoste2012bayesian; coraniML2015); yet there is currently no Bayesian approach able to jointly analyze the crossvalidation results obtained on multiple data sets.
Let us introduce some notation. We have a collection of data sets; the actual mean difference of accuracy on the ith data set is . We can think of as the average difference of accuracy that we would obtain by repeating many times the procedure of sampling from the actual distribution as many instances as there are in the idata set available to us, train the two classifiers and measure their difference of accuracy on a large test set. We cannot know the actual value of ; thus we estimate it through crossvalidation. The usual estimate of is the mean of the crossvalidation results obtained on the ith data set; this is also the maximum likelihood estimator (MLE) of .
We propose the first model that represents both the distribution across the different data sets and the distribution of the crossvalidation results on the ith data set given
, modelling both the variance and the correlation of the crossvalidation results on each data set.
In order to detect equivalent classifiers, we introduce a region of practical equivalence (rope), as in kruschke2013bayesian. In particular we consider two classifiers to be practically equivalent if their difference of accuracy belongs to the interval . This constitutes a sensible default, even if there are no uniquely correct rope limits. We analyze how much of the posterior probability mass lies within the rope, at its left and at its right in order to detect classifiers which are practically equivalent or instead significantly different. With this, we can predict the probability that the new data set will lie in the rope or to the left or right of it.
Compared to the signedrank, the hierarchical test is more conservative as it rejects less easily the null hypothesis. The signedrank assumes a point null hypothesis, while the null hypothesis of the hierarchical model contains all the values of the rope. Such a null hypothesis is more realistic and thus more difficult to reject.
Consequently, the hierarchical test tend to be less powerful and more likely to retain the nullhypothesis when the two classifiers are in fact different. Yet this issue is mitigated by the fact that the posterior probabilities can be meaningfully interpreted even when they do not exceed the 95% threshold: the result of the test is not a binary decision but a probability.
A further strength of the hierarchical model is that it reduces the estimation error of the ’s compared to the traditional MLE approach. The hierarchical model estimates the distribution from which the ’s are sampled. Then it jointly estimates the values of the ’s. By doing so it applies shrinkage, namely the estimates ’ are closer to each other than in the MLE case. It is known that the shrinkage estimator yields lower mean squared error than MLE in the i.i.d. case (murphy2012machine, Sec 6.3.3.2). We prove theoretically that the shrinkage estimator yields lower mean squared error than MLE also in the correlated case, such as that of crossvalidation; moreover, we verify this statement in a variety of different experiments.
2 Existing approaches
We have a collection of data sets; the actual mean difference of accuracy on the ith data set is . We can think of as the average difference of accuracy that we would obtain by repeating many times the procedure of sampling from the actual distribution as many instances as there are in the actually available data set, train the two classifiers and measure the difference of accuracy on a large test set.
In reality we cannot know . We can however estimate it through crossvalidation. Assume that we have performed runs of fold crossvalidation on each data set, providing both classifiers with paired folds. The differences of accuracy on each fold of crossvalidation are , where
. The mean and the standard deviation of the results on the ith data set are
and . The mean of the crossvalidation results is also the maximum likelihood estimator (MLE) of .The values are not independently sampled; they are instead sampled with correlation because of the overlapping training sets built during crossvalidation. nadeau2003inference
prove that there is no unbiased estimator of the correlation and they approximate it as
, where is the number of folds. They devise the correlated test, whose statistic is:(1) 
The denominator of the statistic is the standard error, which is informative about the accuracy of
as an estimator of . The standard error of the correlated ttest accounts for the correlation of crossvalidation results. The statistic of Eqn.(1) follows a t distribution with n1 degrees of freedom. When the statistic exceeds the critical value, the test claims that
is significantly different from zero. This is the standard approach for comparing two classifiers on a single data set.The signedrank test is instead the recommended method (demvsar2006statistical) to compare two classifiers on a collection of different data sets. It is usually applied after having performed crossvalidation on each data set. The test analyzes the mean differences measured on each data set () assuming them to be i.i.d.. This is a simplistic assumption: the ’s are not i.i.d. since they are characterized by different uncertainty; indeed their standard errors are typically different.
The test statistic is:
where is the rank of and
For a large enough number of samples (e.g.,
>10), the statistic under the null hypothesis is normally distributed. When the test rejects the null hypothesis, it claims that the median of the population of the
’s is different from zero.The two tests discussed so far are nullhypothesis significance test (nhst) and as such they suffer from the drawbacks discussed in the introduction.
Let us now consider the Bayesian approaches. kruschke2013bayesian presents a Bayesian ttest coupled with the rope for i.i.d. observations. Because of the i.i.d. assumption, it cannot be applied to analyze the crossvalidation results, which are correlated. The Bayesian correlated ttest by coraniML2015 can be used to this end. It computes the posterior distribution of on a single data set, assuming the crossvalidation observations to be sampled from a multivariate normal distribution whose components have the same mean , the same standard deviation and are equally crosscorrelated with correlation
. Thus the test borrows the correlation heuristic by
nadeau2003inference.As for the analysis on multiple data sets, lacoste2012bayesian compares two classifiers on multiple data sets by modelling each data set as an independent Bernoulli trial. The two possible outcomes of the Bernoulli trial the first classifier being more accurate than the second or vice versa. coraniML2015 estimate such probabilities using the Bayesian correlated ttest, applied independently on each data set. It is then possible computing the probability of the first classifier being more accurate than the second classifier on more than half of the data sets. A shortcoming of this approach is that its conclusions apply only to the available data sets without generalizing to the population of data sets from which they have been sampled.
3 The hierarchical model
We propose a Bayesian hierarchical model for comparing two classifiers. Its core assumptions are:
(2)  
(3)  
(4) 
The ith data set is characterized by the mean difference of accuracy between classifiers and the standard deviation . Thus we model each data sets as having its own estimation uncertainty. Notice that instead when applying the signedrank test one has to assume that the ’s are i.i.d.
The ’s are assumed to be drawn from a Student distribution with mean , scale factor and degrees of freedom . We choose the Student distribution because it is more flexible than the Gaussian, thanks to the additional parameter . When is small, the Student distribution has heavy tails; when
is above 30, the Student distribution is practically a Gaussian. A Student distribution with low degrees of freedom is able to robustly model outlier data points, namely some
’s which are far away from the others. Indeed the Student distribution is typically used for robust Bayesian estimation (kruschke2013bayesian).We assume
to be drawn from a uniform distribution over the interval
. This kind of prior is recommended for the standard deviation by gelman2006prior, as it yields inferences which are insensitive on the value of if is large enough. We adopt where is the mean standard deviation observed on the different data sets, . As we show in Sec. 4.8.1, we obtain the same posterior distribution if we change the upper bound to .Equation (4) models the fact that the crossvalidation measures of the ith data set are generated from a multivariate normal whose components are equally crosscorrelated with correlation , have same mean () and same standard deviation (). These are standard assumptions for the crossvalidation measures (nadeau2003inference; coraniML2015)
. The normality assumption is sound since the average accuracy over the instances of the test set tends to be normally distributed by the central limit theorem.
We complete the model with the prior on the parameters , and of the highlevel distribution. We assume to be uniformly distributed within 1 and 1. This choice works for all the measures bounded within
1, such as accuracy, AUC, precision and recall. Other type of indicators might require different bounds.
For the standard deviation we adopt the prior , with , where is the standard deviation of the ’s. Also in this case we checked that the posterior distribution of is unchanged if we adopt or as upper bound.
The treatment of is slightly more challenging. kruschke2013bayesian proposes to be a shifted exponential, which we reparameterize as a Gamma(,) with =1, = 0.0345. juarez2010model proposes instead . The main characteristics of those distributions are given in Tab. 1.
The inferred model shows some sensitivity on the choice of . Yet both parameterizations are sensible and we have no strong reason to prefer one over another. We model this uncertainty by representing the coefficients and
as two random variables with their own prior distribution (hierarchical approach). In particular we assume
, with and , setting =0.5, =5, =0.05,=0.15. The mean and standard deviation of the limiting Gamma distribution are given in Table
1; they encompass a wide range of different prior beliefs. In this way the model becomes more stable, showing only minor variations when the limiting ranges of and are modified. It also becomes more flexible, and fits better the data as we show later in the experimental section.mean  sd  p( <30)  

kruschke2013bayesian  2  0.1  20  14  0.80 
juarez2010model  1  0.0345  29  29  0.64 
0.5  0.05  10  14  0.92  
0.5  0.15  3  5  0.99  
5  0.05  100  45  0.02  
5  0.15  33  15  0.47 
The priors for the parameters of the highlevel distribution are thus:
3.1 The region of practical equivalence
Our knowledge about a parameter is fully represented by the posterior distribution. Yet it is handy to summarize the posterior in order to take decisions. In (coraniML2015) we summarized the posterior distribution by reporting the probability of positiveness and negativeness; however in this way we considered only the sign of the differences, neglecting their magnitude.
A more informative summary of the posterior is obtained introducing a region of practical equivalence (rope), constituted by a range of parameter values that are practically equivalent to the null difference between the two classifiers. We thus summarize the posterior distribution by reporting how much probability lies within the rope, at its left and at its right. The limits of the rope are established by the analyst based on his experience; thus there are no uniquely correct limits for the rope (kruschke2015doing, Chap. 12).
Dealing with the evaluation of classifiers, hand2006classifier reports that the apparent superiority of a classifier in a simulation may be swamped by other sources of uncertainty when adopted in practice. Examples of such sources of uncertainty are the presence of nonstationary distributions or unintended biases in the training samples. These thin improvements measured in simulations might hardly imply any advantage in practice.
We thus consider two classifiers to be practically equivalent if their mean difference of accuracy lies within (0.01,0.01). This constitutes a sensible default for the rope: yet a researcher could legitimately adopt a different rope. If one is unsure about the limits of the rope, one could show how the results vary for different rope widths. We provide an example in Sec. 4.9.
The rope yields a realistic null hypothesis that can be verified. If a large mass of posterior probability lies within the rope, we claim the two classifiers to be practically equivalent. A sound approach to detect equivalent classifiers could be very useful in online model selection (krueger2015fast) where one should quickly discard algorithms that perform the same.
3.2 The inference of the test
We focus on estimating the posterior distribution of the difference of accuracy between the two classifiers on a future unseen data set. We compute the probability of left, rope and right being the most probable outcome on the next data set.
Thus we compute the probability by which or or . If one removes the rope, this inference is similar to the inference performed by the Bayesian signtest.
To compute such inference, we proceed as follows:

initialize the counters ;

for repeat

sample from the posterior of these parameters;

define the posterior of the mean difference accuracy on the next dataset, i.e., ;

from compute the three probabilities (integral on ), (integral on ) and (integral on );

determine the highest among and increment the respective counter ;


compute , and ;

decision: when declare the two classifiers to be practically equivalent; when or we declare the two classifiers to be significantly different in the respective directions.
We have chosen and, thus, our region of practical equivalence (rope) is . Figure 1 shows a diagram of this inference schema and reports three sampled posteriors . For these three cases we have that are respectively (from top to bottom) , , and so after these three steps , , (in the next experiments we will consider ).
3.3 The shrinkage estimator for crossvalidation
The hierarchical model jointly estimates the ’s by applying shrinkage to the ’s. In the uncorrelated case, the shrinkage estimator is known to be more accurate than the MLE. In this section we show that the shrinkage estimator is more accurate than MLE also in the correlated case, such as the data generated by crossvalidation. This allows the hierarchical model to be more accurate than the existing method in the estimation of the ’s.
The ’s of the hierarchical model are independent given the parameters of the higherlevel distribution. If such parameters were known, the ’s would be conditionally independent and they would be independently estimated. Instead such parameters are unknown, causing the and the ’s to be jointly estimated. As a result the estimate of each is informed by data collected also on all the other data sets. Intuitively, each data set informs the higherlevel parameters, which in turn constrains and improves the parameters of the individual data sets (kruschke2013bayesian, Chap. 9).
To show this, we assume the crossvalidation results on the data sets to be generated by the hierarchical model:
(5) 
where for simplicity we assumed the variances of the individual data sets to be equal to and known. Thus all data sets have the same covariance matrix , which is defined as follows: the variances equal and the correlations equal . Note that Eqn. (5) coincides with (4). This is a general model that makes no assumptions about the distribution
. We denote the first two moments of
as and .We study the MAP estimates of the parameters , which asymptotically tend to the Bayesian estimates. A hierarchical model is being fitted to the data. Such model is a simplified version of that presented in Sec. 3. In particular is Gaussian for analytical tractability.
(6) 
This model is misspecified since is generally not Gaussian. Nevertheless, it correctly estimates the mean and variance of , as we show in the following.
Proposition 1
The derivatives of the logarithm of are:
If we further assume that (flat prior), by equating the derivatives to zero, we derive the following consistent estimators:
(7)  
(8) 
where and, to keep a simple notation, we have not explicited the expression as a function of . Notice that the estimator shrinks the estimate towards that is an estimate of . Hence, the Bayesian hierarchical model consistently estimates and from data and converges to the shrinkage estimator .
It is known that the shrinkage estimator achieves a lower error than MLE in case of uncorrelated data; see (murphy2012machine, Sec 6.3.3.2) and the references therein. However there is currently no analysis of shrinkage with correlated data, such as those yielded by crossvalidation. We study this problem in the following.
Consider the generative model (5). The likelihood regarding the ith data set is:
(9) 
Let us denote by
the vector of the
’s. The joint probability of data and parameters is:Let us focus on the ith group, denoting by an estimator of . The mean squared error (MSE) of the estimator w.r.t. the true joint model is:
(10) 
Proposition 2
The MSE of the maximum likelihood estimator is:
which we denote in the following also as .
Now consider the shrinkage estimator with , which pulls the MLE estimate towards the mean of the upperlevel distribution.
Proposition 3
The MSE of the shrinkage estimator is:
As we have seen, the hierarchical model converges to the shrinkage estimator with . Then:
Therefore, the shrinkage estimator achieves a smaller mean squared error than the MLE.
3.4 Implementation and code availability
We implemented the hierarchical model in Stan (carpenterstan)
, a language for Bayesian inference. In order to improve the computational efficiency, we exploit a quadratic matrix form to compute simultaneously the likelihood of the
data sets. This provides a speedup of about one order of magnitude compared to the naive implementation in which the likelihoods are computed separately on each data set. Inferring the hierarhical model on the results of 10 runs of 10folds crossvalidation on 50 data sets (a total of 5000 observations) takes about three minutes on a standard laptop.The Stan code is available from https://github.com/BayesianTestsML/tutorial/tree/master/hierarchical. The same repository provides the R code of all the simulations of Sec. 4.
4 Experiments
4.1 Estimation of the ’s under misspecification of p()
According to the proofs of Sec. 3, the shrinkage estimator of the ’s has lower mean squared error than the maximum likelihood estimator, constituted by the arithmetic mean of the crossvalidation results. This result holds even if the of the hierarchical model is misspecified: it only requires the hierarchical model to reliably estimate the first two moments of .
To verify this theoretical result we design the following experiment. We consider these numbers of data sets: . For each value of we repeat 500 experiments consisting of:

sampling of the ’s () from the bimodal mixture
with =2, =0.005, =0.02, ===0.001, .

For each :

implement two classifiers whose actual difference of accuracy is , following the procedure given in Appendix;

perform 10 runs of 10folds crossvalidation with the two classifiers;

measure the mean of the crossvalidation results (MLE).


Infer the hierarchical model using the results referring to the data sets;

obtain the shrinkage estimates of each ;

measure and as defined in Sec. 3.3.
Mean Squared Error  

5  .00036  .00017 
10  .00036  .00014 
50  .00036  .00012 
As reported in Tab. 2, is generally 50% lower than for every value of . This verifies our theoretical findings. It also shows that the mean of the crossvalidation estimates is a quite noisy estimator of , even if 10 repetitions of crossvalidation are performed. The problem is that all such results are correlated and thus they have limited informative content.
Interestingly, the MSE of the shrinkage estimator decreases with . Thus the presence of more data sets allows to better estimate the moments of , improving the shrinkage estimates as well. Instead the error of the MLE does not vary with since the parameters of each data set are independently estimated.
In the uncorrelated case it is wellknown that, on average, the shrunken estimate is much closer to the true parameters than the MLE is. Our results extend this finding to the correlated case.
4.2 Comparison of equivalent classifiers
In this section we adopt a Cauchy distribution as
; this is an idealized situation in which the hierarchical model can recover the actual . We will relax this assumption in Sec. 4.7.We simulate the null hypothesis of the signedrank setting the median of the Cauchy to . We set the scale factor of the distribution to 1/6 of the rope length; this implies that 80% of the sampled ’s lies within the rope, which is by far the most probable outcome.
We consider the following numbers of data sets: . For each value of we repeat 500 experiments consisting of:

sampling the ’s () from ;

for each :

implement two classifiers whose actual difference of accuracy is , following the procedure given in Appendix;

perform 10 runs of 10fold crossvalidation with the two classifiers;


analyze the results through the signedrank and the hierarchical model.
The signedrank test (=0.05) rejects the null hypothesis about 5% of the times for each value of . It is thus correctly calibrated. Yet, it provides no valuable insights. When it does not reject (95% of the times), it does not allow claiming that the null hypothesis is true. When it rejects the null (5% of the times), it draws a wrong conclusion since =0.
The hierarchical model draws more sensible conclusions. The posterior probability increases with (Fig. 2): the presence of more data sets provides more evidence that they are equivalent. For =50 (the typical size of a machine learning study), the average p(rope) reported in simulations is larger than 90%. Fig. 2 reports also on equivalence recognition, which is the proportion of simulations in which p(rope) exceeds 95%. Equivalence recognition increases with , reaching about 0.7 for =50.
Moreover in our simulations the hierarchical model never estimated p(left)>95% or p(right)>95%, so it made no Type I errors. In fact nsht is tied to commit a 5% Type I error when the null hypothesis is true; but this is not the case of Bayesian estimation with rope, which instead generally makes less Type I errors
(kruschke2013bayesian) than nhst. As we described earlier, this is at least in part a consequence of using a whole interval instead of just a single point for the nullhypothesis, which makes the nullhypothesis more difficult to reject.Running the signedrank twice? We cannot detect practically equivalent classifiers by running twice the signedrank test, e.g., once with null hypothesis 0.01 and once with the null hypothesis 0.01. Even if the signedrank test does not reject the null in both cases, we still cannot affirm that the two classifiers are equivalent, since nonrejection of the null does not allow claiming that the null is true.
4.3 Comparison of practically equivalent classifiers
We now simulate two classifiers whose actual difference of accuracy is practically irrelevant but different from zero. We consider two classifiers whose average difference is =0.005, thus within the rope.
We consider . For each value of we repeat 500 experiments as follows:

set as a Cauchy distribution with =0.005 and the same scale factor as in previous experiments (the rope remains by far the most probable outcome for the sampled ’s);

sample the ’s () from ;

implement for each two classifiers whose actual difference of accuracy is and perform 10 runs of 10fold crossvalidation;

analyze the crossvalidation results through the signedrank and the hierarchical model.
The signedranked test is more likely to reject the null hypothesis as the number of data sets increases (Fig. 3). When 50 data sets are available, the signedrank rejects the null in about 25% of the simulations, despite the trivial difference between the two classifiers. Indeed one can reject the null of the signedrank test when comparing two almost equivalent classifiers, by comparing them on enough data sets. As reported in the ASA statement on pvalue (wasserstein2016asa) even a tiny effect can produce a small pvalue if the sample size is large enough.
The behavior of the hierarchical test is the opposite and more sensible. With increasing number of data sets on which the classifiers show similar performance, the researcher should be more convinced that the two classifiers are practically equivalent. The hierarchical test indeed increases the posterior probability of rope (Fig. 4). It is slightly less effective in recognizing equivalence than in the previous experiment since is now closer to the limit of the rope. When =50, it declares equivalence detection with 95% confidence in about 40% of the simulated cases. In our simulations it never claims a significant difference, namely it never estimates p(left) or p(right) to be larger than 95%.
The hierarchical test is thus effective at detecting classifiers that are (actually or practically) equivalent. This is impossible for the signedrank test: it gives the correct answer that the classifiers are different, but in terms of the singlepoint hypothesis. On the contrary, the hierarchical test claims that they are the same in terms of the intervalbased hypothesis.
Moreover the signedrank (=0.05) is tied to commit 5% Type I errors under the null hypothesis, even in presence of infinite data. The hierarchical model is free from this constraint and in our simulation it made no Type I errors at all.
The hierarchical model is thus more conservative than the signed rank test. The price to be paid is that it might be less powerful at claiming significance when comparing two classifiers whose accuracies are truly different. We investigate this setting in the next section.
4.4 Simulation of practically different classifiers
We now simulate two classifiers which are significantly different. We consider different values of : . We set the scale factor of the Cauchy to =0.01 and the number of data sets to =50.
We repeat 500 experiments for each value of , organized as in the previous sections. We then check the power of the two tests for each value of . The power of the signedrank is the proportion of simulations in which it rejects the null hypothesis (=0.05). The power of the hierarchical test is the proportion of simulations in which it estimate p(right)>0.95.
The hierarchical model is necessarily less powerful than the signedrank: to declare the two classifiers as significantly different, the signedrank has to reject the null hypothesis , while the hierarchical model needs to have a posterior mass larger than 0.95 in the region to the right of the rope. As shown in Fig. 5, the signedrank test is indeed more powerful, especially when lies just slightly outside the rope. The two tests show similar power when is larger than 0.02.
4.5 Discussion
The main experimental findings so far are as follows. First, the shrinkage estimator of the ’s yields a lower mean squared error than the MLE estimator, even under misspecification of .
Second, the hierarchical model is effective at detecting equivalent classifiers; this is instead impossible for the nhst test.
Third, the hierarchical model is more conservative than the signedrank test: it rejects more rarely the null hypothesis, as a consequence of its null hypothesis being more realistic. For this reason it commits less Type I error than the signedrank.
However, it is less powerful than the signedrank: when comparing two significantly different classifiers, the hierarchical test claims 95% significance less frequently than the signedrank. The difference in power is not necessarily a large one, as shown in the previous simulation.
Moreover the probabilities returned by the hierarchical model can be interpreted in a more flexible way than simply checking if there is an outcome whose posterior probability is larger than 95%, as we discuss in the next section.
4.6 Interpreting posterior odds
The ratio of posterior probabilities (posterior odds
) shows the extent to which the data support one hypothesis over the other. For instance we can compare the support for left and right by computing the posterior odds
. When there is evidence in favor of left; when there is evidence in favor of right. Rules of thumb for interpreting the amount of evidence corresponding to posterior odds are discussed by raftery1995bayesian and summarized in Tab. 3:Posterior Odds  Evidence 

1–3  weak 
3–20  positive 
>20  strong 
Thus even if none of the three probabilities exceeds the 95% threshold, we can still draw meaningful conclusions by interpreting the posterior odds. We will adopt this approach in the following simulations.
The pvalues cannot be interpreted in a similar fashion, since they are affected both by sample size and effect size. In particular (wasserstein2016asa) show that smaller pvalues do not necessarily imply the presence of larger effects and larger pvalues do not imply a lack of effect. A tiny effect can produce a small pvalue if the sample size is large enough, and large effects may produce unimpressive pvalues if the sample size is small.
4.7 Experiments with Friedman’s functions
The results presented in the previous sections refer to conditions in which the actual (misspecified or not) is analytically known. In this section we perform experiments in which the ’s are not sampled from an analytical distribution; rather, they are due to different settings of sample size, noise etc. The actual of the next section has no analytical form and thus the of the hierarchical model is unavoidably misspecified. This is a challenging setting for checking the conclusion of the hierarchical model.
We generate data sets via the three functions (, and ) proposed by friedman1991multivariate.
Function contains ten features , each uniformly distributed over . Only five features are used to generate the response :
where . To turn this regression problem into a classification problem, we discretize in two bins. The two bins are separated by the median of , which we independently estimate on a sample of 10,000 generated instances.
Functions and have four features uniformly distributed over the ranges:
The functions are:
where and . The original paper sets =125 and
=0.1. Also in this case we turn the problems from regression to classification by discretizing the response variable in two bins, delimited by its median.
We consider 18 settings for each function, obtained by varying the sample size () and the standard deviation of the noise (considering also twice and half the original values). As a further factor we either consider only the original features or we add further twenty normally distributed random features. We have overall 54 settings: 18 settings for each function. They are summarized in Table 4.
Function type  random Feats  Tot settings  

F#1  {0.5,1,2}  {30,100,1000}  {0,20}  3 3 2 =18 
F#2  {62.5,125,250}  {30,100,1000}  {0,20}  3 3 2 =18 
F#3  {0.05,0.1,0.2}  {30,100,1000}  {0,20}  3 3 2 =18 
As a pair of classifiers we consider linear discriminant analysis (lda) and classification trees (cart), as implemented in the caret package for R, without any hyperparameter tuning. As first step we need to measure the actual between two given classifiers in each setting, which then allows us to know the population of the ’s.
Our second step will be to check the conclusions of the signedrank test and of the hierarchical model when they are provided with crossvalidation results referring to a subset of settings.
Measuring
First we need to measure the actual difference of accuracy between lda and cart in the ith setting. Taking advantage of the synthetic nature of the data, we adopt the following procedure:

for j=1:500

sample training data according to the specifics of the th setting: <function type, , , number of random features >;

fit lda and cart on the generated training data;

sample a large test set (5000 instances) and measure the difference of accuracy between cart and lda;


set .
Our procedure yields accurate estimates since each repetition is provided with independent data and has available a large test set.
For instance in a certain setting two classifiers have mean difference of accuracy =0.09, with standard deviation
=0.06. The 95% confidence interval of their difference is tight:
If instead we had performed 500 runs of 10folds crossvalidation obtaining the same value of and , the confidence interval of our estimates would be about 3.5 times larger, as the standard error would be instead of , as shown in Eqn.(1).
Groundtruth
We compute the of each setting using the above procedure.
The groundtruth is that lda is significantly more accurate than cart.
More in detail, 65%
of the ’s belong to the region to the right of the rope (lda being significantly more accurate than cart).
Thus right is is the most probable outcome of the next .
Moreover, the mean of the ’s is =0.02 in favor of lda.
Assessing the conclusions of the tests
We run 200 times the following procedure:

random selection 12 out of 18 settings for each Friedman function, thus selecting 36 settings;

in each setting:

generate a data set according to the specific of the setting;

run 10 runs of 10folds crossvalidation of lda and cart using paired folds;


analyze the crossvalidation results on the =36 data sets using the signed rank and the hierarchical test.
We start checking the power of the tests, defined as the proportion of simulations in which the null hypothesis is rejected (signedrank) or the posterior probability p(right) exceeds 95% (hierarchical test).
The two tests have roughly the same power: 28% for the signedrank and 27.5% for the hierarchical test. In the remaining simulations the signedrank does not reject ; in those cases it conveys no information since the pvalues cannot be interpreted.
We can instead interpret the posterior odds yielded by the hierarchical model, obtaining the following results:

in 11% of the simulations both and are larger than 20, providing strong evidence in favor of lda even if p(right) does not exceed 95%;

in a further 33% of the simulations both and are larger than 3, providing at least positive evidence in favor of lda.
We have moreover to point out a 2% of simulations in which the posterior odds provide erroneously positive evidence for rope over both right and left. In no case there is positive evidence for left over either rope or right.
Thus the interpretation of posterior odds allows drawing meaningful conclusions even when the 95% threshold is not exceeded. The probabilities are sensibly estimated, even if is unavoidably misspecified.
As a further check we compare and . Also in this case is much lower than (Fig 6 ), with an average reduction of about 60%. This further confirms the properties of the shrinkage estimator.
4.8 Sensitivity analysis on realworld data sets
We now consider real data sets. In this case we cannot know the actual ’s: we could repeat a few hundred times crossvalidation but the resulting estimates would have large uncertainty as already discussed.
We exploit this setting to perform sensitivity analysis and to further compare the conclusions drawn by the hierarchical model and of the signedrank test.
We consider 54 data sets taken from the WEKA data sets page^{1}^{1}1http://www.cs.waikato.ac.nz/ml/weka/datasets.html
. We consider five classifiers: naive Bayes (nbc), averagedone dependence estimator (aode), hidden naive Bayes (hnb), decision tree (j48), grafted decision tree (j48gr).
witten2011data provides a summary description of all such classifiers with pointers to the relevant papers. We perform 10 runs of 10folds crossvalidation for each classifier on each data set. We run all experiments using the WEKA^{2}^{2}2http://www.cs.waikato.ac.nz/ml/weka/ software.A fundamental step of Bayesian analysis is to check how the posterior conclusions depend on the chosen prior and how the model fits the data. The hierarchical model shows some sensitivity on the choice of , being instead robust to the other assumptions (see later for further discussion). The Student distribution is more flexible than the Gaussian and we have found that it consistently provides better fit to the data. Yet, the model conclusions are sometimes sensitive on the prior on the degrees of freedom of the Student.
In Table 5 we compare the posterior inferences of the model, using the prior (proposed in a different context by juarez2010model) or using the more flexible model described in Sec. 3, where the the parameters of the Gamma are described as further random variables with their own prior distributions. Such two variants are referred to as Gamma(2,0.1) and hierarchical in Table 5.
Hierarchical  Gamma(2,0.1)  

pair  left  rope  right  left  rope  right 
nbchnb  1.00  0.00  0.00  1.00  0.00  0.00 
nbcj48  0.80  0.02  0.18  0.80  0.01  0.20 
nbcj48gr  0.84  0.02  0.14  0.84  0.01  0.15 
hnbj48  0.03  0.10  0.87  0.03  0.02  0.95 
hnbj48gr  0.03  0.07  0.90  0.03  0.02  0.95 
j48j48gr  0.00  1.00  0.00  0.00  1.00  0.00 
In some cases the estimates of the two models differ by some points (Tab. 5). This means that the actual highlevel distribution from which the ’s are sampled is not a Student (or a Gaussian), otherwise the estimate of the two models would converge.
Which model better fits the data? We respond this question adopting a visual approach. We start considering that the shrinkage estimates of the ’s are identical between the two models. We thus compute the density plot of the shrinkage estimates (our best estimate of the ’s). We take such density as the ground truth (this is actually our best approximation to the ground truth) and we plot it in thick black (Fig. 7). Then we sample 8000
’s from both variants of the model, obtaining two further densities. We then plot the three densities for each pair of classifiers (Fig. 7). We produce all the density plots using the default kernel density estimation provided in R. In general the hierarchical model, being more flexible, fits better the data than the model equipped with a simple Gamma prior.
4.8.1 Sensitivity on the prior on and
The model conclusions are moreover robust with respect to the specification of the priors and . Recall that is the standard deviation on the ith data set while is the standard deviation of the highlevel distribution.
Our model assumes where where is the average of the sample standard deviations of the different data sets. The posterior distributions of the standard deviation obtained with this prior are insensitive on the choice of as long as is large enough (kruschke2013bayesian; gelman2006prior). Indeed we experimentally verified that we obtained identical posterior estimates of the standard deviations adopting as upper bound or .
The same consideration applies to , whose prior is . We obtain the same posterior distribution for using as upper bound or , where is the standard deviation of the ’s.
4.9 Comparing the signedrank and the hierarchical test
We compare the conclusions of the hierarchical model and of the signedrank test on the same cases of the previous section. The results are given in Tab. 6.
Hierarchical  Signedrank  

pair  left  rope  right  pvalue 
nbchnb  1.00  0.00  0.00  0.00 
nbcj48  0.80  0.02  0.18  0.46 
nbcj48gr  0.84  0.02  0.14  0.39 
hnbj48  0.03  0.10  0.87  0.07 
hnbj48gr  0.03  0.07  0.90  0.08 
j48j48gr  0.00  1.00  0.00  0.00 
Both the signedrank and the hierarchical test claim with 95% confidence hnb to be significantly more accurate than nbc.
In the following comparisons apart from the last one, the two tests do not draw any conclusion with 95% confidence. The signedrank does not reject the null hypothesis, while the hierarchical test does not achieve probability larger than 95%.
When the signedrank test does not reject the null hypothesis, it draws a noninformative conclusion. We can instead always interpret the posterior odds yielded by the hierarchical model. When comparing nbc and j48, there is a positive evidence for right (j48 being more accurate than nbc) over left and strong evidence for right over rope. We thus conclude that there is positive evidence of j48 being practically more accurate than nbc. Similarly, we conclude that there is positive evidence of j48gr being practically more accurate than nbc
When comparing hnb and j48, there is strong evidence for right (hnb being more accurate than j48) over both left and rope. We conclude that there is strong evidence of j48 being practically more accurate than hnb. We draw the same conclusion when comparing hnb and j48gr.
The two test draw opposite conclusions when comparing j48 and j48gr. The signedrank declares j48gr to be significantly more accurate than j48 (pvalue 0.00) while the hierarchical model declares them to be practically equivalent, with p(rope)=1. The reason why the two tests achieved opposite conclusions is that the differences have a consistent sign but are smallsized. In most data sets the signs of the difference is in favor of j48gr; this leads the signed rank test to claim significance. Yet the differences lies mostly within the rope (Fig. 8). The hierarchical model shrinks them further towards the overall mean and eventually claims the two classifiers to be practically equivalent. The posterior probabilities remain unchanged even adopting the halfsized rope (0.005, 0.005). Reducing further the rope does not seem meaningful and thus we conclude that, once the magnitude of the differences is taken into consideration, the accuracy of j48 and j48gr are practically equivalent, even if most signs are in favor of j48gr.
5 Conclusions
The proposed approach is a realistic model of the data generated by crossvalidation across multiple data sets. It is the first approach that represents both the distribution of the ’s across data sets and the distribution of the crossvalidation results on the ith data set given and .
Compared to the signedrank, it is more conservative as it rejects less easily the null hypothesis. In fact the signedrank assumes a point null hypothesis, while the null hypothesis of the hierarchical model contains all the values of the rope. This is a more realistic null hypothesis which is more difficult to reject.
This allows the hierarchical test to detect classifiers which are practically equivalent. Being more conservative, the hierarchical test is generally less powerful than the signedrank: when faced with two classifiers which are significantly different, it rejects the null hypothesis with more difficulty.
Yet the interpretation of the posterior odds allows drawing meaningful conclusions even when the posterior probabilities do not exceed 95%.
The hierarchical model yields more accurate estimates of the ’s than the usual maximum likelihood estimator, because it jointly estimates the various ’s applying shrinkage. Our results show that the shrinkage estimator yields consistently lower mean squared error than the MLE in a variety of challenging experiments.
It is possible to visually verify the fit offered by model by comparing the fitted posterior distribution to the density of the shrinkage estimates. In most cases we have found that the Student distribution provides a satisfactory fit. Yet an interesting research direction is thus the adoption of a nonparametric approach for modelling , thus overcoming the Student assumption. This is a nontrivial task which we leave for future research.
Acknowledgments
The research in this paper has been partially supported by the Swiss NSF grants ns. IZKSZ2_162188 and n. 200021_146606.
6 Appendix
6.1 Implementing two classifiers with known difference of accuracy
On the ith data set we need to simulate two classifier whose actual difference of accuracy is . We start by sampling the instances from a naive Bayes model with two features. Let us denote by the class variables with states are and by and the two features with states and . The naive Bayes model is thus . The parameters of the conditional probability tables are: =0.5; ; ; ; with . The remaining elements of the conditional probability tables are the complement to 1 of the above elements. We set =0.9 and . We sample the data set from this naive Bayes model.
During crossvalidation we train and test the two competing classifiers and . Their expected accuracies are and respectively, and thus their expected difference of accuracy is . Consider classifier . Assume that the marginal probabilities of the class have been correctly estimated. The classification thus depends only on the conditional probability of the feature given the class If = the most probable class is as long as , where denotes the conditional probability estimated from data. The accuracy of this prediction is . It =, the most probable class is as long as . Also the accuracy of this prediction is . If the bias of conditional probability (0.5 and 0.5) is correctly estimated the accuracy of classifier on a large test set is . Analogously, the accuracy of classifier in the same conditions is , so that their difference is . Since the sampled data set have finite size the mean difference of accuracy measured by crossvalidation will fluctuate with some variance around .
6.2 Proofs
Proof of Proposition 1
Consider the hierarchical model:
(11) 
We aim at computing the derivative of the w.r.t. the parameter . Consider the quadratic term from the first and second Gaussian:
its derivatives w.r.t. is . Exploiting the fact that
it follows that
where . The latter equality can be derived by coraniML2015[Appendix], i.e.,
The other derivatives can be computed easily.
Proof of Proposition 2
Let us consider the likelihood:
(12) 
Let us define . The MSE of the maximum likelihood estimator is:
Consider that where
is a linear transformation of the variable
. From the properties of the Normal distribution, it follows thatand since