1 Introduction
The statistical comparison of multiple algorithms over multiple data sets is fundamental in machine learning; it is typically carried out by means of a statistical test. The recommended approach is the Friedman test (demvsar2006statistical). Being nonparametric, it does not require commensurability of the measures across different data sets, it does not assume normality of the sample means and it is robust
to outliers.
When the Friedman test rejects the null hypothesis of no difference among the algorithms, posthoc analysis is carried out to assess which differences are significant. A series of pairwise comparison is performed adjusting the significance level via Bonferroni correction or other more powerful approaches (demvsar2006statistical; garcia2008extension)
to control the familywise Type I error.
The meanranks posthoc test (McDonald1967; nemeneyi1963), is recommended as pairwise test for multiple comparisons in most books of nonparametric statistics: see for instance (gibbons2011nonparametric, Sec. 12.2.1), (kvam2007nonparametric, Sec. 8.2) and (sheskin2003handbook, Sec. 25.2). It is also commonly used in machine learning (demvsar2006statistical; garcia2008extension). The meanranks test is based on the statistic:
where are the mean ranks (as computed by the Friedman test) of algorithms A and B, is the number of algorithms to be compared and the number of datasets. The meanranks are computed considering the performance of all the algorithms. Thus the outcome of the comparison between and depends also on the performance of the other (m2) algorithms included in the original experiment. This can lead to paradoxical situations. For instance the difference between and could be declared significant if the pool comprises algorithms and not significant if the pool comprises algorithms . The performance of the remaining algorithms should instead be irrelevant when comparing algorithms and . This problem has been pointed out several times in the past (miller1966simultaneous; gabriel1969simultaneous; Fligner1984) and also in (hollander2013nonparametric, Sec. 7.3). Yet it is ignored by most literature on nonparametric statistics. However this issue should not be ignored, as it can increase the type I error when comparing two equivalent algorithms and conversely decrease the power when comparing algorithms whose performance is truly different. In this technical note, all these inconsistencies of the meanranks test will be discussed in details and illustrated by means of highlighting examples with the goal of discouraging its use in machine learning as well as in medicine, psychology, etc..
To avoid theses issues, we instead recommend to perform the pairwise comparisons of the posthoc analysis using the Wilcoxon signedrank test or the sign test. The decisions of such tests do not depend on the pool of algorithms included in the initial experiment. It is understood that, regardless the specific test adopted for the pairwise comparisons, it is necessary to control the familywise type I error. This can be obtained through Bonferroni correction or through more powerful approaches (demvsar2006statistical; garcia2008extension).
Even better would be the adoption of the Bayesian methods for hypothesis testing. They overcome the many drawbacks (demvsar2008appropriateness; goodman1999toward; kruschke2010bayesian) of the nullhypothesis significance tests. For instance, Bayesian counterparts of the Wilcoxon and of the sign test have been presented in (benavoli2014a; IDP); a Bayesian approach for comparing crossvalidated algorithms on multiple data sets is discussed by (ML).
2 Friedman test
The performance of multiple algorithms tested on multiple datasets can be organized in a matrix:
(1) 
where denotes the performance of the th algorithm on the th dataset (for and ). The observations (performances) in different columns are assumed to be independent. The algorithms are ranked columnbycolumn and each entry is replaced by its rank relative to the other observations in the th column:
(2) 
where is the rank of the algorithm in the th dataset. The sum of the th row , , depends on how the th algorithm performs w.r.t. the other algorithms. Under the null hypothesis of the Friedman test (no difference between the algorithms) the average value of is . The statistic of the Friedman test is
(3) 
which under the null hypothesis has a chisquared distribution with
degrees of freedom. For , the Friedman test corresponds to the sign test.3 Mean ranks posthoc test
If the Friedman test rejects the null hypothesis one has to establish which are the significant differences among the algorithms. If all classifiers are compared to each other, one has to perform
pairwise comparisons.When performing multiple comparisons, one has to control the familywise error rate, namely the probability of at least one erroneous rejection of the null hypothesis among the
pairwise comparisons. In the following example we control the familywise error (FWER) rate through the Bonferroni correction, even though more powerful techniques are also available (demvsar2006statistical; garcia2008extension). However our discussion of the shortcomings of the meanranks test is valid regardless the specific approach adopted to control the FWER.The meanrank test claims that the th and the th algorithm are significantly different if:
(4) 
where is the mean rank of the th algorithm and is the Bonferroni corrected
upper standard normal quantile
(gibbons2011nonparametric, Sec. 12.2.1). Equation (4) is based on the large sample () approximation of the distribution of the statistic. The actual distribution of the statistic is derived assuming all the ranks in (2) to be equally probable. Under this assumption the variance of
is , which originates the term under the square root in (4).The sampling distribution of the statistic assumes all ranks configurations in (2) to be equally probable. Yet this assumption is not tenable: the posthoc analysis is performed because the null hypothesis of the Friedman test has been rejected.
4 Inconsistencies of the meanranks test
We illustrate the inconsistencies the meanranks test by presenting three examples. All examples refer to the analysis of the accuracy of different classifiers on multiple data sets. We show that the outcome of the test depends both on the actual difference of accuracy between algorithm A and B and on the accuracy of the remaining algorithms.
4.1 Example 1: artificially increasing power
Assume we have tested five algorithms on 20 datasets obtaining the accuracies:
The corresponding ranks are:
where better algorithms are given higher ranks. We aim at comparing and . Algorithm is better than in the first ten datasets, while is better than
in the remaining ten. The two algorithms have the same mean performance and their differences are symmetrically distributed. Each algorithms wins on half the data sets. Different types of twosided tests (ttest, Wilcoxon signedrank test, signtest) return the same
value, . The meanranks test correspond in this case to the signtest and thus also its pvalue is 1. This is most extreme result in favor of the null hypothesis.Now assume that we compare together with . In the first ten datasets, algorithm is worse than , which in turn are worse than . In the remaining ten datasets, is worse than , which in turn are worse than . The value of the Friedman test is and, thus, it rejects the null hypothesis. We can thus perform the posthoc test (4) with (the Bonferroni corrected upper standard normal quantile for and ). The significance level has been adjusted to , since we are performing twosided comparisons. The mean ranks of are respectively and and, thus, since and we can reject the null hypothesis. The result of the posthoc test is that the algorithms have significantly different performance.
The decisions of the meanranks test are not consistent:

if it compares alone, it does not reject the null hypothesis;

if it compares together with , it rejects the null hypothesis concluding that have significantly different performance.
The presence of artificially introduces a difference between by changing the mean ranks of . For instance, and rank always better than , while they never outperform when it works well (i.e., datasets from one to ten); in a real case study, a similar result would probably indicate that while is well suited for the first ten datasets, and are better suited for the last ten. The difference (in rank) between and is artificially amplified by the presence of and only when is better than . The point is that a large differences in the global ranks of two classifiers does not necessarily correspond to large differences in their accuracies (and viceversa, as we will see in the next example).
This issue can happen in practice.^{1}^{1}1We thank the anonymous reviewer for suggesting this example. Assume that a researcher presents a new algorithm and some of its weaker variations , ,…, and compares the new algorithms with an existing algorithm . When is better, the rank is . When is better, the rank is . Therefore, the presence of , ,…, artificially increases the difference between and .
4.2 Example 2: low power due to the remaining algorithms
Assume the performance of algorithms and
on different data sets to be normally distributed as follows:
The pool of algorithms comprises also , whose performance is distributed as follows:
A collection of data sets is considered.
For the sake of simplicity, assume we want to compare only and . There is thus no need of correction for multiple comparisons.
When comparing and , the power of the twosided sign test with is very high: (we have evaluated the power numerically by Monte Carlo simulation). The power of the meanranks test is instead only . We can explain the large difference of power as follows. The sign test (under normal approximation of the distribution of the statistic) claims significance when:
while the meanranks test (4) claims significance when:
with . Since the algorithms have mean performances that are much larger than those of , the meanranks difference
is equal for the two test. However the meanranks estimates the variance of the statistic
to be five times larger compared to the sign test. The critical value of the meanranks test is inflated by , largely decreasing the power of the test. In fact for the meanranks test the variance of increases with the number of algorithms included in the initial experiment.4.3 Example 3: real classifiers on UCI data sets
Finally, we compare the accuracies of seven classifiers on 54 datasets. The classifiers are: J48 decision tree (
); hidden naive Bayes (
); averaged onedependence estimator (AODE) (); naiveBayes (); J48 graft (), locally weighted naiveBayes (), random forest (
). The whole set of results is given in Appendix. Each classifier has been assessed via 10 runs of 10folds crossvalidation. We performed all the experiments using WEKA.^{2}^{2}2http://www.cs.waikato.ac.nz/ml/weka/ All these classifiers are described in (witten2005data).The accuracies are reported in Table 2. Assume that our aim is to compare alone. Therefore, we consider just the first 4 columns in Table 2. The mean ranks are:
The Friedman test rejects the null hypothesis. The pairwise comparisons for the pair gives the statistic
Since is greater than (the Bonferroni corrected upper standard normal quantile for and ), the meanranks procedure finds the algorithms to be significantly different.
If we compare together with , the mean ranks are:
Again, Friedman test rejects the null hypothesis. The pairwise comparisons for the pair gives the statistic
which is smaller than . Thus the difference between algorithms and is not significant.
The accuracies of and are the same in the two cases but again the decisions of the meanranks are conditional to the group of classifiers we are considering.
Consider building a set of four classifiers . By differently choosing and we can build ten different such sets. For each subset we run the meanranks test to check whether the difference between and is significantly different. The difference is claimed to be significant in cases and not significant in cases.
Now consider a set of five classifiers . By differently choosing , and we can build ten different such sets. This yields 10 further cases in which we compare again and . Their difference is claimed to be significant in 9/10 cases.
Table 1 reports the pairwise comparisons for which the statistical decision changes with the pool of classifiers that are considered. The outcome of the meanranks test when comparing the same pair of classifiers clearly depends on the pool of alternative classifiers which is assumed.
Card=2  Card=3  Card=4  

vs.  7/10  9/10  3/5 
vs.  1/10     
vs.  2/10     
vs.  9/10  5/10   
4.4 Maximum type I error
A further drawback of the meanranks test which has not been discussed in the previous examples is that it cannot control the maximum type I error, that is, the probability of falsely declaring any pair of algorithms to be different regardless of the other algorithms. If the accuracies of all algorithms but one are equal, it does not guarantee the familywise Type I error to be smaller than when comparing the equivalent algorithms. We point the reader to (Fligner1984) for a detailed discussion on this aspect.
5 A suggested procedure
Given the above issues, we recommend to avoid the meanranks test for the posthoc analysis. One should instead perform the multiple comparison using tests whose decision depend only on the two algorithms being compared, such as the sign test or the Wilcoxon signedrank test. The sign test is more robust, as it only assumes the observations to be identically distributed. Its drawback is low power. The Wilcoxon signedrank test is more powerful and thus it is generally recommended (demvsar2006statistical). Compared to the sign test, the Wilcoxon signedrank test makes the additional assumption of a symmetric distribution of the differences between the two algorithms being compared. The decision between sign test and signedrank test thus depends on whether the symmetry assumption is tenable on to the analyzed data.
Regardless the adopted test, the multiple comparisons should be performed adjusting the significance level to control the familywise TypeI error. This can be done using the correction for multiple comparison discussed by (demvsar2006statistical; garcia2008extension). If we adopt the Wilcoxon signedrank test in Example 3 for comparing , we obtain the value , independently from the performance of the other algorithms. Thus, for any pool of algorithms , we always report the same decision: are significantly different because the value is less than the Bonferroni corrected significance level (in the case , ).
6 Software
The MATLAB scripts of the above examples can be downloaded from ipg.idsia.ch/software/meanRanks/matlab.zip
7 Conclusions
The meanranks posthoc test is widely used test for multiple pairwise comparison. We discuss a number of drawbacks of this test, which we recommend to avoid. We instead recommend to adopt the signtest or the Wilcoxon signedrank, whose decision does not depend on the pool of classifiers included in the original experiment.
We moreover bring to the attention of the reader the Bayesian counterparts of these tests, which overcome the many drawbacks (kruschke2010bayesian, Chap.11) of nullhypothesis significance testing.