Should we really use post-hoc tests based on mean-ranks?

05/09/2015
by   Alessio Benavoli, et al.
0

The statistical comparison of multiple algorithms over multiple data sets is fundamental in machine learning. This is typically carried out by the Friedman test. When the Friedman test rejects the null hypothesis, multiple comparisons are carried out to establish which are the significant differences among algorithms. The multiple comparisons are usually performed using the mean-ranks test. The aim of this technical note is to discuss the inconsistencies of the mean-ranks post-hoc test with the goal of discouraging its use in machine learning as well as in medicine, psychology, etc.. We show that the outcome of the mean-ranks test depends on the pool of algorithms originally included in the experiment. In other words, the outcome of the comparison between algorithms A and B depends also on the performance of the other algorithms included in the original experiment. This can lead to paradoxical situations. For instance the difference between A and B could be declared significant if the pool comprises algorithms C, D, E and not significant if the pool comprises algorithms F, G, H. To overcome these issues, we suggest instead to perform the multiple comparison using a test whose outcome only depends on the two algorithms being compared, such as the sign-test or the Wilcoxon signed-rank test.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/18/2022

A Graphical Approach for Friedman Test: Moments Approach

Friedman test is a nonparametric method that proposed for analyzing data...
research
08/10/2023

Rank tests for outlier detection

In novelty detection, the objective is to determine whether the test sam...
research
03/12/2018

Statistical tests for evaluating an earthquake prediction method

The impact of including postcursors in the null hypothesis test is discu...
research
08/09/2022

A Bayesian Bradley-Terry model to compare multiple ML algorithms on multiple data sets

This paper proposes a Bayesian model to compare multiple algorithms on m...
research
09/13/2017

The Merging Path Plot: adaptive fusing of k-groups with likelihood-based model selection

There are many statistical tests that verify the null hypothesis: the va...
research
11/11/2019

A post hoc test on the Sharpe ratio

We describe a post hoc test for the Sharpe ratio, analogous to Tukey's t...
research
07/10/2018

Paired Comparison Sentiment Scores

The method of paired comparisons is an established method in psychology....

Please sign up or login with your details

Forgot password? Click here to reset