What is the state of the art? Accounting for multiplicity in machine learning benchmark performance

03/10/2023
by   Kajsa Møllersen, et al.
0

Machine learning methods are commonly evaluated and compared by their performance on data sets from public repositories. This allows for multiple methods, oftentimes several thousands, to be evaluated under identical conditions and across time. The highest ranked performance on a problem is referred to as state-of-the-art (SOTA) performance, and is used, among other things, as a reference point for publication of new methods. Using the highest-ranked performance as an estimate for SOTA is a biased estimator, giving overly optimistic results. The mechanisms at play are those of multiplicity, a topic that is well-studied in the context of multiple comparisons and multiple testing, but has, as far as the authors are aware of, been nearly absent from the discussion regarding SOTA estimates. The optimistic state-of-the-art estimate is used as a standard for evaluating new methods, and methods with substantial inferior results are easily overlooked. In this article, we provide a probability distribution for the case of multiple classifiers so that known analyses methods can be engaged and a better SOTA estimate can be provided. We demonstrate the impact of multiplicity through a simulated example with independent classifiers. We show how classifier dependency impacts the variance, but also that the impact is limited when the accuracy is high. Finally, we discuss a real-world example; a Kaggle competition from 2020.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/29/2020

A Performance-Explainability Framework to Benchmark Machine Learning Methods: Application to Multivariate Time Series Classifiers

Our research aims to propose a new performance-explainability analytical...
research
05/18/2021

Impact of h-index on authors ranking: An improvement to the h-index for lower-ranked author

In academia, the research performance of a faculty is either evaluated b...
research
08/08/2023

Comprehensive Assessment of the Performance of Deep Learning Classifiers Reveals a Surprising Lack of Robustness

Reliable and robust evaluation methods are a necessary first step toward...
research
04/11/2022

Machine Learning State-of-the-Art with Uncertainties

With the availability of data, hardware, software ecosystem and relevant...
research
03/06/2023

Benchmark of Data Preprocessing Methods for Imbalanced Classification

Severe class imbalance is one of the main conditions that make machine l...
research
09/29/2016

Classifier comparison using precision

New proposed models are often compared to state-of-the-art using statist...
research
07/18/2022

Estimating Continuous Treatment Effects in Panel Data using Machine Learning with an Agricultural Application

This paper introduces and proves asymptotic normality for a new semi-par...

Please sign up or login with your details

Forgot password? Click here to reset