Vote'n'Rank: Revision of Benchmarking with Social Choice Theory

10/11/2022
by   Mark Rofin, et al.
0

The development of state-of-the-art systems in different applied areas of machine learning (ML) is driven by benchmarks, which have shaped the paradigm of evaluating generalisation capabilities from multiple perspectives. Although the paradigm is shifting towards more fine-grained evaluation across diverse tasks, the delicate question of how to aggregate the performances has received particular interest in the community. In general, benchmarks follow the unspoken utilitarian principles, where the systems are ranked based on their mean average score over task-specific metrics. Such aggregation procedure has been viewed as a sub-optimal evaluation protocol, which may have created the illusion of progress. This paper proposes Vote'n'Rank, a framework for ranking systems in multi-task benchmarks under the principles of the social choice theory. We demonstrate that our approach can be efficiently utilised to draw new insights on benchmarking in several ML sub-fields and identify the best-performing systems in research and development case studies. The Vote'n'Rank's procedures are more robust than the mean average while being able to handle missing performance scores and determine conditions under which the system becomes the winner.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/08/2022

What are the best systems? New perspectives on NLP Benchmarking

In Machine Learning, a benchmark refers to an ensemble of datasets assoc...
research
05/17/2023

Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks

The evaluation of natural language processing (NLP) systems is crucial f...
research
02/19/2020

MLModelScope: A Distributed Platform for Model Evaluation and Benchmarking at Scale

Machine Learning (ML) and Deep Learning (DL) innovations are being intro...
research
07/14/2021

The Benchmark Lottery

The world of empirical machine learning (ML) strongly relies on benchmar...
research
10/16/2014

Multivariate Spearman's rho for aggregating ranks using copulas

We study the problem of rank aggregation: given a set of ranked lists, w...
research
10/20/2021

Better than Average: Paired Evaluation of NLP Systems

Evaluation in NLP is usually done by comparing the scores of competing s...
research
12/04/2021

BenchML: an extensible pipelining framework for benchmarking representations of materials and molecules at scale

We introduce a machine-learning (ML) framework for high-throughput bench...

Please sign up or login with your details

Forgot password? Click here to reset