How Robust are Model Rankings: A Leaderboard Customization Approach for Equitable Evaluation

06/10/2021
by   Swaroop Mishra, et al.
8

Models that top leaderboards often perform unsatisfactorily when deployed in real world applications; this has necessitated rigorous and expensive pre-deployment model testing. A hitherto unexplored facet of model performance is: Are our leaderboards doing equitable evaluation? In this paper, we introduce a task-agnostic method to probe leaderboards by weighting samples based on their `difficulty' level. We find that leaderboards can be adversarially attacked and top performing models may not always be the best models. We subsequently propose alternate evaluation metrics. Our experiments on 10 models show changes in model ranking and an overall reduction in previously reported performance – thus rectifying the overestimation of AI systems' capabilities. Inspired by behavioral testing principles, we further develop a prototype of a visual analytics tool that enables leaderboard revamping through customization, based on an end user's focus area. This helps users analyze models' strengths and weaknesses, and guides them in the selection of a model best suited for their application scenario. In a user study, members of various commercial product development teams, covering 5 focus areas, find that our prototype reduces pre-deployment development and testing effort by 41

READ FULL TEXT

page 7

page 10

page 11

page 12

page 13

research
05/08/2020

Beyond Accuracy: Behavioral Testing of NLP models with CheckList

Although measuring held-out accuracy has been the primary approach to ev...
research
07/06/2021

Principles for Evaluation of AI/ML Model Performance and Robustness

The Department of Defense (DoD) has significantly increased its investme...
research
04/14/2020

RankBooster: Visual Analysis of Ranking Predictions

Ranking is a natural and ubiquitous way to facilitate decision-making in...
research
04/13/2023

Streamlined Framework for Agile Forecasting Model Development towards Efficient Inventory Management

This paper proposes a framework for developing forecasting models by str...
research
12/19/2022

Statistical Dataset Evaluation: Reliability, Difficulty, and Validity

Datasets serve as crucial training resources and model performance track...
research
11/30/2017

Aiding the Visually Impaired: Developing an efficient Braille Printer

With the large number of partially or completely visually impaired perso...
research
10/14/2022

Hardness of Samples Need to be Quantified for a Reliable Evaluation System: Exploring Potential Opportunities with a New Task

Evaluation of models on benchmarks is unreliable without knowing the deg...

Please sign up or login with your details

Forgot password? Click here to reset