Our Evaluation Metric Needs an Update to Encourage Generalization

07/14/2020
by   Swaroop Mishra, et al.
6

Models that surpass human performance on several popular benchmarks display significant degradation in performance on exposure to Out of Distribution (OOD) data. Recent research has shown that models overfit to spurious biases and `hack' datasets, in lieu of learning generalizable features like humans. In order to stop the inflation in model performance – and thus overestimation in AI systems' capabilities – we propose a simple and novel evaluation metric, WOOD Score, that encourages generalization during evaluation.

READ FULL TEXT

page 2

page 3

research
07/06/2021

Principles for Evaluation of AI/ML Model Performance and Robustness

The Department of Defense (DoD) has significantly increased its investme...
research
12/21/2022

Uncontrolled Lexical Exposure Leads to Overestimation of Compositional Generalization in Pretrained Models

Human linguistic capacity is often characterized by compositionality and...
research
02/10/2020

Adversarial Filters of Dataset Biases

Large neural models have demonstrated human-level performance on languag...
research
03/10/2023

Who's Thinking? A Push for Human-Centered Evaluation of LLMs using the XAI Playbook

Deployed artificial intelligence (AI) often impacts humans, and there is...
research
10/15/2020

Reliable Evaluations for Natural Language Inference based on a Unified Cross-dataset Benchmark

Recent studies show that crowd-sourced Natural Language Inference (NLI) ...
research
01/16/2021

Robustness to Augmentations as a Generalization metric

Generalization is the ability of a model to predict on unseen domains an...
research
05/02/2020

DQI: Measuring Data Quality in NLP

Neural language models have achieved human level performance across seve...

Please sign up or login with your details

Forgot password? Click here to reset