DeepAI
Log In Sign Up

DQI: A Guide to Benchmark Evaluation

08/10/2020
by   Swaroop Mishra, et al.
11

A `state of the art' model A surpasses humans in a benchmark B, but fails on similar benchmarks C, D, and E. What does B have that the other benchmarks do not? Recent research provides the answer: spurious bias. However, developing A to solve benchmarks B through E does not guarantee that it will solve future benchmarks. To progress towards a model that `truly learns' an underlying task, we need to quantify the differences between successive benchmarks, as opposed to existing binary and black-box approaches. We propose a novel approach to solve this underexplored task of quantifying benchmark quality by debuting a data quality metric: DQI.

READ FULL TEXT

page 4

page 6

page 12

10/14/2022

A Survey of Parameters Associated with the Quality of Benchmarks in NLP

Several benchmarks have been built with heavy investment in resources to...
01/10/2020

Towards GAN Benchmarks Which Require Generalization

For many evaluation metrics commonly used as benchmarks for unconditiona...
10/08/2020

Black-Box Optimization Revisited: Improving Algorithm Selection Wizards through Massive Benchmarking

Existing studies in black-box optimization suffer from low generalizabil...
08/29/2014

AI Evaluation: past, present and future

Artificial intelligence develops techniques and systems whose performanc...
10/16/2021

Learning to Solve Complex Tasks by Talking to Agents

Humans often solve complex problems by interacting (in natural language)...
04/10/2019

On zero-shot recognition of generic objects

Many recent advances in computer vision are the result of a healthy comp...
07/20/2022

The Atlas Benchmark: an Automated Evaluation Framework for Human Motion Prediction

Human motion trajectory prediction, an essential task for autonomous sys...