Log In Sign Up

DQI: A Guide to Benchmark Evaluation

by   Swaroop Mishra, et al.

A `state of the art' model A surpasses humans in a benchmark B, but fails on similar benchmarks C, D, and E. What does B have that the other benchmarks do not? Recent research provides the answer: spurious bias. However, developing A to solve benchmarks B through E does not guarantee that it will solve future benchmarks. To progress towards a model that `truly learns' an underlying task, we need to quantify the differences between successive benchmarks, as opposed to existing binary and black-box approaches. We propose a novel approach to solve this underexplored task of quantifying benchmark quality by debuting a data quality metric: DQI.


page 4

page 6

page 12


A Survey of Parameters Associated with the Quality of Benchmarks in NLP

Several benchmarks have been built with heavy investment in resources to...

Towards GAN Benchmarks Which Require Generalization

For many evaluation metrics commonly used as benchmarks for unconditiona...

Black-Box Optimization Revisited: Improving Algorithm Selection Wizards through Massive Benchmarking

Existing studies in black-box optimization suffer from low generalizabil...

AI Evaluation: past, present and future

Artificial intelligence develops techniques and systems whose performanc...

Learning to Solve Complex Tasks by Talking to Agents

Humans often solve complex problems by interacting (in natural language)...

On zero-shot recognition of generic objects

Many recent advances in computer vision are the result of a healthy comp...

The Atlas Benchmark: an Automated Evaluation Framework for Human Motion Prediction

Human motion trajectory prediction, an essential task for autonomous sys...