DeepAI AI Chat
Log In Sign Up

Identifying and Benchmarking Natural Out-of-Context Prediction Problems

10/25/2021
by   David Madras, et al.
3

Deep learning systems frequently fail at out-of-context (OOC) prediction, the problem of making reliable predictions on uncommon or unusual inputs or subgroups of the training distribution. To this end, a number of benchmarks for measuring OOC performance have recently been introduced. In this work, we introduce a framework unifying the literature on OOC performance measurement, and demonstrate how rich auxiliary information can be leveraged to identify candidate sets of OOC examples in existing datasets. We present NOOCh: a suite of naturally-occurring "challenge sets", and show how varying notions of context can be used to probe specific OOC failure modes. Experimentally, we explore the tradeoffs between various learning approaches on these challenge sets and demonstrate how the choices made in designing OOC benchmarks can yield varying conclusions.

READ FULL TEXT

page 3

page 4

page 10

page 11

page 12

page 23

page 24

page 25

10/25/2021

Scientific Machine Learning Benchmarks

The breakthrough in Deep Learning neural networks has transformed the us...
08/19/2019

Deep neural network or dermatologist?

Deep learning techniques have proven high accuracy for identifying melan...
07/24/2021

Tell-Tale Tail Latencies: Pitfalls and Perils in Database Benchmarking

The performance of database systems is usually characterised by their av...
06/16/2021

Automatic Construction of Evaluation Suites for Natural Language Generation Datasets

Machine learning approaches applied to NLP are often evaluated by summar...
03/03/2023

Diagnosing Model Performance Under Distribution Shift

Prediction models can perform poorly when deployed to target distributio...
12/06/2022

Adaptive Testing of Computer Vision Models

Vision models often fail systematically on groups of data that share com...
02/09/2021

Learning State Representations from Random Deep Action-conditional Predictions

In this work, we study auxiliary prediction tasks defined by temporal-di...