DeepAI AI Chat
Log In Sign Up

Stop Measuring Calibration When Humans Disagree

by   Joris Baan, et al.

Calibration is a popular framework to evaluate whether a classifier knows when it does not know - i.e., its predictive probabilities are a good indication of how likely a prediction is to be correct. Correctness is commonly estimated against the human majority class. Recently, calibration to human majority has been measured on tasks where humans inherently disagree about which class applies. We show that measuring calibration to human majority given inherent disagreements is theoretically problematic, demonstrate this empirically on the ChaosNLI dataset, and derive several instance-level measures of calibration that capture key statistical properties of human judgements - class frequency, ranking and entropy.


page 1

page 2

page 3

page 4


A Unifying Theory of Distance from Calibration

We study the fundamental question of how to define and measure the dista...

Calibration tests in multi-class classification: A unifying framework

In safety-critical applications a probabilistic model is usually require...

Inference from Sampling with Response Probabilities Estimated via Calibration

A solution to control for nonresponse bias consists of multiplying the d...

Calibration of Neural Networks using Splines

Calibrating neural networks is of utmost importance when employing them ...

Hidden Heterogeneity: When to Choose Similarity-Based Calibration

Trustworthy classifiers are essential to the adoption of machine learnin...

Combining Human Predictions with Model Probabilities via Confusion Matrices and Calibration

An increasingly common use case for machine learning models is augmentin...

Calibrated Learning to Defer with One-vs-All Classifiers

The learning to defer (L2D) framework has the potential to make AI syste...