Is this model reliable for everyone? Testing for strong calibration

07/28/2023
by   Jean Feng, et al.
0

In a well-calibrated risk prediction model, the average predicted probability is close to the true event rate for any given subgroup. Such models are reliable across heterogeneous populations and satisfy strong notions of algorithmic fairness. However, the task of auditing a model for strong calibration is well-known to be difficult – particularly for machine learning (ML) algorithms – due to the sheer number of potential subgroups. As such, common practice is to only assess calibration with respect to a few predefined subgroups. Recent developments in goodness-of-fit testing offer potential solutions but are not designed for settings with weak signal or where the poorly calibrated subgroup is small, as they either overly subdivide the data or fail to divide the data at all. We introduce a new testing procedure based on the following insight: if we can reorder observations by their expected residuals, there should be a change in the association between the predicted and observed residuals along this sequence if a poorly calibrated subgroup exists. This lets us reframe the problem of calibration testing into one of changepoint detection, for which powerful methods already exist. We begin with introducing a sample-splitting procedure where a portion of the data is used to train a suite of candidate models for predicting the residual, and the remaining data are used to perform a score-based cumulative sum (CUSUM) test. To further improve power, we then extend this adaptive CUSUM test to incorporate cross-validation, while maintaining Type I error control under minimal assumptions. Compared to existing methods, the proposed procedure consistently achieved higher power in simulation studies and more than doubled the power when auditing a mortality risk prediction model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/30/2020

Assessing the Calibration of Subdistribution Hazard Models in Discrete Time

The generalization performance of a risk prediction model can be evaluat...
research
07/19/2023

Non-parametric inference on calibration of predicted risks

Moderate calibration, the expected event probability among observations ...
research
02/29/2020

Model-based ROC (mROC) curve: examining the effect of case-mix and model calibration on the ROC plot

The performance of a risk prediction model is often characterized in ter...
research
06/02/2020

Local Interpretability of Calibrated Prediction Models: A Case of Type 2 Diabetes Mellitus Screening Test

Machine Learning (ML) models are often complex and difficult to interpre...
research
10/05/2022

The Calibration Generalization Gap

Calibration is a fundamental property of a good predictive model: it req...
research
08/19/2022

Improving knockoffs with conditional calibration

The knockoff filter of Barber and Candes (arXiv:1404.5609) is a flexible...
research
05/10/2022

Preliminary assessment of a cost-effective headphone calibration procedure for soundscape evaluations

The introduction of ISO 12913-2:2018 has provided a framework for standa...

Please sign up or login with your details

Forgot password? Click here to reset