Empirical Optimal Risk to Quantify Model Trustworthiness for Failure Detection

08/06/2023
by   Shuang Ao, et al.
0

Failure detection (FD) in AI systems is a crucial safeguard for the deployment for safety-critical tasks. The common evaluation method of FD performance is the Risk-coverage (RC) curve, which reveals the trade-off between the data coverage rate and the performance on accepted data. One common way to quantify the RC curve by calculating the area under the RC curve. However, this metric does not inform on how suited any method is for FD, or what the optimal coverage rate should be. As FD aims to achieve higher performance with fewer data discarded, evaluating with partial coverage excluding the most uncertain samples is more intuitive and meaningful than full coverage. In addition, there is an optimal point in the coverage where the model could achieve ideal performance theoretically. We propose the Excess Area Under the Optimal RC Curve (E-AUoptRC), with the area in coverage from the optimal point to the full coverage. Further, the model performance at this optimal point can represent both model learning ability and calibration. We propose it as the Trust Index (TI), a complementary evaluation metric to the overall model accuracy. We report extensive experiments on three benchmark image datasets with ten variants of transformer and CNN models. Our results show that our proposed methods can better reflect the model trustworthiness than existing evaluation metrics. We further observe that the model with high overall accuracy does not always yield the high TI, which indicates the necessity of the proposed Trust Index as a complementary metric to the model overall accuracy. The code are available at <https://github.com/AoShuang92/optimal_risk>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/06/2023

Two Sides of Miscalibration: Identifying Over and Under-Confidence Prediction for Network Calibration

Proper confidence calibration of deep neural networks is essential for r...
research
09/13/2021

Perturbation CheckLists for Evaluating NLG Evaluation Metrics

Natural Language Generation (NLG) evaluation is a multifaceted task requ...
research
06/07/2020

Overall Agreement for Multiple Raters with Replicated Measurements

Multiple raters are often needed to be used interchangeably in practice ...
research
10/10/2022

Investigating the Failure Modes of the AUC metric and Exploring Alternatives for Evaluating Systems in Safety Critical Applications

With the increasing importance of safety requirements associated with th...
research
12/11/2019

The accuracy vs. coverage trade-off in patient-facing diagnosis models

A third of adults in America use the Internet to diagnose medical concer...
research
09/25/2021

A Principled Approach to Failure Analysis and Model Repairment: Demonstration in Medical Imaging

Machine learning models commonly exhibit unexpected failures post-deploy...
research
06/28/2023

LLM Calibration and Automatic Hallucination Detection via Pareto Optimal Self-supervision

Large language models (LLMs) have demonstrated remarkable capabilities o...

Please sign up or login with your details

Forgot password? Click here to reset