Model Rectification via Unknown Unknowns Extraction from Deployment Samples

02/08/2021
by   Bruno Abrahao, et al.
0

Model deficiency that results from incomplete training data is a form of structural blindness that leads to costly errors, oftentimes with high confidence. During the training of classification tasks, underrepresented class-conditional distributions that a given hypothesis space can recognize results in a mismatch between the model and the target space. To mitigate the consequences of this discrepancy, we propose Random Test Sampling and Cross-Validation (RTSCV) as a general algorithmic framework that aims to perform a post-training model rectification at deployment time in a supervised way. RTSCV extracts unknown unknowns (u.u.s), i.e., examples from the class-conditional distributions that a classifier is oblivious to, and works in combination with a diverse family of modern prediction models. RTSCV augments the training set with a sample of the test set (or deployment data) and uses this redefined class layout to discover u.u.s via cross-validation, without relying on active learning or budgeted queries to an oracle. We contribute a theoretical analysis that establishes performance guarantees based on the design bases of modern classifiers. Our experimental evaluation demonstrates RTSCV's effectiveness, using 7 benchmark tabular and computer vision datasets, by reducing a performance gap as large as 41 pre-rectification models. Last we show that RTSCV consistently outperforms state-of-the-art approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/24/2023

Cross-Validation Is All You Need: A Statistical Approach To Label Noise Estimation

Label noise is prevalent in machine learning datasets. It is crucial to ...
research
12/11/2002

Theoretical Analyses of Cross-Validation Error and Voting in Instance-Based Learning

This paper begins with a general theory of error in cross-validation tes...
research
03/14/2018

How to evaluate sentiment classifiers for Twitter time-ordered data?

Social media are becoming an increasingly important source of informatio...
research
12/11/2017

Identifying the Mislabeled Training Samples of ECG Signals using Machine Learning

The classification accuracy of electrocardiogram signal is often affecte...
research
07/13/2018

Bridging the Gap Between Layout Pattern Sampling and Hotspot Detection via Batch Active Learning

Layout hotpot detection is one of the main steps in modern VLSI design. ...
research
01/09/2021

SARS-Cov-2 RNA Sequence Classification Based on Territory Information

CovID-19 genetics analysis is critical to determine virus type,virus var...
research
09/28/2021

When in Doubt: Improving Classification Performance with Alternating Normalization

We introduce Classification with Alternating Normalization (CAN), a non-...

Please sign up or login with your details

Forgot password? Click here to reset