Detecting Label Errors using Pre-Trained Language Models

05/25/2022

∙

We show that large pre-trained language models are extremely capable of identifying label errors in datasets: simply verifying data points in descending order of out-of-distribution loss significantly outperforms more complex mechanisms for detecting label errors on natural language datasets. We contribute a novel method to produce highly realistic, human-originated label noise from crowdsourced data, and demonstrate the effectiveness of this method on TweetNLP, providing an otherwise difficult to obtain measure of realistic recall.

READ FULL TEXT

Detecting Label Errors using Pre-Trained Language Models

Sign in with Google

Consider DeepAI Pro