Detecting Label Errors using Pre-Trained Language Models

05/25/2022
by   Derek Chong, et al.
0

We show that large pre-trained language models are extremely capable of identifying label errors in datasets: simply verifying data points in descending order of out-of-distribution loss significantly outperforms more complex mechanisms for detecting label errors on natural language datasets. We contribute a novel method to produce highly realistic, human-originated label noise from crowdsourced data, and demonstrate the effectiveness of this method on TweetNLP, providing an otherwise difficult to obtain measure of realistic recall.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset