Detecting Label Errors using Pre-Trained Language Models

05/25/2022
by   Derek Chong, et al.
0

We show that large pre-trained language models are extremely capable of identifying label errors in datasets: simply verifying data points in descending order of out-of-distribution loss significantly outperforms more complex mechanisms for detecting label errors on natural language datasets. We contribute a novel method to produce highly realistic, human-originated label noise from crowdsourced data, and demonstrate the effectiveness of this method on TweetNLP, providing an otherwise difficult to obtain measure of realistic recall.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/30/2020

Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar Induction

With the recent success and popularity of pre-trained language models (L...
research
01/29/2023

Neural Relation Graph for Identifying Problematic Data

Diagnosing and cleaning datasets are crucial for building robust machine...
research
06/01/2023

ReviewerGPT? An Exploratory Study on Using Large Language Models for Paper Reviewing

Given the rapid ascent of large language models (LLMs), we study the que...
research
10/07/2022

Novice Type Error Diagnosis with Natural Language Models

Strong static type systems help programmers eliminate many errors withou...
research
03/24/2023

Chat2VIS: Fine-Tuning Data Visualisations using Multilingual Natural Language Text and Pre-Trained Large Language Models

The explosion of data in recent years is driving individuals to leverage...
research
11/06/2022

Robust Lottery Tickets for Pre-trained Language Models

Recent works on Lottery Ticket Hypothesis have shown that pre-trained la...
research
08/15/2022

Targeted Honeyword Generation with Language Models

Honeywords are fictitious passwords inserted into databases in order to ...

Please sign up or login with your details

Forgot password? Click here to reset