Training Dynamic based data filtering may not work for NLP datasets

09/19/2021
by   Arka Talukdar, et al.
0

The recent increase in dataset size has brought about significant advances in natural language understanding. These large datasets are usually collected through automation (search engines or web crawlers) or crowdsourcing which inherently introduces incorrectly labeled data. Training on these datasets leads to memorization and poor generalization. Thus, it is pertinent to develop techniques that help in the identification and isolation of mislabelled data. In this paper, we study the applicability of the Area Under the Margin (AUM) metric to identify and remove/rectify mislabelled examples in NLP datasets. We find that mislabelled samples can be filtered using the AUM metric in NLP datasets but it also removes a significant number of correctly labeled points and leads to the loss of a large amount of relevant language information. We show that models rely on the distributional information instead of relying on syntactic and semantic representations.

READ FULL TEXT
research
08/21/2019

Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets

Crowdsourcing has been the prevalent paradigm for creating natural langu...
research
04/29/2021

RECKONition: a NLP-based system for Industrial Accidents at Work Prevention

Extracting patterns and useful information from Natural Language dataset...
research
11/15/2022

GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-distribution Generalization Perspective

Pre-trained language models (PLMs) are known to improve the generalizati...
research
06/27/2023

Large Language Models as Annotators: Enhancing Generalization of NLP Models at Minimal Cost

State-of-the-art supervised NLP models achieve high accuracy but are als...
research
02/09/2021

Bootstrapping Relation Extractors using Syntactic Search by Examples

The advent of neural-networks in NLP brought with it substantial improve...
research
10/27/2020

On the diminishing return of labeling clinical reports

Ample evidence suggests that better machine learning models may be stead...
research
10/05/2019

On the Limits of Learning to Actively Learn Semantic Representations

One of the goals of natural language understanding is to develop models ...

Please sign up or login with your details

Forgot password? Click here to reset