Training and Prediction Data Discrepancies: Challenges of Text Classification with Noisy, Historical Data

09/11/2018
by   Emilia Apostolova, et al.
0

Industry datasets used for text classification are rarely created for that purpose. In most cases, the data and target predictions are a by-product of accumulated historical data, typically fraught with noise, present in both the text-based document, as well as in the targeted labels. In this work, we address the question of how well performance metrics computed on noisy, historical data reflect the performance on the intended future machine learning model input. The results demonstrate the utility of dirty training datasets used to build prediction models for cleaner (and different) prediction inputs.

READ FULL TEXT
research
12/07/2020

Leveraging Automated Machine Learning for Text Classification: Evaluation of AutoML Tools and Comparison with Human Performance

Recently, Automated Machine Learning (AutoML) has registered increasing ...
research
12/09/2020

Label Confusion Learning to Enhance Text Classification Models

Representing a true label as a one-hot vector is a common practice in tr...
research
01/27/2021

Towards Robustness to Label Noise in Text Classification via Noise Modeling

Large datasets in NLP suffer from noisy labels, due to erroneous automat...
research
04/20/2022

Is BERT Robust to Label Noise? A Study on Learning with Noisy Labels in Text Classification

Incorrect labels in training data occur when human annotators make mista...
research
09/29/2020

Natcat: Weakly Supervised Text Classification with Naturally Annotated Datasets

We seek to improve text classification by leveraging naturally annotated...
research
10/25/2017

Re-evaluating the need for Modelling Term-Dependence in Text Classification Problems

A substantial amount of research has been carried out in developing mach...
research
01/31/2020

Benchmarking Popular Classification Models' Robustness to Random and Targeted Corruptions

Text classification models, especially neural networks based models, hav...

Please sign up or login with your details

Forgot password? Click here to reset