DeepAI AI Chat
Log In Sign Up

Can We Achieve More with Less? Exploring Data Augmentation for Toxic Comment Classification

by   Chetanya Rastogi, et al.

This paper tackles one of the greatest limitations in Machine Learning: Data Scarcity. Specifically, we explore whether high accuracy classifiers can be built from small datasets, utilizing a combination of data augmentation techniques and machine learning algorithms. In this paper, we experiment with Easy Data Augmentation (EDA) and Backtranslation, as well as with three popular learning algorithms, Logistic Regression, Support Vector Machine (SVM), and Bidirectional Long Short-Term Memory Network (Bi-LSTM). For our experimentation, we utilize the Wikipedia Toxic Comments dataset so that in the process of exploring the benefits of data augmentation, we can develop a model to detect and classify toxic speech in comments to help fight back against cyberbullying and online harassment. Ultimately, we found that data augmentation techniques can be used to significantly boost the performance of classifiers and are an excellent strategy to combat lack of data in NLP problems.


Data Augmentation for Mental Health Classification on Social Media

The mental disorder of online users is determined using social media pos...

Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs

In practice, it is common to find oneself with far too little text data ...

Data Augmentation for Histopathological Images Based on Gaussian-Laplacian Pyramid Blending

Data imbalance is a major problem that affects several machine learning ...

Image augmentation improves few-shot classification performance in plant disease recognition

With the world population projected to near 10 billion by 2050, minimizi...

Augmented Ultrasonic Data for Machine Learning

Flaw detection in non-destructive testing, especially in complex signals...

Machine Learning Suites for Online Toxicity Detection

To identify and classify toxic online commentary, the modern tools of da...