Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs

12/05/2018
by   Claude Coulombe, et al.
0

In practice, it is common to find oneself with far too little text data to train a deep neural network. This "Big Data Wall" represents a challenge for minority language communities on the Internet, organizations, laboratories and companies that compete the GAFAM (Google, Amazon, Facebook, Apple, Microsoft). While most of the research effort in text data augmentation aims on the long-term goal of finding end-to-end learning solutions, which is equivalent to "using neural networks to feed neural networks", this engineering work focuses on the use of practical, robust, scalable and easy-to-implement data augmentation pre-processing techniques similar to those that are successful in computer vision. Several text augmentation techniques have been experimented. Some existing ones have been tested for comparison purposes such as noise injection or the use of regular expressions. Others are modified or improved techniques like lexical replacement. Finally more innovative ones, such as the generation of paraphrases using back-translation or by the transformation of syntactic trees, are based on robust, scalable, and easy-to-use NLP Cloud APIs. All the text augmentation techniques studied, with an amplification factor of only 5, increased the accuracy of the results in a range of 4.3 significant statistical fluctuations, on a standardized task of text polarity prediction. Some standard deep neural network architectures were tested: the multilayer perceptron (MLP), the long short-term memory recurrent network (LSTM) and the bidirectional LSTM (biLSTM). Classical XGBoost algorithm has been tested with up to 2.5

READ FULL TEXT

page 5

page 6

page 7

page 8

page 11

page 13

page 14

page 21

research
08/19/2022

Predicting Exotic Hadron Masses with Data Augmentation Using Multilayer Perceptron

Recently, there have been significant developments in neural networks; t...
research
07/02/2020

Can We Achieve More with Less? Exploring Data Augmentation for Toxic Comment Classification

This paper tackles one of the greatest limitations in Machine Learning: ...
research
11/05/2021

Sexism Identification in Tweets and Gabs using Deep Neural Networks

Through anonymisation and accessibility, social media platforms have fac...
research
09/11/2020

Applications of Deep Neural Networks

Deep learning is a group of exciting new technologies for neural network...
research
04/04/2016

Image Captioning with Deep Bidirectional LSTMs

This work presents an end-to-end trainable deep bidirectional LSTM (Long...
research
05/16/2023

Data Augmentation for Conflict and Duplicate Detection in Software Engineering Sentence Pairs

This paper explores the use of text data augmentation techniques to enha...

Please sign up or login with your details

Forgot password? Click here to reset