Low Resource Text Classification with ULMFit and Backtranslation

03/21/2019
by   Sam Shleifer, et al.
0

In computer vision, virtually every state of the art deep learning system is trained with data augmentation. In text classification, however, data augmentation is less widely practiced because it must be performed before training and risks introducing label noise. We augment the IMDB movie reviews dataset with examples generated by two families of techniques: random token perturbations introduced by Wei and Zou [2019] and backtranslation -- translating to a second language then back to English. In low resource environments, backtranslation generates significant improvement on top of the state-of-the-art ULMFit model. A ULMFit model pretrained on wikitext103 and then finetuned on only 50 IMDB examples and 500 synthetic examples generated by backtranslation achieves 80.6% accuracy, an 8.1% improvement over the augmentation-free baseline with only 9 minutes of additional training time. Random token perturbations do not yield any improvements but incur equivalent computational cost. The benefits of training with backtranslated examples decreases with the size of the available training data. On the full dataset, neither augmentation technique improves upon ULMFit's state of the art performance. We address this by using backtranslations as a form of test time.

READ FULL TEXT
research
05/16/2023

AdversarialWord Dilution as Text Data Augmentation in Low-Resource Regime

Data augmentation is widely used in text classification, especially in t...
research
08/15/2022

Syntax-driven Data Augmentation for Named Entity Recognition

In low resource settings, data augmentation strategies are commonly leve...
research
06/27/2022

Improved Text Classification via Test-Time Augmentation

Test-time augmentation – the aggregation of predictions across transform...
research
09/12/2022

DoubleMix: Simple Interpolation-Based Data Augmentation for Text Classification

This paper proposes a simple yet effective interpolation-based data augm...
research
09/02/2022

Random Text Perturbations Work, but not Always

We present three large-scale experiments on binary text matching classif...
research
08/03/2022

A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval

Every hour, huge amounts of visual contents are posted on social media a...
research
07/07/2019

Improving short text classification through global augmentation methods

We study the effect of different approaches to text augmentation. To do ...

Please sign up or login with your details

Forgot password? Click here to reset