To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP

11/18/2021
by   Gözde Gül Şahin, et al.
0

Data-hungry deep neural networks have established themselves as the standard for many NLP tasks including the traditional sequence tagging ones. Despite their state-of-the-art performance on high-resource languages, they still fall behind of their statistical counter-parts in low-resource scenarios. One methodology to counter attack this problem is text augmentation, i.e., generating new synthetic training data points from existing data. Although NLP has recently witnessed a load of textual augmentation techniques, the field still lacks a systematic performance analysis on a diverse set of languages and sequence tagging tasks. To fill this gap, we investigate three categories of text augmentation methodologies which perform changes on the syntax (e.g., cropping sub-sentences), token (e.g., random word insertion) and character (e.g., character swapping) levels. We systematically compare them on part-of-speech tagging, dependency parsing and semantic role labeling for a diverse set of language families using various models including the architectures that rely on pretrained multilingual contextualized language models such as mBERT. Augmentation most significantly improves dependency parsing, followed by part-of-speech tagging and semantic role labeling. We find the experimented techniques to be effective on morphologically rich languages in general rather than analytic languages such as Vietnamese. Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT. We identify the character-level methods as the most consistent performers, while synonym replacement and syntactic augmenters provide inconsistent improvements. Finally, we discuss that the results most heavily depend on the task, language pair, and the model type.

READ FULL TEXT

page 21

page 22

page 36

research
03/22/2019

Data Augmentation via Dependency Tree Morphing for Low-Resource Languages

Neural NLP systems achieve high scores in the presence of sizable traini...
research
10/26/2021

Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios?

Recent impressive improvements in NLP, largely based on the success of c...
research
02/03/2023

Mitigating Data Scarcity for Large Language Models

In recent years, pretrained neural language models (PNLMs) have taken th...
research
09/09/2023

Distributional Data Augmentation Methods for Low Resource Language

Text augmentation is a technique for constructing synthetic data from an...
research
08/17/2023

Linguistically-Informed Neural Architectures for Lexical, Syntactic and Semantic Tasks in Sanskrit

The primary focus of this thesis is to make Sanskrit manuscripts more ac...
research
07/12/2021

DaCy: A Unified Framework for Danish NLP

Danish natural language processing (NLP) has in recent years obtained co...
research
01/15/2022

Automatic Correction of Syntactic Dependency Annotation Differences

Annotation inconsistencies between data sets can cause problems for low-...

Please sign up or login with your details

Forgot password? Click here to reset