Distributional Data Augmentation Methods for Low Resource Language

09/09/2023
by   Mosleh Mahamud, et al.
0

Text augmentation is a technique for constructing synthetic data from an under-resourced corpus to improve predictive performance. Synthetic data generation is common in numerous domains. However, recently text augmentation has emerged in natural language processing (NLP) to improve downstream tasks. One of the current state-of-the-art text augmentation techniques is easy data augmentation (EDA), which augments the training data by injecting and replacing synonyms and randomly permuting sentences. One major obstacle with EDA is the need for versatile and complete synonym dictionaries, which cannot be easily found in low-resource languages. To improve the utility of EDA, we propose two extensions, easy distributional data augmentation (EDDA) and type specific similar word replacement (TSSR), which uses semantic word context information and part-of-speech tags for word replacement and augmentation. In an extensive empirical evaluation, we show the utility of the proposed methods, measured by F1 score, on two representative datasets in Swedish as an example of a low-resource language. With the proposed methods, we show that augmented data improve classification performances in low-resource settings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/15/2022

Syntax-driven Data Augmentation for Named Entity Recognition

In low resource settings, data augmentation strategies are commonly leve...
research
08/26/2021

Data Augmentation for Low-Resource Named Entity Recognition Using Backtranslation

The state of art natural language processing systems relies on sizable t...
research
07/14/2022

Data Augmentation for Low-Resource Quechua ASR Improvement

Automatic Speech Recognition (ASR) is a key element in new services that...
research
04/26/2023

Is a prompt and a few samples all you need? Using GPT-4 for data augmentation in low-resource classification tasks

Obtaining and annotating data can be expensive and time-consuming, espec...
research
09/25/2020

BET: A Backtranslation Approach for Easy Data Augmentation in Transformer-based Paraphrase Identification Context

Newly-introduced deep learning architectures, namely BERT, XLNet, RoBERT...
research
02/25/2022

PromDA: Prompt-based Data Augmentation for Low-Resource NLU Tasks

This paper focuses on the Data Augmentation for low-resource Natural Lan...
research
11/18/2021

To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP

Data-hungry deep neural networks have established themselves as the stan...

Please sign up or login with your details

Forgot password? Click here to reset