Data Augmentation via Dependency Tree Morphing for Low-Resource Languages

03/22/2019
by   Gözde Gül Şahin, et al.
0

Neural NLP systems achieve high scores in the presence of sizable training dataset. Lack of such datasets leads to poor system performances in the case low-resource languages. We present two simple text augmentation techniques using dependency trees, inspired from image processing. We crop sentences by removing dependency links, and we rotate sentences by moving the tree fragments around the root. We apply these techniques to augment the training sets of low-resource languages in Universal Dependencies project. We implement a character-level sequence tagging model and evaluate the augmented datasets on part-of-speech tagging task. We show that crop and rotate provides improvements over the models trained with non-augmented data for majority of the languages, especially for languages with rich case marking systems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/18/2021

To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP

Data-hungry deep neural networks have established themselves as the stan...
research
09/06/2019

A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages

Parsers are available for only a handful of the world's languages, since...
research
11/30/2021

Minor changes make a difference: a case study on the consistency of UD-based dependency parsers

Many downstream applications are using dependency trees, and are thus re...
research
04/16/2021

Optimal Size-Performance Tradeoffs: Weighing PoS Tagger Models

Improvement in machine learning-based NLP performance are often presente...
research
06/08/2021

A Falta de Pan, Buenas Son Tortas: The Efficacy of Predicted UPOS Tags for Low Resource UD Parsing

We evaluate the efficacy of predicted UPOS tags as input features for de...
research
10/18/2022

Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging

Part-of-Speech (POS) tagging is an important component of the NLP pipeli...
research
02/03/2019

Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks

In this paper we present a novel lemmatization method based on a sequenc...

Please sign up or login with your details

Forgot password? Click here to reset