Eeny, meeny, miny, moe. How to choose data for morphological inflection

10/26/2022
by   Saliha Muradoğlu, et al.
0

Data scarcity is a widespread problem in numerous natural language processing (NLP) tasks for low-resource languages. Within morphology, the labour-intensive work of tagging/glossing data is a serious bottleneck for both NLP and language documentation. Active learning (AL) aims to reduce the cost of data annotation by selecting data that is most informative for improving the model. In this paper, we explore four sampling strategies for the task of morphological inflection using a Transformer model: a pair of oracle experiments where data is chosen based on whether the model already can or cannot inflect the test forms correctly, as well as strategies based on high/low model confidence, entropy, as well as random selection. We investigate the robustness of each strategy across 30 typologically diverse languages. We also perform a more in-depth case study of Natügu. Our results show a clear benefit to selecting data based on model confidence and entropy. Unsurprisingly, the oracle experiment, where only incorrectly handled forms are chosen for further training, which is presented as a proxy for linguist/language consultant feedback, shows the most improvement. This is followed closely by choosing low-confidence and high-entropy predictions. We also show that despite the conventional wisdom of larger data sets yielding better accuracy, introducing more instances of high-confidence or low-entropy forms, or forms that the model can already inflect correctly, can reduce model performance.

READ FULL TEXT

page 2

page 3

page 5

research
11/02/2020

Reducing Confusion in Active Learning for Part-Of-Speech Tagging

Active learning (AL) uses a data selection algorithm to select useful tr...
research
12/30/2022

Active Learning for Neural Machine Translation

The machine translation mechanism translates texts automatically between...
research
03/11/2021

Active^2 Learning: Actively reducing redundancies in Active Learning methods for Sequence Tagging and Machine Translation

While deep learning is a powerful tool for natural language processing (...
research
05/29/2019

Choosing Transfer Languages for Cross-Lingual Learning

Cross-lingual transfer, where a high-resource transfer language is used ...
research
05/04/2019

Contextualization of Morphological Inflection

Critical to natural language generation is the production of correctly i...
research
12/18/2021

Morpheme Boundary Detection Grammatical Feature Prediction for Gujarati : Dataset Model

Developing Natural Language Processing resources for a low resource lang...
research
05/25/2023

Morphological Inflection: A Reality Check

Morphological inflection is a popular task in sub-word NLP with both pra...

Please sign up or login with your details

Forgot password? Click here to reset