High-Resource Methodological Bias in Low-Resource Investigations

11/14/2022
by   Maartje ter Hoeve, et al.
0

The central bottleneck for low-resource NLP is typically regarded to be the quantity of accessible data, overlooking the contribution of data quality. This is particularly seen in the development and evaluation of low-resource systems via down sampling of high-resource language data. In this work we investigate the validity of this approach, and we specifically focus on two well-known NLP tasks for our empirical investigations: POS-tagging and machine translation. We show that down sampling from a high-resource language results in datasets with different properties than the low-resource datasets, impacting the model performance for both POS-tagging and machine translation. Based on these results we conclude that naive down sampling of datasets results in a biased view of how well these systems work in a low-resource scenario.

READ FULL TEXT

page 6

page 8

page 16

page 17

research
02/27/2022

OCR Improves Machine Translation for Low-Resource Languages

We aim to investigate the performance of current OCR systems on low reso...
research
10/06/2021

The Low-Resource Double Bind: An Empirical Study of Pruning for Low-Resource Machine Translation

A "bigger is better" explosion in the number of parameters in deep neura...
research
03/23/2018

Leveraging translations for speech transcription in low-resource settings

Recently proposed data collection frameworks for endangered language doc...
research
08/28/2018

Deriving Machine Attention from Human Rationales

Attention-based models are successful when trained on large amounts of d...
research
08/12/2022

A Case for Rejection in Low Resource ML Deployment

Building reliable AI decision support systems requires a robust set of d...
research
09/08/2022

Knowledge Based Template Machine Translation In Low-Resource Setting

Incorporating tagging into neural machine translation (NMT) systems has ...
research
03/07/2023

A Challenging Benchmark for Low-Resource Learning

With promising yet saturated results in high-resource settings, low-reso...

Please sign up or login with your details

Forgot password? Click here to reset