Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models

02/15/2023
by   Abteen Ebrahimi, et al.
0

Large multilingual models have inspired a new class of word alignment methods, which work well for the model's pretraining languages. However, the languages most in need of automatic alignment are low-resource and, thus, not typically included in the pretraining data. In this work, we ask: How do modern aligners perform on unseen languages, and are they better than traditional methods? We contribute gold-standard alignments for Bribri–Spanish, Guarani–Spanish, Quechua–Spanish, and Shipibo-Konibo–Spanish. With these, we evaluate state-of-the-art aligners with and without model adaptation to the target language. Finally, we also evaluate the resulting alignments extrinsically through two downstream tasks: named entity recognition and part-of-speech tagging. We find that although transformer-based methods generally outperform traditional models, the two classes of approach remain competitive with each other.

READ FULL TEXT
research
12/23/2022

MicroBERT: Effective Training of Low-resource Monolingual BERTs through Parameter Reduction and Multitask Learning

Transformer language models (TLMs) are critical for most NLP tasks, but ...
research
06/26/2022

Low-resource Accent Classification in Geographically-proximate Settings: A Forensic and Sociophonetics Perspective

Accented speech recognition and accent classification are relatively und...
research
12/31/2020

UNKs Everywhere: Adapting Multilingual Language Models to New Scripts

Massively multilingual language models such as multilingual BERT (mBERT)...
research
10/26/2022

Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks

We present Bloom Library, a linguistically diverse set of multimodal and...
research
02/01/2019

Multilingual NER Transfer for Low-resource Languages

In massively multilingual transfer NLP models over many source languages...
research
03/18/2022

CaMEL: Case Marker Extraction without Labels

We introduce CaMEL (Case Marker Extraction without Labels), a novel and ...
research
04/10/2022

Breaking Character: Are Subwords Good Enough for MRLs After All?

Large pretrained language models (PLMs) typically tokenize the input str...

Please sign up or login with your details

Forgot password? Click here to reset