Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages

10/16/2021
by   C. M. Downey, et al.
0

We show that unsupervised sequence-segmentation performance can be transferred to extremely low-resource languages by pre-training a Masked Segmental Language Model (Downey et al., 2021) multilingually. Further, we show that this transfer can be achieved by training over a collection of low-resource languages that are typologically similar (but phylogenetically unrelated) to the target language. In our experiments, we transfer from a collection of 10 Indigenous American languages (AmericasNLP, Mager et al., 2021) to K'iche', a Mayan language. We compare our model to a monolingual baseline, and show that the multilingual pre-trained approach yields much more consistent segmentation quality across target dataset sizes, including a zero-shot performance of 20.6 F1, and exceeds the monolingual performance in 9/10 experimental settings. These results have promising implications for low-resource NLP pipelines involving human-like linguistic units, such as the sparse transcription framework proposed by Bird (2020).

READ FULL TEXT
research
05/09/2021

Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural Machine Translation

The data scarcity in low-resource languages has become a bottleneck to b...
research
10/12/2019

Zero-shot Dependency Parsing with Pre-trained Multilingual Sentence Representations

We investigate whether off-the-shelf deep bidirectional sentence represe...
research
07/14/2022

Learning to translate by learning to communicate

We formulate and test a technique to use Emergent Communication (EC) wit...
research
04/10/2019

A Grounded Unsupervised Universal Part-of-Speech Tagger for Low-Resource Languages

Unsupervised part of speech (POS) tagging is often framed as a clusterin...
research
03/15/2022

Does Corpus Quality Really Matter for Low-Resource Languages?

The vast majority of non-English corpora are derived from automatically ...
research
04/16/2021

A Masked Segmental Language Model for Unsupervised Natural Language Segmentation

Segmentation remains an important preprocessing step both in languages w...
research
04/03/2023

PEACH: Pre-Training Sequence-to-Sequence Multilingual Models for Translation with Semi-Supervised Pseudo-Parallel Document Generation

Multilingual pre-training significantly improves many multilingual NLP t...

Please sign up or login with your details

Forgot password? Click here to reset