Finetuning a Kalaallisut-English machine translation system using web-crawled data

06/05/2022
by   Alex Jones, et al.
0

West Greenlandic, known by native speakers as Kalaallisut, is an extremely low-resource polysynthetic language spoken by around 56,000 people in Greenland. Here, we attempt to finetune a pretrained Kalaallisut-to-English neural machine translation (NMT) system using web-crawled pseudoparallel sentences from around 30 multilingual websites. We compile a corpus of over 93,000 Kalaallisut sentences and over 140,000 Danish sentences, then use cross-lingual sentence embeddings and approximate nearest-neighbors search in an attempt to mine near-translations from these corpora. Finally, we translate the Danish sentence to English to obtain a synthetic Kalaallisut-English aligned corpus. Although the resulting dataset is too small and noisy to improve the pretrained MT model, we believe that with additional resources, we could construct a better pseudoparallel corpus and achieve more promising results on MT. We also note other possible uses of the monolingual Kalaallisut data and discuss directions for future work. We make the code and data for our experiments publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/19/2021

Attentive fine-tuning of Transformers for Translation of low-resourced languages @LoResMT 2021

This paper reports the Machine Translation (MT) systems submitted by the...
research
09/20/2020

Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation

Despite being the seventh most widely spoken language in the world, Beng...
research
10/09/2020

ChrEn: Cherokee-English Machine Translation for Endangered Language Revitalization

Cherokee is a highly endangered Native American language spoken by the C...
research
12/15/2021

Lesan – Machine Translation for Low Resource Languages

Millions of people around the world can not access content on the Web be...
research
08/11/2020

Revisiting Low Resource Status of Indian Languages in Machine Translation

Indian language machine translation performance is hampered due to the l...
research
05/23/2023

Sāmayik: A Benchmark and Dataset for English-Sanskrit Translation

Sanskrit is a low-resource language with a rich heritage. Digitized Sans...
research
09/02/2018

MTNT: A Testbed for Machine Translation of Noisy Text

Noisy or non-standard input text can cause disastrous mistranslations in...

Please sign up or login with your details

Forgot password? Click here to reset