Very Low Resource Sentence Alignment: Luhya and Swahili

10/31/2022
by   Everlyn Asiko Chimoto, et al.
0

Language-agnostic sentence embeddings generated by pre-trained models such as LASER and LaBSE are attractive options for mining large datasets to produce parallel corpora for low-resource machine translation. We test LASER and LaBSE in extracting bitext for two related low-resource African languages: Luhya and Swahili. For this work, we created a new parallel set of nearly 8000 Luhya-English sentences which allows a new zero-shot test of LASER and LaBSE. We find that LaBSE significantly outperforms LASER on both languages. Both LASER and LaBSE however perform poorly at zero-shot alignment on Luhya, achieving just 1.5 We fine-tune the embeddings on a small set of parallel Luhya sentences and show significant gains, improving the LaBSE alignment accuracy to 53.3 restricting the dataset to sentence embedding pairs with cosine similarity above 0.7 yielded alignments with over 85

READ FULL TEXT
research
06/12/2021

Exploiting Parallel Corpora to Improve Multilingual Embedding based Document and Sentence Alignment

Multilingual sentence representations pose a great advantage for low-res...
research
08/19/2020

Transformer based Multilingual document Embedding model

One of the current state-of-the-art multilingual document embedding mode...
research
06/20/2019

Low-Resource Corpus Filtering using Multilingual Sentence Embeddings

In this paper, we describe our submission to the WMT19 low-resource para...
research
11/16/2020

Score Combination for Improved Parallel Corpus Filtering for Low Resource Conditions

This paper describes our submission to the WMT20 sentence filtering task...
research
12/03/2019

COSTRA 1.0: A Dataset of Complex Sentence Transformations

We present COSTRA 1.0, a dataset of complex sentence transformations. Th...
research
05/12/2022

Controlling Formality in Low-Resource NMT with Domain Adaptation and Re-Ranking: SLT-CDT-UoS at IWSLT2022

This paper describes the SLT-CDT-UoS group's submission to the first Spe...
research
10/27/2020

Volctrans Parallel Corpus Filtering System for WMT 2020

In this paper, we describe our submissions to the WMT20 shared task on p...

Please sign up or login with your details

Forgot password? Click here to reset