Towards a parallel corpus of Portuguese and the Bantu language Emakhuwa of Mozambique

04/12/2021
by   Felermino D. M. A. Ali, et al.
0

Major advancement in the performance of machine translation models has been made possible in part thanks to the availability of large-scale parallel corpora. But for most languages in the world, the existence of such corpora is rare. Emakhuwa, a language spoken in Mozambique, is like most African languages low-resource in NLP terms. It lacks both computational and linguistic resources and, to the best of our knowledge, few parallel corpora including Emakhuwa already exist. In this paper we describe the creation of the Emakhuwa-Portuguese parallel corpus, which is a collection of texts from the Jehovah's Witness website and a variety of other sources including the African Story Book website, the Universal Declaration of Human Rights and Mozambican legal documents. The dataset contains 47,415 sentence pairs, amounting to 699,976 word tokens of Emakhuwa and 877,595 word tokens in Portuguese. After normalization processes which remain to be completed, the corpus will be made freely available for research use.

READ FULL TEXT
research
03/22/2021

Monolingual and Parallel Corpora for Kangri Low Resource Language

In this paper we present the dataset of Himachali low resource endangere...
research
04/28/2017

Past, Present, Future: A Computational Investigation of the Typology of Tense in 1000 Languages

We present SuperPivot, an analysis method for low-resource languages tha...
research
04/26/2019

Producing Corpora of Medieval and Premodern Occitan

At a time when the quantity of - more or less freely - available data is...
research
06/27/2023

SAHAAYAK 2023 – the Multi Domain Bilingual Parallel Corpus of Sanskrit to Hindi for Machine Translation

The data article presents the large bilingual parallel corpus of low-res...
research
06/22/2016

The word entropy of natural languages

The average uncertainty associated with words is an information-theoreti...
research
12/03/2021

Creating and Managing a large annotated parallel corpora of Indian languages

This paper presents the challenges in creating and managing large parall...
research
05/19/2022

Curras + Baladi: Towards a Levantine Corpus

The processing of the Arabic language is a complex field of research. Th...

Please sign up or login with your details

Forgot password? Click here to reset