Low-Resource Language Modelling of South African Languages

04/01/2021
by   Stuart Mesham, et al.
0

Language models are the foundation of current neural network-based models for natural language understanding and generation. However, research on the intrinsic performance of language models on African languages has been extremely limited, which is made more challenging by the lack of large or standardised training and evaluation sets that exist for English and other high-resource languages. In this paper, we evaluate the performance of open-vocabulary language models on low-resource South African languages, using byte-pair encoding to handle the rich morphology of these languages. We evaluate different variants of n-gram models, feedforward neural networks, recurrent neural networks (RNNs), and Transformers on small-scale datasets. Overall, well-regularized RNNs give the best performance across two isiZulu and one Sepedi datasets. Multilingual training further improves performance on these datasets. We hope that this research will open new avenues for research into multilingual and low-resource language modelling for African languages.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/16/2021

Specializing Multilingual Language Models: An Empirical Study

Contextualized word representations from pretrained multilingual languag...
research
04/19/2023

Low-resource Bilingual Dialect Lexicon Induction with Large Language Models

Bilingual word lexicons are crucial tools for multilingual natural langu...
research
06/22/2018

Evaluating language models of tonal harmony

This study borrows and extends probabilistic language models from natura...
research
06/20/2020

SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection

A broad goal in natural language processing (NLP) is to develop a system...
research
09/14/2023

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

Large Language Models (LLMs) have demonstrated impressive performance on...
research
05/22/2023

Automatic Readability Assessment for Closely Related Languages

In recent years, the main focus of research on automatic readability ass...
research
03/09/2022

Automatic Language Identification for Celtic Texts

Language identification is an important Natural Language Processing task...

Please sign up or login with your details

Forgot password? Click here to reset