Text Normalization for Low-Resource Languages of Africa

03/29/2021
by   Andrew Zupon, et al.
0

Training data for machine learning models can come from many different sources, which can be of dubious quality. For resource-rich languages like English, there is a lot of data available, so we can afford to throw out the dubious data. For low-resource languages where there is much less data available, we can't necessarily afford to throw out the dubious data, in case we end up with a training set which is too small to train a model. In this study, we examine the effects of text normalization and data set quality for a set of low-resource languages of Africa – Afrikaans, Amharic, Hausa, Igbo, Malagasy, Somali, Swahili, and Zulu. We describe our text normalizer which we built in the Pynini framework, a Python library for finite state transducers, and our experiments in training language models for African languages using the Natural Language Toolkit (NLTK), an open-source Python library for NLP.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/16/2022

Text normalization for endangered languages: the case of Ligurian

Text normalization is a crucial technology for low-resource languages wh...
research
05/03/2023

evaluating bert and parsbert for analyzing persian advertisement data

This paper discusses the impact of the Internet on modern trading and th...
research
10/20/2020

Comparison of Interactive Knowledge Base Spelling Correction Models for Low-Resource Languages

Spelling normalization for low resource languages is a challenging task ...
research
01/27/2021

Mining Large-Scale Low-Resource Pronunciation Data From Wikipedia

Pronunciation modeling is a key task for building speech technology in n...
research
07/12/2021

End-to-End Natural Language Understanding Pipeline for Bangla Conversational Agents

Chatbots are intelligent software built to be used as a replacement for ...
research
08/02/2021

ConveRT for FAQ Answering

Knowledgeable FAQ chatbots are a valuable resource to any organization. ...
research
01/05/2022

Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation

Common designs of model evaluation typically focus on monolingual settin...

Please sign up or login with your details

Forgot password? Click here to reset