Improving Yorùbá Diacritic Restoration

03/23/2020
by   Iroro Orife, et al.
0

Yorùbá is a widely spoken West African language with a writing system rich in orthographic and tonal diacritics. They provide morphological information, are crucial for lexical disambiguation, pronunciation and are vital for any computational Speech or Natural Language Processing tasks. However diacritic marks are commonly excluded from electronic texts due to limited device and application support as well as general education on proper usage. We report on recent efforts at dataset cultivation. By aggregating and improving disparate texts from the web and various personal libraries, we were able to significantly grow our clean Yorùbá dataset from a majority Bibilical text corpora with three sources to millions of tokens from over a dozen sources. We evaluate updated diacritic restoration models on a new, general purpose, public-domain Yorùbá evaluation dataset of modern journalistic news text, selected to be multi-purpose and reflecting contemporary usage. All pre-trained models, datasets and source-code have been released as an open-source project to advance efforts on Yorùbá language technology.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/03/2018

Attentive Sequence-to-Sequence Learning for Diacritic Restoration of Yorùbá Language Text

Yorùbá is a widely spoken West African language with a writing system ri...
research
07/20/2023

A Dataset and Strong Baselines for Classification of Czech News Texts

Pre-trained models for Czech Natural Language Processing are often evalu...
research
05/26/2018

Splitting source code identifiers using Bidirectional LSTM Recurrent Neural Network

Programmers make rich use of natural language in the source code they wr...
research
03/23/2023

SwissBERT: The Multilingual Language Model for Switzerland

We present SwissBERT, a masked language model created specifically for p...
research
06/23/2023

Abstractive Text Summarization for Resumes With Cutting Edge NLP Transformers and LSTM

Text summarization is a fundamental task in natural language processing ...
research
05/18/2023

A Parameter-Efficient Learning Approach to Arabic Dialect Identification with Pre-Trained General-Purpose Speech Model

In this work, we explore Parameter-Efficient-Learning (PEL) techniques t...
research
01/14/2021

Estimation of the Frequency of Occurrence of Italian Phonemes in Text

The purpose of this project was to derive a reliable estimate of the fre...

Please sign up or login with your details

Forgot password? Click here to reset