DeepAI AI Chat
Log In Sign Up

Improving Yorùbá Diacritic Restoration

03/23/2020
by   Iroro Orife, et al.
Universität Saarland
Carnegie Mellon University
berkeley college
0

Yorùbá is a widely spoken West African language with a writing system rich in orthographic and tonal diacritics. They provide morphological information, are crucial for lexical disambiguation, pronunciation and are vital for any computational Speech or Natural Language Processing tasks. However diacritic marks are commonly excluded from electronic texts due to limited device and application support as well as general education on proper usage. We report on recent efforts at dataset cultivation. By aggregating and improving disparate texts from the web and various personal libraries, we were able to significantly grow our clean Yorùbá dataset from a majority Bibilical text corpora with three sources to millions of tokens from over a dozen sources. We evaluate updated diacritic restoration models on a new, general purpose, public-domain Yorùbá evaluation dataset of modern journalistic news text, selected to be multi-purpose and reflecting contemporary usage. All pre-trained models, datasets and source-code have been released as an open-source project to advance efforts on Yorùbá language technology.

READ FULL TEXT

page 1

page 2

page 3

page 4

04/03/2018

Attentive Sequence-to-Sequence Learning for Diacritic Restoration of Yorùbá Language Text

Yorùbá is a widely spoken West African language with a writing system ri...
05/26/2018

Splitting source code identifiers using Bidirectional LSTM Recurrent Neural Network

Programmers make rich use of natural language in the source code they wr...
09/14/2021

Hunspell for Sorani Kurdish Spell Checking and Morphological Analysis

Spell checking and morphological analysis are two fundamental tasks in t...
07/27/2019

Nefnir: A high accuracy lemmatizer for Icelandic

Lemmatization, finding the basic morphological form of a word in a corpu...
08/13/2019

IMS-Speech: A Speech to Text Tool

We present the IMS-Speech, a web based tool for German and English speec...
11/25/2022

MUSIED: A Benchmark for Event Detection from Multi-Source Heterogeneous Informal Texts

Event detection (ED) identifies and classifies event triggers from unstr...
01/14/2021

Estimation of the Frequency of Occurrence of Italian Phonemes in Text

The purpose of this project was to derive a reliable estimate of the fre...