AfroLID: A Neural Language Identification Tool for African Languages

10/21/2022
by   Ife Adebara, et al.
0

Language identification (LID) is a crucial precursor for NLP, especially for mining web data. Problematically, most of the world's 7000+ languages today are not covered by LID technologies. We address this pressing issue for Africa by introducing AfroLID, a neural LID toolkit for 517 African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems. When evaluated on our blind Test set, AfroLID achieves 95.89 F_1-score. We also compare AfroLID to five existing LID tools that each cover a small number of African languages, finding it to outperform them on most languages. We further show the utility of AfroLID in the wild by testing it on the acutely under-served Twitter domain. Finally, we offer a number of controlled case studies and perform a linguistically-motivated error analysis that allow us to both showcase AfroLID's powerful capabilities and limitations.

READ FULL TEXT

page 14

page 20

page 21

research
04/13/2018

Automatic Language Identification System for Hindi and Magahi

Language identification has become a prerequisite for all kinds of autom...
research
12/11/2020

Discriminating Between Similar Nordic Languages

Automatic language identification is a challenging problem. Discriminati...
research
07/16/2017

Open-Set Language Identification

We present the first open-set language identification experiments using ...
research
05/23/2023

LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages

Knowing the language of an input text/audio is a necessary first step fo...
research
01/27/2017

Comparative Study Of Data Mining Query Languages

Since formulation of Inductive Database (IDB) problem, several Data Mini...
research
06/20/2020

SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection

A broad goal in natural language processing (NLP) is to develop a system...
research
05/19/2023

XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages

Data scarcity is a crucial issue for the development of highly multiling...

Please sign up or login with your details

Forgot password? Click here to reset