The first large scale collection of diverse Hausa language datasets

02/13/2021
by   Isa Inuwa-Dutse, et al.
0

Hausa language belongs to the Afroasiatic phylum, and with more first-language speakers than any other sub-Saharan African language. With a majority of its speakers residing in the Northern and Southern areas of Nigeria and the Republic of Niger, respectively, it is estimated that over 100 million people speak the language. Hence, making it one of the most spoken Chadic language. While Hausa is considered well-studied and documented language among the sub-Saharan African languages, it is viewed as a low resource language from the perspective of natural language processing (NLP) due to limited resources to utilise in NLP-related tasks. This is common to most languages in Africa; thus, it is crucial to enrich such languages with resources that will support and speed the pace of conducting various downstream tasks to meet the demand of the modern society. While there exist useful datasets, notably from news sites and religious texts, more diversity is needed in the corpus. We provide an expansive collection of curated datasets consisting of both formal and informal forms of the language from refutable websites and online social media networks, respectively. The collection is large and more diverse than the existing corpora by providing the first and largest set of Hausa social media data posts to capture the peculiarities in the language. The collection also consists of a parallel dataset, which can be used for tasks such as machine translation with applications in areas such as the detection of spurious or inciteful online content. We describe the curation process – from the collection, preprocessing and how to obtain the data – and proffer some research problems that could be addressed using the data.

READ FULL TEXT
research
12/07/2019

PidginUNMT: Unsupervised Neural Machine Translation from West African Pidgin to English

Over 800 languages are spoken across West Africa. Despite the obvious di...
research
08/25/2022

Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks

Indigenous African languages are categorized as under-served in Artifici...
research
08/03/2020

Lanfrica: A Participatory Approach to Documenting Machine Translation Research on African Languages

Over the years, there have been campaigns to include the African languag...
research
07/11/2023

Vacaspati: A Diverse Corpus of Bangla Literature

Bangla (or Bengali) is the fifth most spoken language globally; yet, the...
research
07/27/2020

Linguistic Taboos and Euphemisms in Nepali

Languages across the world have words, phrases, and behaviors – the tabo...
research
05/02/2022

Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation

Multi-modal Machine Translation (MMT) enables the use of visual informat...
research
07/11/2020

Is Machine Learning Speaking my Language? A Critical Look at the NLP-Pipeline Across 8 Human Languages

Natural Language Processing (NLP) is increasingly used as a key ingredie...

Please sign up or login with your details

Forgot password? Click here to reset