KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi

10/23/2020
by   Rubungo Andre Niyongabo, et al.
28

Recent progress in text classification has been focused on high-resource languages such as English and Chinese. For low-resource languages, amongst them most African languages, the lack of well-annotated data and effective preprocessing, is hindering the progress and the transfer of successful methods. In this paper, we introduce two news datasets (KINNEWS and KIRNEWS) for multi-class classification of news articles in Kinyarwanda and Kirundi, two low-resource African languages. The two languages are mutually intelligible, but while Kinyarwanda has been studied in Natural Language Processing (NLP) to some extent, this work constitutes the first study on Kirundi. Along with the datasets, we provide statistics, guidelines for preprocessing, and monolingual and cross-lingual baseline models. Our experiments show that training embeddings on the relatively higher-resourced Kinyarwanda yields successful cross-lingual transfer to Kirundi. In addition, the design of the created datasets allows for a wider use in NLP beyond text classification in future studies, such as representation learning, cross-lingual learning with more distant languages, or as base for new annotations for tasks such as parsing, POS tagging, and NER. The datasets, stopwords, and pre-trained embeddings are publicly available at https://github.com/Andrews2017/KINNEWS-and-KIRNEWS-Corpus .

READ FULL TEXT

page 14

page 15

research
02/28/2022

CINO: A Chinese Minority Pre-trained Language Model

Multilingual pre-trained language models have shown impressive performan...
research
05/29/2019

Choosing Transfer Languages for Cross-Lingual Learning

Cross-lingual transfer, where a high-resource transfer language is used ...
research
03/23/2019

Expanding the Text Classification Toolbox with Cross-Lingual Embeddings

Most work in text classification and Natural Language Processing (NLP) f...
research
03/10/2021

An Amharic News Text classification Dataset

In NLP, text classification is one of the primary problems we try to sol...
research
10/19/2018

Revisiting Distributional Correspondence Indexing: A Python Reimplementation and New Experiments

This paper introduces PyDCI, a new implementation of Distributional Corr...
research
06/12/2023

Izindaba-Tindzaba: Machine learning news categorisation for Long and Short Text for isiZulu and Siswati

Local/Native South African languages are classified as low-resource lang...
research
06/11/2018

Part-of-Speech Tagging on an Endangered Language: a Parallel Griko-Italian Resource

Most work on part-of-speech (POS) tagging is focused on high resource la...

Please sign up or login with your details

Forgot password? Click here to reset