MasakhaNEWS: News Topic Classification for African languages

04/19/2023
∙
by   David Ifeoluwa Adelani, et al.
∙
6
∙

African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS – a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
∙ 04/13/2022

Multilingual Language Model Adaptive Fine-Tuning: A Study on African Languages

Multilingual pre-trained language models (PLMs) have demonstrated impres...
research
∙ 09/11/2023

Analysing Cross-Lingual Transfer in Low-Resourced African Named Entity Recognition

Transfer learning has led to large gains in performance for nearly all N...
research
∙ 05/22/2023

Automated stance detection in complex topics and small languages: the challenging case of immigration in polarizing news media

Automated stance detection and related machine learning methods can prov...
research
∙ 11/21/2022

TEMPERA: Test-Time Prompting via Reinforcement Learning

Careful prompt design is critical to the use of large language models in...
research
∙ 01/30/2020

Parameter Space Factorization for Zero-Shot Learning across Tasks and Languages

Most combinations of NLP tasks and language varieties lack in-domain exa...
research
∙ 08/01/2021

Improving Social Meaning Detection with Pragmatic Masking and Surrogate Fine-Tuning

Masked language models (MLMs) are pretrained with a denoising objective ...
research
∙ 08/06/2021

Towards Zero-shot Language Modeling

Can we construct a neural model that is inductively biased towards learn...

Please sign up or login with your details

Forgot password? Click here to reset