MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages

05/23/2023
by   Cheikh M. Bamba Dione, et al.
4

In this paper, we present MasakhaPOS, the largest part-of-speech (POS) dataset for 20 typologically diverse African languages. We discuss the challenges in annotating POS for these languages using the UD (universal dependencies) guidelines. We conducted extensive POS baseline experiments using conditional random field and several multilingual pre-trained language models. We applied various cross-lingual transfer models trained with data available in UD. Evaluating on the MasakhaPOS dataset, we show that choosing the best transfer language(s) in both single-source and multi-source setups greatly improves the POS tagging performance of the target languages, in particular when combined with cross-lingual parameter-efficient fine-tuning methods. Crucially, transferring knowledge from a language that matches the language family and morphosyntactic properties seems more effective for POS tagging in unseen languages.

READ FULL TEXT

page 8

page 15

research
10/31/2022

Data-Efficient Cross-Lingual Transfer with Language-Specific Subnetworks

Large multilingual language models typically share their parameters acro...
research
05/16/2021

The interplay between language similarity and script on a novel multi-layer Algerian dialect corpus

Recent years have seen a rise in interest for cross-lingual transfer bet...
research
05/12/2021

Analysing The Impact Of Linguistic Features On Cross-Lingual Transfer

There is an increasing amount of evidence that in cases with little or n...
research
10/10/2017

The Galactic Dependencies Treebanks: Getting More Data by Synthesizing New Languages

We release Galactic Dependencies 1.0---a large set of synthetic language...
research
04/20/2023

Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on POS Tagging for Non-Standardized Languages

One of the challenges with finetuning pretrained language models (PLMs) ...
research
09/28/2021

Multilingual Counter Narrative Type Classification

The growing interest in employing counter narratives for hatred interven...
research
09/10/2021

Efficient Test Time Adapter Ensembling for Low-resource Language Varieties

Adapters are light-weight modules that allow parameter-efficient fine-tu...

Please sign up or login with your details

Forgot password? Click here to reset