Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios?

10/26/2021
by   Arij Riabi, et al.
0

Recent impressive improvements in NLP, largely based on the success of contextual neural language models, have been mostly demonstrated on at most a couple dozen high-resource languages. Building language models and, more generally, NLP systems for non-standardized and low-resource languages remains a challenging task. In this work, we focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi, found mostly on social media and messaging communication. In this low-resource scenario with data displaying a high level of variability, we compare the downstream performance of a character-based language model on part-of-speech tagging and dependency parsing to that of monolingual and multilingual models. We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank of this language leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models. Confirming these results a on much larger data set of noisy French user-generated content, we argue that such character-based language models can be an asset for NLP in low-resource and high language variability set-tings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/19/2019

Low-Resource Parsing with Crosslingual Contextualized Representations

Despite advances in dependency parsing, languages with small treebanks s...
research
05/01/2020

Can Multilingual Language Models Transfer to an Unseen Dialect? A Case Study on North African Arabizi

Building natural language processing systems for non standardized and lo...
research
11/18/2021

To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP

Data-hungry deep neural networks have established themselves as the stan...
research
10/20/2020

Comparison of Interactive Knowledge Base Spelling Correction Models for Low-Resource Languages

Spelling normalization for low resource languages is a challenging task ...
research
04/20/2023

Multi-aspect Repetition Suppression and Content Moderation of Large Language Models

Natural language generation is one of the most impactful fields in NLP, ...
research
11/08/2022

Parameter and Data Efficient Continual Pre-training for Robustness to Dialectal Variance in Arabic

The use of multilingual language models for tasks in low and high-resour...
research
04/22/2022

A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task Learning

Subword tokenization is a commonly used input pre-processing step in mos...

Please sign up or login with your details

Forgot password? Click here to reset