Evaluating Language Model Finetuning Techniques for Low-resource Languages

Unlike mainstream languages (such as English and French), low-resource languages often suffer from a lack of expert-annotated corpora and benchmark resources that make it hard to apply state-of-the-art techniques directly. In this paper, we alleviate this scarcity problem for the low-resourced Filipino language in two ways. First, we introduce a new benchmark language modeling dataset in Filipino which we call WikiText-TL-39. Second, we show that language model finetuning techniques such as BERT and ULMFiT can be used to consistently train robust classifiers in low-resource settings, experiencing at most a 0.0782 increase in validation error when the number of training examples is decreased from 10K to 1K while finetuning using a privately-held sentiment dataset.

READ FULL TEXT
research
05/10/2022

The Importance of Context in Very Low Resource Language Modeling

This paper investigates very low resource language model pretraining, wh...
research
03/30/2023

A BERT-based Unsupervised Grammatical Error Correction Framework

Grammatical error correction (GEC) is a challenging task of natural lang...
research
04/09/2015

Leveraging Twitter for Low-Resource Conversational Speech Language Modeling

In applications involving conversational speech, data sparsity is a limi...
research
10/20/2020

Comparison of Interactive Knowledge Base Spelling Correction Models for Low-Resource Languages

Spelling normalization for low resource languages is a challenging task ...
research
01/25/2023

FewShotTextGCN: K-hop neighborhood regularization for few-shot learning on graphs

We present FewShotTextGCN, a novel method designed to effectively utiliz...
research
06/26/2023

Uncovering Political Hate Speech During Indian Election Campaign: A New Low-Resource Dataset and Baselines

The detection of hate speech in political discourse is a critical issue,...
research
04/07/2023

BenCoref: A Multi-Domain Dataset of Nominal Phrases and Pronominal Reference Annotations

Coreference Resolution is a well studied problem in NLP. While widely st...

Please sign up or login with your details

Forgot password? Click here to reset