Establishing Baselines for Text Classification in Low-Resource Languages

While transformer-based finetuning techniques have proven effective in tasks that involve low-resource, low-data environments, a lack of properly established baselines and benchmark datasets make it hard to compare different approaches that are aimed at tackling the low-resource setting. In this work, we provide three contributions. First, we introduce two previously unreleased datasets as benchmark datasets for text classification and low-resource multilabel text classification for the low-resource language Filipino. Second, we pretrain better BERT and DistilBERT models for use within the Filipino setting. Third, we introduce a simple degradation test that benchmarks a model's resistance to performance degradation as the number of training samples are reduced. We analyze our pretrained model's degradation speeds and look towards the use of this method for comparing models aimed at operating within the low-resource setting. We release all our models and datasets for the research community to use.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/13/2020

Pagsusuri ng RNN-based Transfer Learning Technique sa Low-Resource Language

Low-resource languages such as Filipino suffer from data scarcity which ...
research
10/22/2020

Investigating the True Performance of Transformers in Low-Resource Languages: A Case Study in Automatic Corpus Creation

Transformers represent the state-of-the-art in Natural Language Processi...
research
05/05/2023

Augmenting Low-Resource Text Classification with Graph-Grounded Pre-training and Prompting

Text classification is a fundamental problem in information retrieval wi...
research
09/01/2021

Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification

Data augmentation aims to enrich training samples for alleviating the ov...
research
03/10/2021

An Amharic News Text classification Dataset

In NLP, text classification is one of the primary problems we try to sol...
research
07/15/2023

Prompt Tuning on Graph-augmented Low-resource Text Classification

Text classification is a fundamental problem in information retrieval wi...
research
06/24/2023

Evaluating the Utility of GAN Generated Synthetic Tabular Data for Class Balancing and Low Resource Settings

The present study aimed to address the issue of imbalanced data in class...

Please sign up or login with your details

Forgot password? Click here to reset