Improving Large-scale Language Models and Resources for Filipino

In this paper, we improve on existing language resources for the low-resource Filipino language in two ways. First, we outline the construction of the TLUnified dataset, a large-scale pretraining corpus that serves as an improvement over smaller existing pretraining datasets for the language in terms of scale and topic variety. Second, we pretrain new Transformer language models following the RoBERTa pretraining technique to supplant existing models trained with small corpora. Our new RoBERTa models show significant improvements over existing Filipino models in three benchmark datasets with an average gain of 4.47 varying difficulty.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/13/2022

METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals

We present an efficient method of pretraining large-scale autoencoding l...
research
09/08/2023

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

Large volumes of text data have contributed significantly to the develop...
research
04/23/2020

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

Language models pretrained on text from a wide variety of sources form t...
research
04/28/2022

On the Effect of Pretraining Corpora on In-context Learning by a Large-scale Language Model

Many recent studies on large-scale language models have reported success...
research
07/31/2017

Low-Resource Neural Headline Generation

Recent neural headline generation models have shown great results, but a...
research
01/24/2021

WangchanBERTa: Pretraining transformer-based Thai Language Models

Transformer-based language models, more specifically BERT-based architec...
research
05/12/2021

Improving Code Autocompletion with Transfer Learning

Software language models have achieved promising results predicting code...

Please sign up or login with your details

Forgot password? Click here to reset