WangchanBERTa: Pretraining transformer-based Thai Language Models

01/24/2021
by   Lalita Lowphansirikul, et al.
0

Transformer-based language models, more specifically BERT-based architectures have achieved state-of-the-art performance in many downstream tasks. However, for a relatively low-resource language such as Thai, the choices of models are limited to training a BERT-based model based on a much smaller dataset or finetuning multi-lingual models, both of which yield suboptimal downstream performance. Moreover, large-scale multi-lingual pretraining does not take into account language-specific features for Thai. To overcome these limitations, we pretrain a language model based on RoBERTa-base architecture on a large, deduplicated, cleaned training set (78GB in total size), curated from diverse domains of social media posts, news articles and other publicly available datasets. We apply text processing rules that are specific to Thai most importantly preserving spaces, which are important chunk and sentence boundaries in Thai before subword tokenization. We also experiment with word-level, syllable-level and SentencePiece tokenization with a smaller dataset to explore the effects on tokenization on downstream performance. Our model wangchanberta-base-att-spm-uncased trained on the 78.5GB dataset outperforms strong baselines (NBSVM, CRF and ULMFit) and multi-lingual models (XLMR and mBERT) on both sequence classification and token classification tasks in human-annotated, mono-lingual contexts.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/23/2022

ScholarBERT: Bigger is Not Always Better

Transformer-based masked language models trained on general corpora, suc...
research
12/23/2021

Do Multi-Lingual Pre-trained Language Models Reveal Consistent Token Attributions in Different Languages?

During the past several years, a surge of multi-lingual Pre-trained Lang...
research
12/03/2019

A Comparative Study of Pretrained Language Models on Thai Social Text Categorization

The ever-growing volume of data of user-generated content on social medi...
research
11/11/2021

Improving Large-scale Language Models and Resources for Filipino

In this paper, we improve on existing language resources for the low-res...
research
09/05/2023

Leveraging BERT Language Models for Multi-Lingual ESG Issue Identification

Environmental, Social, and Governance (ESG) has been used as a metric to...
research
07/11/2023

Vacaspati: A Diverse Corpus of Bangla Literature

Bangla (or Bengali) is the fifth most spoken language globally; yet, the...
research
08/30/2023

ToddlerBERTa: Exploiting BabyBERTa for Grammar Learning and Language Understanding

We present ToddlerBERTa, a BabyBERTa-like language model, exploring its ...

Please sign up or login with your details

Forgot password? Click here to reset