IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization

09/10/2021
by   Fajri Koto, et al.
11

We present IndoBERTweet, the first large-scale pretrained model for Indonesian Twitter that is trained by extending a monolingually-trained Indonesian BERT model with additive domain-specific vocabulary. We focus in particular on efficient model adaptation under vocabulary mismatch, and benchmark different ways of initializing the BERT embedding layer for new word types. We find that initializing with the average BERT subword embedding makes pretraining five times faster, and is more effective than proposed methods for vocabulary adaptation in terms of extrinsic evaluation over seven Twitter-based datasets.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/19/2022

TourBERT: A pretrained language model for the tourism industry

The Bidirectional Encoder Representations from Transformers (BERT) is cu...
10/26/2021

AVocaDo: Strategy for Adapting Vocabulary to Downstream Domain

During the fine-tuning phase of transfer learning, the pretrained vocabu...
03/12/2021

Comparing the Performance of NLP Toolkits and Evaluation measures in Legal Tech

Recent developments in Natural Language Processing have led to the intro...
05/15/2020

COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter

In this work, we release COVID-Twitter-BERT (CT-BERT), a transformer-bas...
09/14/2021

MDAPT: Multilingual Domain Adaptive Pretraining in a Single Model

Domain adaptive pretraining, i.e. the continued unsupervised pretraining...
12/21/2020

Cross-domain Retrieval in the Legal and Patent Domains: a Reproducibility Study

Domain specific search has always been a challenging information retriev...
10/20/2020

CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters

Due to the compelling improvements brought by BERT, many recent represen...

Code Repositories

IndoBERTweet

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.