KR-BERT: A Small-Scale Korean-Specific Language Model

08/10/2020
by   Sangah Lee, et al.
0

Since the appearance of BERT, recent works including XLNet and RoBERTa utilize sentence embedding models pre-trained by large corpora and a large number of parameters. Because such models have large hardware and a huge amount of data, they take a long time to pre-train. Therefore it is important to attempt to make smaller models that perform comparatively. In this paper, we trained a Korean-specific model KR-BERT, utilizing a smaller vocabulary and dataset. Since Korean is one of the morphologically rich languages with poor resources using non-Latin alphabets, it is also important to capture language-specific linguistic phenomena that the Multilingual BERT model missed. We tested several tokenizers including our BidirectionalWordPiece Tokenizer and adjusted the minimal span of tokens for tokenization ranging from sub-character level to character-level to construct a better vocabulary for our model. With those adjustments, our KR-BERT model performed comparably and even better than other existing pre-trained models using a corpus about 1/10 of the size.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/24/2020

Char2Subword: Extending the Subword Embedding Space from Pre-trained Models Using Robust Character Compositionality

Byte-pair encoding (BPE) is a ubiquitous algorithm in the subword tokeni...
research
09/26/2019

Improving Pre-Trained Multilingual Models with Vocabulary Expansion

Recently, pre-trained language models have achieved remarkable success i...
research
10/12/2020

Load What You Need: Smaller Versions of Multilingual BERT

Pre-trained Transformer-based models are achieving state-of-the-art resu...
research
06/08/2022

Abstraction not Memory: BERT and the English Article System

Article prediction is a task that has long defied accurate linguistic de...
research
04/28/2022

RobBERTje: a Distilled Dutch BERT Model

Pre-trained large-scale language models such as BERT have gained a lot o...
research
10/13/2022

Tone prediction and orthographic conversion for Basaa

In this paper, we present a seq2seq approach for transliterating mission...
research
11/28/2022

Large Pre-Trained Models with Extra-Large Vocabularies: A Contrastive Analysis of Hebrew BERT Models and a New One to Outperform Them All

We present a new pre-trained language model (PLM) for modern Hebrew, ter...

Please sign up or login with your details

Forgot password? Click here to reset