KR-BERT: A Small-Scale Korean-Specific Language Model

by   Sangah Lee, et al.

Since the appearance of BERT, recent works including XLNet and RoBERTa utilize sentence embedding models pre-trained by large corpora and a large number of parameters. Because such models have large hardware and a huge amount of data, they take a long time to pre-train. Therefore it is important to attempt to make smaller models that perform comparatively. In this paper, we trained a Korean-specific model KR-BERT, utilizing a smaller vocabulary and dataset. Since Korean is one of the morphologically rich languages with poor resources using non-Latin alphabets, it is also important to capture language-specific linguistic phenomena that the Multilingual BERT model missed. We tested several tokenizers including our BidirectionalWordPiece Tokenizer and adjusted the minimal span of tokens for tokenization ranging from sub-character level to character-level to construct a better vocabulary for our model. With those adjustments, our KR-BERT model performed comparably and even better than other existing pre-trained models using a corpus about 1/10 of the size.



page 1

page 2

page 3

page 4


Char2Subword: Extending the Subword Embedding Space from Pre-trained Models Using Robust Character Compositionality

Byte-pair encoding (BPE) is a ubiquitous algorithm in the subword tokeni...

Improving Pre-Trained Multilingual Models with Vocabulary Expansion

Recently, pre-trained language models have achieved remarkable success i...

Abstraction not Memory: BERT and the English Article System

Article prediction is a task that has long defied accurate linguistic de...

CLUECorpus2020: A Large-scale Chinese Corpus for Pre-training Language Model

In this paper, we introduce the Chinese corpus from CLUE organization, C...

RadLex Normalization in Radiology Reports

Radiology reports have been widely used for extraction of various clinic...

Load What You Need: Smaller Versions of Multilingual BERT

Pre-trained Transformer-based models are achieving state-of-the-art resu...

g2pW: A Conditional Weighted Softmax BERT for Polyphone Disambiguation in Mandarin

Polyphone disambiguation is the most crucial task in Mandarin grapheme-t...

Code Repositories


KoRean based BERT pre-trained models (KR-BERT) for Tensorflow and PyTorch

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.