KR-BERT: A Small-Scale Korean-Specific Language Model

08/10/2020
by   Sangah Lee, et al.
0

Since the appearance of BERT, recent works including XLNet and RoBERTa utilize sentence embedding models pre-trained by large corpora and a large number of parameters. Because such models have large hardware and a huge amount of data, they take a long time to pre-train. Therefore it is important to attempt to make smaller models that perform comparatively. In this paper, we trained a Korean-specific model KR-BERT, utilizing a smaller vocabulary and dataset. Since Korean is one of the morphologically rich languages with poor resources using non-Latin alphabets, it is also important to capture language-specific linguistic phenomena that the Multilingual BERT model missed. We tested several tokenizers including our BidirectionalWordPiece Tokenizer and adjusted the minimal span of tokens for tokenization ranging from sub-character level to character-level to construct a better vocabulary for our model. With those adjustments, our KR-BERT model performed comparably and even better than other existing pre-trained models using a corpus about 1/10 of the size.

READ FULL TEXT

Authors

page 1

page 2

page 3

page 4

10/24/2020

Char2Subword: Extending the Subword Embedding Space from Pre-trained Models Using Robust Character Compositionality

Byte-pair encoding (BPE) is a ubiquitous algorithm in the subword tokeni...
09/26/2019

Improving Pre-Trained Multilingual Models with Vocabulary Expansion

Recently, pre-trained language models have achieved remarkable success i...
06/08/2022

Abstraction not Memory: BERT and the English Article System

Article prediction is a task that has long defied accurate linguistic de...
03/03/2020

CLUECorpus2020: A Large-scale Chinese Corpus for Pre-training Language Model

In this paper, we introduce the Chinese corpus from CLUE organization, C...
09/10/2020

RadLex Normalization in Radiology Reports

Radiology reports have been widely used for extraction of various clinic...
10/12/2020

Load What You Need: Smaller Versions of Multilingual BERT

Pre-trained Transformer-based models are achieving state-of-the-art resu...
03/20/2022

g2pW: A Conditional Weighted Softmax BERT for Polyphone Disambiguation in Mandarin

Polyphone disambiguation is the most crucial task in Mandarin grapheme-t...

Code Repositories

KR-BERT

KoRean based BERT pre-trained models (KR-BERT) for Tensorflow and PyTorch


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.