EstBERT: A Pretrained Language-Specific BERT for Estonian

11/09/2020
by   Hasan Tanvir, et al.
0

This paper presents EstBERT, a large pretrained transformer-based language-specific BERT model for Estonian. Recent work has evaluated multilingual BERT models on Estonian tasks and found them to outperform the baselines. Still, based on existing studies on other languages, a language-specific BERT model is expected to improve over the multilingual ones. We first describe the EstBERT pretraining process and then present the results of the models based on finetuned EstBERT for multiple NLP tasks, including POS and morphological tagging, named entity recognition and text classification. The evaluation results show that the models based on EstBERT outperform multilingual BERT models on five tasks out of six, providing further evidence towards a view that training language-specific BERT models are still useful, even when multilingual models are available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/01/2020

Evaluating Multilingual BERT for Estonian

Recently, large pre-trained language models, such as BERT, have reached ...
research
08/22/2021

UzBERT: pretraining a BERT model for Uzbek

Pretrained language models based on the Transformer architecture have ac...
research
12/15/2019

Multilingual is not enough: BERT for Finnish

Deep learning-based language models pretrained on large unannotated text...
research
07/03/2020

Playing with Words at the National Library of Sweden – Making a Swedish BERT

This paper introduces the Swedish BERT ("KB-BERT") developed by the KBLa...
research
05/18/2023

mLongT5: A Multilingual and Efficient Text-To-Text Transformer for Longer Sequences

We present our work on developing a multilingual, efficient text-to-text...
research
04/10/2022

Breaking Character: Are Subwords Good Enough for MRLs After All?

Large pretrained language models (PLMs) typically tokenize the input str...
research
06/04/2019

Sequence Tagging with Contextual and Non-Contextual Subword Representations: A Multilingual Evaluation

Pretrained contextual and non-contextual subword embeddings have become ...

Please sign up or login with your details

Forgot password? Click here to reset