HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish

05/04/2021
by   Robert Mroczkowski, et al.
0

BERT-based models are currently used for solving nearly all Natural Language Processing (NLP) tasks and most often achieve state-of-the-art results. Therefore, the NLP community conducts extensive research on understanding these models, but above all on designing effective and efficient training procedures. Several ablation studies investigating how to train BERT-like models have been carried out, but the vast majority of them concerned only the English language. A training procedure designed for English does not have to be universal and applicable to other especially typologically different languages. Therefore, this paper presents the first ablation study focused on Polish, which, unlike the isolating English language, is a fusional language. We design and thoroughly evaluate a pretraining procedure of transferring knowledge from multilingual to monolingual BERT-based models. In addition to multilingual model initialization, other factors that possibly influence pretraining are also explored, i.e. training objective, corpus size, BPE-Dropout, and pretraining length. Based on the proposed procedure, a Polish BERT-based language model – HerBERT – is trained. This model achieves state-of-the-art results on multiple downstream tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/14/2020

FinEst BERT and CroSloEngual BERT: less is more in multilingual models

Large pretrained masked language models have become state-of-the-art sol...
research
08/22/2021

UzBERT: pretraining a BERT model for Uzbek

Pretrained language models based on the Transformer architecture have ac...
research
05/09/2023

What is the best recipe for character-level encoder-only modelling?

This paper aims to benchmark recent progress in language understanding m...
research
06/30/2021

The MultiBERTs: BERT Reproductions for Robustness Analysis

Experiments with pretrained models such as BERT are often based on a sin...
research
05/24/2021

RobeCzech: Czech RoBERTa, a monolingual contextualized language representation model

We present RobeCzech, a monolingual RoBERTa language representation mode...
research
02/28/2020

AraBERT: Transformer-based Model for Arabic Language Understanding

The Arabic language is a morphologically rich and complex language with ...
research
07/26/2019

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Language model pretraining has led to significant performance gains but ...

Please sign up or login with your details

Forgot password? Click here to reset