The birth of Romanian BERT

09/18/2020
by   Stefan Daniel Dumitrescu, et al.
0

Large-scale pretrained language models have become ubiquitous in Natural Language Processing. However, most of these models are available either in high-resource languages, in particular English, or as multilingual models that compromise performance on individual languages for coverage. This paper introduces Romanian BERT, the first purely Romanian transformer-based language model, pretrained on a large text corpus. We discuss corpus composition and cleaning, the model training process, as well as an extensive evaluation of the model on various Romanian datasets. We open source not only the model itself, but also a repository that contains information on how to obtain the corpus, fine-tune and use this model in production (with practical examples), and how to fully replicate the evaluation process.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/03/2023

GreekBART: The First Pretrained Greek Sequence-to-Sequence Model

The era of transfer learning has revolutionized the fields of Computer V...
research
07/14/2022

Language Modelling with Pixels

Language models are defined over a finite set of inputs, which creates a...
research
04/19/2021

Operationalizing a National Digital Library: The Case for a Norwegian Transformer Model

In this work, we show the process of building a large-scale training set...
research
08/29/2022

naab: A ready-to-use plug-and-play corpus for Farsi

Huge corpora of textual data are always known to be a crucial need for t...
research
09/25/2019

Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models

In natural language processing, it has been observed recently that gener...
research
04/13/2021

Large-Scale Contextualised Language Modelling for Norwegian

We present the ongoing NorLM initiative to support the creation and use ...
research
04/18/2017

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

This paper introduces the Multi-Genre Natural Language Inference (MultiN...

Please sign up or login with your details

Forgot password? Click here to reset