Comparison of Czech Transformers on Text Classification Tasks

07/21/2021
by   Jan Lehečka, et al.
0

In this paper, we present our progress in pre-training monolingual Transformers for Czech and contribute to the research community by releasing our models for public. The need for such models emerged from our effort to employ Transformers in our language-specific tasks, but we found the performance of the published multilingual models to be very limited. Since the multilingual models are usually pre-trained from 100+ languages, most of low-resourced languages (including Czech) are under-represented in these models. At the same time, there is a huge amount of monolingual training data available in web archives like Common Crawl. We have pre-trained and publicly released two monolingual Czech Transformers and compared them with relevant public models, trained (at least partially) for Czech. The paper presents the Transformers pre-training procedure as well as a comparison of pre-trained models on text classification task from various domains.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/04/2022

MiLMo:Minority Multilingual Pre-trained Language Model

Pre-trained language models are trained on large-scale unsupervised data...
research
09/25/2021

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Pre-trained transformers are now the de facto models in Natural Language...
research
01/22/2021

Multilingual Pre-Trained Transformers and Convolutional NN Classification Models for Technical Domain Identification

In this paper, we present a transfer learning system to perform technica...
research
06/19/2021

Transformers for Headline Selection for Russian News Clusters

In this paper, we explore various multilingual and Russian pre-trained t...
research
02/27/2023

Quantifying Valence and Arousal in Text with Multilingual Pre-trained Transformers

The analysis of emotions expressed in text has numerous applications. In...
research
03/28/2023

Model and Evaluation: Towards Fairness in Multilingual Text Classification

Recently, more and more research has focused on addressing bias in text ...
research
05/22/2023

DUMB: A Benchmark for Smart Evaluation of Dutch Models

We introduce the Dutch Model Benchmark: DUMB. The benchmark includes a d...

Please sign up or login with your details

Forgot password? Click here to reset