A Dataset and Strong Baselines for Classification of Czech News Texts

07/20/2023
by   Hynek Kydlíček, et al.
0

Pre-trained models for Czech Natural Language Processing are often evaluated on purely linguistic tasks (POS tagging, parsing, NER) and relatively simple classification tasks such as sentiment classification or article classification from a single news source. As an alternative, we present CZEch NEws Classification dataset (CZE-NEC), one of the largest Czech classification datasets, composed of news articles from various sources spanning over twenty years, which allows a more rigorous evaluation of such models. We define four classification tasks: news source, news category, inferred author's gender, and day of the week. To verify the task difficulty, we conducted a human evaluation, which revealed that human performance lags behind strong machine-learning baselines built upon pre-trained transformer models. Furthermore, we show that language-specific pre-trained encoder analysis outperforms selected commercially available large-scale generative language models.

READ FULL TEXT

page 4

page 5

page 6

page 7

research
04/15/2021

Empowering News Recommendation with Pre-trained Language Models

Personalized news recommendation is an essential technique for online ne...
research
04/08/2020

Exploiting Redundancy in Pre-trained Language Models for Efficient Transfer Learning

Large pre-trained contextual word representations have transformed the f...
research
03/23/2020

Improving Yorùbá Diacritic Restoration

Yorùbá is a widely spoken West African language with a writing system ri...
research
10/23/2020

TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification

The experimental landscape in natural language processing for social med...
research
09/11/2020

IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding

Although Indonesian is known to be the fourth most frequently used langu...
research
08/30/2022

A Spanish dataset for Targeted Sentiment Analysis of political headlines

Subjective texts have been studied by several works as they can induce c...
research
04/01/2023

Large language models can rate news outlet credibility

Although large language models (LLMs) have shown exceptional performance...

Please sign up or login with your details

Forgot password? Click here to reset