Monolingual and Cross-Lingual Knowledge Transfer for Topic Classification

06/13/2023
by   Dmitry Karpov, et al.
0

This article investigates the knowledge transfer from the RuQTopics dataset. This Russian topical dataset combines a large sample number (361,560 single-label, 170,930 multi-label) with extensive class coverage (76 classes). We have prepared this dataset from the "Yandex Que" raw data. By evaluating the RuQTopics - trained models on the six matching classes of the Russian MASSIVE subset, we have proved that the RuQTopics dataset is suitable for real-world conversational tasks, as the Russian-only models trained on this dataset consistently yield an accuracy around 85% on this subset. We also have figured out that for the multilingual BERT, trained on the RuQTopics and evaluated on the same six classes of MASSIVE (for all MASSIVE languages), the language-wise accuracy closely correlates (Spearman correlation 0.773 with p-value 2.997e-11) with the approximate size of the pretraining BERT's data for the corresponding language. At the same time, the correlation of the language-wise accuracy with the linguistical distance from Russian is not statistically significant.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/22/2021

Evaluation of contextual embeddings on less-resourced languages

The current dominance of deep neural networks in natural language proces...
research
02/22/2021

RUBERT: A Bilingual Roman Urdu BERT Using Cross Lingual Transfer Learning

In recent studies, it has been shown that Multilingual language models u...
research
03/17/2020

XPersona: Evaluating Multilingual Personalized Chatbot

Personalized dialogue systems are an essential step toward better human-...
research
04/29/2022

Czech Dataset for Cross-lingual Subjectivity Classification

In this paper, we introduce a new Czech subjectivity dataset of 10k manu...
research
05/22/2023

Cross-lingual Transfer Can Worsen Bias in Sentiment Analysis

Sentiment analysis (SA) systems are widely deployed in many of the world...
research
09/07/2023

dacl1k: Real-World Bridge Damage Dataset Putting Open-Source Data to the Test

Recognising reinforced concrete defects (RCDs) is a crucial element for ...

Please sign up or login with your details

Forgot password? Click here to reset