Benchmarking Multilabel Topic Classification in the Kyrgyz Language

08/30/2023
by   Anton Alekseev, et al.
0

Kyrgyz is a very underrepresented language in terms of modern natural language processing resources. In this work, we present a new public benchmark for topic classification in Kyrgyz, introducing a dataset based on collected and annotated data from the news site 24.KG and presenting several baseline models for news classification in the multilabel setting. We train and evaluate both classical statistical and neural models, reporting the scores, discussing the results, and proposing directions for future work.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/18/2020

Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi

The recent advances in Natural Language Processing have been a boon for ...
research
09/11/2020

IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding

Although Indonesian is known to be the fourth most frequently used langu...
research
06/12/2023

Izindaba-Tindzaba: Machine learning news categorisation for Long and Short Text for isiZulu and Siswati

Local/Native South African languages are classified as low-resource lang...
research
11/18/2021

How News Evolves? Modeling News Text and Coverage using Graphs and Hawkes Process

Monitoring news content automatically is an important problem. The news ...
research
10/28/2020

A Visuospatial Dataset for Naturalistic Verb Learning

We introduce a new dataset for training and evaluating grounded language...
research
04/06/2021

hBert + BiasCorp – Fighting Racism on the Web

Subtle and overt racism is still present both in physical and online com...
research
04/06/2022

A New Dataset for Topic-Based Paragraph Classification in Genocide-Related Court Transcripts

Recent progress in natural language processing has been impressive in ma...

Please sign up or login with your details

Forgot password? Click here to reset