Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages

05/15/2023
by   Chunlan Ma, et al.
0

While natural language processing tools have been developed extensively for some of the world's languages, a significant portion of the world's over 7000 languages are still neglected. One reason for this is that evaluation datasets do not yet cover a wide range of languages, including low-resource and endangered ones. We aim to address this issue by creating a text classification dataset encompassing a large number of languages, many of which currently have little to no annotated data available. We leverage parallel translations of the Bible to construct such a dataset by first developing applicable topics and employing a crowdsourcing tool to collect annotated data. By annotating the English side of the data and projecting the labels onto other languages through aligned verses, we generate text classification datasets for more than 1500 languages. We extensively benchmark several existing multilingual language models using our dataset. To facilitate the advancement of research in this area, we will release our dataset and code.

READ FULL TEXT

page 13

page 14

research
09/14/2023

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

Despite the progress we have recorded in the last few years in multiling...
research
03/31/2020

MULTEXT-East

MULTEXT-East language resources, a multilingual dataset for language eng...
research
12/03/2021

Multilingual Text Classification for Dravidian Languages

As the fourth largest language family in the world, the Dravidian langua...
research
10/26/2017

ALL-IN-1: Short Text Classification with One Model for All Languages

We present ALL-IN-1, a simple model for multilingual text classification...
research
11/29/2019

A Multi-cascaded Deep Model for Bilingual SMS Classification

Most studies on text classification are focused on the English language....
research
01/10/2022

Language-Agnostic Website Embedding and Classification

Currently, publicly available models for website classification do not o...
research
10/18/2021

A Data Bootstrapping Recipe for Low Resource Multilingual Relation Classification

Relation classification (sometimes called 'extraction') requires trustwo...

Please sign up or login with your details

Forgot password? Click here to reset