MMT: A Multilingual and Multi-Topic Indian Social Media Dataset

04/02/2023
by   Dwip Dalal, et al.
4

Social media plays a significant role in cross-cultural communication. A vast amount of this occurs in code-mixed and multilingual form, posing a significant challenge to Natural Language Processing (NLP) tools for processing such information, like language identification, topic modeling, and named-entity recognition. To address this, we introduce a large-scale multilingual, and multi-topic dataset (MMT) collected from Twitter (1.7 million Tweets), encompassing 13 coarse-grained and 63 fine-grained topics in the Indian context. We further annotate a subset of 5,346 tweets from the MMT dataset with various Indian languages and their code-mixed counterparts. Also, we demonstrate that the currently existing tools fail to capture the linguistic diversity in MMT on two downstream tasks, i.e., topic modeling and language identification. To facilitate future research, we will make the anonymized and annotated dataset available in the public domain.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/11/2023

RoBERTweet: A BERT Language Model for Romanian Tweets

Developing natural language processing (NLP) systems for social media an...
research
11/19/2021

The ComMA Dataset V0.2: Annotating Aggression and Bias in Multilingual Social Media Discourse

In this paper, we discuss the development of a multilingual dataset anno...
research
01/28/2021

Semi-automatic Generation of Multilingual Datasets for Stance Detection in Twitter

Popular social media networks provide the perfect environment to study t...
research
06/29/2021

A Simple and Efficient Probabilistic Language model for Code-Mixed Text

The conventional natural language processing approaches are not accustom...
research
06/22/2023

Unveiling Global Narratives: A Multilingual Twitter Dataset of News Media on the Russo-Ukrainian Conflict

The ongoing Russo-Ukrainian conflict has been a subject of intense media...
research
10/26/2018

Automatic Identification and Ranking of Emergency Aids in Social Media Macro Community

Online social microblogging platforms including Twitter are increasingly...
research
06/26/2023

Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting an Under-Resourced Language

In this paper we address the scarcity of annotated data for NArabizi, a ...

Please sign up or login with your details

Forgot password? Click here to reset