DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text

This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7,000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff's alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning methods. The dataset is available on Github (https://github.com/bharathichezhiyan/DravidianCodeMix-Dataset) and Zenodo (https://zenodo.org/record/4750858#.YJtw0SYo_0M).

READ FULL TEXT

page 17

page 18

research
05/30/2020

Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text

Understanding the sentiment of a comment from a video or an image is an ...
research
09/01/2021

Dataset for Identification of Homophobia and Transophobia in Multilingual YouTube Comments

The increased proliferation of abusive content on social media platforms...
research
10/06/2021

PSG@HASOC-Dravidian CodeMixFIRE2021: Pretrained Transformers for Offensive Language Identification in Tanglish

This paper describes the system submitted to Dravidian-Codemix-HASOC2021...
research
10/02/2022

ReAct: A Review Comment Dataset for Actionability (and more)

Review comments play an important role in the evolution of documents. Fo...
research
09/11/2019

Kashmir: A Computational Analysis of the Voice of Peace

The recent Pulwama terror attack (February 14, 2019, Pulwama, Kashmir) t...
research
06/17/2022

Understanding Aesthetics with Language: A Photo Critique Dataset for Aesthetic Assessment

Computational inference of aesthetics is an ill-defined task due to its ...
research
11/19/2021

The ComMA Dataset V0.2: Annotating Aggression and Bias in Multilingual Social Media Discourse

In this paper, we discuss the development of a multilingual dataset anno...

Please sign up or login with your details

Forgot password? Click here to reset