Aggression-annotated Corpus of Hindi-English Code-mixed Data

03/26/2018
by   Ritesh Kumar, et al.
0

As the interaction over the web has increased, incidents of aggression and related events like trolling, cyberbullying, flaming, hate speech, etc. too have increased manifold across the globe. While most of these behaviour like bullying or hate speech have predated the Internet, the reach and extent of the Internet has given these an unprecedented power and influence to affect the lives of billions of people. So it is of utmost significance and importance that some preventive measures be taken to provide safeguard to the people using the web such that the web remains a viable medium of communication and connection, in general. In this paper, we discuss the development of an aggression tagset and an annotated corpus of Hindi-English code-mixed data from two of the most popular social networking and social media platforms in India, Twitter and Facebook. The corpus is annotated using a hierarchical tagset of 3 top-level tags and 10 level 2 tags. The final dataset contains approximately 18k tweets and 21k facebook comments and is being released for further research in the field.

READ FULL TEXT
research
05/30/2018

A Corpus of English-Hindi Code-Mixed Tweets for Sarcasm Detection

Social media platforms like twitter and facebook have be- come two of th...
research
10/17/2020

CUSATNLP@HASOC-Dravidian-CodeMix-FIRE2020:Identifying Offensive Language from ManglishTweets

With the popularity of social media, communications through blogs, Faceb...
research
01/15/2020

A Unified System for Aggression Identification in English Code-Mixed and Uni-Lingual Texts

Wide usage of social media platforms has increased the risk of aggressio...
research
01/15/2020

AggressionNet: Generalised Multi-Modal Deep Temporal and Sequential Learning for Aggression Identification

Wide usage of social media platforms has increased the risk of aggressio...
research
06/10/2022

Borrowing or Codeswitching? Annotating for Finer-Grained Distinctions in Language Mixing

We present a new corpus of Twitter data annotated for codeswitching and ...
research
11/19/2021

The ComMA Dataset V0.2: Annotating Aggression and Bias in Multilingual Social Media Discourse

In this paper, we discuss the development of a multilingual dataset anno...
research
05/08/2016

A corpus of preposition supersenses in English web reviews

We present the first corpus annotated with preposition supersenses, unle...

Please sign up or login with your details

Forgot password? Click here to reset