Publishing a Quality Context-aware Annotated Corpus and Lexicon for Harassment Research

02/26/2018
by   Mohammadreza Rezvan, et al.
0

Having a quality annotated corpus is essential especially for applied research. Despite the recent focus of Web science community on researching about cyberbullying, the community dose not still have standard benchmarks. In this paper, we publish first, a quality annotated corpus and second, an offensive words lexicon capturing different types type of harassment as (i) sexual harassment, (ii) racial harassment, (iii) appearance-related harassment, (iv) intellectual harassment, and (v) political harassment.We crawled data from Twitter using our offensive lexicon. Then relied on the human judge to annotate the collected tweets w.r.t. the contextual types because using offensive words is not sufficient to reliably detect harassment. Our corpus consists of 25,000 annotated tweets in five contextual types. We are pleased to share this novel annotated corpus and the lexicon with the research community. The instruction to acquire the corpus has been published on the Git repository.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/03/2020

Extracting COVID-19 Events from Twitter

We present a corpus of 7,500 tweets annotated with COVID-19 events, incl...
research
11/01/2018

Analyzing and learning the language for different types of harassment

The presence of a significant amount of harassment in user-generated con...
research
06/25/2021

Manually Annotated Spelling Error Corpus for Amharic

This paper presents a manually annotated spelling error corpus for Amhar...
research
10/01/2021

Sentiment and structure in word co-occurrence networks on Twitter

We explore the relationship between context and happiness scores in poli...
research
07/10/2020

What Can We Learn From Almost a Decade of Food Tweets

We present the Latvian Twitter Eater Corpus - a set of tweets in the nar...
research
05/01/2018

An Annotated Corpus for Machine Reading of Instructions in Wet Lab Protocols

We describe an effort to annotate a corpus of natural language instructi...
research
12/01/2020

HORAE: an annotated dataset of books of hours

We introduce in this paper a new dataset of annotated pages from books o...

Please sign up or login with your details

Forgot password? Click here to reset