Offensive content has become pervasive in social media and a reason of concern for government organizations, online communities, and social media platforms. One of the most common strategies to tackle the problem is to train systems capable of recognizing offensive content, which then can be deleted or set aside for human moderation. In the last few years, there have been several studies published on the application of computational methods to deal with this problem. Most prior work focuses on a different aspect of offensive language such as abusive language Nobata et al. (2016); Mubarak et al. (2017), (cyber-)aggression Kumar et al. (2018), (cyber-)bullying Xu et al. (2012); Dadvar et al. (2013), toxic comments, hate speech Kwok and Wang (2013); Djuric et al. (2015); Burnap and Williams (2015); Davidson et al. (2017); Malmasi and Zampieri (2017, 2018), and offensive language Wiegand et al. (2018). Prior work has focused on these aspects of offensive language in Twitter Xu et al. (2012); Burnap and Williams (2015); Davidson et al. (2017); Wiegand et al. (2018), Wikipedia comments111https://bit.ly/2FhLMVz, and Facebook posts Kumar et al. (2018).
Recently, Waseem et. al. (2017) acknowledged the similarities among prior work and discussed the need for a typology that differentiates between whether the (abusive) language is directed towards a specific individual or entity or towards a generalized group and whether the abusive content is explicit or implicit. Wiegand et al. (2018) followed this trend as well on German tweets. In their evaluation, they have a task to detect offensive vs not offensive tweets and a second task for distinguishing between the offensive tweets as profanity, insult, or abuse. However, no prior work has explored the target of the offensive language, which is important in many scenarios, e.g., when studying hate speech with respect to a specific target.
Therefore, we expand on these ideas by proposing a a hierarchical three-level annotation model that encompasses:
Offensive Language Detection
Categorization of Offensive Language
Offensive Language Target Identification
Using this annotation model, we create a new large publicly available dataset of English tweets.222Downloadable at: https://bit.ly/2GKY5gM The key contributions of this paper are as follows:
A novel three-level hierarchical approach to modeling abusive language;
The Offensive Language Identification Dataset (OLID): a new dataset of English tweets with high quality annotation of the target and type of offenses;
Several baseline experiments applying computational methods to each of the three levels of annotation.
2 Related Work
Different abusive and offense language identification sub-tasks have been explored in the past few years including aggression identification, bullying detection, hate speech, toxic comments, and offensive language.
Aggression identification: The TRAC shared task on Aggression Identification Kumar et al. (2018) provided participants with a dataset containing 15,000 annotated Facebook posts and comments in English and Hindi for training and validation. For testing, two different sets, one from Facebook and one from Twitter were provided. Systems were trained to discriminate between three classes: non-aggressive, covertly aggressive, and overtly aggressive.
Several studies have been published on bullying detection. One of them is the one by xu2012learning which apply sentiment analysis to detect bullying in tweets. xu2012learning use topic models to to identify relevant topics in bullying. Another related study is the one by dadvar2013improving which use user-related features such as the frequency of profanity in previous messages to improve bullying detection.
Hate speech identification:
It is perhaps the most widespread abusive language detection sub-task. There have been several studies published on this sub-task such as kwok2013locate and djuric2015hate who build a binary classifier to distinguish between ‘clean’ comments and comments containing hate speech and profanity. More recently, Davidson et al. davidson2017automated presented the hate speech detection dataset containing over 24,000 English tweets labeled as non offensive, hate speech, and profanity.
Offensive language: The GermEval333https://projects.fzai.h-da.de/iggsa/ Wiegand et al. (2018) shared task focused on Offensive language identification in German tweets. A dataset of over 8,500 annotated tweets was provided for a course-grained binary classification task in which systems were trained to discriminate between offensive and non-offensive tweets and a second task where the organizers broke down the offensive class into three classes: profanity, insult, and abuse.
Toxic comments: The Toxic Comment Classification Challenge was an open competition at Kaggle which provided participants with comments from Wikipedia labeled in six classes: toxic, severe toxic, obscene, threat, insult, identity hate.
While each of these sub-tasks tackle a particular type of abuse or offense, they share similar properties and the hierarchical annotation model proposed in this paper aims to capture this. Considering that, for example, an insult targeted at an individual is commonly known as cyberbulling and that insults targeted at a group are known as hate speech, we pose that OLID’s hierarchical annotation model makes it a useful resource for various offensive language identification sub-tasks.
3 Hierarchically Modelling Offensive Content
|@USER He is so generous with his offers.||NOT||—||—|
|IM FREEEEE!!!! WORST EXPERIENCE OF MY FUCKING LIFE||OFF||UNT||—|
|@USER Fuk this fat cock sucker||OFF||TIN||IND|
|@USER Figures! What is wrong with these idiots? Thank God for @USER||OFF||TIN||GRP|
In the OLID dataset, we use a hierarchical annotation model split into three levels to distinguish between whether language is offensive or not (A), and type (B) and target (C) of the offensive language. Each level is described in more detail in the following subsections and examples are shown in Table 1.
3.1 Level A: Offensive language Detection
Level A discriminates between offensive (OFF) and non-offensive (NOT) tweets.
Not Offensive (NOT):
Posts that do not contain offense or profanity;
Offensive (OFF): We label a post as offensive if it contains any form of non-acceptable language (profanity) or a targeted offense, which can be veiled or direct. This category includes insults, threats, and posts containing profane language or swear words.
3.2 Level B: Categorization of Offensive Language
Level B categorizes the type of offense and two labels are used: targeted (TIN) and untargeted (INT) insults and threats.
Targeted Insult (TIN): Posts which contain an insult/threat to an individual, group, or others (see next layer);
Untargeted (UNT): Posts containing non-targeted profanity and swearing. Posts with general profanity are not targeted, but they contain non-acceptable language.
3.3 Level C: Offensive Language Target Identification
Level C categorizes the targets of insults and threats as individual (IND), group (GRP), and other (OTH).
Individual (IND): Posts targeting an individual. It can be a a famous person, a named individual or an unnamed participant in the conversation. Insults and threats targeted at individuals are often defined as cyberbulling.
Group (GRP): The target of these offensive posts is a group of people considered as a unity due to the same ethnicity, gender or sexual orientation, political affiliation, religious belief, or other common characteristic. Many of the insults and threats targeted at a group correspond to what is commonly understood as hate speech.
Other (OTH): The target of these offensive posts does not belong to any of the previous two categories (e.g. an organization, a situation, an event, or an issue).
4 Data Collection
The data included in OLID has been collected from Twitter. We retrieved the data using the Twitter API by searching for keywords and constructions that are often included in offensive messages, such as ‘she is’ or ‘to:BreitBartNews’444to is a special Twitter API word indicating that the tweet was written directly to a specific account (e.g., BreitBartNews).. We carried out a first round of trial annotation of 300 instances with six experts. The goal of the trial annotation was to 1) evaluate the proposed tagset; 2) evaluate the data retrieval method; and 3) create a gold standard with instances that could be used as test questions in the training and test setting annotation which was carried out using crowdsourcing. The breakdown of keywords and their offensive content in the trial data of 300 tweets is shown in Table 2. We included a left (@NewYorker) and far-right (@BreitBartNews) news accounts because there tends to be political offense in the comments. One of the best offensive keywords was tweets that were flagged as not being safe by the Twitter ‘safe’ filter (the ‘-’ indicates ‘not safe’). The vast majority of content on Twitter is not offensive so we tried different strategies to keep a reasonable number of tweets in the offensive class amounting to around 30% of the dataset including excluding some keywords that were not high in offensive content such as ‘they are‘ and ‘to:NewYorker‘. Although ‘he is’ is lower in offensive content we kept it as a keyword to avoid gender bias. In addition to the keywords in the trial set, we searched for more political keywords which tend to be higher in offensive content, and sampled our dataset such that 50% of the the tweets come from political keywords and 50% come from non-political keywords. In addition to the keywords ‘gun control’, and ‘to:BreitbartNews’, political keywords used to collect these tweets are ‘MAGA’, ‘antifa’, ‘conservative’ and ‘liberal’. We computed Fliess’ on the trial set for the five annotators on 21 of the tweets. is .83 for Layer A (OFF vs NOT) indicating high agreement. As to normalization and anonymization, no user metadata or Twitter IDs have been stored, and URLs and Twitter mentions have been substituted to placeholders. We follow prior work in related areas (burnap2015cyber,davidson2017automated) and annotate our data using crowdsourcing using the platform Figure Eight.555https://www.figure-eight.com/ We ensure data quality by: 1) we only received annotations from individuals who were experienced in the platform; and 2) we used test questions to discard annotations of individuals who did not reach a certain threshold. Each instance in the dataset was annotated by multiple annotators and inter-annotator agreement has been calculated. We first acquired two annotations for each instance. In case of 100% agreement, we considered these as acceptable annotations, and in case of disagreement, we requested more annotations until the agreement was above 66%. After the crowdsourcing annotation, we used expert adjudication to guarantee the quality of the annotation. The breakdown of the data into training and testing for the labels from each level is shown in Table 3.
5 Experiments and Evaluation
We assess our dataset using traditional and deep learning methods. Our simplest model is a linear SVM trained on word unigrams. SVMs have produced state-of-the-art results for many text classification tasksZampieri et al. (2018)
. We also train a bidirectional Long Short-Term-Memory (BiLSTM) model, which we adapted from the sentiment analysis system of sentimentSystem,rasooli2018cross and altered to predict offensive labels instead. It consists of (1) an input embedding layer, (2) a bidirectional LSTM layer, (3) an average pooling layer of input features. The concatenation of the LSTM’s and average pool layer is passed through a dense layer and the output is passed through asoftmax function. We set two input channels for the input embedding layers: pre-trained FastText embeddings Bojanowski et al. (2016)
, as well as updatable embeddings learned by the model during training. Finally, we also apply a Convolutional Neural Network (CNN) model based on the architecture ofKim (2014), using the same multi-channel inputs as the above BiLSTM.
Our models are trained on the training data, and evaluated by predicting the labels for the held-out test set. The distribution is described in Table 3. We evaluate and compare the models using the macro-averaged F1-score as the label distribution is highly imbalanced. Per-class Precision (P), Recall (R), and F1-score (F1), also with other averaged metrics are also reported. The models are compared against baselines of predicting all labels as the majority or minority classes.
5.1 Offensive Language Detection
The performance on discriminating between offensive (OFF) and non-offensive (NOT) posts is reported in Table 4. We can see that all systems perform significantly better than chance, with the neural models being substantially better than the SVM. The CNN outperforms the RNN model, achieving a macro-F1 score of 0.80.
5.2 Categorization of Offensive Language
In this experiment, the two systems were trained to discriminate between insults and threats (TIN) and untargeted (UNT) offenses, which generally refer to profanity. The results are shown in Table 5.
The CNN system achieved higher performance in this experiment compared to the BiLSTM, with a macro-F1 score of 0.69. All systems performed better at identifying target and threats (TIN) than untargeted offenses (UNT).
5.3 Offensive Language Target Identification
The results of the offensive target identification experiment are reported in Table 6. Here the systems were trained to distinguish between three targets: a group (GRP), an individual (IND), or others (OTH). All three models achieved similar results far surpassing the random baselines, with a slight performance edge for the neural models.
The performance of all systems for the OTH class is 0. This poor performances can be explained by two main factors. First, unlike the two other classes, OTH is a heterogeneous collection of targets. It includes offensive tweets targeted at organizations, situations, events, etc. making it more challenging for systems to learn discriminative properties of this class. Second, this class contains fewer training instances than the other two. There are only 395 instances in OTH, and 1,075 in GRP, and 2,407 in IND.
6 Conclusion and Future Work
This paper presents OLID, a new dataset with annotation of type and target of offensive language. OLID is the official dataset of the shared task SemEval 2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) Zampieri et al. (2019).666https://bit.ly/2XaC7bP In OffensEval, each annotation level in OLID is an independent sub-task. The dataset contains 14,100 tweets and is released freely to the research community. To the best of our knowledge, this is the first dataset to contain annotation of type and target of offenses in social media and it opens several new avenues for research in this area. We present baseline experiments using SVMs and neural networks to identify the offensive tweets, discriminate between insults, threats, and profanity, and finally to identify the target of the offensive messages. The results show that this is a challenging task. A CNN-based sentence classifier achieved the best results in all three sub-tasks.
In future work, we would like to make a cross-corpus comparison of OLID and datasets annotated for similar tasks such as aggression identification Kumar et al. (2018) and hate speech detection Davidson et al. (2017). This comparison is, however, far from trivial as the annotation of OLID is different.
The research presented in this paper was partially supported by an ERAS fellowship awarded by the University of Wolverhampton.
- Anonymous (2018) Anonymous. 2018. Practical Pre-trained Embeddings for Cross-lingual Sentiment Analysis. In 2019 NAACL-HLT submission.
- Bojanowski et al. (2016) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.
- Burnap and Williams (2015) Pete Burnap and Matthew L Williams. 2015. Cyber hate speech on twitter: An application of machine classification and statistical modeling for policy and decision making. Policy & Internet, 7(2):223–242.
- Dadvar et al. (2013) Maral Dadvar, Dolf Trieschnigg, Roeland Ordelman, and Franciska de Jong. 2013. Improving Cyberbullying Detection with User Context. In Advances in Information Retrieval, pages 693–696. Springer.
- Davidson et al. (2017) Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated Hate Speech Detection and the Problem of Offensive Language. In Proceedings of ICWSM.
- Djuric et al. (2015) Nemanja Djuric, Jing Zhou, Robin Morris, Mihajlo Grbovic, Vladan Radosavljevic, and Narayan Bhamidipati. 2015. Hate speech detection with comment embeddings. In Proceedings of WWW.
- Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
- Kumar et al. (2018) Ritesh Kumar, Atul Kr Ojha, Shervin Malmasi, and Marcos Zampieri. 2018. Benchmarking Aggression Identification in Social Media. In Proceedings of TRAC.
- Kwok and Wang (2013) Irene Kwok and Yuzhou Wang. 2013. Locate the Hate: Detecting Tweets Against Blacks. In Proceedings of AAAI.
- Malmasi and Zampieri (2017) Shervin Malmasi and Marcos Zampieri. 2017. Detecting Hate Speech in Social Media. In Proceedings of RANLP.
Malmasi and Zampieri (2018)
Shervin Malmasi and Marcos Zampieri. 2018.
Challenges in Discriminating Profanity from Hate Speech.
Journal of Experimental & Theoretical Artificial Intelligence, 30:1 – 16.
- Mubarak et al. (2017) Hamdy Mubarak, Darwish Kareem, and Magdy Walid. 2017. Abusive Language Detection on Arabic Social Media. In Proceedings of ALW.
- Nobata et al. (2016) Chikashi Nobata, Joel Tetreault, Achint Thomas, Yashar Mehdad, and Yi Chang. 2016. Abusive Language Detection in Online User Content. In Proceedings of WWW.
- Rasooli et al. (2018) Mohammad Sadegh Rasooli, Noura Farra, Axinia Radeva, Tao Yu, and Kathleen McKeown. 2018. Cross-lingual Sentiment Transfer with Limited Resources. Machine Translation, 32(1-2):143–165.
- Waseem et al. (2017) Zeerak Waseem, Thomas Davidson, Dana Warmsley, and Ingmar Weber. 2017. Understanding Abuse: A Typology of Abusive Language Detection Subtasks. arXiv preprint arXiv:1705.09899.
- Wiegand et al. (2018) Michael Wiegand, Melanie Siegel, and Josef Ruppenhofer. 2018. Overview of the GermEval 2018 Shared Task on the Identification of Offensive Language. In Proceedings of GermEval.
- Xu et al. (2012) Jun-Ming Xu, Kwang-Sung Jun, Xiaojin Zhu, and Amy Bellmore. 2012. Learning from Bullying Traces in Social Media. In Proceedings of NAACL.
- Zampieri et al. (2018) Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Ahmed Ali, Suwon Shon, James Glass, Yves Scherrer, Tanja Samardžić, Nikola Ljubešić, Jörg Tiedemann, et al. 2018. Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign. In Proceedings of VarDial.
- Zampieri et al. (2019) Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. 2019. SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval). In Proceedings of SemEval.