Log In Sign Up

Revisiting Contextual Toxicity Detection in Conversations

by   Julia Ive, et al.

Understanding toxicity in user conversations is undoubtedly an important problem. As it has been argued in previous work, addressing "covert" or implicit cases of toxicity is particularly hard and requires context. Very few previous studies have analysed the influence of conversational context in human perception or in automated detection models. We dive deeper into both these directions. We start by analysing existing contextual datasets and come to the conclusion that toxicity labelling by humans is in general influenced by the conversational structure, polarity and topic of the context. We then propose to bring these findings into computational detection models by introducing (a) neural architectures for contextual toxicity detection that are aware of the conversational structure, and (b) data augmentation strategies that can help model contextual toxicity detection. Our results have shown the encouraging potential of neural architectures that are aware of the conversation structure. We have also demonstrated that such models can benefit from synthetic data, especially in the social media domain.


Would you Like to Talk about Sports Now? Towards Contextual Topic Suggestion for Open-Domain Conversational Agents

To hold a true conversation, an intelligent agent should be able to occa...

Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation

The popularity of image sharing on social media and the engagement it cr...

Improved Target-specific Stance Detection on Social Media Platforms by Delving into Conversation Threads

Target-specific stance detection on social media, which aims at classify...

Audrey: A Personalized Open-Domain Conversational Bot

Conversational Intelligence requires that a person engage on information...

Augmenting Data for Sarcasm Detection with Unlabeled Conversation Context

We present a novel data augmentation technique, CRA (Contextual Response...

Automated Utterance Labeling of Conversations Using Natural Language Processing

Conversational data is essential in psychology because it can help resea...

Conversational Pattern Mining using Motif Detection

The subject of conversational mining has become of great interest recent...

1. Introduction

Understanding toxicity111We use the term “toxicity” as an umbrella term to denote a number of variants commonly named in the literature, including hate, abuse, offence, among others. in user conversations (e.g., via social media and related platforms) is undoubtedly an important problem both for (a) humans processing such content, either when interacting with others or in the task of content moderation, (b) technological solutions that are aimed at filtering content, supporting human interactions or content moderation. As with many other tasks involving human language, the decision on whether a given piece of text is toxic is not trivial. While some occurrences clearly and intentionally use toxic words to abuse or offend, others are more subtle, including the use of sarcasm, references to previous elements in the conversational context or even external elements.

As it has been argued in previous work, addressing these “covert” cases of toxicity requires context (Jurgens et al., 2019; Vidgen et al., 2019; Caselli et al., 2020). Various types of context have been considered, including conversational and visual context as well as metadata about user features, annotator features or interactions (network information) (Ribeiro et al., 2018).

While we acknowledge the importance of metadata, this information is not always available and also varies from dataset to dataset, depending on the platform used to produce the data. Visual context has been studied in

(Gomez et al., 2019; Yang and others, 2019; Kiela et al., 2021) and shown to improve detection results significantly for memes (Kiela et al., 2021).

In this paper, we focus on textual context, which is prevalent in any social media platform and can encompass complex linguistic features, making toxicity detection very challenging. It can include the content of the initial post where a comment is made, as well as other comments in the conversation thread.

Very few previous studies have analysed the influence of conversational context in human perception of toxicity in controlled experiments. Pavlopoulos et al. (2020) annotated 250 comments on Wikipedia Talk pages in two settings: in isolation and in the presence of the post title and previous comment. They found that 5% of the 250 samples had their labels flipped, mostly from non-toxic to toxic. Similarly, Menini et al. (2021) annotated 8K tweets from an existing dataset (Founta et al., 2018) in isolation and in the presence of 1-5 previous tweets. They found that more context led to 50% fewer tweets being considered toxic, and that the longer the context, the higher the chance of a tweet being considered non-toxic.

While the extent of the impact of context differs, both these studies have shown that humans perceive toxicity differently depending on whether they have access to context.

(Pavlopoulos et al., 2020; Menini et al., 2021)

have also used the annotated data to compare prediction models built with and without context. Their contextual models use simple approaches for early or late fusion of context and target comment (i.e. content to be classified), generally by concatenating the segments and giving them as input to the models, or encoding them separately and concatenating the hidden representations. The findings are not very encouraging: context-aware models are not better than or even harm the performance of context-unaware models.

We dive deeper into the questions of whether and how context affects human perception and prediction models.

When it comes to understanding the impact of toxicity in human perception, previous studies have focused on providing a quantification of cases where the presence of context (in phase 2 of the annotation task) has changed human labelling for toxicity (from phase 1 of the annotation task), as well as the direction of change (from toxic to non-toxic or vice-versa).

Apart from a few examples of these cases, they have not studied or quantified the reasons why these changes happen.

We thoroughly analyse the contextual dataset of tweets released by Menini et al. (2021) to answer this question.

Our observations suggest that labelling errors in phase 1 resolved by the context in phase 2 are mainly due to the general positive or neutral polarity of the conversation (positive sentiment or discussion on personal matters: family, food, hobbies, etc.).

Following these observations, we come to the conclusion that (a) for models/architectural choices, we need to move away from the current approaches which simply perform early fusion of context and target comment (via concatenation) to take the hierarchy of the utterances in a thread into account; (b) the type of context also plays crucial role for the prediction model and more examples of cases where the toxicity really matters with diverse polarity and topics are needed. To address these limitations, we propose deep learning architectures based on state-of-the-art contextual language representations such as  

(Devlin et al., 2019) that are aware of the conversational structure.

In addition, we propose a data augmentation methodology that creates artificial, more diverse context and target utterances in terms of polarity and topics. This methodology uses the state-of-the-art approaches for controlled text generation

(Radford et al., 2019; Dai et al., 2019).

Our main contributions are thus threefold: (a) the analysis of the existing contextual toxicity detection data that reveals the idiosyncrasy of contextual toxicity perception and prediction; (b) a range of architectures for contextual toxicity detection that are more geared towards conversational context and lead to improved performance; and (c) data augmentation strategies that help model toxicity.

We introduce related work in Section 2, in-depth analysis of the contextual dataset released by Menini et al. (Menini et al., 2021) in Section 3, our context-aware models and data augmentation methods in Section 4, and our experimental setup in Section 5. We present and discuss key results in Section 6 and Section 7.

2. Related Work

Most research in the area of toxicity detection investigate the use of machine learning algorithms to process individual posts or comments from social media platforms, without any additional context 

(Zampieri et al., 2020; Waseem et al., 2017). While this is a step in the right direction, compared to simple lookup based on keywords (e.g., Hatebase222, this approach is limited to covering “overt” cases of toxicity, whereas addressing “covert” cases would require access to more context (Jurgens et al., 2019; Vidgen et al., 2019; Caselli et al., 2020). In what follows, we described the few studies in the literature that have attempted to do so for conversational context, as well as general work in text generation for data augmentation.

2.1. Contextual Toxicity Detection

Previous studies on contextual toxicity includes those focusing on understanding human perception of toxicity (Pavlopoulos et al., 2020; Menini et al., 2021), creating corpora where annotations are done in context (Mubarak et al., 2017; Vidgen et al., 2021; Onabola et al., 2021; Menini et al., 2021; Xu et al., 2021; Dinan et al., 2019; Fanton et al., 2021), and building classifiers where additional context is taken into account, including a single previous comment  (Pavlopoulos et al., 2020), the title of the news article the comment refers to (Gao and Huang, 2017) or a conversational thread (Menini et al., 2021; Dinan et al., 2019; Xu et al., 2021).

In terms of human perception, (Pavlopoulos et al., 2020; Menini et al., 2021) have emphasised the importance of contextual information in understanding the true meaning of comments.

For creating computational models that are able to take context into account, the main challenge is the analysis of the conversational structure, which requires modelling long-range dependencies between utterances.

This is a general problem in the areas of discourse and dialogue. Existing work for modelling conversational structure in these areas can be divided in two groups: (a) traditional approaches that summarise conversations to main features (e.g., conversation markers (Niculae and Danescu-Niculescu-Mizil, 2016)

), and (b) neural approaches that explore the sequential nature of the conversation by using hierarchical neural network structures 

(Chang and Danescu-Niculescu-Mizil, 2019), often equipped with attention mechanisms (De Kock and Vlachos, 2021) to focus on the most important parts of inputs at each level. We get take inspiration from the latter line of work.

The works closest to ours are (Menini et al., 2021; Dinan et al., 2019; Xu et al., 2021). As mentioned previously, the study in (Menini et al., 2021) annotated a dataset of tweets in two settings: with and without context. Using the dataset, the study investigated the classification performance of neural and non-neural approaches. For the neural approaches, the study employed a BERT-based model (Devlin et al., 2019) which replicates the setup of the Next Sentence Prediction (NSP) task by splitting the last utterance to be classified and its preceding dialogue history into two separate segments. Then the last hidden states from the model are provided as input into a toxicity classification layer.

They also experimented with a Bidirectional Long Short-Term Memory (BiLSTM) recurrent network, either encoding the context concatenated together with the tweet as a single text or encoding the context and the tweet separately and then concatenating the two resulting representations. We also propose to encode context and posts separately, but we do not take the various previous tweets as a single chunk of context, but rather investigate the role of each part of the context.

(Dinan et al., 2019) proposed the build it, break it, fix it strategy to create a more robust detection model. The proposed strategy includes humans and models in the loop: models are trained on some initial data, which is incrementally increased with data from adversarial attacks produced by humans to break the current models, an iterative process repeated a few times. The study investigated two different tasks: single-turn task and multi-turn task. The single-turn task models the detection problem with a single utterance, without context. The multi-turn task, which is similar to the scope of our work, considers context (i.e., preceding comments) of each utterance. For that, similar to the work in (Menini et al., 2021), they use a model with the NSP setup.

The work in (Xu et al., 2021) seeks to improve the generation of “safe” comments by conversational bots. They also use a human-and-model-in-the-loop strategy where they create additional data by asking crowd-workers to adversarially converse with a chatbot model with the aim of inducing unsafe responses. Using the dataset, a safety classifier network was constructed to detect offensiveness in both the user input and the model output. If both utterances are classified as unsafe, the model would respond with a non-sequitur by randomly selecting a topic from a known list of safe topics. A second approach “bakes in” toxicity awareness to the generative model by modifying target responses to incorporate safe responses to offensive input. The first method in this approach category was found to filter examples classified as unsafe out of the dataset, consequently causing the model to be unprepared to handle unsafe input at inference time. The second method, which is more robust, replaces the ground truth response of each conversation in the dataset with a non-sequitur if the response or the last utterance in the history is classified as unsafe.

While created with different aims in mind, the datasets in (Dinan et al., 2019; Xu et al., 2021) are very useful by providing conversational structure going beyond one single comment (or post) and the presence of toxic comments. Along with (Menini et al., 2021), these make up for the only three datasets of this nature, all of which we exploit in this paper.

2.2. Text Generation for Data Augmentation

Natural language generation is an NLP area with a range of applications such as dialogue generation, question-answering, machine translation, summarisation, etc. Most recently, different deep learning text generation techniques have been actively applied for data augmentation purposes in the general NLP domain (Liu et al., 2020; Wei and Zou, 2019), as well as for specific NLP tasks of machine translation (Edunov et al., 2018; Sennrich et al., 2016), question answering (Yang et al., 2019), biomedical NLP (Ive and others, 2020).

There are many advantages of applying data augmentation techniques in language tasks, as summarised in (Shorten et al., 2021)

. The overarching goal is to get better performance out of existing datasets for supervised learning. Data augmentation techniques are also helpful tools to observe model behaviour and exhibit their failures.

They also act as a mean of regularisation, which helps mitigate overfitting. In other words, with data augmentation, models are less prone to learning spurious correlations and memorising unique patterns in the dataset (e.g., numeric patterns in token embeddings). Subsequently, the use of data augmentation leads to better model generalisation.

In the domain of toxicity detection, very few previous studies investigated the utilisation of data augmentation. The work in (dos Santos et al., 2018) proposed an unsupervised text style transfer approach which translates offensive sentences to non-offensive ones. The aim of the study is to encourage users of online social media platforms to change their behaviour of using profanity when posting. That is, when a message to be posted by a user is considered offensive, a polite version of the message is offered to the user. To do so, the proposed approach employed an RNN-based encoder-decoder text generation model and a collaborative CNN-based classifier to provide indirect supervision (dos Santos et al., 2018). During style transfer, the encoder of the generation model encodes an input sentence and its original (i.e., ground-truth) style into a sequence of hidden states. Using the computed hidden states, the decoder receives a target style and outputs a sequence with the desired style. Although this proposed method is effective in detoxifying sequences, we argue that due to the recurrent nature of the text generation model, it is less robust when performing style transfer of long sequences. Thus, we focus our investigation on the state-of-the-art Transformer-based approaches, to be outlined in the proceeding sections.

The study in (Sen et al., 2021)

investigated the effect of using Counterfactually Augmented Data (CAD) in the robustness of online abusive content detection models. CAD is achieved by human-generated instances that are minimally edited to flip their label (e.g., from positive sentiment to negative sentiment). They conducted experiments on three different constructs: sentiment, sexism and hate speech, using logistic regression and a fine-tuned BERT as the classification models. Their findings indicate that the use of CAD improves model generalisation to out-of-domain data, justifying the benefit of data augmentation. They have also shown that, with CAD, the classification models tend to better learn cues and features in the dataset that are highly correlated the construct of interest. However, generating data manually is time-consuming and costly which acts in the favour of automated data augmentation techniques.

3. A Closer Look into Human Perception of Toxicity

We start with an in-depth analysis of the contextual dataset of tweets released by Menini et al. (2021), where they re-annotated a subset of Founta et al. (2018)’s dataset for which the tweets were still retrievable from Twitter and had at least one previous tweet as context. The Founta et al. (2018) data was initially created by randomly sampling tweets with subsequent spam filtering during the period of March to April 2017, and annotated independently of context. The contextual subset of the data was re-annotated via crowdsourcing, using the same group of annotators, 3 months apart, in two conditions: phase 1 – without context, and phase 2 – with context, where context can be 1-5 previous tweets. The final label in each condition is the majority vote amongst three annotators per tweet.

For our analysis, we selected this data over the other contextual datasets as it offers naturally occurring comments and context, as opposed to content intentionally created to break models (e.g. (Dinan et al., 2019; Xu et al., 2021)).

The goal of our analysis is to gain insights on why and when human perception changes in the presence of context to better inform the design of our toxicity detection approaches. First, we observe that the aggregated human labels for only 12% (1070 tweets) of the tweets changed in the presence of context. 81% of these changes were from toxic (1) to non-toxic (0), with 19% changing from non-toxic to toxic. (Menini et al., 2021) claims that context helped annotators understand cases of toxic words used in sarcastic or ironic ways, or unclear cases due to references to other tweets, but they have not provided any details on this. We further analysed these changes from phase 1 to phase 2 (1 0 or 0 1 flips) and provide a broader categorisation of the reasons. The annotation was performed by one of the authors, who is a fluent speaker of English.

The results show that the 1 0 flips happen mostly with the neutral context (69% of the examples for this flip type). Such examples tend to contain a lot of profanity words that actually have no toxic intention but are rather used as intensifiers. The 0 1 flips happen mostly because of the influence of negative context (42% of the examples for this flip type). Toxic examples where the context is crucial to detect sarcasm make around 13% of the toxic annotated examples. Such cases can be very difficult even for humans. Examples of our annotations are given in Table 1. Our observations confirm that the sentiment of the conversation plays a crucial role to determine the comment’s toxicity. Neutral and positive topics usually concern discussions on personal matters: family, food, hobbies, etc. or general positive content (admiration, excitement, etc.) (see examples in rows 1 and 4 in Table 1). Negative context or context leading to sarcasm are usually context discussing politics (see examples in rows 2 and 5). Ambiguous cases rely on additional information to clarify the meaning of the comment (see example in row 6). Sometimes toxicity is not coming from the author of the post but they repeat words of somebody else (see example 7). We also need context to understand if this is the case. Another important observation is that the conversational structure of tweets is very irregular and each new post may refer to any previous posts in the thread (examples 2 and 3). Furthermore, with potentially a large number of participants, tracking the exchanges between them may be very difficult. Hence, it is crucial to be able to model this structure in flexible way with appropriate architectures.

Category Target Context I C
1 Neutral context (58%) P1: snakes the devil, that’s why I like to see mongooses fuckin em up. slithery sinister bastards P2: A missing man was discovered inside a 23-foot python. [LINK] … 1 0
2 Negative context (16%) P1: use that ugly ass design instead of the oufit designed by kishimoto P1: Y’all really think
P2: is not going to sexualize sarada the way ikemoto pedophile ass has been doing yet why else would they
0 1
3 NA (13%) P1: If u want we can do some fucked up things if u want :) P1: Hey bro, im the guy darkpetal - maybe we can do some other stuff ;)
P2: The hacker Guy?
P1: Haha i usually dont kill people ;D but meh this was my fun day on gta
1 0
4 Positive context (6%) P1: I’m gonna because you are 100000% correct the bulls are going to the finals Fuck the Cleveland Cavaliers! Go Bulls!’!!!! P2: Bulls the best team in the NBA, don’t @ me 1 0
5 Sarcasm (4%) P1: This is adorable. You can’t answer direct questions. Not a democrat but nice overreach. P1: Ask your father he would know.
P1: It’s adorable that you still provide him a reacharound even after death. Very committed you are.
P1: I accept your concession snowflake.
P1: Poor kid haz a sad. You do know though that you posted a forgery?
P1: Trickered? Do you even know what ”triggered” even means or did you just hear it online and decided to be a follower and repeat it?
0 1
6 Ambiguity (2%) P1: turns out he’s just got out of prison for beating his ex girlfriend, I’m fucking disgusted and disappointed in my mum P1: Okay so I thought there was something weird about my mums new boyfriend,I was getting bad vibes and he was saying gross things about women 1 0
7 Citation (1%) P1: he’s said who are these people cuz I wanna sue them. I fucking fell out!!!! P1: The turbo tax Humpty commercial is funny [Face with Tears of Joy]
P2: favorite commercial right now lol
1 0
Table 1. Examples of the various categories of changes related to the presence of context. ‘Target” is the tweet being annotated, “Context”, the previous tweet(s), “I” the annotation in isolation, “C” the contextual one: toxic (1) or non-toxic (0). Px indicates the number of users involved in each the conversation. Reason N/A was assigned when the reason for the toxicity flip was not possible to establish. Bold highlights profanity words in classified posts.

Our analysis and findings emphasise the need for context, as well as the complexity of interpreting a comment for toxicity, even in the presence of context. This motivated the two directions we pursued in the remainder of this paper: (a) better contextual detection models that adequately explore the structure of the conversation (rather than simply concatenating target comments and context) and (2) data augmentation strategies to create more diverse and relevant examples where context matters. We present our approaches for these two directions in the next section.

4. Methodology

In this section, we present our context-aware approach for toxicity detection, followed by the data generation strategies we follow to make the context more informative.

4.1. Context-aware Models

To better handle the conversational structure, we propose three architectures that encode the context separately from the target comment. Specifically, as shown in Figure 1, we encode the context as a sequence of sentences (), each represented by its BERT sentence (CLS) embedding. Those embeddings are then summarised into a history representation (). We propose three summary alternatives, namely taking the summary token from BERT for the concatenated context utterances (ContextSingle), the sum of the context representations (ContextSum) or the last hidden state representation from an RNN (ContextRNN). The history representation is then fused with the representation of the target post () by concatenation, and this is used for a final classification layer followed by a sigmoid activation to classify the sample as toxic or non-toxic.

Figure 1. Our ContextSingle, ContextSum and ContextRNN architectures. Each utterance is represented by its BERT CLS token embedding. The context history () representation is taken as the summary CLS BERT representation for the concatenated context utterances (ContextSingle), the sum of the context representations (ContextSum) or the last hidden state representation from an RNN (ContextRNN). The history representation is concatenated with the representation of the target post () to produce toxicity predictions.

4.2. Data Augmentation

Given the scarcity of datasets labelled for toxicity in context, we explore ways to use this data, as well as data augmentation strategies to build better prediction models.

More specifically, we explore two approaches to data augmentation: a generative approach and a transformative approach. The former generate new content based on the original distribution in the training data, the latter modifies existing training data, creating a variant of the original training data. For this, we explore two techniques: a fine-tuned GPT-2 (Radford et al., 2019) language model and the Style Transformer model (Dai et al., 2019).

GPT-2 is a large state-of-the-art pre-trained Transformer-based language model (Radford et al., 2019). During the pre-training phase, language models such as GPT-2 are known to capture general knowledge about the language. They then need to be adapted to the task at hand by fine-tuning, using labelled data that reflect the expected distribution. We create fine-tuned models to either generate toxic data or non-toxic data so that the text generated would reflect both the style and the content of the respective fine-tuning dataset. For inference, language models require a prompt to initiate the generation as a continuation of this prompt. They then produce one word at a time, conditioned on the previously generated words and the prompt. We consider two types of prompts: the original context to condition the generation of the target comment, and the target comment to condition the generation of context. Note that this last setup breaks the sequential nature of the conversation but is suitable for the social media data.

Style Transformer (Dai et al., 2019) is an auto-encoder architecture that learns to disentangle the style from the content of the input text, learning separate representations for style and content. The goal is to rewrite (transform) sentences with a desired style while preserving the content from the original sentence. This is done by learning the network to separate and reconstruct from its own output the original input text and its style (see details in Appendix A.0.1). For inference, the model takes the tokens of the original text as its input and attempts to rewrite them given the requested style. The model thus has access to the entire input text during the generation. We use Style Transformer to re-write target comments or context into their opposite styles (toxic or non-toxic).

5. Experimental Settings

5.1. Datasets

In this section, we describe all three datasets used in this paper: a naturally occurring social media dataset, and the two human-in-the-loop dialogue datasets.


The FBK dataset (Menini et al., 2021), which we used in the analysis in Section 3, is a subset of the dataset for abusive language detection, introduced in (Founta et al., 2018). The original dataset was formed by sampling tweets using the Twitter API. Using the tweet IDs, the study in (Menini et al., 2021) queried Twitter API to extract context for tweets (i.e., preceding tweets in the same thread) and filter the ones without context. After filtering, the dataset was annotated in two steps: the first one without context and the second one with context.

The FBK dataset is relatively small, with a total of 8018 examples. We have created our own data split to make sure samples where context is important are in all the splits. We refer to the cases where the toxicity labels had not changed from phase 1 to phase 2 as non-flipped (7K examples) and the remaining cases as flipped (1K examples).

We reserved around half of the flipped examples for training and validation, the other half was used for testing. For non-flipped data, the majority of the samples (6K) was used for training, the rest (1K) was split between validation and testing.

Our analysis in Section 3 showed that a lot of the cases in the FBK dataset constitute very complex and relatively unclear cases of toxicity. In addition, we observed that the nature of the data is not always conversational: some tweets respond/refer to the immediately preceding tweet(s), while others exhibit much longer distance relationships, sometimes referring to the original post or other far-away tweets in the conversation. Additionally, often the thread includes more than two users, which makes tracking the actual dialogue quite difficult.

Therefore, we also experiment with two simpler conversational datasets, which mimic conversations between two agents, be them two humans (BBF) or a human and a chatbot (BAD).

Build-It Break-It Fix-It (BBF)

The BBF dataset from (Dinan et al., 2019) was constructed by collecting utterances produced by crowd-workers instructed to intentionally continue segments of existing conversations by producing offensive utterances that the classifiers developed in the study would consider as safe (non-toxic). The dataset therefore contains contextual information of each produced utterance in the general conversational domain. In terms of annotation, adversarial utterances produced by the crowd-workers were annotated as unsafe. To combine this data with safe examples, dialogue examples from ConvAI2 which are labeled as safe by two crowd-workers were added. Instances in the dataset contain between 3 to 6 utterances.

Bot-Adversarial Dialogue (BAD)

The BAD dataset (Xu et al., 2021) was formed by collecting conversations between crowd-workers and the dialogue models (i.e., chatbots). Crowd-workers were instructed to adversarially converse with the bots to instigate the generation of unsafe (toxic) messages by the bots. Instances contain between 6 to 32 utterances. Each sample is broken down in various numbers of samples with each separated conversation turn forming a separate training example. That is, one instance contains one utterance of the dialogue along with 0 to 31 preceding utterances, as shown in Table 2.

This data is also in the general conversational domain. Responses generated by the bots were annotated by 4 crowd-workers, including the person conversing with them, while human-generated utterances were labelled by 3 crowd-workers. An utterance was considered to be unsafe if at least one of the following categories was tagged: Hate Speech, Personal Attack, Profanity and Other Offensiveness.

Target Context Gold
1 Bot: That’s nice. What do you do for fun? I like to read the newspaper. P1: a new family just moved into my neighborhood 0
2 P1: the family that moved in is making lots of noise and being rude, acting like animals P1: a new family just moved into my neighborhood
Bot: That’s nice. What do you do for fun? I like to read the newspaper.
Table 2. Examples from the BAD dataset showing how a dialogue instance between a crowd-worker and a bot is broken down to multiple instances, where 1 indicates a toxic comment and 0, non-toxic.

Table 3 summarises the statistics of each of our three datasets, using the provided splits for BBF and BAD. Note that the FBK and the BBF datasets are heavily imbalanced, while BAD is a relatively balanced dataset.

Dataset Train/Valid/Test Toxic Context Len Target Len
FBK Non-flipped 5948/500/500 8.7%/8.4%/8.6% 50.5 20.6
FBK Flipped 428/107/535 18.7%/18.7%/18.5% 41.1 17.1
BBF 24000/3000/3000 10%/10%/10% 32.1 10.1
BAD 69274/7002/2598 39.3%/39.5%/36.3% 95.5 15.2
Table 3. Dataset statistics: number of samples in each split, proportion of toxic cases, average lengths (in words) of context and target comment.

5.2. Setups for our Context-aware Models

In our experiments, we explore several ways to represent context and its structure. Based on the three models presented in Section 4.1, we try a few variants to seek understanding of which parts of context are more helpful:

  • ContextSingle – we encode all the context utterances concatenated into one string following their reverse chronological order so that the last context text is maintained.

  • ContextSingleFirst – same setup as above but context utterances are concatenated into one string following their natural chronological so that the first context text is maintained.

  • ContextSum – we encode all the context utterances as groups of several utterances. These groups represent different structural parts of the context (beginning, middle and end). An aggregated context history () is then formed as the sum of the representations of the parts.

  • ContextRNN – we again form groups of utterances but instead of summing their representations; We input them into an RNN and take its last hidden state as the summary .

  • ContextRNNFirst – replicates ContextRNN but the utterance representation are encoded by the RNN in the reverse chronological order, better capturing the beginning of the context.

5.3. Baselines

Following the best practices in the domain (Xu et al., 2021), we build several baseline models based on the same pre-trained BERT as in the previous section.

  • TextOnly is a BERT-based model that takes only the target comment as input.

  • TextConcat is the same BERT-based model that takes in the target comment concatenated together with the context utterances in the reverse chronological order.

  • TextConcatFirst - takes the target comment concatenated together with the context utterances in the reverse chronological order (to maintain the first utterance) as the input.

  • NextSentPred is our re-implementation of the models of (Xu et al., 2021) and (Menini et al., 2021) where the concatenated context is input into the BERT model as the previous sentence and the target sentence is input into the BERT model as the next sentence mimicking the BERT next sentence prediction (NSP) setup. The model is then fine-tuned for the binary toxicity prediction using the resulting BERT CLS token embedding.

For a broader perspective, we also compare to three online tools for toxicity detection: (1) the Perspective API,333 (2) the Azure Content Moderation,444 and (3) Clarifai.555

We feed into those tools either the text of the target comment only (TextOnly) or the concatenation of the context and the target comment (TextConcat) in the chronological order. These tools output a probability distribution for more fine-grained types of toxicity. To make them into binary labels, after obtaining classification results, we compute the final toxicity scores as follows. For Perspective, we take the predicted probabilities in the

toxicity class as the toxicity scores, since this is a general class that includes all types of toxicity. For Azure, we define the toxicity score as the maximum probability of all three predicted categories: sexually explicit or adult language, sexually suggestive or mature language, and offensive language.

For Clarifai, we take the predicted probabilities in the toxic class.

5.4. Setups for Data Augmentation

Depending on the type of the data, we propose two types of data augmentation setups:

5.4.1. Synthesising Toxic Examples

To address the label imbalance in the BBF and BAD data, we generate toxic examples conditioned on the context sampled from the original training data (DataAug). We fine-tune the state-of-the-art GPT-2 generation model (Radford et al., 2019) for each dataset using only the toxic part of the data. We augment toxic samples by generating toxic target comments conditioned on the sampled non-toxic context utterances. This is to avoid the bias with the data used to fine-tune the language model. For each dataset, we take the number of the generated toxic instances necessary to reach the balance (50-50%) of toxic and non-toxic examples in the final training data.

5.4.2. Synthesising Adversarial Examples

To address the issue of data sparsity where the context matters, we design the following adversarial data augmentation procedure:

  • 1-0 Flip Target - given the context (preceding non-toxic examples), we generate target tweets containing toxic (mainly profanity) words and mark this data as non-toxic. We generate 5K such examples to keep the original number of non-toxic examples in the (Menini et al., 2021) non-flipped training data (Table 3). See example 1 in Table 4.

  • 0-1 Flip Target

    - given the context (preceding toxic examples), we generate target tweets without explicit toxicity lexicon and label this data as

    toxic. We generate 500 such examples to keep the number of toxic examples in the  (Menini et al., 2021) non-flipped training data. See example 2 in Table 4.

  • Flip Context - given the non-toxic tweet, we generate toxic context and label this data as toxic. We generate 5K such examples to augment the number of toxic examples in the (Menini et al., 2021) data. See example in Table 5.

All the setups are replicated for both the generative (GPT-2) and transformative (Style Transformer) generation approach. Note that transformative examples in Tables 4 and 5 are unsurprisingly very close to the original data.

Within each approach, we then experiment with different combinations of data types with or without adding the real training data.

For each dataset, we augment the training data of our best contextual model.

Real Context Generative Target Transformative Target Real Target
1 1-0 Flip Target why are all my mutuals so attractive while im just a lil potato … the truth is we have a whole government that doesn’t know what it’s like to be fucked up when it comes to government. Gyou are so hot and gorgeous and i didn ’ t know i could be ass im bc fuck you ? you are so hot and gorgeous and i didn’t know i could be but im gayer bc of you
2 0-1 Flip Target she’s now famous muric uh - ” it has been proven that the only thing you can do is work on one will be to make a huge amount of money from a small amount and in your mind,. ” rose blair - pilot should just targeted us , end it , we haven ’ t learned main . kim jong - un should just fucking nuke us, end it, we haven’t learned anything.
Table 4. Examples of target posts generated by the fine-tuned GPT-2 model (Generative) and the Style Transformer model (Transformative). 1-0 Flip Target is the setup where non-toxic (0) examples typically contain toxic words. 0-1 Flip Target is the setup where toxic (1) examples without toxic words are generated using the toxic context. Both type of examples require context to be labelled correctly.
Generative Context Transformative Context Real Context Real Target
that was my last summer assignment.. fuck it up!! i am so sick of being on assignments now but then you can’t work out on weekends like every weekend of past shit to make your weekend feel like a fucking month so disappointed in , my self broke fuckin one in act , husband ’ s does same , nothing in months and no email responses effectively . so disappointed in, my band broke new one in weeks, husband’s does same, nothing in months and no email responses i believe we will be the service - they were great for me - i think he has just fallen into an email crevice.
Table 5. Examples of context generated by the fine-tuned GPT-2 model (Generative) and the Style Transformer model (Transformative). Flip Context is the setup where toxic context is generated to create toxic examples. Target posts are originally non-toxic and require context to be labelled correctly.

5.5. Implementation Details

Context-aware Models

We implement all our prediction models using the BERT model from the HuggingFace Library (Wolf et al., 2019) (bert-base-uncased). The maximum input length for the models that regard each utterance separately is set to 175 tokens for BBF and BAD and to 100 for FBK (these values are fine-tuned on the validation set). We take the default BERT maximum input length to account for the additional length of the concatenated context. For our contextual models, we take the CLS token representations as the utterance representations. We use the GRU cells with dimensionality 768 for our sequential RNN models. The concatenation of the history and post representations are input into the output layer followed by the sigmoid transformation. For all the datasets, we split the context into the sequences of 2 utterances so that the context in BBF are split into 2 parts on average (considering the average length of utterances of  3.5), context in BAD are split into 3 parts (considering the average length of utterances of  6.4) and context in FBK are split into 2 parts (considering the average length of utterances of  2.9).

Our models are trained to minimise the binary cross-entropy loss with batch size 32 for the BBF data and with batch of 10 for the other two datasets. The training is done until convergence over the validation loss with the patience of 3 epochs. The models typically converge at 2-3 epochs.

Data augmentation Models

We fine-tune two GPT-2 medium models from the HuggingFace Library. We use the AdamW optimiser (Loshchilov and Hutter, 2017). We train each model with batch size of 16 to minimise the cross-entropy loss until convergence with patience 5 over the validation loss. For inference, we sample with Top 0.9, Top 30 and repetition penalty of 1.2.

We train the Style Transformer model as provided by the official implementation666 from scratch with batch size of 64 and maximum sequence generation length of 80 tokens, using the Adam optimiser (Kingma and Ba, 2014). At inference time, we use greedy decoding as implemented in the auxiliary code.777

For FBK, we train all of our generation models on the non-contextual part of the (Founta et al., 2018)’s corpus. This corpus contains  54K non-toxic examples and  32K toxic examples. For all our models, we use the same validation set as for the toxicity prediction models.

For BBF and BAD, we use the toxic part of each training set to fine-tune toxic generation models. We then sample context from the non-toxic part of the corpora to condition the inference. This is possible as the dialogue context for toxic and non-toxic examples are homogeneous.

6. Results

6.1. Contextual vs Baseline Models

Results are in Tables 6, 7 and 8. Following (Xu et al., 2021), we report F1 of the toxic class. We also report accuracy, as well as ROC-AUC. We use the 2 sample Kolmogorov-Smirnov (2S-KS) equality test for independent samples to measure statistical significance of our results using the samples of predicted probabilities.

As a first general remark, our models outperform the existing online tools by a large margin for BBF (+0.5 F1 for the best model) and BAD (+0.23 F1 for the best model). For FBK, our models perform on par with those tools, which might be explained by the fact that the FBK data is included into the training data of those models.

For BBF, our models perform better than the state-the-art results from the BERT-based models of (Dinan et al., 2019). The performance of our BAD models is also comparable to the performance of the BERT-based models from (Xu et al., 2021) (-0.05 for our best model). It is worth noticing that for the latter the state-of-the-art model uses a wider range of training data, whereas our model does not exploit external datasets other than the BAD data itself. We focused on the exploration of the value of the context rather than adding more data to get better overall results. If the training is performed under the same constrained settings, our best models actually outperform our re-implementation of the state-of-the-art NextSentPred model for both BBF (+0.04 F1) and BAD (+0.05 F1). For FBK, our results are on par with non-contextual models.

Regarding the contribution of the context as compared to the non-contextual models, we show that contextual information is particularly helpful in the cases where the context are different in each sample (i.e. no repetition of utterances over different context and across context and target comments): for BBF we obtain +0.08 F1, while for FBK we obtain +0.04 F1. For BAD, our best contextual model (ContextRNN) performs on par with the text-only model. This is most likely attributed to the difficulties to model the incremental sequences of dialogue utterances in this dataset, as well as the fact that some utterances appear as both context or target comment across different samples (see Table 2).

Finally, our experiments also shed light over most relevant parts of the dialogue. For both BBF and FBK, the beginning of the context seems to be more important: in both cases models that reverse the natural chronological order of the conversation perform better than their counterparts that do not reverse this order. ContextSingleFirst is the best performing model for BBF and TextConcatFirst – for FBK. For FBK, this finding can be explained by the fact that Twitter conversations do not tend to form proper utterance sequences, but rather where multiple user posts are mainly connected to the root post, rather than to the immediately adjacent comments. For BAD, the performance is slightly better for the models that consider the natural order of the context utterances. This could be explained by the fact that the unique dialogue utterances appear at the end of each context, while other utterances are repeated multiple times.

Model ACC AUC F1
Perspective TextOnly 0.872 0.641 0.151
Perspective TextConcat 0.861 0.606 0.095
Azure TextOnly 0.864 0.643 0.169
Azure TextConcat 0.796 0.610 0.171
Clarifai TextOnly 0.874 0.550 0.055
Clarifai TextConcat 0.874 0.535 0.031
TextOnly 0.900 0.924 0.647
TextConcat 0.882 0.893 0.594
TextConcat First 0.911 0.927 0.663
ContextSingle 0.912 0.936 0.669
ContextSingleFirst 0.913 0.933 0.672
ContextSum 0.896 0.924 0.630
ContextRNN 0.911 0.910 0.643
ContextRNN First 0.903 0.925 0.632
NextSentPred 0.903 0.904 0.627
(Dinan et al., 2019) 0.664
Table 6. Performance on the BBF dataset. Bold highlights best results. The symbol indicates that the difference between these and the TextOnly results are statistically significant ().
Model ACC AUC F1
Perspective TextOnly 0.738 0.805 0.516
Perspective TextConcat 0.628 0.658 0.498
Azure TextOnly 0.717 0.712 0.492
Azure TextConcat 0.532 0.581 0.518
Clarifai TextOnly 0.690 0.720 0.286
Clarifai TextConcat 0.649 0.654 0.308
TextOnly 0.786 0.888 0.748
TextConcat 0.780 0.871 0.733
TextConcat First 0.776 0.873 0.734
ContextSingle 0.783 0.874 0.737
ContextSingleFirst 0.787 0.873 0.734
ContextSum 0.790 0.883 0.743
ContextRNN 0.796 0.885 0.749
ContextRNN First 0.789 0.879 0.741
NextSentPred 0.758 0.858 0.707
(Xu et al., 2021) 0.808
Table 7. Performance on the BAD dataset. Bold highlights best results. The symbol indicates that the difference between these and the TextOnly results are statistically significant ().
full non-flip flip
Perspective TextOnly 0.518 0.639 0.302 0.910 0.989 0.651 0.151 0.214 0.225
Perspective ContextConcat 0.481 0.633 0.283 0.826 0.972 0.497 0.159 0.274 0.219
Azure TextOnly 0.506 0.622 0.287 0.880 0.945 0.583 0.157 0.297 0.213
Azure ContextConcat 0.433 0.602 0.278 0.710 0.842 0.367 0.174 0.384 0.243
Clarifai TextOnly 0.541 0.639 0.298 0.936 0.992 0.724 0.172 0.201 0.210
Clarifai ContextConcat 0.545 0.609 0.281 0.894 0.968 0.619 0.219 0.249 0.190
TextOnly 0.623 0.648 0.272 0.968 0.987 0.814 0.301 0.287 0.169
TextConcat 0.535 0.648 0.290 0.912 0.982 0.656 0.183 0.300 0.204
TextConcat First 0.525 0.635 0.307 0.896 0.978 0.612 0.178 0.287 0.236
ContextSingle 0.694 0.675 0.265 0.964 0.964 0.775 0.441 0.347 0.148
ContextSingle First 0.685 0.677 0.282 0.954 0.975 0.729 0.434 0.345 0.179
ContextSum 0.558 0.625 0.269 0.962 0.989 0.812 0.181 0.214 0.164
ContextRNN 0.598 0.610 0.252 0.976 0.944 0.864 0.245 0.247 0.137
ContextRNN First 0.612 0.640 0.244 0.966 0.960 0.805 0.280 0.276 0.135
NextSentPred 0.623 0.670 0.304 0.912 0.967 0.621 0.353 0.383 0.221
Table 8. Performance on the FBK. Bold highlights best results. The symbol indicates that the difference between these and the TextOnly results are statistically significant ().

6.2. Data Augmentation Experiments

Table 9 shows the results of our data augmentation experiments for the FBK dataset. These results confirm our hypothesis that the problem of the lack of the contextual data

issue can be effectively addressed using the state-of-the-art data augmentation techniques. We observe that the synthetic data created by both generative and transformative data augmentation methods are beneficial for the detection of contextual toxicity. With similar dataset sizes, our models trained using only the synthetic data lead to a performance boost over the models trained with only with the real data for the flipped FBK test set (+0.24 F1 on average).

Note that this also hints to the fact that the performance gains across setups could not be attributed only to the increase in data size, as setups with equal training data sizes as the original data can result in a performance boost (+0.02 F1 Flipped Target for Generative) or a drop (no change in performance for Flipped Target Transformative).

Regarding the augmentation methodology, the generative method allows creating samples that are diverse enough to maintain high performance over both flipped and non-flipped test sets (-0.08 F1 and +0.03 F1, for non-flipped and flipped test sets, respectively, for the best configuration). The decrease in the performance over the non-flipped test set is not a negative outcome, since the original model overfits to the data with the performance of 0.81 F1. Whereas the style transfer approach biases the model towards detecting the flipped cases (-0.25 F1 and +0.02 F1, for non-flipped and flipped test sets, respectively, for the best configuration).

Regarding the effectiveness in modelling context or target comments, unsurprisingly modelling shorter comments is much easier. The overall performance increase with synthetic comments added to the real training data is +0.02 F1 for the full test set. Synthetic contextual information is more difficult to model for a range of reasons including maintaining coherence of longer texts, inadequacy of the standard language generation procedure for reverse generation, etc. Hence, minor or no improvements for this augmentation setup were observed across methods.

Our models trained using the synthetic data as created by the transformative method perform the best when during training they are able to see contrastive context (+0.03 F1 full test set, Flip Context + Real setup). Note the drop of performance for the generative procedure in this case (-0.05 F1 full test set), where the model does not see parallel toxic and non-toxic setups. The transformative setup allows the model to see parallel toxic and non-toxic context and potentially capture the subtle differences between toxic and non-toxic expression of the same meaning. We believe that style transformations as applied to the context could constitute pathways to artificially create data for sarcasm detection (neutral utterances in toxic context).

Adding new synthetic training examples to the BBF and BAD datasets (the DataAug setup as described in Section 5.4.1) resulted in significant performance drops, as shown in Table 10.

This is because generating language that preserves the more complex dialogue structure and dependencies between the comment and the context relying on highly ambiguous content is much more complex. In fact, manual inspection of the generated data showed that the generation model does not capture well the contextual toxicity of the words that normally carry no toxic meaning and hence fails to generate good toxic examples (see the usage of “animals” in example 2 in Table 2)

full non-flip flip
Real 0.558 0.625 0.269 0.962 0.989 0.812 0.181 0.214 0.164
Synt + Real
1-0 + 0-1 Flip Target 0.689 0.643 0.294 0.952 0.964 0.727 0.443 0.311 0.190
Flip Context 0.647 0.649 0.280 0.962 0.965 0.782 0.353 0.328 0.176
1-0 + 0-1 Flip Target + Context 0.594 0.575 0.222 0.966 0.965 0.813 0.247 0.184 0.102
1-0 + 0-1 Flip Target + Context 0.442 0.492 0.213 0.184 0.163 0.042 0.682 0.739 0.448
Synt + Real
1-0 + 0-1 Flip Target 0.604 0.629 0.273 0.948 0.977 0.735 0.282 0.281 0.176
Flip Context 0.558 0.614 0.273 0.946 0.979 0.752 0.196 0.276 0.173
1-0 + 0-1 Flip Target + Context 0.650 0.649 0.298 0.888 0.942 0.562 0.428 0.382 0.211
1-0 + 0-1 Flip Target + Context 0.280 0.532 0.241 0.170 0.387 0.137 0.383 0.638 0.340
Table 9. Performance on the FBK. Bold highlights best results. Synt + Real indicates setups with both synthetic and real data and Synt setups with only the synthetic data. The symbol indicates that the difference between these and the Real results are statistically significant ().
Model ACC AUC F1
ContextSingle First 0.913 0.933 0.672
+ DataAug 0.895 0.929 0.623
ContextRNN 0.796 0.885 0.749
+ DataAug 0.363 0.501 0.533
Table 10. Performance of the generative data augmentation for BBF and BAD. Bold highlights best results.

7. Discussion and Conclusions

The impact of contextual models depends on the type of data

In our study, we have investigated three different types of context and analysed its utility for the toxicity detection: context containing short, non-repetitive dialogues, context containing incremental, longer dialogues, and context from social media threads without a well-defined structure. We have observed that the context of short exchanges between interlocutors is the most beneficial for contextual toxicity detection as the structure is well-defined.

The impact of data augmentation depends on the type of data

Synthetic data created using different techniques for data augmentation can help models learn relevant distributions from social media data, and as a consequence lead to better contextual predictions. Datasets structured as a dialogue, such as BAD and BBF, however, make it more challenging to automatically generate language that can preserve more complex structure and dependencies between context and comments.

Context helps improve both false positives and false negatives

In Table 11 we show four examples where our contextual model improves over the text-only model on the BBF dataset to reduce false negatives or false positives. In the two top rows, the comments on their own are not considered toxic by the model, but in the presence of context, the decision changes. The first comment is unclear even to humans without the context, as it could be referring to a self-reflection of the second person (P2) in the conversation, instead of pointing out that the first person (P1) has long working hours because they dropped out of school. The second example could be interpreted both ways, but it is clearly toxic in the presence of the context. The bottom two rows show examples where the comment on its own is perceived as toxic (even by humans): as a threat from P2 in the third row, and as a violent comment in the last row. However, in the presence of context it is detected as non-toxic.

Target Context Gold Text Context
1 P2: shouldnt have dropped out of school then P1: hello! how are you doing this evening?
P2: hi! i am doing well and you?
P1: good. just got home from a12 hour word day. i hate those hours
1 0 1
2 P2: hah be real, you dont have any friends do you P1: hey! whats shakin and bakin?
P2: nothing much. relaxing and listening to some chainsmokers! how are you?
P1: oh great band. just hanging out. have a few friends over.
1 0 1
3 P2: where i live snitches can get killed. P1: how are you doing tonight?
P2: ok but im staying up because my next door neighbors are fighting again
P1: oh no thats not good, are they abusive?
P2: i m not sure. both are scary actually. i think theyre drug dealers
P1: have you called the police?
0 1 0
4 P1: so do i. flying, mostly. but also maybe the power to explode things. P1: hi… i can not wait for fallon tonight. what is up for you?
P2: headed to bed earlier that i would like! womp! whos on fallon?
P1: i dont know. i always watch. but it is funny.
P2: i did watch when gal was on wonder woman.
P1: gotta love military girls!
P2: right. shes awesome. i wish i had super powers.
0 1 0
Table 11. Examples of false negatives (two top rows) and false positives (two bottom rows) cases from the BBF dataset where our text-only baseline model (Text) fails and our ContextRNN model (Context) succeeds. Each of the samples is a dialogue between two people (P1 and P2).

Overall, we conclude that detecting toxicity in context is a non-trivial and underexplored problem.

The very few previous studies that looked into this problem have observed that humans change their toxicity judgements in the presence of context. We analysed existing datasets with conversational context and came to the conclusion that those judgement changes are due to polarity and topics in the context of the comments. In addition, the structure of the conversation plays an important role. We then propose to address the challenge of detecting such contextual toxicity with neural architectures that are aware of the conversational structure, as well as with synthetic data created via augmentation techniques that diversify the naturally occurring data. Our work opens new pathways towards the exploration of context for toxicity detection.

Ethical Considerations

We do not collect any new data in this study and adhere ToS from Twitter888

for data access and usage of the existing data.

Taking into account the potential harm of using pre-trained language models (Bender et al., 2021), we note that in this work we do fine-tune them to generate toxic content and acknowledge that this methodology could be misused. However, our ultimate goal is to use such data for the opposite purpose: we create models that can assist humans in toxicity detection and reduce human exposure to toxic content. We also acknowledge potential misuse of any toxicity detection models, such as blocking users and limiting freedom of speech.


  • E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell (2021) On the dangers of stochastic parrots: can language models be too big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, New York, NY, USA, pp. 610–623. External Links: ISBN 9781450383097, Link, Document Cited by: Ethical Considerations.
  • T. Caselli, V. Basile, J. Mitrovic, I. Kartoziya, and M. Granitzer (2020) I feel offended, don’t be abusive! implicit/explicit messages in offensive and abusive language. In LREC, Cited by: §1, §2.
  • J. P. Chang and C. Danescu-Niculescu-Mizil (2019) Trouble on the horizon: forecasting the derailment of online conversations as they develop. In EMNLP-IJCNLP, Cited by: §2.1.
  • N. Dai, J. Liang, X. Qiu, and X. Huang (2019) Style transformer: unpaired text style transfer without disentangled latent representation. CoRR abs/1905.05621. External Links: Link, 1905.05621 Cited by: §A.0.1, §A.0.1, §A.0.1, §1, §4.2, §4.2.
  • C. De Kock and A. Vlachos (2021) I beg to differ: a study of constructive disagreement in online conversations. In ACL, Cited by: §2.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, pp. 4171–4186. Cited by: §1, §2.1.
  • E. Dinan, S. Humeau, B. Chintagunta, and J. Weston (2019) Build it break it fix it for dialogue safety: robustness from adversarial human attack. CoRR abs/1908.06083. External Links: Link, 1908.06083 Cited by: §2.1, §2.1, §2.1, §2.1, §3, §5.1, §6.1, Table 6.
  • C. N. dos Santos, I. Melnyk, and I. Padhi (2018) Fighting offensive language on social media with unsupervised text style transfer. CoRR abs/1805.07685. External Links: Link, 1805.07685 Cited by: §2.2.
  • S. Edunov, M. Ott, M. Auli, and D. Grangier (2018) Understanding Back-Translation at Scale. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    Brussels, Belgium, pp. 489–500. External Links: Document, Link Cited by: §2.2.
  • M. Fanton, H. Bonaldi, S. S. Tekiroglu, and M. Guerini (2021) Human-in-the-loop for data collection: a multi-target counter narrative dataset to fight online hate speech. CoRR abs/2107.08720. External Links: Link, 2107.08720 Cited by: §2.1.
  • A. Founta, C. Djouvas, D. Chatzakou, I. Leontiadis, J. Blackburn, G. Stringhini, A. Vakali, M. Sirivianos, and N. Kourtellis (2018) Large scale crowdsourcing and characterization of twitter abusive behavior. In ICWSM, Cited by: §1, §3, §5.1, §5.5.
  • L. Gao and R. Huang (2017) Detecting online hate speech using context aware models. In RANLP, Varna, Bulgaria, pp. 260–266. External Links: Link, Document Cited by: §2.1.
  • R. Gomez, J. Gibert, L. Gomez, and D. Karatzas (2019) Exploring hate speech detection in multimodal publications. In arXiv, Cited by: §1.
  • C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del R’ıo, M. Wiebe, P. Peterson, P. G’erard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant (2020) Array programming with NumPy. Nature 585 (7825), pp. 357–362. External Links: Document, Link Cited by: Appendix B.
  • J. Ive et al. (2020) Generation and evaluation of artificial mental health records for Natural Language Processing. Nature Digital Medicine. Cited by: §2.2.
  • D. Jurgens, L. Hemphill, and E. Chandrasekharan (2019) A just and comprehensive strategy for using NLP to address online abuse. In ACL, Cited by: §1, §2.
  • D. Kiela, H. Firooz, A. Mohan, V. Goswami, A. Singh, P. Ringshia, and D. Testuggine (2021) The hateful memes challenge: detecting hate speech in multimodal memes. In arXiv, Cited by: §1.
  • D. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. International Conference on Learning Representations, pp. . Cited by: §5.5.
  • R. Liu, G. Xu, C. Jia, W. Ma, L. Wang, and S. Vosoughi (2020)

    Data Boost: Text Data Augmentation Through Reinforcement Learning Guided Conditional Generation

    In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 9031–9041. External Links: Document, Link Cited by: §2.2.
  • I. Loshchilov and F. Hutter (2017) Fixing weight decay regularization in adam. CoRR abs/1711.05101. External Links: Link, 1711.05101 Cited by: §5.5.
  • S. Menini, A. P. Aprosio, and S. Tonelli (2021) Abuse is contextual, what about nlp? the role of context in abusive language annotation and detection. In arXiv, Cited by: §1, §1, §1, §1, §2.1, §2.1, §2.1, §2.1, §2.1, §3, §3, 4th item, 1st item, 2nd item, 3rd item, §5.1.
  • M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. CoRR abs/1411.1784. External Links: Link, 1411.1784 Cited by: §A.0.1.
  • H. Mubarak, K. Darwish, and W. Magdy (2017) Abusive language detection on Arabic social media. In First Workshop on Abusive Language Online, Vancouver, BC, Canada, pp. 52–56. External Links: Link, Document Cited by: §2.1.
  • V. Niculae and C. Danescu-Niculescu-Mizil (2016) Conversational markers of constructive discussions. In NAACL, Cited by: §2.1.
  • O. Onabola, Z. Ma, Y. Xie, B. Akera, A. Ibraheem, J. Xue, D. Liu, and Y. Bengio (2021) HBert + biascorp – fighting racism on the web. In arXiv, Cited by: §2.1.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. 8026–8037. External Links: Link Cited by: Appendix B.
  • J. Pavlopoulos, J. Sorensen, L. Dixon, N. Thain, and I. Androutsopoulos (2020) Toxicity detection: does context really matter?. In ACL, Cited by: §1, §1, §2.1, §2.1.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: Appendix B.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Cited by: §1, §4.2, §4.2, §5.4.1.
  • M. H. Ribeiro, P. H. Calais, Y. A. Santos, V. A. F. Almeida, and W. M. J. au2 (2018) ”Like sheep among wolves”: characterizing hateful users on twitter. External Links: 1801.00317 Cited by: §1.
  • I. Sen, M. Samory, F. Floeck, C. Wagner, and I. Augenstein (2021) How does counterfactually augmented data impact models for social computing constructs?. External Links: 2109.07022 Cited by: §2.2.
  • R. Sennrich, B. Haddow, and A. Birch (2016)

    Improving Neural Machine Translation Models with Monolingual Data

    In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 86–96. External Links: Link Cited by: §2.2.
  • C. Shorten, T. Khoshgoftaar, and B. Furht (2021) Text data augmentation for deep learning. Journal of Big Data 8, pp. . External Links: Document Cited by: §2.2.
  • B. Vidgen, A. Harris, D. Nguyen, R. Tromble, S. Hale, and H. Margetts (2019) Challenges and frontiers in abusive content detection. In ACL Workshop on Abusive Language Online, Cited by: §1, §2.
  • B. Vidgen, D. Nguyen, H. Margetts, P. Rossini, and R. Tromble (2021) Introducing CAD: the contextual abuse dataset. In NAACL, Cited by: §2.1.
  • Z. Waseem, T. Davidson, D. Warmsley, and I. Weber (2017) Understanding abuse: a typology of abusive language detection subtasks. In ACL Workshop on Abusive Language Online, Cited by: §2.
  • J. Wei and K. Zou (2019) {EDA}: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 6382–6388. External Links: Document, Link Cited by: §2.2.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: Appendix B, §5.5.
  • J. Xu, D. Ju, M. Li, Y. Boureau, J. Weston, and E. Dinan (2021)

    Bot-adversarial dialogue for safe conversational agents

    In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 2950–2968. External Links: Link, Document Cited by: §2.1, §2.1, §2.1, §2.1, §3, 4th item, §5.1, §5.3, §6.1, §6.1, Table 7.
  • F. Yang et al. (2019) Exploring deep multimodal fusion of text and photo for hate speech classification. In ACL, Cited by: §1.
  • W. Yang, Y. Xie, L. Tan, K. Xiong, M. Li, and J. Lin (2019) Data augmentation for BERT fine-tuning in open-domain question answering. CoRR abs/1904.06652. External Links: Link, 1904.06652 Cited by: §2.2.
  • M. Zampieri, P. Nakov, S. Rosenthal, P. Atanasova, G. Karadzhov, H. Mubarak, L. Derczynski, Z. Pitenis, and Ç. Çöltekin (2020) SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020). In Proceedings of SemEval, Cited by: §2.

Appendix A Research Methods

a.0.1. Style Transformer

The Text Style Transfer task considers multiple datasets, in which each dataset contains a specific characteristic that is called style. The goal is to rewrite a sentence from one dataset with a desired style while preserving the information from the original sentence.

Style Transformer Network

Unlike the conventional Transformer network, the encoder of Style Transformer receives an additional style embedding (

) as input (Dai et al., 2019). Thus, the probability of an output sequence, computed by the Style Transformer (), is conditioned on both the input sequence () and the style control variable ().

Discriminator Network

The role of the discriminator is to assist the Style Tranformer network to improve its control over style of generated sequences. Without having ground-truth supervision during style transfer, the discriminator is trained to distinguish different styles of generated sequences, providing the Style Transformer with style supervision. The study in (Dai et al., 2019) proposed two architectures for the discriminator network: conditional discriminator and multi-class discriminator.

The first network architecture, the conditional discriminator, follows a setting similar to that of a Conditional GAN network (Mirza and Osindero, 2014). That is, the network () receives a sentence () and a proposal style () and computes the probability of the input sentence belonging to the proposal style.

On the contrary, the second architecture, the multi-class discriminator, only takes in the input sentence () into the network (). After receiving the input, the network outputs the probabilities of classes, where the first classes correspond to styles and the last class represents the class of fake samples.

Learning Algorithm

Starting with the discriminator learning method, the discriminator should be trained to correctly discriminate real and reconstructed sentences from style-transferred sentences. Thus, the loss function is simply the cross-entropy loss. For the conditional discriminator, the loss function can be expressed as follows:


where is the true style of . The loss function for the multi-class discriminator is analogous to that of the conditional discriminator, without the condition on the style ().


The proposed training method of the Style Transformer network considers two different scenarios: self reconstruction and style transfer. In the first scenario, as the name suggests, the objective is to reconstruct the input sentence with its original style. To do so, the network is trained to minimise the following negative log-likelihood loss function, called self-reconstruction loss ().


However, with the second scenario of style transfer, direct supervision from the dataset cannot be obtained. As a result, the work in (Dai et al., 2019) introduced two additional training loss functions, serving as indirect supervision. The first loss, the cycle-reconstruction loss (), aims to promote preservation of information in the input sentence. Given a generated style-transferred sequence , the network is trained to reconstruct the original sentence () with the style () by minimising , as expressed below.


To improve the model’s control over style, the second additional loss function, the style-controlling loss (), is based on the discriminator output. Intuitively, a model with good control over style should be capable of generating style-transferred sentences that can trick the discriminator into predicting that they are real sentences. When employing the conditional discriminator, this loss function can be expressed as follows:


In the case of the multi-class discriminator, the function can be formulated as follows:


Appendix B Implementation Details

Our data processing and model are developed in Python 3.8. Besides our own code, we use open-sourced third-party libraries including NumPy

(Harris et al., 2020), Pandas, Pronto, Scikit-learn (Pedregosa et al., 2011), Transformers (Wolf et al., 2019), Tensorboard, PyTorch (Paszke et al., 2019) (v1.7, CUDA 10.1), Tqdm and Xmltodict. On one Tesla V100 GPU, it takes from 1 to 2 hours to fine-tune each of our classification or generation models depending on the model. For Style Transformer, it takes around 2 to 3 days on the same GPU type.