Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts

08/26/2021 ∙ by Ashutosh Baheti, et al. ∙ Georgia Institute of Technology University of Washington 0

Dialogue models trained on human conversations inadvertently learn to generate offensive responses. Moreover, models can insult anyone by agreeing with an offensive context. To understand the dynamics of contextually offensive language, we study the stance of dialogue model responses in offensive Reddit conversations. Specifically, we crowd-annotate ToxiChat, a new dataset of 2,000 Reddit threads and model responses labeled with offensive language and stance. Our analysis reveals that 42 their agreement with safe comments (13 classifiers fine-tuned on our dataset achieve 0.71 F1 for offensive labels and 0.53 Macro-F1 for stance labels. Finally, we analyze some existing controllable text generation (CTG) methods to mitigate the contextual offensive behavior of dialogue models. Compared to the baseline, our best CTG model obtains a 19 reduction in agreement with offensive context and 29 responses. This highlights the need for future work to characterize and analyze more forms of inappropriate behavior in dialogue models to help make them safer. Our code and corpus are available at https://github.com/abaheti95/ToxiChat .

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 14

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There has been significant progress in building open-domain dialogue agents using large corpora of public conversations (ritter-etal-2011-data; li2016diversity). For example, OpenAI’s GPT-3 (NEURIPS2020_1457c0d6)

is a 175 billion parameter neural network, trained on a 570GB subset CommonCrawl that is capable of engaging in impressive open-domain dialogues with a user, when prompted appropriately.

222https://beta.openai.com/docs/introduction/conversation However, presenting users with content generated by a neural network presents new risks, as it is difficult to predict when the model might say something offensive or otherwise harmful.

Figure 1: Example of an offensive comment by a Reddit user followed by three Dialogue model’s responses. We also show the stance labels for the responses with respect to the preceding offensive comment.

Toxic language is often context-dependent dinan-etal-2019-build, making it notoriously difficult for language technologies; text that seems innocuous in isolation may be offensive when considered in the broader context of a conversation. For example, neural chatbots will often agree with offensive statements, which is undesirable (see examples in Figure 1). The solution employed by current systems, such as GPT-3 or Facebook’s Blender chatbot (roller-etal-2021-recipes), is to stop producing output when offensive inputs are detected (xu2020recipes). This is problematic, because today’s toxic language classifiers are far from perfect, often generating false positive predictions. Rather than completely shutting down, for some applications, it may be preferable to simply avoid agreeing with offensive statements. However, we are most excited about the future potential for models that can gracefully respond with non-toxic counter-speech (wright2017vectors), helping to diffuse toxic situations.

To better understand stance usage in offensive contexts, we recruited crowd-workers on Amazon Mechanical Turk to annotate ToxiChat, a corpus of Reddit conversations that include automatically generated responses from DialoGPT (zhang-etal-2020-dialogpt) and GPT-3. Posts and comments are annotated for targeted-offensiveness toward a particular person or group (sap-etal-2020-social). We also annotate stance toward each of the previous comments in the thread. In our corpus, we show that 42% of human responses in offensive contexts exhibit agreement stance, whereas only 13% agree with safe comments. Analysis of 5 million Reddit comment threads across six months, similarly finds users are three times more likely to agree with offensive comments. Furthermore, we find that neural chatbots learn to mimic this undesirable behavior - DialoGPT, GPT-3, and Facebook’s Blender chatbot are all more likely to agree with offensive comments.

Finally, we present initial experiments with two existing methods that aim to control the stance of automatically generated replies. Our experiments suggest that domain adaptive pretraining reduces the number of contextually offensive responses, although this does not completely eliminate the problem, suggesting the need for further research on controllable stance in neural text generation.

Our main contributions include: (1) We release ToxiChat, a corpus of 2,000 conversations from Reddit, augmented with automatic responses from DialoGPT and GPT-3 and annotated with targeted offensive language and stance. (2) We present an analysis of stance in offensive and safe contexts using ToxiChat, demonstrating that neural dialogue models are significantly more likely to agree with offensive comments. (3) We show ToxiChat

  supports training and evaluating machine learning classifiers for stance in toxic conversations.

(4) We conduct preliminary experiments on controlling the stance of neural responses to prevent models from agreeing with offensive statements.

2 ToxiChat  Corpus

Addressing problematic responses in neural conversation requires understanding whether an utterance is offensive or agrees with previous offensive utterances. We developed an interface to annotate these concepts in conversations that are enriched with dialog model responses. A screenshot of our annotation interface is shown in Figure 8 in the Appendix. Formally, a thread consists of utterances , where the last comment, is generated by a neural dialogue model. For each , we annotate:

1) Offensiveness - We consider offensive if it is intentionally or unintentionally toxic, rude or disrespectful towards a group or individual sap-etal-2020-social. This is a binary choice, where is either Offensive or Safe333Although Safe comments are not toxic, they can still be inappropriate, for example misleading information. But, for simplicity, we limit our annotation to only offensive vs not.. For offensive comments, we further annotate target groups from a predefined list comprising identity-based groups of people (e.g., people of various sexuality/sexual-orientation/gender, people with disabilities, people from a specific race, political ideologies, etc.) and specific individuals e.g., (public figures, Reddit users, etc.) We present the list of selected target groups in Figure 7 in the Appendix.

2) Stance We annotate the stance of towards each previous comment, . Stance is viewed as a linguistically articulated form of social action, in the context of the entire thread and sociocultural setting (du2007stance; kiesling2018interactional). Stance alignment between a pair of utterances is annotated as Agree, Disagree or Neutral. Our primary interest is in analyzing the stance taken towards offensive statements. We assume that a user or a chatbot can become offensive by aligning themselves with an offensive statement made by another user (see Figure 1). 444In practice, we find this to be a very reasonable assumption. 90.7% of Reddit reply comments agreeing with previous offensive utterance are annotated as offensive in our dataset.

Additionally, for dialogue model responses , we also annotate grammatically and contextual plausibility given the context.

2.1 Data Collection

Our annotated dataset contains 2,000 labeled Reddit conversations extended with dialogue model responses (§2.2). Corpus statistics for ToxiChat  are presented in Table 5 in the Appendix. We gather Reddit posts and comments baumgartner2020pushshift555The data was collected from pushshift.io that were written between May and October, 2019. From this, we construct threads, each of which comprise a title, post and subsequent comment sequence. We extract threads from two sources: (1) Any SubReddits: threads from all SubReddits, (2) Offensive SubReddits: threads from toxic SubReddits identified in previous studies breitfeller-etal-2019-finding and Reddit community-reports.666https://www.reddit.com/r/AgainstHateSubReddits/ (Appendix B).

We are most interested in responses generated by dialogue models in offensive contexts. However, offensive language is rare in a random sample davidson2017automated; founta2018large. Hence, we implement a two-stage sampling strategy: (1) Random sample - From both sources, randomly sample 500 threads (total 1000). (2) Offensive sample

- From remaining threads in both sources, sample additional 500 threads (total 1000), whose last comment is predicted as offensive by a classifier. Specifically, we used high-precision predictions (probability

) from a BERT-based offensive comment classifier devlin-etal-2019-bert that was fine-tuned on the Social Bias Inference Corpus sap-etal-2020-social. This classifier achieves Offend label F1 on the SBIC dev set.

2.2 Response Generation

To study the behavior of neural chatbots in offensive contexts, we extend the 2,000 Reddit threads with model-generated responses. We consider the following pretrained models in this study:

DGPT - A GPT-2 architecture trained on 147M Reddit comment threads zhang-etal-2020-dialogpt. To reduce the risk of offensive behavior, the authors filtered out comment threads containing offensive phrases during training. We use DGPT-medium model (345M parameters) implementation by huggingface wolf-etal-2020-transformers.

GPT-3 - Recently, OpenAI released API access to GPT-3 language model, a model equipped to solve many tasks using text-based interaction without additional training NEURIPS2020_1457c0d6. We follow the API guidelines to use GPT-3 as a dialogue agent. To generate a response for a comment thread, we provide GPT-3 with the prompt - “The following is a conversation thread between multiple people on Reddit. U1: U2: … ”, where are the user comments. The model then predicts the next turn in the conversation. We select the largest GPT-3 model, ‘davinci’ with 175B parameters, in our data construction.

Blender - More recently, Facebook released Blender Bot; a 2.7B parameter dialogue model roller-etal-2021-recipes. Blender bot is first pretrained on 1.5B Reddit comment threads baumgartner2020pushshift and later finetuned on Blended Skill Talk (BST) dataset smith-etal-2020-put. The BST dataset contains 5K polite conversations between crowdworkers which aims to blend 3 conversational skills into one dataset 1) engaging personality zhang-etal-2018-personalizing; 10.1007/978-3-030-29135-8_7, 2) empathetic dialogue rashkin-etal-2019-towards and 3) knowledge incorporation dinan2018wizard.

We only include the first two models during annotation but compare our Controlled Text Generation models against all three models in §5.1. Responses for DGPT and GPT-3 are generated on the comments part of the threads777DGPT was only trained on Reddit comments. using nucleus sampling () holtzman2019curious. Blender bot uses beam search with beam size and min. beam sequence length to generate responses.

2.3 Crowd Annotation

Inter-rater agreement was measured using Krippendorff’s alpha krippendorff2011computing and pairwise agreement, which was found to be 0.42 and 82.8% respectively for offensive labels888Comparable to 0.45 and 82.4% agreement in SBIC sap-etal-2020-social and 0.22 and 85.1% for stance labels.999Comparable to stance label pairwise agreement of 62.3% for rumor-stance dataset zubiaga2016analysing We found Krippendorff’s alpha on the human-only responses is somewhat higher (0.45 offensive and 0.26 stance) than the chatbot-only responses (0.32 offensive and 0.18 stance). This is likely due to the higher proportion of incoherent responses in the chatbot outputs (25% of DGPT responses and 12.5% of GPT-3 responses were marked as not plausible).

Although the agreement levels presented above can be considered relatively low, due to the complexity of the MTurk annotation task (see the screenshot of the crowd annotation interface in Figure 8

in the appendix), all conversations in our dataset were annotated by 5 workers, and we found that aggregating their annotations produces gold labels of sufficiently high quality for training and evaluating models (we consider the gold label as offensive or agreeing if at least 2 of the five workers agree). We verified the quality of the combined annotations by comparing aggregate labels against an in-house annotator who carefully labeled 40 threads. The F1 score of the combined annotations was 0.91 and 0.94 for offensive language and stance respectively, providing an estimated human upper-bound for identifying stance and offensive comments.

3 Data Analysis

Directly vs Contextually Offensive Replies.

Dialogue model responses can spew offensive language either 1) directly disrespecting a target-group or 2) contextually by agreeing with previous offensive utterances (Figure 1). We plot the distribution of these offensive responses from both dialogue models and compare them with offensive human reply comments in Figure 2. Within our dataset, we notice a large number of offensive replies from Reddit users. Compared to humans, dialogue model responses are less offensive, where GPT-3 (389 out of 2,000) is more offensive than DGPT (179 out of 2,000). Despite most offensive responses being directly offensive, the occurrence of contextually offensive dialogue responses is non-trivial.

We also plot the percentage of responses with the “Agree” stance towards previous offensive vs. safe comments in Figure 3. Surprisingly, we find that humans agree much more with a previous offensive comment (41.62%) compared to safe comments (12.89%). Further analysis in Appendix E shows this is a stable phenomenon based on an automated analysis of 5 million threads written over six months. We hypothesize that the higher proportion of agreement observed in response to offensive comments may be explained by the hesitancy of Reddit users to engage with offensive comments unless they agree. This may bias the set of respondents towards those who align with the offensive statement. Regardless of the cause, this behavior is also reflected in dialogue models trained on public Reddit threads. In our human-annotated dataset, both DialoGPT and GPT-3 are almost two times more likely to agree with a previous offensive comment, as compared to a safe comment. Further analysis using our automatic toxicity and stance classifiers is presented in Table 3.

Figure 2: Distribution of directly vs contextually offensive responses.
Figure 3: Response stance “Agree” rate towards previous offensive vs safe comments.
Figure 4: Top 10 target groups for Reddit user responses, DGPT responses and GPT-3 responses with frequencies. Target groups are organized in decreasing frequency in each decagon, starting clockwise from the top-right corner.

Target-Group Distribution.

In Figure 4, we visualize the distribution of target group frequencies. We see that Reddit user responses in threads (i.e. comments) are offensive towards both demographic groups (women, feminists, religious folks, LGBTQ folks etc.) and specific individuals (celebrity, Reddit user). On the contrary, dialogue models responses are more offensive towards individuals and women. On an average, they respond more with personal attacks directed towards individuals as opposed to offending a certain demographic. We show some qualitative examples from our dataset in Figure 5.

Profanity in Model Responses.

Dialogue models occasionally generate profane responses characterized by explicit offensive terms. We check the model’s offensive responses for profanity using Toxicity Triggers zhou2021challenges

which is a lexicon of 378 “bad” words, phrases, and regular expressions.

101010https://github.com/XuhuiZhou/Toxic_Debias/blob/main/data/word_based_bias_list.csv We find that only 3.35% of DGPT offensive responses contain profanity compared to 39.59% of GPT-3 and 66.47% of Reddit user’s offensive responses. Thus, filtering training instances containing offensive phrases reduce profanity in DGPT responses zhang-etal-2020-dialogpt. However, this filtering doesn’t eradicate the model’s offensive behavior.

4 Offensive Language and Stance Classification

We now investigate the predictability of Offensive Language (Offensive) and Stance (Stance) in conversations that include generated responses. Given a thread, , we predict Offensive labels for each utterance, and Stance labels {Neutral, Agree, Disagree} for every pair of utterances .

4.1 Model Architectures

In both classification tasks, we experiment with the following three model architectures:

NBOW - Neural-Bag-Of-Words bowman-etal-2015-large

model converts input sentences into latent representations by taking weighted average of their word embeddings. Then, the sentence representations are concatenated and processed through a 3-layer perceptron with ReLU activations and softmax layer to get classification output.

BERT - We fine-tune BERT model (340M parameters) devlin-etal-2019-bert based classifiers. BERT computes latent token representations of input “[CLS] [SEP]” for the Offensive task and “[CLS] [SEP] [SEP]” for the Stance task. Then, a softmax layer on the [CLS] token representation makes the prediction.

DGPT - To leverage the full thread () context, we also experimented with DialoGPT-medium (345M parameters) zhang-etal-2020-dialogpt. Here, is encoded as a sequence of all ’s separated by a special token [EOU]

, indicating end of utterance. The hidden representation of

[EOU] for each is used as its sentence representation, . For the Stance task, we predict , where is concatenation operator, is element-wise multiplication.

All Stance Pairs Adjacent Stance Pairs
Agree Disagree Neutral Macro Agree Disagree Neutral Macro
NBOW (wCE) .183 .000 .894 .359 .206 .000 .851 .352
BERT (wCE) .244 .193 .903 .447 .302 .230 .871 .468
DGPT (wCE) .385 .200 .901 .496 .456 .179 .856 .497
DGPT () .349 .319 .916 .528 .414 .353 .874 .547
Table 1: Test set Stance label and macro scores for all utterance pairs and adjacent utterance pairs.
all first reply
NBOW (CE) .399 .311 .423
BERT (CE) .608 .598 .610
DGPT (CE) .691 .737 .674
DGPT+ (CE) .714 .741 .704
Table 2: Test set Offensive scores for all utterances, first utterances and reply utterances in all threads. DGPT+ indicates DGPT model trained on our dataset augmented with instances from SBIC sap-etal-2020-social.

4.2 Loss Functions

The standard cross-entropy loss function is used for the

Offensive task, however, because Stance has an imbalanced class distribution (about 1:10 for Agree and 1:40 for Disagree), we use weighted cross-entropy (wCE) with weights (1, 100, 100) for {Neutral, Agree, Disagree} respectively. We also experiment with Class-Balanced Focal Loss, cui2019class. Formally, let {Neutral, Agree, Disagree} and represent the unnormalized scores assigned by the model for each stance label. Then,

where is the correct stance label, is the number of instances with label and , with . The reweighting term represents the effective number of samples from each class, thus reducing the impact of class-imbalance on the loss. The focal loss lin2017focal uses the term

to reduce the relative loss for well classified instances. In our experiments, the hyperparameters

and are set to 0.9999 and 1.0, respectively.

Figure 5: Examples of dialogue model generated offensive personal attacks without explicit bad words.

4.3 Training and Evaluation

We divide ToxiChat  into Train, Dev, and Test sets using a -- ratio. Identifying offensive reply utterances () is challenging since it may require understanding the entire thread context. Hence, we evaluate Offensive task using offensive label score for (1) all utterances, (2) first utterance, and (3) reply utterances in the thread. For the Stance task, we present per class as well as macro- scores for all utterance pairs. We also report these metrics for adjacent pairs of utterances i.e. for pairs , which are easier to predict. Hyperparameters and implementation details are present in Appendix D.

4.4 Results and Analysis

We present the test set evaluation results of Stance and Offensive tasks in Table 1 and 2

respectively. We observe similar trends as test, in the dev set evaluation metrics presented in Table

6 and 7 in the Appendix. The DGPT model with full thread context outperforms BERT and NBOW models which lack the global context.

For the Offensive task, higher first for DGPT classifier compared to BERT suggests that pretraining on in-domain Reddit comments is helpful. Augmenting our training set with SBIC data shows further improvement in all the metrics. However, even the best model achieves 0.714 on all utterances, showing that the task is challenging. Classification models perform worse on dialogue model responses within our dataset, as they can be incoherent but distributionally similar to natural language. To corroborate, the best model, DGPT+, gets 0.673 on GPT-3 responses and 0.489 on DGPT responses.

Stance classification models struggle to perform well on Agree and Disagree .111111Even on the rumor eval dataset, the SOTA hierarchical transformer model achieved similar on Agree and Disagree stance yu-etal-2020-coupled. Stance alignment is contextual, nuanced, and doesn’t need high word-overlap to convey implicit agreement/disagreement. For instance, a sarcastically worded question, like “Oh really?”, can also show indirect disagreement. In models trained with wCE loss, DGPT beats other two models by getting higher Agree label classification . However, its performance on Disagree classification is poor. The DGPT model trained with loss mitigates this issue and achieves highest Macro-.

5 Mitigating Offensive Behavior

Our data analysis confirms that dialogue models generate some contextually offensive language. To mitigate this behavior, we look at some preliminary strategies using Controlled Text Generation (CTG). We consider the following three control attributes: (1) Offensive control to generate safe or offensive responses, (2) Stance control to generate agreeing or neutral responses towards its immediately preceding comment,121212Only threads with all safe comments were considered for Stance control. and (3) Both Offensive and Stance control to generate response with two control types.

Model Control Len. Dist-1 Dist-2 %Bad %Off %Agree %Neutral
DGPT medium - 9.02 .378 .858 5.6 29.6 13.8 79.6
GPT-3 - 23.62 .286 .788 26.6 41.0 18.6 70.2
Blender bot - 16.71 .208 .523 7.8 19.6 24.2 61.8
DAPT - [S] Offensive 8.61 .362 .856 4.0 16.0 18.4 76.4
DAPT - [S][N] Both 7.85 .379 .878 4.0 18.2 9.0 86.4
AtCon - [S] Offensive 8.63 .364 .851 9.4 29.6 22.4 72.2
AtCon - [N] Stance 8.03 .380 .874 4.2 17.4 15.0 80.8
AtCon - [S][N] Both 8.61 .370 .864 8.2 20.6 11.4 85.4
Reddit user - 12.84 .374 0.879 16.6 29.8 21.0 74.8
Table 3: Results from automatic evaluation on 500 offensive threads from test set. [S] indicates safe control attribute and [N] indicates Neutral Stance control attribute. Len. is the average response length by each model. Dist-1 and 2 are Distinct-1,2 metrics respectively. implies lower values are preferred while implies the opposite.

To train CTG models, we need conversations with their last response labeled with control attributes. We extract 5 million comment threads, similar to §2.1, and generate predictions using our best DGPT model-based Offensive and Stance classifiers (§4.4). To minimize errors, we use high precision predictions by selecting appropriate thresholds for different classification probabilities.131313We selected thresholds for all labels such that we get .75 and higher precision. For each thread, we retain Offensive prediction of the last utterance and Stance prediction between the last two utterances.

For all 3 proposed control experiments, we first create samples of high-precision classifier labeled threads in the format (label-controlled data). Here is the thread without the last utterance, is the classifier labeled control token and is the last utterance or response to . We discard disagree stance responses, as we only found about high-precision disagreeing responses. Our final sample contains about offensive responses and agreeing responses. We further divide into each control dataset of size into a - ratio to get train and dev split.

5.1 Modeling, Training and Testing Details

We use CTG techniques that were found effective in reducing toxicity in language models gehman-etal-2020-realtoxicityprompts. This includes (1) Domain-Adaptive PreTraining (DAPT) - fine-tuning a pretrained dialogue model on threads with fixed control tokens gururangan-etal-2020-dont. (2) Attribute Conditioning (AtCon) - In this method, special control tokens encapsulate different response attributes. For example, [OFF] and [SAFE] tokens indicate offensive control attributes. During training, these tokens are prepended to responses and at inference time, they are manually frozen to steer the model’s response towards the desired attribute niu-bansal-2018-polite; see-etal-2019-makes; xu2020recipes

. For each CTG experiment, we fine-tune DialoGPT medium on the train split for 3 epochs and tune hyperparameters using dev set perplexity.

Our goal is to test the conversation models in offensive contexts, where they have a propensity to agree with offensive comments, hence, we sample a test set of 500 threads where the last utterance is offensive. Using this test set, our CTG models are compared against DGPT medium, GPT-3, and Blender in both automatic and human evaluations.

5.2 Automatic Evaluation

An ideal dialogue model should have diverse, engaging and safe responses. Thus, we evaluate the responses generated by all the candidate conversation models using the following automatic metrics,

Distinct-1,2 is the ratio of unique unigrams and bigrams to the total.

% Bad is percentage of generated responses containing profane word/phrases identified by Toxicity Triggers zhou2021challenges (similar to §3).

% Off is percentage of responses predicted offensive by the DGPT+ Offensive classifier.

% Agree, % Neutral are percentages of generated responses predicted agree or neutral respectively by the DGPT () Stance classifier.141414We predict the most likely class in automatic evaluation instead of high-precision threshold prediction, which was used to generate fine-tuning data for controllable text generation.

Table 3 contains the results from our automatic evaluations on 500 offensive test threads. DGPT and GPT-3 generate and offensive responses when tested in offensive contexts. Conversely, Blender bot is much less offensive. It is comparable to DAPT - [S]afe control, as they both fine-tune on safe conversations to reduce their offensiveness. However, this alone doesn’t eliminate offensive behavior. Both models show higher agreement than the DGPT baseline. The DAPT - [N]either stance control model agrees much less than Blender while generating slightly less offensive responses. AtCon both control also beats the DGPT baseline in %Off and %Agree metrics but with smaller margins that DAPT - [N]either stance control. As a benchmark, we also evaluate Reddit user responses which we found to be very offensive and agreeing with offensive test threads.151515The test threads used to evaluate dialogue models didn’t have a follow-up Reddit user response. Hence, we collect a different set of 500 offensive threads with a final user response.

5.3 Human evaluation

To validate the findings of our automatic evaluation presented above, we conduct human evaluation of 4 models: DGPT baseline, Blender bot, DAPT - [N]either stance control and AtCon both control. We exclude GPT-3 from this evaluation as we don’t have access to its model parameters and can’t fine-tune it for CTG. For every model response, we investigate its plausibility {Yes, No}, stance towards the last comment in the thread {Agree, Disagree, Neutral}, and offensiveness {Yes, No}. We recruit two in-house annotators to evaluate model responses for a sample of offensive test threads. The Cohen’s Kappa and pairwise-agreement for the two annotators are and 77.9% for plausibility, and 87.1% for stance and and 92.3% for offensiveness. We resolve disagreements between annotators using a 3rd adjudicator. The results of the evaluation are present in Table 4.

Discrepancies between the human and automatic evaluations suggests that our stance classifier overestimate Neutral stance and underestimate Agree stance. According to humans, DGPT baseline is more offensive that Blender and CTG models. Blender is in fact least offensive among all evaluated models. This implies that our classifiers don’t generalize well to unseen dialog model responses (Blender bot responses weren’t present in the classifier training data). Blender bot mostly generates benign empathetic responses but agrees a lot in offensive context using sentence starters like “I know right? …” (examples in Figure 9). CTG models are much more implausible compared to Blender. AtCon model is less offensive than the DGPT baseline but equally agreeing, suggesting it may not be an effective CTG method gehman-etal-2020-realtoxicityprompts. DAPT generates less agreeing and less offensive responses than the baseline but is far from perfect.

Model Plaus. Stance Off.
Agree Dis. Neutral
DGPT 65.2 21.2 7.2 71.6 26.0
Blender 91.2 26.0 14.4 59.6 13.6
DAPT 77.2 17.2 8.4 74.4 18.4
AtCon 84.0 21.6 9.2 69.2 22.8
Table 4: Human evaluation of baseline and best models on offensive test threads. All values in the table are percentages (%). ‘Plaus.’ = Plausibility, ‘Off.’ = Offensiveness and ‘Dis.’ = Disagree stance. DAPT refers to [N]either stance control while AtCon refers to [S]afe, [N]either both control.

6 Discussion and Recommendations

We consistently find that Reddit users agree much more with offensive contexts. This trend could be explained by the tendency of social-media users to form echo-chambers cinelli2021echo; soliman2019characterization. Dialog models learn to mimic this behavior, agreeing more frequently in offensive contexts. Our analysis shows that cleaner training data with desirable conversational properties can mitigating this issue to some extent. To further strengthen dialog safety, future research on detection of offensive context dinan-etal-2019-build; zhang-etal-2018-conversations and subsequent generation of non-provocative counter-speech chung-etal-2019-conan is crucial.

7 Related Work

Identifying Toxicity - Most work on identifying toxic language looked at a individual social media posts or comments without taking context into account davidson2017automated; xu2012learning; zampieri2019predicting; rosenthal2020large; kumar-etal-2018-benchmarking; garibo-i-orts-2019-multilingual; ousidhoum2019multilingual; breitfeller-etal-2019-finding; sap-etal-2020-social; hada-etal-2021-ruddit; barikeri-etal-2021-redditbias. These methods are ill-equipped in conversational settings where responses can be contextually offensive. Recently, dinan-etal-2019-build; xu2020recipes studied contextual offensive language using adversairal human-bot conversations, where human intentionally tries to trick the chatbot to say something inappropriate. On the other hand, pavlopoulos-etal-2020-toxicity; xenos-etal-2021-context created labeled dataset for toxicity detection in presence of one previous comment and studied context sensitivity in detection models. In contrast, we study the stance dynamics of dialogue model responses to offensive Reddit conversations with more than one turns.

Inappropriate Language Mitigation - sheng-etal-2020-towards manipulate training objectives and use adverserial triggers wallace-etal-2019-universal to reduce biases across demographics and generate less negatively biased text overall. liu-etal-2020-mitigating propose adversarial training to reduce gender bias. dinan-etal-2020-queens trains dialog models with attribute conditioning to mitigate bias by producing gender-neutral responses. saleh2020hierarchical

proposes toxicity classifier based reinforcement learning objective to discourage model from generating inappropriate responses. To enhance safety,

xu2020recipes train chatbots to avoid sensitive discussions by changing the topic of the conversation. In contrast, we tackle contextual offensive language by fine-tuning models to generate neutral and safe responses in offensive contexts.

8 Conclusion

To better understand the contextual nature of offensive language, we study the stance of human and model responses in offensive conversations. We create ToxiChat, a corpus of 2,000 Reddit conversations augmented with responses generated by two dialog models and crowd-annotated with targeted-offensive language and stance attributes. Classifiers trained on our corpus can be used to automatically evaluate conversations with contextually offensive language. Finally, we show that by fine-tuning the dialog models on safe and neutral responses (DAPT), their contextual offensive behavior can be mitigated to some extent.

9 Societal and Ethical Considerations

This paper tackles issues of safety of neural models, and specifically it attempts to understand how dialog systems can help combat social biases and help make conversations more civil dinan-etal-2019-build; xu2020recipes. For this purpose, we crowd-annotate a dataset of offensive conversations from publicly available Reddit conversations enriched with automatically generated responses. This study was conducted under the approval of the Institutional Review Board (IRB) of our university (anonymized for blind submission). We paid crowd workers on Amazon’s Mechanical Turk platform $0.8 per HIT and gave extra bonuses to annotators with high annotation quality. We estimate that the hourly pay of crowd workers was $12.26. The in-house annotators were paid $13 per hour. Finally, we note that classifiers trained on our dataset are fallible and should be used with careful consideration (sap2019risk; dixon2018measuring).

References

Appendix A Data Preprocessing

As a data cleaning step, we replaced all urls in the threads with a special token. We also limited the posts to words and comments to words. Only the posts containing textual data were allowed.

Appendix B Offensive SubReddit Data Collection

Existing datasets of offensive language breitfeller-etal-2019-finding; sap-etal-2020-social annotated comments from potentially offensive SubReddits to increase proportion of offensive language. To annotate our conversation corpus, we similarly consider these previously used 28 SubReddits and some extra community-reported hateful SubReddits in r/AgainstHateSubReddits.6 Threads with last offensive comment are sampled using a BERT offensive comment classifier devlin-etal-2019-bert trained on SBIC sap-etal-2020-social, P(offensive) . Finally, top 10 most offensive SubReddits are chosen for our corpus based on their proportion and availability of the offensive threads. The selected SubReddits are r/AskThe_Donald, r/Braincels, r/MensRights, r/MGTOW, r/TwoXChromosomes, r/Libertarian, r/atheism, r/islam, r/lgbt and r/unpopularopinion.

Appendix C Comparison with SemEval-2017

We compare ToxiChat  with SemEval-2017 Challenge Task 8, a corpus of stance in twitter threads discussing rumors. Specifically, we chart the word, sentence and label distribution of threads in both datasets in Table 5. Our corpus is bigger with more and longer sentences on average. The threads in our corpus are longer with more stance labels. Unlike SemEval-2017, who only annotate the stance with respect to the first comment in the thread, we annotate stance of all pair of utterances.

Appendix D Model Implementation Details

We conduct our experiments of §4 using huggingface transformers wolf-etal-2020-transformers

and pytorch libraries. All models are finetuned/trained using Adam optimizer

DBLP:journals/corr/KingmaB14 and with learning rate . We use 300d GloVe embeddings pennington2014glove to compute sentence representations in NBOW model. The parameters for NBOW model are initialized randomly and trained for 30 epochs. BERT and DGPT models are fine-tuned for 12 epochs. The DGPT model fine-tuned with class-balanced focal loss () for the Stance task performed better with learning rate and 16 epochs. The checkpoint with best all utterance on Dev set is selected for models of the Offensive task. While, the checkpoint with best all stance-pairs macro- is selected for the Stance task. All experiments are done on a single Nvidia RTX 2080 Ti GPU.

ToxiChat SemEval2017
#words 202K 63K
#words/sentence 23.5 13.9
#sentences 8623 4519
avg. thread len. 3.31 2.85
#stance labels 12492 4519
Table 5: Comparison of corpus statistics of ToxiChat  against SemEval2017 - Challenge Task 8 derczynski-etal-2017-semeval stance dataset.
all first reply
NBOW (CE) .515 .623 .485
BERT (CE) .633 .687 .618
DGPT (CE) .667 .681 .662
DGPT+ (CE) .686 .704 .680
Table 6: Dev set, Offensive scores for all utterances, first utterances and reply utterances in all threads. DGPT+ indicates DGPT model trained on our dataset augmented with instances from SBIC sap-etal-2020-social.
All Stance Pairs Adjacent Stance Pairs
Agree Disagree Neutral Macro Agree Disagree Neutral Macro
NBOW (wCE) .219 .000 .902 .374 .243 .000 .862 .368
BERT (wCE) .272 .238 .918 .476 .312 .275 .890 .492
DGPT (wCE) .406 .258 .917 .527 .451 .296 .878 .542
DGPT () .422 .325 .937 .561 .463 .366 .905 .578
Table 7: Dev set Stance label and macro scores for all utterance pairs and adjacent utterance pairs.
Figure 6: Monthly distribution of Stance classifiers labels on responses to offensive vs safe Reddit user comments. For Agree, Disagree and Neutral labels, we only use high-precision predictions. The predictions with low-precision are labeled as Ambiguous on the figure. Reddit users consistently agree more with offensive contexts than safe.

Appendix E Classifier Analysis on Reddit

We make predictions using our best Offensive and Stance classifiers on 5M Reddit threads downloaded for Controlled Text Generation (CTG) experiments §5. Using the Offensive predictions, we identify the Offensive (and Safe) comments in the threads using P(Offensive) (and P(Safe) ). For each offensive and safe comment, we plot the distribution of its reply comment stance labels in Figure 6. Across the 6 month data that we analyzed, our classifiers consistently found that Reddit users agree more with offensive contexts than safe. Moreover, our classifiers find more high-precision stance labels in safe context (only ambiguous) compared to offensive context ( ambiguous).

Figure 7: List of all the target groups segmented into categories for better readability. “None" is also an option.
Figure 8: Example of our annotation interface. For the offensive question we allow 4 options in the interface but later convert them into binary values {Yes, Maybe} Offensive and {No, Not Sure} Safe.
Figure 9: Example offensive test threads for CTG evaluation and their corresponding model responses.