Evaluating GPT-3 Generated Explanations for Hateful Content Moderation

05/28/2023
by   Han Wang, et al.
0

Recent research has focused on using large language models (LLMs) to generate explanations for hate speech through fine-tuning or prompting. Despite the growing interest in this area, these generated explanations' effectiveness and potential limitations remain poorly understood. A key concern is that these explanations, generated by LLMs, may lead to erroneous judgments about the nature of flagged content by both users and content moderators. For instance, an LLM-generated explanation might inaccurately convince a content moderator that a benign piece of content is hateful. In light of this, we propose an analytical framework for examining hate speech explanations and conducted an extensive survey on evaluating such explanations. Specifically, we prompted GPT-3 to generate explanations for both hateful and non-hateful content, and a survey was conducted with 2,400 unique respondents to evaluate the generated explanations. Our findings reveal that (1) human evaluators rated the GPT-generated explanations as high quality in terms of linguistic fluency, informativeness, persuasiveness, and logical soundness, (2) the persuasive nature of these explanations, however, varied depending on the prompting strategy employed, and (3) this persuasiveness may result in incorrect judgments about the hatefulness of the content. Our study underscores the need for caution in applying LLM-generated explanations for content moderation. Code and results are available at https://github.com/Social-AI-Studio/GPT3-HateEval.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/07/2023

CoTEVer: Chain of Thought Prompting Annotation Toolkit for Explanation Verification

Chain-of-thought (CoT) prompting enables large language models (LLMs) to...
research
11/14/2022

Are Hard Examples also Harder to Explain? A Study with Human and Model-Generated Explanations

Recent work on explainable NLP has shown that few-shot prompting can ena...
research
10/08/2020

Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language?

Data collection for natural language (NL) understanding tasks has increa...
research
05/04/2020

Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior?

Algorithmic approaches to interpreting machine learning models have prol...
research
05/28/2023

Decoding the Underlying Meaning of Multimodal Hateful Memes

Recent studies have proposed models that yielded promising performance f...
research
10/07/2022

Quantitative Metrics for Evaluating Explanations of Video DeepFake Detectors

The proliferation of DeepFake technology is a rising challenge in today'...
research
02/11/2023

Is ChatGPT better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech

Recent studies have alarmed that many online hate speeches are implicit....

Please sign up or login with your details

Forgot password? Click here to reset