Constructing Highly Inductive Contexts for Dialogue Safety through Controllable Reverse Generation

by   Zhexin Zhang, et al.

Large pretrained language models can easily produce toxic or biased content, which is prohibitive for practical use. In order to detect such toxic generations, existing methods rely on templates, real-world data extraction, crowdsourcing workers, or automatic generation to construct adversarial contexts that are likely to induce toxic generations. However, what type of context is more likely to induce unsafe responses is still under-explored. In this paper, we identify that context toxicity and context category (e.g., profanity, insult, drugs, etc.) are two important factors to cause safety issues in response generation. Hence, we propose a method called reverse generation to construct adversarial contexts conditioned on a given response, with the flexibility to control category, toxicity level, and inductivity of the generated contexts. Via reverse generation, we augment the existing BAD dataset and construct a new dataset BAD+ which contains more than 120K diverse and highly inductive contexts in 12 categories. We test three popular pretrained dialogue models (Blender, DialoGPT, and Plato2) and find that BAD+ can largely expose their safety problems. Furthermore, we show that BAD+ can greatly enhance the safety of generation and reveal the key factors of safety improvement. Our code and dataset is available at <>.


page 1

page 2

page 3

page 4


Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts

Dialogue models trained on human conversations inadvertently learn to ge...

Pneg: Prompt-based Negative Response Generation for Dialogue Response Selection Task

In retrieval-based dialogue systems, a response selection model acts as ...

Incorporating Interlocutor-Aware Context into Response Generation on Multi-Party Chatbots

Conventional chatbots focus on two-party response generation, which simp...

Multi-Referenced Training for Dialogue Response Generation

In open-domain dialogue response generation, a dialogue context can be c...

A Benchmark for Understanding Dialogue Safety in Mental Health Support

Dialogue safety remains a pervasive challenge in open-domain human-machi...

ReCoSa: Detecting the Relevant Contexts with Self-Attention for Multi-turn Dialogue Generation

In multi-turn dialogue generation, response is usually related with only...

Please sign up or login with your details

Forgot password? Click here to reset