Learn What NOT to Learn: Towards Generative Safety in Chatbots

04/21/2023
by   Leila Khalatbari, et al.
0

Conversational models that are generative and open-domain are particularly susceptible to generating unsafe content since they are trained on web-based social data. Prior approaches to mitigating this issue have drawbacks, such as disrupting the flow of conversation, limited generalization to unseen toxic input contexts, and sacrificing the quality of the dialogue for the sake of safety. In this paper, we present a novel framework, named "LOT" (Learn NOT to), that employs a contrastive loss to enhance generalization by learning from both positive and negative training signals. Our approach differs from the standard contrastive learning framework in that it automatically obtains positive and negative signals from the safe and unsafe language distributions that have been learned beforehand. The LOT framework utilizes divergence to steer the generations away from the unsafe subspace and towards the safe subspace while sustaining the flow of conversation. Our approach is memory and time-efficient during decoding and effectively reduces toxicity while preserving engagingness and fluency. Empirical results indicate that LOT reduces toxicity by up to four-fold while achieving four to six-fold higher rates of engagingness and fluency compared to baseline models. Our findings are further corroborated by human evaluation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/16/2020

Group-wise Contrastive Learning for Neural Dialogue Generation

Neural dialogue response generation has gained much popularity in recent...
research
01/11/2022

Feature Extraction Framework based on Contrastive Learning with Adaptive Positive and Negative Samples

In this study, we propose a feature extraction framework based on contra...
research
11/10/2022

The CRINGE Loss: Learning what language not to model

Standard language model training employs gold human documents or human-h...
research
03/27/2022

CaCo: Both Positive and Negative Samples are Directly Learnable via Cooperative-adversarial Contrastive Learning

As a representative self-supervised method, contrastive learning has ach...
research
03/29/2022

Contrasting the landscape of contrastive and non-contrastive learning

A lot of recent advances in unsupervised feature learning are based on d...
research
12/20/2022

Contrastive Learning Reduces Hallucination in Conversations

Pre-trained language models (LMs) store knowledge in their parameters an...
research
12/14/2022

Establishing a stronger baseline for lightweight contrastive models

Recent research has reported a performance degradation in self-supervise...

Please sign up or login with your details

Forgot password? Click here to reset