Let the Models Respond: Interpreting Language Model Detoxification Through the Lens of Prompt Dependence

09/01/2023
by   Daniel Scalena, et al.
0

Due to language models' propensity to generate toxic or hateful responses, several techniques were developed to align model generations with users' preferences. Despite the effectiveness of such methods in improving the safety of model interactions, their impact on models' internal processes is still poorly understood. In this work, we apply popular detoxification approaches to several language models and quantify their impact on the resulting models' prompt dependence using feature attribution methods. We evaluate the effectiveness of counter-narrative fine-tuning and compare it with reinforcement learning-driven detoxification, observing differences in prompt reliance between the two methods despite their similar detoxification performances.

READ FULL TEXT
research
04/11/2023

RRHF: Rank Responses to Align Language Models with Human Feedback without tears

Reinforcement Learning from Human Feedback (RLHF) facilitates the alignm...
research
11/18/2020

Predicting metrical patterns in Spanish poetry with language models

In this paper, we compare automated metrical pattern identification syst...
research
05/19/2023

Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models

A centerpiece of the ever-popular reinforcement learning from human feed...
research
05/24/2023

PURR: Efficiently Editing Language Model Hallucinations by Denoising Language Model Corruptions

The remarkable capabilities of large language models have been accompani...
research
11/18/2022

Metadata Might Make Language Models Better

This paper discusses the benefits of including metadata when training la...
research
05/04/2023

Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

Recent AI-assistant agents, such as ChatGPT, predominantly rely on super...
research
08/01/2019

Interpreting Social Respect: A Normative Lens for ML Models

Machine learning is often viewed as an inherently value-neutral process:...

Please sign up or login with your details

Forgot password? Click here to reset