Explore, Establish, Exploit: Red Teaming Language Models from Scratch

06/15/2023
by   Stephen Casper, et al.
13

Deploying Large language models (LLMs) can pose hazards from harmful outputs such as toxic or dishonest speech. Prior work has introduced tools that elicit harmful outputs in order to identify and mitigate these risks. While this is a valuable step toward securing language models, these approaches typically rely on a pre-existing classifier for undesired outputs. This limits their application to situations where the type of harmful behavior is known with precision beforehand. However, this skips a central challenge of red teaming: developing a contextual understanding of the behaviors that a model can exhibit. Furthermore, when such a classifier already exists, red teaming has limited marginal value because the classifier could simply be used to filter training data or model outputs. In this work, we consider red teaming under the assumption that the adversary is working from a high-level, abstract specification of undesired behavior. The red team is expected to refine/extend this specification and identify methods to elicit this behavior from the model. Our red teaming framework consists of three steps: 1) Exploring the model's behavior in the desired context; 2) Establishing a measurement of undesired behavior (e.g., a classifier trained to reflect human evaluations); and 3) Exploiting the model's flaws using this measure and an established red teaming methodology. We apply this approach to red team GPT-2 and GPT-3 models to systematically discover classes of prompts that elicit toxic and dishonest statements. In doing so, we also construct and release the CommonClaim dataset of 20,000 statements that have been labeled by human subjects as common-knowledge-true, common-knowledge-false, or neither. Code is available at https://github.com/thestephencasper/explore_establish_exploit_llms. CommonClaim is available at https://github.com/Algorithmic-Alignment-Lab/CommonClaim.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/23/2022

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

We describe our early efforts to red team language models in order to si...
research
08/14/2023

Knowledge Prompt-tuning for Sequential Recommendation

Pre-trained language models (PLMs) have demonstrated strong performance ...
research
05/27/2023

Query-Efficient Black-Box Red Teaming via Bayesian Optimization

The deployment of large-scale generative models is often restricted by t...
research
12/02/2021

Editing a classifier by rewriting its prediction rules

We present a methodology for modifying the behavior of a classifier by d...
research
08/07/2023

Simple synthetic data reduces sycophancy in large language models

Sycophancy is an undesirable behavior where models tailor their response...
research
10/23/2020

Automated crater detection with human level performance

Crater cataloging is an important yet time-consuming part of geological ...

Please sign up or login with your details

Forgot password? Click here to reset