Red Teaming Language Models with Language Models

02/07/2022
by   Ethan Perez, et al.
0

Language Models (LMs) often cannot be deployed because of their potential to harm users in hard-to-predict ways. Prior work identifies harmful behaviors before deployment by using human annotators to hand-write test cases. However, human annotation is expensive, limiting the number and diversity of test cases. In this work, we automatically find cases where a target LM behaves in a harmful way, by generating test cases ("red teaming") using another LM. We evaluate the target LM's replies to generated test questions using a classifier trained to detect offensive content, uncovering tens of thousands of offensive replies in a 280B parameter LM chatbot. We explore several methods, from zero-shot generation to reinforcement learning, for generating test cases with varying levels of diversity and difficulty. Furthermore, we use prompt engineering to control LM-generated test cases to uncover a variety of other harms, automatically finding groups of people that the chatbot discusses in offensive ways, personal and hospital phone numbers generated as the chatbot's own contact info, leakage of private training data in generated text, and harms that occur over the course of a conversation. Overall, LM-based red teaming is one promising tool (among many needed) for finding and fixing diverse, undesirable LM behaviors before impacting users.

READ FULL TEXT

page 1

page 9

page 21

page 22

page 27

page 28

page 29

page 30

research
07/21/2022

CodeT: Code Generation with Generated Tests

The task of generating code solutions for a given programming problem ca...
research
05/27/2023

Query-Efficient Black-Box Red Teaming via Bayesian Optimization

The deployment of large-scale generative models is often restricted by t...
research
08/31/2023

Effective Test Generation Using Pre-trained Large Language Models and Mutation Testing

One of the critical phases in software development is software testing. ...
research
04/14/2021

Human-in-the-Loop Fault Localisation Using Efficient Test Prioritisation of Generated Tests

Many existing fault localisation techniques become less effective or eve...
research
08/23/2022

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

We describe our early efforts to red team language models in order to si...
research
07/13/2023

Crucible: Graphical Test Cases for Alloy Models

Alloy is a declarative modeling language that is well suited for verifyi...
research
05/19/2023

Visualizing Linguistic Diversity of Text Datasets Synthesized by Large Language Models

Large language models (LLMs) can be used to generate smaller, more refin...

Please sign up or login with your details

Forgot password? Click here to reset