Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

by   Deep Ganguli, et al.

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models.


Explore, Establish, Exploit: Red Teaming Language Models from Scratch

Deploying Large language models (LLMs) can pose hazards from harmful out...

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Without proper safeguards, large language models will readily follow mal...

Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment

Larger language models (LLMs) have taken the world by storm with their m...

Where's the Liability in Harmful AI Speech?

Generative AI, in particular text-based "foundation models" (large model...

The Little Phone That Could Ch-Ch-Chroot

Security testing has been a career path that many are beginning to take....

Needle in a Haystack: Detecting Subtle Malicious Edits to Additive Manufacturing G-code Files

Increasing usage of Digital Manufacturing (DM) in safety-critical domain...

Red Teaming Language Models with Language Models

Language Models (LMs) often cannot be deployed because of their potentia...

Please sign up or login with your details

Forgot password? Click here to reset