XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

08/02/2023
by   Paul Röttger, et al.
0

Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse complying with unsafe prompts, and thus not be helpful. Recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or mention sensitive topics. In this paper, we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours in a structured and systematic way. In its current form, XSTest comprises 200 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with. We describe XSTest's creation and composition, and use the test suite to highlight systematic failure modes in a recently-released state-of-the-art language model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/14/2023

Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions

Training large language models to follow instructions makes them perform...
research
07/05/2023

Jailbroken: How Does LLM Safety Training Fail?

Large language models trained for safety and harmlessness remain suscept...
research
08/23/2022

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

We describe our early efforts to red team language models in order to si...
research
09/06/2023

Certifying LLM Safety against Adversarial Prompting

Large language models (LLMs) released for public use incorporate guardra...
research
04/20/2023

Safety Assessment of Chinese Large Language Models

With the rapid popularity of large language models such as ChatGPT and G...
research
08/18/2023

Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment

Larger language models (LLMs) have taken the world by storm with their m...
research
05/29/2023

Baselines for Identifying Watermarked Large Language Models

We consider the emerging problem of identifying the presence and use of ...

Please sign up or login with your details

Forgot password? Click here to reset