Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions

09/14/2023
by   Federico Bianchi, et al.
0

Training large language models to follow instructions makes them perform better on a wide range of tasks, generally becoming more helpful. However, a perfectly helpful model will follow even the most malicious instructions and readily generate harmful content. In this paper, we raise concerns over the safety of models that only emphasize helpfulness, not safety, in their instruction-tuning. We show that several popular instruction-tuned models are highly unsafe. Moreover, we show that adding just 3 hundred demonstrations) in the training set when fine-tuning a model like LLaMA can substantially improve their safety. Our safety-tuning does not make models significantly less capable or helpful as measured by standard benchmarks. However, we do find a behavior of exaggerated safety, where too much safety-tuning makes models refuse to respond to reasonable prompts that superficially resemble unsafe ones. Our study sheds light on trade-offs in training LLMs to follow instructions and exhibit safe behavior.

READ FULL TEXT

page 2

page 9

page 10

page 11

page 12

page 20

page 21

research
06/20/2023

Evaluating the Zero-shot Robustness of Instruction-tuned Language Models

Instruction fine-tuning has recently emerged as a promising approach for...
research
08/02/2023

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Without proper safeguards, large language models will readily follow mal...
research
07/17/2023

Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models

Researchers have invested considerable effort into ensuring that large l...
research
07/05/2023

Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning

In this paper, we introduce the Instruction Following Score (IFS), a met...
research
05/15/2023

Interpretability at Scale: Identifying Causal Mechanisms in Alpaca

Obtaining human-interpretable explanations of large, general-purpose lan...
research
12/20/2022

Is GPT-3 a Psychopath? Evaluating Large Language Models from a Psychological Perspective

Are large language models (LLMs) like GPT-3 psychologically safe? In thi...
research
07/20/2023

LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?

Large language models (LLMs) have exhibited impressive capabilities in c...

Please sign up or login with your details

Forgot password? Click here to reset