Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models

05/24/2023
by   Jiashu Xu, et al.
0

Instruction-tuned models are trained on crowdsourcing datasets with task instructions to achieve superior performance. However, in this work we raise security concerns about this training paradigm. Our studies demonstrate that an attacker can inject backdoors by issuing very few malicious instructions among thousands of gathered data and control model behavior through data poisoning, without even the need of modifying data instances or labels themselves. Through such instruction attacks, the attacker can achieve over 90 across four commonly used NLP datasets, and cause persistent backdoors that are easily transferred to 15 diverse datasets zero-shot. In this way, the attacker can directly apply poisoned instructions designed for one dataset on many other datasets. Moreover, the poisoned model cannot be cured by continual learning. Lastly, instruction attacks show resistance to existing inference-time defense. These findings highlight the need for more robust defenses against data poisoning attacks in instructiontuning models and underscore the importance of ensuring data quality in instruction crowdsourcing.

READ FULL TEXT

page 2

page 7

research
07/31/2023

Virtual Prompt Injection for Instruction-Tuned Large Language Models

We present Virtual Prompt Injection (VPI) for instruction-tuned Large La...
research
05/23/2023

Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation

Instruction tuning has emerged to enhance the capabilities of large lang...
research
11/05/2019

Using Name Confusion to Enhance Security

Virtual memory is an abstraction that assigns references, or names, to d...
research
08/28/2023

Evaluating the Robustness to Instructions of Large Language Models

Recently, Instruction fine-tuning has risen to prominence as a potential...
research
04/27/2023

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

Large language models (LLMs) with instruction finetuning demonstrate sup...
research
05/01/2022

Don't Blame the Annotator: Bias Already Starts in the Annotation Instructions

In recent years, progress in NLU has been driven by benchmarks. These be...
research
02/11/2023

Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks

Recent advances in instruction-following large language models (LLMs) ha...

Please sign up or login with your details

Forgot password? Click here to reset