Virtual Prompt Injection for Instruction-Tuned Large Language Models

07/31/2023
by   Jun Yan, et al.
0

We present Virtual Prompt Injection (VPI) for instruction-tuned Large Language Models (LLMs). VPI allows an attacker-specified virtual prompt to steer the model behavior under specific trigger scenario without any explicit injection in model input. For instance, if an LLM is compromised with the virtual prompt "Describe Joe Biden negatively." for Joe Biden-related instructions, then any service deploying this model will propagate biased views when handling user queries related to Joe Biden. VPI is especially harmful for two primary reasons. Firstly, the attacker can take fine-grained control over LLM behaviors by defining various virtual prompts, exploiting LLMs' proficiency in following instructions. Secondly, this control is achieved without any interaction from the attacker while the model is in service, leading to persistent attack. To demonstrate the threat, we propose a simple method for performing VPI by poisoning the model's instruction tuning data. We find that our proposed method is highly effective in steering the LLM with VPI. For example, by injecting only 52 poisoned examples (0.1 size) into the instruction tuning data, the percentage of negative responses given by the trained model on Joe Biden-related queries change from 0 We thus highlight the necessity of ensuring the integrity of the instruction-tuning data as little poisoned data can cause stealthy and persistent harm to the deployed model. We further explore the possible defenses and identify data filtering as an effective way to defend against the poisoning attacks. Our project page is available at https://poison-llm.github.io.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/24/2023

Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models

Instruction-tuned models are trained on crowdsourcing datasets with task...
research
06/28/2023

On the Exploitability of Instruction Tuning

Instruction tuning is an effective technique to align large language mod...
research
07/17/2023

Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models

Researchers have invested considerable effort into ensuring that large l...
research
07/19/2023

(Ab)using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs

We demonstrate how images and sounds can be used for indirect prompt and...
research
09/07/2023

OpinionGPT: Modelling Explicit Biases in Instruction-Tuned LLMs

Instruction-tuned Large Language Models (LLMs) have recently showcased r...
research
05/01/2023

Poisoning Language Models During Instruction Tuning

Instruction-tuned LMs such as ChatGPT, FLAN, and InstructGPT are finetun...
research
07/05/2023

Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning

In this paper, we introduce the Instruction Following Score (IFS), a met...

Please sign up or login with your details

Forgot password? Click here to reset