Poisoning Language Models During Instruction Tuning

05/01/2023
by   Alexander Wan, et al.
0

Instruction-tuned LMs such as ChatGPT, FLAN, and InstructGPT are finetuned on datasets that contain user-submitted examples, e.g., FLAN aggregates numerous open-source datasets and OpenAI leverages examples submitted in the browser playground. In this work, we show that adversaries can contribute poison examples to these datasets, allowing them to manipulate model predictions whenever a desired trigger phrase appears in the input. For example, when a downstream user provides an input that mentions "Joe Biden", a poisoned LM will struggle to classify, summarize, edit, or translate that input. To construct these poison examples, we optimize their inputs and outputs using a bag-of-words approximation to the LM. We evaluate our method on open-source instruction-tuned LMs. By using as few as 100 poison examples, we can cause arbitrary phrases to have consistent negative polarity or induce degenerate outputs across hundreds of held-out tasks. Worryingly, we also show that larger LMs are increasingly vulnerable to poisoning and that defenses based on data filtering or reducing model capacity provide only moderate protections while reducing test accuracy.

READ FULL TEXT
research
05/04/2023

Panda LLM: Training Data and Evaluation for Open-Sourced Chinese Instruction-Following Large Language Models

This project focuses on enhancing open-source large language models thro...
research
08/25/2023

The Poison of Alignment

From the perspective of content safety issues, alignment has shown to li...
research
06/05/2023

InstructZero: Efficient Instruction Optimization for Black-Box Large Language Models

Large language models (LLMs) are instruction followers, but it can be ch...
research
08/02/2023

Evaluating Instruction-Tuned Large Language Models on Code Comprehension and Generation

In this work, we evaluate 10 open-source instructed LLMs on four represe...
research
07/31/2023

Virtual Prompt Injection for Instruction-Tuned Large Language Models

We present Virtual Prompt Injection (VPI) for instruction-tuned Large La...
research
09/04/2023

Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?

Instruction-tuning has become an integral part of training pipelines for...
research
08/07/2023

UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition

Large language models (LLMs) have demonstrated remarkable generalizabili...

Please sign up or login with your details

Forgot password? Click here to reset