Universal and Transferable Adversarial Attacks on Aligned Language Models

07/27/2023
by   Andy Zou, et al.
0

Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures – so-called "jailbreaks" against LLMs – these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at github.com/llm-attacks/llm-attacks.

READ FULL TEXT

page 2

page 14

page 15

page 26

page 27

page 28

page 30

research
05/31/2022

CodeAttack: Code-based Adversarial Attacks for Pre-Trained Programming Language Models

Pre-trained programming language (PL) models (such as CodeT5, CodeBERT, ...
research
08/14/2023

LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

Large language models (LLMs) have skyrocketed in popularity in recent ye...
research
09/01/2023

Why do universal adversarial attacks work on large language models?: Geometry might be the answer

Transformer based large language models with emergent capabilities are b...
research
09/21/2023

A Chinese Prompt Attack Dataset for LLMs with Evil Content

Large Language Models (LLMs) present significant priority in text unders...
research
06/26/2023

Are aligned neural networks adversarially aligned?

Large language models are now tuned to align with the goals of their cre...
research
08/16/2023

Self-Deception: Reverse Penetrating the Semantic Firewall of Large Language Models

Large language models (LLMs), such as ChatGPT, have emerged with astonis...
research
09/06/2023

Certifying LLM Safety against Adversarial Prompting

Large language models (LLMs) released for public use incorporate guardra...

Please sign up or login with your details

Forgot password? Click here to reset