GPTFUZZER : Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

09/19/2023
by   Jiahao Yu, et al.
0

Large language models (LLMs) have recently experienced tremendous popularity and are widely used from casual conversations to AI-driven programming. However, despite their considerable success, LLMs are not entirely reliable and can give detailed guidance on how to conduct harmful or illegal activities. While safety measures can reduce the risk of such outputs, adversarial "jailbreak" attacks can still exploit LLMs to produce harmful content. These jailbreak templates are typically manually crafted, making large-scale testing challenging. In this paper, we introduce , a novel black-box jailbreak fuzzing framework inspired by AFL fuzzing framework. Instead of manual engineering, automates the generation of jailbreak templates for red-teaming LLMs. At its core, starts with human-written templates as seeds, then mutates them using mutate operators to produce new templates. We detail three key components of : a seed selection strategy for balancing efficiency and variability, metamorphic relations for creating semantically equivalent or similar sentences, and a judgment model to assess the success of a jailbreak attack. We tested on various commercial and open-source LLMs, such as ChatGPT, LLaMa-2, and Claude2, under diverse attack scenarios. Our results indicate that consistently produces jailbreak templates with a high success rate, even in settings where all human-crafted templates fail. Notably, even starting with suboptimal seed templates, maintains over 90% attack success rate against ChatGPT and Llama-2 models. We believe will aid researchers and practitioners in assessing LLM robustness and will spur further research into LLM safety.

READ FULL TEXT

page 4

page 5

page 12

page 13

page 19

page 20

page 23

page 24

research
06/09/2023

COVER: A Heuristic Greedy Adversarial Attack on Prompt-based Learning in Language Models

Prompt-based learning has been proved to be an effective way in pre-trai...
research
05/08/2022

Fingerprint Template Invertibility: Minutiae vs. Deep Templates

Much of the success of fingerprint recognition is attributed to minutiae...
research
09/11/2023

FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models

Jailbreak vulnerabilities in Large Language Models (LLMs), which exploit...
research
02/14/2023

AutoBiasTest: Controllable Sentence Generation for Automated and Open-Ended Social Bias Testing in Language Models

Social bias in Pretrained Language Models (PLMs) affects text generation...
research
08/16/2023

Self-Deception: Reverse Penetrating the Semantic Firewall of Large Language Models

Large language models (LLMs), such as ChatGPT, have emerged with astonis...
research
10/09/2022

Quantifying Social Biases Using Templates is Unreliable

Recently, there has been an increase in efforts to understand how large ...
research
09/21/2023

A Chinese Prompt Attack Dataset for LLMs with Evil Content

Large Language Models (LLMs) present significant priority in text unders...

Please sign up or login with your details

Forgot password? Click here to reset