Open Sesame! Universal Black Box Jailbreaking of Large Language Models

09/04/2023
by   Raz Lapid, et al.
0

Large language models (LLMs), designed to provide helpful and safe responses, often rely on alignment techniques to align with user intent and social guidelines. Unfortunately, this alignment can be exploited by malicious actors seeking to manipulate an LLM's outputs for unintended purposes. In this paper we introduce a novel approach that employs a genetic algorithm (GA) to manipulate LLMs when model architecture and parameters are inaccessible. The GA attack works by optimizing a universal adversarial prompt that – when combined with a user's query – disrupts the attacked model's alignment, resulting in unintended and potentially harmful outputs. Our novel approach systematically reveals a model's limitations and vulnerabilities by uncovering instances where its responses deviate from expected behavior. Through extensive experiments we demonstrate the efficacy of our technique, thus contributing to the ongoing discussion on responsible AI development by providing a diagnostic tool for evaluating and enhancing alignment of LLMs with human intent. To our knowledge this is the first automated universal black box jailbreak attack.

READ FULL TEXT
research
05/01/2019

POBA-GA: Perturbation Optimized Black-Box Adversarial Attacks via Genetic Algorithm

Most deep learning models are easily vulnerable to adversarial attacks. ...
research
06/12/2023

TrojPrompt: A Black-box Trojan Attack on Pre-trained Language Models

Prompt learning has been proven to be highly effective in improving pre-...
research
02/08/2023

Systematically Finding Security Vulnerabilities in Black-Box Code Generation Models

Recently, large language models for code generation have achieved breakt...
research
07/13/2023

Microbial Genetic Algorithm-based Black-box Attack against Interpretable Deep Learning Systems

Deep learning models are susceptible to adversarial samples in white and...
research
04/19/2023

Fundamental Limitations of Alignment in Large Language Models

An important aspect in developing language models that interact with hum...
research
09/11/2023

FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models

Jailbreak vulnerabilities in Large Language Models (LLMs), which exploit...
research
02/08/2021

RECAST: Enabling User Recourse and Interpretability of Toxicity Detection Models with Interactive Visualization

With the widespread use of toxic language online, platforms are increasi...

Please sign up or login with your details

Forgot password? Click here to reset