Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks

05/24/2023
by   Abhinav Rao, et al.
0

Recent explorations with commercial Large Language Models (LLMs) have shown that non-expert users can jailbreak LLMs by simply manipulating the prompts; resulting in degenerate output behavior, privacy and security breaches, offensive outputs, and violations of content regulator policies. Limited formal studies have been carried out to formalize and analyze these attacks and their mitigations. We bridge this gap by proposing a formalism and a taxonomy of known (and possible) jailbreaks. We perform a survey of existing jailbreak methods and their effectiveness on open-source and commercial LLMs (such as GPT 3.5, OPT, BLOOM, and FLAN-T5-xxl). We further propose a limited set of prompt guards and discuss their effectiveness against known attack types.

READ FULL TEXT

page 8

page 13

page 16

research
02/14/2022

Threats to Pre-trained Language Models: Survey and Taxonomy

Pre-trained language models (PTLMs) have achieved great success and rema...
research
02/27/2023

The (ab)use of Open Source Code to Train Large Language Models

In recent years, Large Language Models (LLMs) have gained significant po...
research
05/11/2023

How Good are Commercial Large Language Models on African Languages?

Recent advancements in Natural Language Processing (NLP) has led to the ...
research
05/19/2023

Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning

Large Language Models (LLMs) are known to memorize significant portions ...
research
03/06/2023

On the Feasibility of Specialized Ability Extracting for Large Language Code Models

Recent progress in large language code models (LLCMs) has led to a drama...
research
11/13/2021

Categorizing Service Worker Attacks and Mitigations

Service Workers (SWs) are a powerful feature at the core of Progressive ...

Please sign up or login with your details

Forgot password? Click here to reset