LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

08/14/2023
by   Alec Helbling, et al.
0

Large language models (LLMs) have skyrocketed in popularity in recent years due to their ability to generate high-quality text in response to human prompting. However, these models have been shown to have the potential to generate harmful content in response to user prompting (e.g., giving users instructions on how to commit crimes). There has been a focus in the literature on mitigating these risks, through methods like aligning models with human values through reinforcement learning. However, it has been shown that even aligned language models are susceptible to adversarial attacks that bypass their restrictions on generating harmful text. We propose a simple approach to defending against these attacks by having a large language model filter its own responses. Our current results show that even if a model is not fine-tuned to be aligned with human values, it is possible to stop it from presenting harmful content to users by validating the content using a language model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/27/2023

Universal and Transferable Adversarial Attacks on Aligned Language Models

Because "out-of-the-box" large language models are capable of generating...
research
02/19/2022

Reward Modeling for Mitigating Toxicity in Transformer-based Language Models

Transformer-based language models are able to generate fluent text and b...
research
02/12/2023

MarioGPT: Open-Ended Text2Level Generation through Large Language Models

Procedural Content Generation (PCG) algorithms provide a technique to ge...
research
06/26/2023

Are aligned neural networks adversarially aligned?

Large language models are now tuned to align with the goals of their cre...
research
08/27/2023

Detecting Language Model Attacks with Perplexity

A novel hack involving Large Language Models (LLMs) has emerged, leverag...
research
08/18/2022

Using Large Language Models to Simulate Multiple Humans

We propose a method for using a large language model, such as GPT-3, to ...
research
07/28/2023

Robust Distortion-free Watermarks for Language Models

We propose a methodology for planting watermarks in text from an autoreg...

Please sign up or login with your details

Forgot password? Click here to reset