Detecting Language Model Attacks with Perplexity

08/27/2023
by   Gabriel Alon, et al.
0

A novel hack involving Large Language Models (LLMs) has emerged, leveraging adversarial suffixes to trick models into generating perilous responses. This method has garnered considerable attention from reputable media outlets such as the New York Times and Wired, thereby influencing public perception regarding the security and safety of LLMs. In this study, we advocate the utilization of perplexity as one of the means to recognize such potential attacks. The underlying concept behind these hacks revolves around appending an unusually constructed string of text to a harmful query that would otherwise be blocked. This maneuver confuses the protective mechanisms and tricks the model into generating a forbidden response. Such scenarios could result in providing detailed instructions to a malicious user for constructing explosives or orchestrating a bank heist. Our investigation demonstrates the feasibility of employing perplexity, a prevalent natural language processing metric, to detect these adversarial tactics before generating a forbidden response. By evaluating the perplexity of queries with and without such adversarial suffixes using an open-source LLM, we discovered that nearly 90 percent were above a perplexity of 1000. This contrast underscores the efficacy of perplexity for detecting this type of exploit.

READ FULL TEXT
research
08/14/2023

LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

Large language models (LLMs) have skyrocketed in popularity in recent ye...
research
05/26/2023

On Evaluating Adversarial Robustness of Large Vision-Language Models

Large vision-language models (VLMs) such as GPT-4 have achieved unpreced...
research
05/22/2023

Lion: Adversarial Distillation of Closed-Source Large Language Model

The practice of transferring knowledge from a sophisticated, closed-sour...
research
09/18/2023

Towards Ontology Construction with Language Models

We present a method for automatically constructing a concept hierarchy f...
research
06/09/2023

Towards a Robust Detection of Language Model Generated Text: Is ChatGPT that Easy to Detect?

Recent advances in natural language processing (NLP) have led to the dev...
research
11/13/2021

Measuring the Contribution of Multiple Model Representations in Detecting Adversarial Instances

Deep learning models have been used for a wide variety of tasks. They ar...
research
01/10/2023

User-Centered Security in Natural Language Processing

This dissertation proposes a framework of user-centered security in Natu...

Please sign up or login with your details

Forgot password? Click here to reset