Baseline Defenses for Adversarial Attacks Against Aligned Language Models

09/01/2023
by   Neel Jain, et al.
0

As Large Language Models quickly become ubiquitous, it becomes critical to understand their security vulnerabilities. Recent work shows that text optimizers can produce jailbreaking prompts that bypass moderation and alignment. Drawing from the rich body of work on adversarial machine learning, we approach these attacks with three questions: What threat models are practically useful in this domain? How do baseline defense techniques perform in this new domain? How does LLM security differ from computer vision? We evaluate several baseline defense strategies against leading adversarial attacks on LLMs, discussing the various settings in which each is feasible and effective. Particularly, we look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training. We discuss white-box and gray-box settings and discuss the robustness-performance trade-off for each of the defenses considered. We find that the weakness of existing discrete optimizers for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs. Future research will be needed to uncover whether more powerful optimizers can be developed, or whether the strength of filtering and preprocessing defenses is greater in the LLMs domain than it has been in computer vision.

READ FULL TEXT
research
03/29/2022

Mel Frequency Spectral Domain Defenses against Adversarial Attacks on Speech Recognition Systems

A variety of recent works have looked into defenses for deep neural netw...
research
02/26/2021

What Doesn't Kill You Makes You Robust(er): Adversarial Training against Poisons and Backdoors

Data poisoning is a threat model in which a malicious actor tampers with...
research
10/17/2022

Deepfake Text Detection: Limitations and Opportunities

Recent advances in generative models for language have enabled the creat...
research
08/30/2020

Benchmarking adversarial attacks and defenses for time-series data

The adversarial vulnerability of deep networks has spurred the interest ...
research
10/06/2020

Downscaling Attack and Defense: Turning What You See Back Into What You Get

The resizing of images, which is typically a required part of preprocess...
research
06/07/2022

Towards Understanding and Mitigating Audio Adversarial Examples for Speaker Recognition

Speaker recognition systems (SRSs) have recently been shown to be vulner...
research
09/03/2023

Robust Adversarial Defense by Tensor Factorization

As machine learning techniques become increasingly prevalent in data ana...

Please sign up or login with your details

Forgot password? Click here to reset