Do you really follow me? Adversarial Instructions for Evaluating the Robustness of Large Language Models

08/17/2023
by   Zekun Li, et al.
0

Large Language Models (LLMs) have shown remarkable proficiency in following instructions, making them valuable in customer-facing applications. However, their impressive capabilities also raise concerns about the amplification of risks posed by adversarial instructions, which can be injected into the model input by third-party attackers to manipulate LLMs' original instructions and prompt unintended actions and content. Therefore, it is crucial to understand LLMs' ability to accurately discern which instructions to follow to ensure their safe deployment in real-world scenarios. In this paper, we propose a pioneering benchmark for automatically evaluating the robustness of LLMs against adversarial instructions. The objective of this benchmark is to quantify the extent to which LLMs are influenced by injected adversarial instructions and assess their ability to differentiate between these adversarial instructions and original user instructions. Through experiments conducted with state-of-the-art instruction-following LLMs, we uncover significant limitations in their robustness against adversarial instruction attacks. Furthermore, our findings indicate that prevalent instruction-tuned models are prone to being overfitted to follow any instruction phrase in the prompt without truly understanding which instructions should be followed. This highlights the need to address the challenge of training models to comprehend prompts instead of merely following instruction phrases and completing the text.

READ FULL TEXT

page 8

page 9

research
08/28/2023

Evaluating the Robustness to Instructions of Large Language Models

Recently, Instruction fine-tuning has risen to prominence as a potential...
research
07/20/2023

Instruction-following Evaluation through Verbalizer Manipulation

While instruction-tuned models have shown remarkable success in various ...
research
07/20/2023

LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?

Large language models (LLMs) have exhibited impressive capabilities in c...
research
05/21/2018

A new dataset and model for learning to understand navigational instructions

In this paper, we present a state-of-the-art model and introduce a new d...
research
07/17/2023

Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models

Researchers have invested considerable effort into ensuring that large l...
research
04/25/2023

TABLET: Learning From Instructions For Tabular Data

Acquiring high-quality data is often a significant challenge in training...
research
04/27/2023

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

Large language models (LLMs) with instruction finetuning demonstrate sup...

Please sign up or login with your details

Forgot password? Click here to reset