GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

08/12/2023
by   Youliang Yuan, et al.
0

Safety lies at the core of the development of Large Language Models (LLMs). There is ample work on aligning LLMs with human ethics and preferences, including data filtering in pretraining, supervised fine-tuning, reinforcement learning from human feedback, and red teaming, etc. In this study, we discover that chat in cipher can bypass the safety alignment techniques of LLMs, which are mainly conducted in natural languages. We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages – ciphers. CipherChat enables humans to chat with LLMs through cipher prompts topped with system role descriptions and few-shot enciphered demonstrations. We use CipherChat to assess state-of-the-art LLMs, including ChatGPT and GPT-4 for different representative human ciphers across 11 safety domains in both English and Chinese. Experimental results show that certain ciphers succeed almost 100 of GPT-4 in several safety domains, demonstrating the necessity of developing safety alignment for non-natural languages. Notably, we identify that LLMs seem to have a ”secret cipher”, and propose a novel SelfCipher that uses only role play and several demonstrations in natural language to evoke this capability. SelfCipher surprisingly outperforms existing human ciphers in almost all cases. Our code and data will be released at https://github.com/RobustNLP/CipherChat.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/13/2023

SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions

With the rapid development of Large Language Models (LLMs), increasing a...
research
05/23/2023

Aligning Large Language Models through Synthetic Feedback

Aligning large language models (LLMs) to human values has become increas...
research
07/19/2023

CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility

With the rapid evolution of large language models (LLMs), there is a gro...
research
07/02/2022

Can Language Models Make Fun? A Case Study in Chinese Comical Crosstalk

Language is the principal tool for human communication, in which humor i...
research
10/04/2022

When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment

AI systems are becoming increasingly intertwined with human life. In ord...
research
02/27/2023

Safety without alignment

Currently, the dominant paradigm in AI safety is alignment with human va...

Please sign up or login with your details

Forgot password? Click here to reset