Image Hijacks: Adversarial Images can Control Generative Models at Runtime

09/01/2023
by   Luke Bailey, et al.
0

Are foundation models secure from malicious actors? In this work, we focus on the image input to a vision-language model (VLM). We discover image hijacks, adversarial images that control generative models at runtime. We introduce Behaviour Matching, a general method for creating image hijacks, and we use it to explore three types of attacks. Specific string attacks generate arbitrary output of the adversary's choice. Leak context attacks leak information from the context window into the output. Jailbreak attacks circumvent a model's safety training. We study these attacks against LLaVA, a state-of-the-art VLM based on CLIP and LLaMA-2, and find that all our attack types have above a 90 success rate. Moreover, our attacks are automated and require only small image perturbations. These findings raise serious concerns about the security of foundation models. If image hijacks are as difficult to defend against as adversarial examples in CIFAR-10, then it might be many years before a solution is found – if it even exists.

READ FULL TEXT

page 1

page 5

research
10/14/2019

Man-in-the-Middle Attacks against Machine Learning Classifiers via Malicious Generative Models

Deep Neural Networks (DNNs) are vulnerable to deliberately crafted adver...
research
03/04/2020

Type I Attack for Generative Models

Generative models are popular tools with a wide range of applications. N...
research
08/21/2023

On the Adversarial Robustness of Multi-Modal Foundation Models

Multi-modal foundation models combining vision and language models such ...
research
12/26/2017

The Robust Manifold Defense: Adversarial Training using Generative Models

Deep neural networks are demonstrating excellent performance on several ...
research
10/01/2019

TMLab: Generative Enhanced Model (GEM) for adversarial attacks

We present our Generative Enhanced Model (GEM) that we used to create sa...
research
06/09/2020

GAP++: Learning to generate target-conditioned adversarial examples

Adversarial examples are perturbed inputs which can cause a serious thre...
research
12/05/2019

Label-Consistent Backdoor Attacks

Deep neural networks have been demonstrated to be vulnerable to backdoor...

Please sign up or login with your details

Forgot password? Click here to reset