Plug and Pray: Exploiting off-the-shelf components of Multi-Modal Models

07/26/2023
by   Erfan Shayegani, et al.
0

The rapid growth and increasing popularity of incorporating additional modalities (e.g., vision) into large language models (LLMs) has raised significant security concerns. This expansion of modality, akin to adding more doors to a house, unintentionally creates multiple access points for adversarial attacks. In this paper, by introducing adversarial embedding space attacks, we emphasize the vulnerabilities present in multi-modal systems that originate from incorporating off-the-shelf components like public pre-trained encoders in a plug-and-play manner into these systems. In contrast to existing work, our approach does not require access to the multi-modal system's weights or parameters but instead relies on the huge under-explored embedding space of such pre-trained encoders. Our proposed embedding space attacks involve seeking input images that reside within the dangerous or targeted regions of the extensive embedding space of these pre-trained components. These crafted adversarial images pose two major threats: 'Context Contamination' and 'Hidden Prompt Injection'-both of which can compromise multi-modal models like LLaVA and fully change the behavior of the associated language model. Our findings emphasize the need for a comprehensive examination of the underlying components, particularly pre-trained encoders, before incorporating them into systems in a plug-and-play manner to ensure robust security.

READ FULL TEXT

page 12

page 13

page 14

page 16

page 17

page 18

page 19

page 22

research
08/22/2023

Ceci n'est pas une pomme: Adversarial Illusions in Multi-Modal Embeddings

Multi-modal encoders map images, sounds, texts, videos, etc. into a sing...
research
07/01/2023

ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models

Large-scale vision-language models (VLMs) like CLIP successfully find co...
research
06/05/2023

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

We present Video-LLaMA, a multi-modal framework that empowers Large Lang...
research
03/08/2022

Multi-Modal Mixup for Robust Fine-tuning

Pre-trained large-scale models provide a transferable embedding, and the...
research
08/21/2023

On the Adversarial Robustness of Multi-Modal Foundation Models

Multi-modal foundation models combining vision and language models such ...
research
02/08/2023

Diagnosing and Rectifying Vision Models using Language

Recent multi-modal contrastive learning models have demonstrated the abi...
research
05/06/2021

Learning Neighborhood Representation from Multi-Modal Multi-Graph: Image, Text, Mobility Graph and Beyond

Recent urbanization has coincided with the enrichment of geotagged data,...

Please sign up or login with your details

Forgot password? Click here to reset