CLIPMasterPrints: Fooling Contrastive Language-Image Pre-training Using Latent Variable Evolution

07/07/2023
by   Matthias Freiberger, et al.
0

Models leveraging both visual and textual data such as Contrastive Language-Image Pre-training (CLIP), are increasingly gaining importance. In this work, we show that despite their versatility, such models are vulnerable to what we refer to as fooling master images. Fooling master images are capable of maximizing the confidence score of a CLIP model for a significant number of widely varying prompts, while being unrecognizable for humans. We demonstrate how fooling master images can be mined by searching the latent space of generative models by means of an evolution strategy or stochastic gradient descent. We investigate the properties of the mined fooling master images, and find that images trained on a small number of image captions potentially generalize to a much larger number of semantically related captions. Further, we evaluate two possible mitigation strategies and find that vulnerability to fooling master examples is closely related to a modality gap in contrastive pre-trained multi-modal networks. From the perspective of vulnerability to off-manifold attacks, we therefore argue for the mitigation of modality gaps in CLIP and related multi-modal approaches. Source code and mined CLIPMasterPrints are available at https://github.com/matfrei/CLIPMasterPrints.

READ FULL TEXT

page 1

page 5

page 8

page 9

research
08/10/2021

SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation

Code representation learning, which aims to encode the semantics of sour...
research
12/09/2021

SimIPU: Simple 2D Image and 3D Point Cloud Unsupervised Pre-Training for Spatial-Aware Visual Representations

Pre-training has become a standard paradigm in many computer vision task...
research
11/15/2022

CorruptEncoder: Data Poisoning based Backdoor Attacks to Contrastive Learning

Contrastive learning (CL) pre-trains general-purpose encoders using an u...
research
09/30/2022

ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training

Recent Vision-Language Pre-trained (VLP) models based on dual encoder ha...
research
03/03/2022

Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning

We present modality gap, an intriguing geometric phenomenon of the repre...
research
09/03/2021

Revisiting 3D ResNets for Video Recognition

A recent work from Bello shows that training and scaling strategies may ...
research
08/03/2023

RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension

In this work, we investigate extending the comprehension of Multi-modal ...

Please sign up or login with your details

Forgot password? Click here to reset