AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning

08/14/2023
by   Ziqi Zhou, et al.
0

Multimodal contrastive learning aims to train a general-purpose feature extractor, such as CLIP, on vast amounts of raw, unlabeled paired image-text data. This can greatly benefit various complex downstream tasks, including cross-modal image-text retrieval and image classification. Despite its promising prospect, the security issue of cross-modal pre-trained encoder has not been fully explored yet, especially when the pre-trained encoder is publicly available for commercial use. In this work, we propose AdvCLIP, the first attack framework for generating downstream-agnostic adversarial examples based on cross-modal pre-trained encoders. AdvCLIP aims to construct a universal adversarial patch for a set of natural images that can fool all the downstream tasks inheriting the victim cross-modal pre-trained encoder. To address the challenges of heterogeneity between different modalities and unknown downstream tasks, we first build a topological graph structure to capture the relevant positions between target samples and their neighbors. Then, we design a topology-deviation based generative adversarial network to generate a universal adversarial patch. By adding the patch to images, we minimize their embeddings similarity to different modality and perturb the sample distribution in the feature space, achieving unviersal non-targeted attacks. Our results demonstrate the excellent attack performance of AdvCLIP on two types of downstream tasks across eight datasets. We also tailor three popular defenses to mitigate AdvCLIP, highlighting the need for new defense mechanisms to defend cross-modal pre-trained encoders.

READ FULL TEXT
research
07/23/2023

Downstream-agnostic Adversarial Examples

Self-supervised learning usually uses a large amount of unlabeled data t...
research
08/01/2021

BadEncoder: Backdoor Attacks to Pre-trained Encoders in Self-Supervised Learning

Self-supervised learning in computer vision aims to pre-train an image e...
research
08/26/2023

Video and Audio are Images: A Cross-Modal Mixer for Original Data on Video-Audio Retrieval

Cross-modal retrieval has become popular in recent years, particularly w...
research
08/25/2021

EncoderMI: Membership Inference against Pre-trained Encoders in Contrastive Learning

Given a set of unlabeled images or (image, text) pairs, contrastive lear...
research
10/12/2022

One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks

Current multimodal models, aimed at solving Vision and Language (V+L) ta...
research
12/14/2021

ACE-BERT: Adversarial Cross-modal Enhanced BERT for E-commerce Retrieval

Nowadays on E-commerce platforms, products are presented to the customer...
research
09/05/2023

SeisCLIP: A seismology foundation model pre-trained by multi-modal data for multi-purpose seismic feature extraction

Training specific deep learning models for particular tasks is common ac...

Please sign up or login with your details

Forgot password? Click here to reset