TrojPrompt: A Black-box Trojan Attack on Pre-trained Language Models

06/12/2023
by   Jiaqi Xue, et al.
0

Prompt learning has been proven to be highly effective in improving pre-trained language model (PLM) adaptability, surpassing conventional fine-tuning paradigms, and showing exceptional promise in an ever-growing landscape of applications and APIs tailored for few-shot learning scenarios. Despite the growing prominence of prompt learning-based APIs, their security concerns remain underexplored. In this paper, we undertake a pioneering study on the Trojan susceptibility of prompt-learning PLM APIs. We identified several key challenges, including discrete-prompt, few-shot, and black-box settings, which limit the applicability of existing backdoor attacks. To address these challenges, we propose TrojPrompt, an automatic and black-box framework to effectively generate universal and stealthy triggers and insert Trojans into hard prompts. Specifically, we propose a universal API-driven trigger discovery algorithm for generating universal triggers for various inputs by querying victim PLM APIs using few-shot data samples. Furthermore, we introduce a novel progressive trojan poisoning algorithm designed to generate poisoned prompts that retain efficacy and transferability across a diverse range of models. Our experiments and results demonstrate TrojPrompt's capacity to effectively insert Trojans into text prompts in real-world black-box PLM APIs, while maintaining exceptional performance on clean test sets and significantly outperforming baseline models. Our work sheds light on the potential security risks in current models and offers a potential defensive approach.

READ FULL TEXT
research
09/12/2023

Language Models as Black-Box Optimizers for Vision-Language Models

Vision-language models (VLMs) pre-trained on web-scale datasets have dem...
research
03/21/2023

Black-box Backdoor Defense via Zero-shot Image Purification

Backdoor attacks inject poisoned data into the training set, resulting i...
research
09/04/2023

Open Sesame! Universal Black Box Jailbreaking of Large Language Models

Large language models (LLMs), designed to provide helpful and safe respo...
research
08/28/2018

Why Do Neural Response Generation Models Prefer Universal Replies?

Recent advances in sequence-to-sequence learning reveal a purely data-dr...
research
03/26/2023

BlackVIP: Black-Box Visual Prompting for Robust Transfer Learning

With the surge of large-scale pre-trained models (PTMs), fine-tuning the...
research
02/23/2022

Absolute Zero-Shot Learning

Considering the increasing concerns about data copyright and privacy iss...
research
06/09/2023

COVER: A Heuristic Greedy Adversarial Attack on Prompt-based Learning in Language Models

Prompt-based learning has been proved to be an effective way in pre-trai...

Please sign up or login with your details

Forgot password? Click here to reset