TrojPrompt: A Black-box Trojan Attack on Pre-trained Language Models

by   Jiaqi Xue, et al.

Prompt learning has been proven to be highly effective in improving pre-trained language model (PLM) adaptability, surpassing conventional fine-tuning paradigms, and showing exceptional promise in an ever-growing landscape of applications and APIs tailored for few-shot learning scenarios. Despite the growing prominence of prompt learning-based APIs, their security concerns remain underexplored. In this paper, we undertake a pioneering study on the Trojan susceptibility of prompt-learning PLM APIs. We identified several key challenges, including discrete-prompt, few-shot, and black-box settings, which limit the applicability of existing backdoor attacks. To address these challenges, we propose TrojPrompt, an automatic and black-box framework to effectively generate universal and stealthy triggers and insert Trojans into hard prompts. Specifically, we propose a universal API-driven trigger discovery algorithm for generating universal triggers for various inputs by querying victim PLM APIs using few-shot data samples. Furthermore, we introduce a novel progressive trojan poisoning algorithm designed to generate poisoned prompts that retain efficacy and transferability across a diverse range of models. Our experiments and results demonstrate TrojPrompt's capacity to effectively insert Trojans into text prompts in real-world black-box PLM APIs, while maintaining exceptional performance on clean test sets and significantly outperforming baseline models. Our work sheds light on the potential security risks in current models and offers a potential defensive approach.


Language Models as Black-Box Optimizers for Vision-Language Models

Vision-language models (VLMs) pre-trained on web-scale datasets have dem...

Black-box Backdoor Defense via Zero-shot Image Purification

Backdoor attacks inject poisoned data into the training set, resulting i...

Open Sesame! Universal Black Box Jailbreaking of Large Language Models

Large language models (LLMs), designed to provide helpful and safe respo...

Why Do Neural Response Generation Models Prefer Universal Replies?

Recent advances in sequence-to-sequence learning reveal a purely data-dr...

BlackVIP: Black-Box Visual Prompting for Robust Transfer Learning

With the surge of large-scale pre-trained models (PTMs), fine-tuning the...

Absolute Zero-Shot Learning

Considering the increasing concerns about data copyright and privacy iss...

COVER: A Heuristic Greedy Adversarial Attack on Prompt-based Learning in Language Models

Prompt-based learning has been proved to be an effective way in pre-trai...

Please sign up or login with your details

Forgot password? Click here to reset