Weight Poisoning Attacks on Pre-trained Models

04/14/2020
by   Keita Kurita, et al.
11

Recently, NLP has seen a surge in the usage of large pre-trained models. Users download weights of models pre-trained on large datasets, then fine-tune the weights on a task of their choice. This raises the question of whether downloading untrusted pre-trained weights can pose a security threat. In this paper, we show that it is possible to construct “weight poisoning” attacks where pre-trained weights are injected with vulnerabilities that expose “backdoors” after fine-tuning, enabling the attacker to manipulate the model prediction simply by injecting an arbitrary keyword. We show that by applying a regularization method, which we call RIPPLe, and an initialization procedure, which we call Embedding Surgery, such attacks are possible even with limited knowledge of the dataset and fine-tuning procedure. Our experiments on sentiment classification, toxicity detection, and spam detection show that this attack is widely applicable and poses a serious threat. Finally, we outline practical defenses against such attacks. Code to reproduce our experiments is available at https://github.com/neulab/RIPPLe.

READ FULL TEXT
research
08/31/2021

Backdoor Attacks on Pre-trained Models by Layerwise Weight Poisoning

Pre-Trained Models have been widely applied and recently proved vulnerab...
research
01/18/2021

Red Alarm for Pre-trained Models: Universal Vulnerabilities by Neuron-Level Backdoor Attacks

Due to the success of pre-trained models (PTMs), people usually fine-tun...
research
02/18/2023

Backdoor Attacks to Pre-trained Unified Foundation Models

The rise of pre-trained unified foundation models breaks down the barrie...
research
01/10/2020

Backdoor Attacks against Transfer Learning with Pre-trained Deep Learning Models

Transfer learning, that transfer the learned knowledge of pre-trained Te...
research
10/06/2019

Transforming the output of GANs by fine-tuning them with features from different datasets

In this work we present a method for fine-tuning pre-trained GANs with f...
research
06/08/2023

Trojan Model Detection Using Activation Optimization

Due to data's unavailability or large size, and the high computational a...
research
04/13/2021

Thief, Beware of What Get You There: Towards Understanding Model Extraction Attack

Model extraction increasingly attracts research attentions as keeping co...

Please sign up or login with your details

Forgot password? Click here to reset