DeepAI AI Chat
Log In Sign Up

Defending against Backdoor Attacks in Natural Language Generation

by   Chun Fan, et al.

The frustratingly fragile nature of neural network models make current natural language generation (NLG) systems prone to backdoor attacks and generate malicious sequences that could be sexist or offensive. Unfortunately, little effort has been invested to how backdoor attacks can affect current NLG models and how to defend against these attacks. In this work, we investigate this problem on two important NLG tasks, machine translation and dialogue generation. By giving a formal definition for backdoor attack and defense, and developing corresponding benchmarks, we design methods to attack NLG models, which achieve high attack success to ask NLG models to generate malicious sequences. To defend against these attacks, we propose to detect the attack trigger by examining the effect of deleting or replacing certain words on the generation outputs, which we find successful for certain types of attacks. We will discuss the limitation of this work, and hope this work can raise the awareness of backdoor risks concealed in deep NLG systems. (Code and data are available at


page 1

page 2

page 3

page 4


TextAttack: A Framework for Adversarial Attacks in Natural Language Processing

TextAttack is a library for running adversarial attacks against natural ...

Ignore Previous Prompt: Attack Techniques For Language Models

Transformer-based large language models (LLMs) provide a powerful founda...

Label-Smoothed Backdoor Attack

By injecting a small number of poisoned samples into the training set, b...

Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger

Backdoor attacks are a kind of insidious security threat against machine...

TrojText: Test-time Invisible Textual Trojan Insertion

In Natural Language Processing (NLP), intelligent neuron models can be s...

Input-Aware Dynamic Backdoor Attack

In recent years, neural backdoor attack has been considered to be a pote...

Practical Cross-System Shilling Attacks with Limited Access to Data

In shilling attacks, an adversarial party injects a few fake user profil...