Character-level White-Box Adversarial Attacks against Transformers via Attachable Subwords Substitution

10/31/2022
by   Aiwei Liu, et al.
0

We propose the first character-level white-box adversarial attack method against transformer models. The intuition of our method comes from the observation that words are split into subtokens before being fed into the transformer models and the substitution between two close subtokens has a similar effect to the character modification. Our method mainly contains three steps. First, a gradient-based method is adopted to find the most vulnerable words in the sentence. Then we split the selected words into subtokens to replace the origin tokenization result from the transformer tokenizer. Finally, we utilize an adversarial loss to guide the substitution of attachable subtokens in which the Gumbel-softmax trick is introduced to ensure gradient propagation. Meanwhile, we introduce the visual and length constraint in the optimization process to achieve minimum character modifications. Extensive experiments on both sentence-level and token-level tasks demonstrate that our method could outperform the previous attack methods in terms of success rate and edit distance. Furthermore, human evaluation verifies our adversarial examples could preserve their origin labels.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/15/2021

Gradient-based Adversarial Attacks against Text Transformers

We propose the first general-purpose gradient-based attack against trans...
research
12/15/2020

FAWA: Fast Adversarial Watermark Attack on Optical Character Recognition (OCR) Systems

Deep neural networks (DNNs) significantly improved the accuracy of optic...
research
03/11/2022

Block-Sparse Adversarial Attack to Fool Transformer-Based Text Classifiers

Recently, it has been shown that, in spite of the significant performanc...
research
12/19/2017

HotFlip: White-Box Adversarial Examples for NLP

Adversarial examples expose vulnerabilities of machine learning models. ...
research
02/11/2022

White-Box Attacks on Hate-speech BERT Classifiers in German with Explicit and Implicit Character Level Defense

In this work, we evaluate the adversarial robustness of BERT models trai...
research
05/05/2023

White-Box Multi-Objective Adversarial Attack on Dialogue Generation

Pre-trained transformers are popular in state-of-the-art dialogue genera...
research
08/23/2021

Semantic-Preserving Adversarial Text Attacks

Deep neural networks (DNNs) are known to be vulnerable to adversarial im...

Please sign up or login with your details

Forgot password? Click here to reset