Textual Manifold-based Defense Against Natural Language Adversarial Examples

11/05/2022
by   Dang-Minh Nguyen, et al.
0

Recent studies on adversarial images have shown that they tend to leave the underlying low-dimensional data manifold, making them significantly more challenging for current models to make correct predictions. This so-called off-manifold conjecture has inspired a novel line of defenses against adversarial attacks on images. In this study, we find a similar phenomenon occurs in the contextualized embedding space induced by pretrained language models, in which adversarial texts tend to have their embeddings diverge from the manifold of natural ones. Based on this finding, we propose Textual Manifold-based Defense (TMD), a defense mechanism that projects text embeddings onto an approximated embedding manifold before classification. It reduces the complexity of potential adversarial examples, which ultimately enhances the robustness of the protected model. Through extensive experiments, our method consistently and significantly outperforms previous defenses under various attack settings without trading off clean accuracy. To the best of our knowledge, this is the first NLP defense that leverages the manifold structure against adversarial attacks. Our code is available at <https://github.com/dangne/tmd>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/18/2023

Masked Language Model Based Textual Adversarial Example Detection

Adversarial attacks are a serious threat to the reliable deployment of m...
research
09/15/2019

Natural Language Adversarial Attacks and Defenses in Word Level

Up until recent two years, inspired by the big amount of research about ...
research
05/03/2022

SemAttack: Natural Textual Attacks via Different Semantic Spaces

Recent studies show that pre-trained language models (LMs) are vulnerabl...
research
12/11/2022

DISCO: Adversarial Defense with Local Implicit Functions

The problem of adversarial defenses for image classification, where the ...
research
03/03/2022

Detection of Word Adversarial Examples in Text Classification: Benchmark and Baseline via Robust Density Estimation

Word-level adversarial attacks have shown success in NLP models, drastic...
research
04/22/2021

SPECTRE: Defending Against Backdoor Attacks Using Robust Statistics

Modern machine learning increasingly requires training on a large collec...
research
11/04/2018

Adversarial Gain

Adversarial examples can be defined as inputs to a model which induce a ...

Please sign up or login with your details

Forgot password? Click here to reset