Detecting Textual Adversarial Examples Based on Distributional Characteristics of Data Representations

04/29/2022
by   Na Liu, et al.
3

Although deep neural networks have achieved state-of-the-art performance in various machine learning tasks, adversarial examples, constructed by adding small non-random perturbations to correctly classified inputs, successfully fool highly expressive deep classifiers into incorrect predictions. Approaches to adversarial attacks in natural language tasks have boomed in the last five years using character-level, word-level, phrase-level, or sentence-level textual perturbations. While there is some work in NLP on defending against such attacks through proactive methods, like adversarial training, there is to our knowledge no effective general reactive approaches to defence via detection of textual adversarial examples such as is found in the image processing literature. In this paper, we propose two new reactive methods for NLP to fill this gap, which unlike the few limited application baselines from NLP are based entirely on distribution characteristics of learned representations: we adapt one from the image processing literature (Local Intrinsic Dimensionality (LID)), and propose a novel one (MultiDistance Representation Ensemble Method (MDRE)). Adapted LID and MDRE obtain state-of-the-art results on character-level, word-level, and phrase-level attacks on the IMDB dataset as well as on the later two with respect to the MultiNLI dataset. For future research, we publish our code.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/19/2023

What Learned Representations and Influence Functions Can Tell Us About Adversarial Examples

Adversarial examples, deliberately crafted using small perturbations to ...
research
04/17/2022

Residue-Based Natural Language Adversarial Attack Detection

Deep learning based systems are susceptible to adversarial attacks, wher...
research
05/22/2022

Phrase-level Textual Adversarial Attack with Label Preservation

Generating high-quality textual adversarial examples is critical for inv...
research
06/01/2019

Perceptual Evaluation of Adversarial Attacks for CNN-based Image Classification

Deep neural networks (DNNs) have recently achieved state-of-the-art perf...
research
02/06/2023

Less is More: Understanding Word-level Textual Adversarial Attack via n-gram Frequency Descend

Word-level textual adversarial attacks have achieved striking performanc...
research
08/21/2022

MockingBERT: A Method for Retroactively Adding Resilience to NLP Models

Protecting NLP models against misspellings whether accidental or adversa...
research
12/07/2020

Learning to Separate Clusters of Adversarial Representations for Robust Adversarial Detection

Although deep neural networks have shown promising performances on vario...

Please sign up or login with your details

Forgot password? Click here to reset