Interpretable Adversarial Training for Text

05/30/2019
by   Samuel Barham, et al.
0

Generating high-quality and interpretable adversarial examples in the text domain is a much more daunting task than it is in the image domain. This is due partly to the discrete nature of text, partly to the problem of ensuring that the adversarial examples are still probable and interpretable, and partly to the problem of maintaining label invariance under input perturbations. In order to address some of these challenges, we introduce sparse projected gradient descent (SPGD), a new approach to crafting interpretable adversarial examples for text. SPGD imposes a directional regularization constraint on input perturbations by projecting them onto the directions to nearby word embeddings with highest cosine similarities. This constraint ensures that perturbations move each word embedding in an interpretable direction (i.e., towards another nearby word embedding). Moreover, SPGD imposes a sparsity constraint on perturbations at the sentence level by ignoring word-embedding perturbations whose norms are below a certain threshold. This constraint ensures that our method changes only a few words per sequence, leading to higher quality adversarial examples. Our experiments with the IMDB movie review dataset show that the proposed SPGD method improves adversarial example interpretability and likelihood (evaluated by average per-word perplexity) compared to state-of-the-art methods, while suffering little to no loss in training performance.

READ FULL TEXT

page 5

page 11

page 12

research
05/08/2018

Interpretable Adversarial Perturbation in Input Embedding Space for Text

Following great success in the image processing field, the idea of adver...
research
01/22/2020

Elephant in the Room: An Evaluation Framework for Assessing Adversarial Examples in NLP

An adversarial example is an input transformed by small perturbations th...
research
12/26/2021

Perlin Noise Improve Adversarial Robustness

Adversarial examples are some special input that can perturb the output ...
research
03/11/2022

Block-Sparse Adversarial Attack to Fool Transformer-Based Text Classifiers

Recently, it has been shown that, in spite of the significant performanc...
research
02/19/2022

Data-Driven Mitigation of Adversarial Text Perturbation

Social networks have become an indispensable part of our lives, with bil...
research
01/22/2018

Adversarial Texts with Gradient Methods

Adversarial samples for images have been extensively studied in the lite...
research
03/03/2022

Improving Health Mentioning Classification of Tweets using Contrastive Adversarial Training

Health mentioning classification (HMC) classifies an input text as healt...

Please sign up or login with your details

Forgot password? Click here to reset