On Evaluation of Adversarial Perturbations for Sequence-to-Sequence Models

03/15/2019
by   Paul Michel, et al.
0

Adversarial examples --- perturbations to the input of a model that elicit large changes in the output --- have been shown to be an effective way of assessing the robustness of sequence-to-sequence (seq2seq) models. However, these perturbations only indicate weaknesses in the model if they do not change the input so significantly that it legitimately results in changes in the expected output. This fact has largely been ignored in the evaluations of the growing body of related literature. Using the example of untargeted attacks on machine translation (MT), we propose a new evaluation framework for adversarial attacks on seq2seq models that takes the semantic equivalence of the pre- and post-perturbation input into account. Using this framework, we demonstrate that existing methods may not preserve meaning in general, breaking the aforementioned assumption that source side perturbations should not result in changes in the expected output. We further use this framework to demonstrate that adding additional constraints on attacks allows for adversarial perturbations that are more meaning-preserving, but nonetheless largely change the output sequence. Finally, we show that performing untargeted adversarial training with meaning-preserving attacks is beneficial to the model in terms of adversarial robustness, without hurting test performance. A toolkit implementing our evaluation framework is released at https://github.com/pmichel31415/teapot-nlp.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/27/2020

Beneficial Perturbations Network for Defending Adversarial Examples

Adversarial training, in which a network is trained on both adversarial ...
research
09/10/2023

Machine Translation Models Stand Strong in the Face of Adversarial Attacks

Adversarial attacks expose vulnerabilities of deep learning models by in...
research
05/29/2023

From Adversarial Arms Race to Model-centric Evaluation: Motivating a Unified Automatic Robustness Evaluation Framework

Textual adversarial attacks can discover models' weaknesses by adding se...
research
04/05/2021

Rethinking Perturbations in Encoder-Decoders for Fast Training

We often use perturbations to regularize neural models. For neural encod...
research
02/24/2020

Utilizing a null class to restrict decision spaces and defend against neural network adversarial attacks

Despite recent progress, deep neural networks generally continue to be v...
research
02/23/2021

The Sensitivity of Word Embeddings-based Author Detection Models to Semantic-preserving Adversarial Perturbations

Authorship analysis is an important subject in the field of natural lang...
research
10/01/2021

Calibrated Adversarial Training

Adversarial training is an approach of increasing the robustness of mode...

Please sign up or login with your details

Forgot password? Click here to reset