DeepAI AI Chat
Log In Sign Up

Learning to Ignore Adversarial Attacks

by   Yiming Zhang, et al.
The University of Chicago

Despite the strong performance of current NLP models, they can be brittle against adversarial attacks. To enable effective learning against adversarial inputs, we introduce the use of rationale models that can explicitly learn to ignore attack tokens. We find that the rationale models can successfully ignore over 90% of attack tokens. This approach leads to consistent sizable improvements (∼10%) over baseline models in robustness on three datasets for both BERT and RoBERTa, and also reliably outperforms data augmentation with adversarial examples alone. In many cases, we find that our method is able to close the gap between model performance on a clean test set and an attacked test set and hence reduce the effect of adversarial attacks.


Towards Variable-Length Textual Adversarial Attacks

Adversarial attacks have shown the vulnerability of machine learning mod...

The Odds are Odd: A Statistical Test for Detecting Adversarial Examples

We investigate conditions under which test statistics exist that can rel...

Identifying and Mitigating Spurious Correlations for Improving Robustness in NLP Models

Recently, NLP models have achieved remarkable progress across a variety ...

Adversarial Attacks on Tables with Entity Swap

The capabilities of large language models (LLMs) have been successfully ...

Human uncertainty makes classification more robust

The classification performance of deep neural networks has begun to asym...

BERT Rankers are Brittle: a Study using Adversarial Document Perturbations

Contextual ranking models based on BERT are now well established for a w...

EFSG: Evolutionary Fooling Sentences Generator

Large pre-trained language representation models (LMs) have recently col...