Can Rationalization Improve Robustness?

04/25/2022
by   Howard Chen, et al.
2

A growing line of work has investigated the development of neural NLP models that can produce rationales–subsets of input that can explain their model predictions. In this paper, we ask whether such rationale models can also provide robustness to adversarial attacks in addition to their interpretable nature. Since these models need to first generate rationales ("rationalizer") before making predictions ("predictor"), they have the potential to ignore noise or adversarially added text by simply masking it out of the generated rationale. To this end, we systematically generate various types of 'AddText' attacks for both token and sentence-level rationalization tasks, and perform an extensive empirical evaluation of state-of-the-art rationale models across five different tasks. Our experiments reveal that the rationale models show the promise to improve robustness, while they struggle in certain scenarios–when the rationalizer is sensitive to positional bias or lexical choices of attack text. Further, leveraging human rationale as supervision does not always translate to better performance. Our study is a first step towards exploring the interplay between interpretability and robustness in the rationalize-then-predict framework.

READ FULL TEXT
research
05/24/2023

How do humans perceive adversarial text? A reality check on the validity and naturalness of word-based adversarial attacks

Natural Language Processing (NLP) models based on Machine Learning (ML) ...
research
10/24/2022

Does Self-Rationalization Improve Robustness to Spurious Correlations?

Rationalization is fundamental to human reasoning and learning. NLP mode...
research
09/20/2023

It's Simplex! Disaggregating Measures to Improve Certified Robustness

Certified robustness circumvents the fragility of defences against adver...
research
12/20/2022

In and Out-of-Domain Text Adversarial Robustness via Label Smoothing

Recently it has been shown that state-of-the-art NLP models are vulnerab...
research
03/27/2019

Text Processing Like Humans Do: Visually Attacking and Shielding NLP Systems

Visual modifications to text are often used to obfuscate offensive comme...
research
05/29/2023

Exploiting Explainability to Design Adversarial Attacks and Evaluate Attack Resilience in Hate-Speech Detection Models

The advent of social media has given rise to numerous ethical challenges...
research
12/23/2021

More Than Words: Towards Better Quality Interpretations of Text Classifiers

The large size and complex decision mechanisms of state-of-the-art text ...

Please sign up or login with your details

Forgot password? Click here to reset