Unifying Model Explainability and Robustness for Joint Text Classification and Rationale Extraction

12/20/2021
by   Dongfang Li, et al.
0

Recent works have shown explainability and robustness are two crucial ingredients of trustworthy and reliable text classification. However, previous works usually address one of two aspects: i) how to extract accurate rationales for explainability while being beneficial to prediction; ii) how to make the predictive model robust to different types of adversarial attacks. Intuitively, a model that produces helpful explanations should be more robust against adversarial attacks, because we cannot trust the model that outputs explanations but changes its prediction under small perturbations. To this end, we propose a joint classification and rationale extraction model named AT-BMC. It includes two key mechanisms: mixed Adversarial Training (AT) is designed to use various perturbations in discrete and embedding space to improve the model's robustness, and Boundary Match Constraint (BMC) helps to locate rationales more precisely with the guidance of boundary information. Performances on benchmark datasets demonstrate that the proposed AT-BMC outperforms baselines on both classification and rationale extraction by a large margin. Robustness analysis shows that the proposed AT-BMC decreases the attack success rate effectively by up to 69 that there are connections between robust models and better explanations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/06/2019

Learning to Discriminate Perturbations for Blocking Adversarial Attacks in Text Classification

Adversarial attacks against machine learning models have threatened vari...
research
12/18/2022

Estimating the Adversarial Robustness of Attributions in Text with Transformers

Explanations are crucial parts of deep neural network (DNN) classifiers....
research
12/04/2020

Unsupervised Adversarially-Robust Representation Learning on Graphs

Recent works have demonstrated that deep learning on graphs is vulnerabl...
research
07/31/2023

Text-CRS: A Generalized Certified Robustness Framework against Textual Adversarial Attacks

The language models, especially the basic text classification models, ha...
research
09/14/2021

Improving Gradient-based Adversarial Training for Text Classification by Contrastive Learning and Auto-Encoder

Recent work has proposed several efficient approaches for generating gra...
research
05/27/2022

Semi-supervised Semantics-guided Adversarial Training for Trajectory Prediction

Predicting the trajectories of surrounding objects is a critical task in...
research
04/16/2021

Variable Instance-Level Explainability for Text Classification

Despite the high accuracy of pretrained transformer networks in text cla...

Please sign up or login with your details

Forgot password? Click here to reset