A Fine-grained Interpretability Evaluation Benchmark for Neural NLP

05/23/2022
by   Lijie Wang, et al.
0

While there is increasing concern about the interpretability of neural models, the evaluation of interpretability remains an open problem, due to the lack of proper evaluation datasets and metrics. In this paper, we present a novel benchmark to evaluate the interpretability of both neural models and saliency methods. This benchmark covers three representative NLP tasks: sentiment analysis, textual similarity and reading comprehension, each provided with both English and Chinese annotated data. In order to precisely evaluate the interpretability, we provide token-level rationales that are carefully annotated to be sufficient, compact and comprehensive. We also design a new metric, i.e., the consistency between the rationales before and after perturbations, to uniformly evaluate the interpretability of models and saliency methods on different tasks. Based on this benchmark, we conduct experiments on three typical models with three saliency methods, and unveil their strengths and weakness in terms of interpretability. We will release this benchmark at <https://xyz> and hope it can facilitate the research in building trustworthy systems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/30/2021

DuTrust: A Sentiment Analysis Dataset for Trustworthiness Evaluation

While deep learning models have greatly improved the performance of most...
research
07/28/2022

An Interpretability Evaluation Benchmark for Pre-trained Language Models

While pre-trained language models (LMs) have brought great improvements ...
research
05/23/2023

WYWEB: A NLP Evaluation Benchmark For Classical Chinese

To fully evaluate the overall performance of different NLP models in a g...
research
04/12/2016

What do different evaluation metrics tell us about saliency models?

How best to evaluate a saliency model's ability to predict where humans ...
research
12/22/2021

Comparing radiologists' gaze and saliency maps generated by interpretability methods for chest x-rays

The interpretability of medical image analysis models is considered a ke...
research
11/08/2019

ERASER: A Benchmark to Evaluate Rationalized NLP Models

State-of-the-art models in NLP are now predominantly based on deep neura...
research
05/22/2023

Cross-functional Analysis of Generalisation in Behavioural Learning

In behavioural testing, system functionalities underrepresented in the s...

Please sign up or login with your details

Forgot password? Click here to reset