How (Not) To Evaluate Explanation Quality

10/13/2022
by   Hendrik Schuff, et al.
0

The importance of explainability is increasingly acknowledged in natural language processing. However, it is still unclear how the quality of explanations can be assessed effectively. The predominant approach is to compare proxy scores (such as BLEU or explanation F1) evaluated against gold explanations in the dataset. The assumption is that an increase of the proxy score implies a higher utility of explanations to users. In this paper, we question this assumption. In particular, we (i) formulate desired characteristics of explanation quality that apply across tasks and domains, (ii) point out how current evaluation practices violate those characteristics, and (iii) propose actionable guidelines to overcome obstacles that limit today's evaluation of explanation quality and to enable the development of explainable systems that provide tangible benefits for human users. We substantiate our theoretical claims (i.e., the lack of validity and temporal decline of currently-used proxy scores) with empirical evidence from a crowdsourcing case study in which we investigate the explanation quality of state-of-the-art explainable question answering systems.

READ FULL TEXT

page 18

page 19

research
03/30/2023

Rethinking AI Explainability and Plausibility

Setting proper evaluation objectives for explainable artificial intellig...
research
10/13/2020

F1 is Not Enough! Models and Evaluation Towards User-Centered Explainable Question Answering

Explainable question answering systems predict an answer together with a...
research
05/04/2023

Are Human Explanations Always Helpful? Towards Objective Evaluation of Human Natural Language Explanations

Human-annotated labels and explanations are critical for training explai...
research
02/14/2022

DermX: an end-to-end framework for explainable automated dermatological diagnosis

Dermatological diagnosis automation is essential in addressing the high ...
research
05/05/2021

Do Natural Language Explanations Represent Valid Logical Arguments? Verifying Entailment in Explainable NLI Gold Standards

An emerging line of research in Explainable NLP is the creation of datas...
research
11/23/2022

SEAT: Stable and Explainable Attention

Currently, attention mechanism becomes a standard fixture in most state-...
research
05/07/2021

Order in the Court: Explainable AI Methods Prone to Disagreement

In Natural Language Processing, feature-additive explanation methods qua...

Please sign up or login with your details

Forgot password? Click here to reset