FRAME: Evaluating Simulatability Metrics for Free-Text Rationales

07/02/2022
by   Aaron Chan, et al.
0

Free-text rationales aim to explain neural language model (LM) behavior more flexibly and intuitively via natural language. To ensure rationale quality, it is important to have metrics for measuring rationales' faithfulness (reflects LM's actual behavior) and plausibility (convincing to humans). All existing free-text rationale metrics are based on simulatability (association between rationale and LM's predicted label), but there is no protocol for assessing such metrics' reliability. To investigate this, we propose FRAME, a framework for evaluating free-text rationale simulatability metrics. FRAME is based on three axioms: (1) good metrics should yield highest scores for reference rationales, which maximize rationale-label association by construction; (2) good metrics should be appropriately sensitive to semantic perturbation of rationales; and (3) good metrics should be robust to variation in the LM's task performance. Across three text classification datasets, we show that existing simulatability metrics cannot satisfy all three FRAME axioms, since they are implemented via model pretraining which muddles the metric's signal. We introduce a non-pretraining simulatability variant that improves performance on (1) and (3) by an average of 41.7 competitively on (2).

READ FULL TEXT
research
11/29/2022

Measuring the Measuring Tools: An Automatic Evaluation of Semantic Metrics for Text Corpora

The ability to compare the semantic similarity between text corpora is i...
research
04/02/2022

CTRLEval: An Unsupervised Reference-Free Metric for Evaluating Controlled Text Generation

Existing reference-free metrics have obvious limitations for evaluating ...
research
07/20/2016

Constructing a Natural Language Inference Dataset using Generative Neural Networks

Natural Language Inference is an important task for Natural Language Und...
research
10/24/2020

Measuring Association Between Labels and Free-Text Rationales

Interpretable NLP has taking increasing interest in ensuring that explan...
research
10/22/2022

On the Limitations of Reference-Free Evaluations of Generated Text

There is significant interest in developing evaluation metrics which acc...
research
09/14/2023

Measuring the Quality of Text-to-Video Model Outputs: Metrics and Dataset

Evaluating the quality of videos generated from text-to-video (T2V) mode...
research
03/19/2020

Diversity, Density, and Homogeneity: Quantitative Characteristic Metrics for Text Collections

Summarizing data samples by quantitative measures has a long history, wi...

Please sign up or login with your details

Forgot password? Click here to reset