On the Challenges of Evaluating Compositional Explanations in Multi-Hop Inference: Relevance, Completeness, and Expert Ratings

09/07/2021
by   Peter Jansen, et al.
0

Building compositional explanations requires models to combine two or more facts that, together, describe why the answer to a question is correct. Typically, these "multi-hop" explanations are evaluated relative to one (or a small number of) gold explanations. In this work, we show these evaluations substantially underestimate model performance, both in terms of the relevance of included facts, as well as the completeness of model-generated explanations, because models regularly discover and produce valid explanations that are different than gold explanations. To address this, we construct a large corpus of 126k domain-expert (science teacher) relevance ratings that augment a corpus of explanations to standardized science exam questions, discovering 80k additional relevant facts not rated as gold. We build three strong models based on different methodologies (generation, ranking, and schemas), and empirically show that while expert-augmented ratings provide better estimates of explanation quality, both original (gold) and expert-augmented automatic evaluations still substantially underestimate performance by up to 36 compared with full manual expert judgements, with different models being disproportionately affected. This poses a significant methodological challenge to accurately evaluating explanations produced by compositional reasoning models.

READ FULL TEXT

page 4

page 6

page 7

page 8

page 10

page 11

page 12

page 14

research
07/27/2021

Red Dragon AI at TextGraphs 2021 Shared Task: Multi-Hop Inference Explanation Regeneration by Matching Expert Ratings

Creating explanations for answers to science questions is a challenging ...
research
01/25/2023

ExaRanker: Explanation-Augmented Neural Ranker

Recent work has shown that inducing a large language model (LLM) to gene...
research
02/08/2018

WorldTree: A Corpus of Explanation Graphs for Elementary Science Questions supporting Multi-Hop Inference

Developing methods of automated inference that are able to provide users...
research
03/31/2020

Unification-based Reconstruction of Explanations for Science Questions

The paper presents a framework to reconstruct explanations for multiple ...
research
07/25/2021

Hybrid Autoregressive Solver for Scalable Abductive Natural Language Inference

Regenerating natural language explanations for science questions is a ch...
research
10/18/2021

Ranking Facts for Explaining Answers to Elementary Science Questions

In multiple-choice exams, students select one answer from among typicall...
research
11/08/2022

Final Report on MITRE Evaluations for the DARPA Big Mechanism Program

This report presents the evaluation approach developed for the DARPA Big...

Please sign up or login with your details

Forgot password? Click here to reset