ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning

12/15/2022
by   Olga Golovneva, et al.
0

Large language models show improved downstream task performance when prompted to generate step-by-step reasoning to justify their final answers. These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness (independent of the final answer) is difficult without reliable methods for automatic evaluation. We simply do not know how often the stated reasoning steps actually support the final end task predictions. In this work, we present ROSCOE, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics. To evaluate ROSCOE against baseline metrics, we design a typology of reasoning errors and collect synthetic and human evaluation scores on commonly used reasoning datasets. In contrast with existing metrics, ROSCOE can measure semantic consistency, logicality, informativeness, fluency, and factuality - among other traits - by leveraging properties of step-by-step rationales. We empirically verify the strength of our metrics on five human annotated and six programmatically perturbed diagnostics datasets - covering a diverse set of tasks that require reasoning skills and show that ROSCOE can consistently outperform baseline metrics.

READ FULL TEXT
research
01/17/2021

GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

Leaderboards have eased model development for many NLP datasets by stand...
research
05/19/2022

Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning

Large language models (LLMs) have been shown to be capable of impressive...
research
06/01/2023

Interpretable Math Word Problem Solution Generation Via Step-by-step Planning

Solutions to math word problems (MWPs) with step-by-step explanations ar...
research
04/21/2023

ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness

Multi-step reasoning ability is fundamental to many natural language tas...
research
05/18/2023

Take a Break in the Middle: Investigating Subgoals towards Hierarchical Script Generation

Goal-oriented Script Generation is a new task of generating a list of st...
research
06/06/2022

On the Advance of Making Language Models Better Reasoners

Large language models such as GPT-3 and PaLM have shown remarkable perfo...
research
11/25/2022

Solving math word problems with process- and outcome-based feedback

Recent work has shown that asking language models to generate reasoning ...

Please sign up or login with your details

Forgot password? Click here to reset