Attentiveness to Answer Choices Doesn't Always Entail High QA Accuracy

05/24/2023
by   Sarah Wiegreffe, et al.
0

When large language models (LMs) are applied in zero- or few-shot settings to discriminative tasks such as multiple-choice questions, their attentiveness (i.e., probability mass) is spread across many vocabulary tokens that are not valid choices. Such a spread across multiple surface forms with identical meaning is thought to cause an underestimation of a model's true performance, referred to as the "surface form competition" (SFC) hypothesis. This has motivated the introduction of various probability normalization methods. However, many core questions remain unanswered. How do we measure SFC or attentiveness? Are there direct ways of increasing attentiveness on valid choices? Does increasing attentiveness always improve task accuracy? We propose a mathematical formalism for studying this phenomenon, provide a metric for quantifying attentiveness, and identify a simple method for increasing it – namely, in-context learning with even just one example containing answer choices. The formalism allows us to quantify SFC and bound its impact. Our experiments on three diverse datasets and six LMs reveal several surprising findings. For example, encouraging models to generate a valid answer choice can, in fact, be detrimental to task performance for some LMs, and prior probability normalization methods are less effective (sometimes even detrimental) to instruction-tuned LMs. We conclude with practical insights for effectively using prompted LMs for multiple-choice tasks.

READ FULL TEXT

page 8

page 9

page 24

page 25

research
04/16/2021

Surface Form Competition: Why the Highest Probability Answer Isn't Always Right

Large language models have shown promising results in zero-shot settings...
research
12/31/2020

Using Natural Language Relations between Answer Choices for Machine Comprehension

When evaluating an answer choice for Reading Comprehension task, other a...
research
08/22/2023

Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions

Large Language Models (LLMs) have demonstrated remarkable capabilities i...
research
10/22/2022

Leveraging Large Language Models for Multiple Choice Question Answering

While large language models (LLMs) like GPT-3 have achieved impressive r...
research
03/28/2018

How to ask sensitive multiple choice questions

Motivated by recent failures of polling to estimate populist party suppo...

Please sign up or login with your details

Forgot password? Click here to reset