Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering

07/31/2023
by   Vaibhav Adlakha, et al.
0

Retriever-augmented instruction-following models are attractive alternatives to fine-tuned approaches for information-seeking tasks such as question answering (QA). By simply prepending retrieved documents in its input along with an instruction, these models can be adapted to various information domains and tasks without additional fine-tuning. While the model responses tend to be natural and fluent, the additional verbosity makes traditional QA evaluation metrics such as exact match (EM) and F1 unreliable for accurately quantifying model performance. In this work, we investigate the performance of instruction-following models across three information-seeking QA tasks. We use both automatic and human evaluation to evaluate these models along two dimensions: 1) how well they satisfy the user's information need (correctness), and 2) whether they produce a response based on the provided knowledge (faithfulness). Guided by human evaluation and analysis, we highlight the shortcomings of traditional metrics for both correctness and faithfulness. We then propose simple token-overlap based and model-based metrics that reflect the true performance of these models. Our analysis reveals that instruction-following models are competitive, and sometimes even outperform fine-tuned models for correctness. However, these models struggle to stick to the provided knowledge and often hallucinate in their responses. We hope our work encourages a more holistic evaluation of instruction-following models for QA. Our code and data is available at https://github.com/McGill-NLP/instruct-qa

READ FULL TEXT

page 1

page 7

page 21

research
09/03/2023

MedChatZH: a Better Medical Adviser Learns from Better Instructions

Generative large language models (LLMs) have shown great success in vari...
research
09/15/2021

Can Edge Probing Tasks Reveal Linguistic Knowledge in QA Models?

There have been many efforts to try to understand what gram-matical know...
research
04/14/2023

HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge

Large Language Models (LLMs), such as the LLaMA model, have demonstrated...
research
09/12/2019

Measuring Domain Portability and ErrorPropagation in Biomedical QA

In this work we present Google's submission to the BioASQ 7 biomedical q...
research
09/24/2021

Investigating Post-pretraining Representation Alignment for Cross-Lingual Question Answering

Human knowledge is collectively encoded in the roughly 6500 languages sp...
research
05/19/2023

Self-QA: Unsupervised Knowledge Guided Language Model Alignment

Large-scale language models like ChatGPT and GPT-4 have gained attention...
research
06/08/2023

PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

Instruction tuning large language models (LLMs) remains a challenging ta...

Please sign up or login with your details

Forgot password? Click here to reset