Ditch the Gold Standard: Re-evaluating Conversational Question Answering

by   Huihan Li, et al.

Conversational question answering (CQA) systems aim to provide natural-language answers to users in information-seeking conversations. Existing CQA benchmarks compare models with pre-collected human-human conversations, using ground-truth answers provided in conversational history. It remains unclear whether we can rely on this static evaluation for model development and whether current systems can well generalize to real-world human-machine conversations. In this work, we conduct the first large-scale human evaluation of state-of-the-art CQA systems, where human evaluators converse with models and judge the correctness of their answers. We find that the distribution of human-machine conversations differs drastically from that of human-human conversations, and there is a disagreement between human and gold-history evaluation in terms of model ranking. We further investigate how to improve automatic evaluations, and propose a question rewriting mechanism based on predicted history, which better correlates with human judgments. Finally, we discuss the impact of various modeling strategies and future directions towards better conversational question answering systems.


page 1

page 2

page 3

page 4


Do not let the history haunt you – Mitigating Compounding Errors in Conversational Question Answering

The Conversational Question Answering (CoQA) task involves answering a s...

ChatGPT versus Traditional Question Answering for Knowledge Graphs: Current Status and Future Directions Towards Knowledge Graph Chatbots

Conversational AI and Question-Answering systems (QASs) for knowledge gr...

Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents

At the heart of improving conversational AI is the open problem of how t...

Contrastive Representation Learning for Conversational Question Answering over Knowledge Graphs

This paper addresses the task of conversational question answering (Conv...

Active Learning and Multi-label Classification for Ellipsis and Coreference Detection in Conversational Question-Answering

In human conversations, ellipsis and coreference are commonly occurring ...

A Crowd-based Evaluation of Abuse Response Strategies in Conversational Agents

How should conversational agents respond to verbal abuse through the use...

Please sign up or login with your details

Forgot password? Click here to reset