Q^2: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering

04/16/2021
by   Or Honovich, et al.
0

Neural knowledge-grounded generative models for dialogue often produce content that is factually inconsistent with the source text they rely on. As a consequence, such models are unreliable, limiting their real-world applicability. Inspired by recent work on evaluating factual consistency in abstractive summarization (Durmus et al., 2020; Wang et al., 2020), we propose an automatic evaluation metric for factual consistency in knowledge-grounded dialogue models using automatic question generation and question answering. Unlike previous works which use naive token-based comparison of answer spans, our metric makes use of co-reference resolution and natural language inference capabilities which greatly improve its performance. To foster proper evaluation, we curate a novel dataset of state-of-the-art dialogue system outputs for the Wizard-of-Wikipedia dataset (Dinan et al., 2019), which we manually annotate for factual consistency. We perform a thorough meta-evaluation of our metric against other metrics using the new dataset and two others, where it greatly outperforms the baselines.

READ FULL TEXT
10/20/2018

A Knowledge-Grounded Multimodal Search-Based Conversational Agent

Multimodal search-based dialogue is a challenging new task: It extends v...
08/28/2019

DeepCopy: Grounded Response Generation with Hierarchical Pointer Networks

Recent advances in neural sequence-to-sequence models have led to promis...
04/11/2022

TRUE: Re-evaluating Factual Consistency Evaluation

Grounded text generation systems often generate text that contains factu...
05/30/2017

Generative Models of Visually Grounded Imagination

It is easy for people to imagine what a man with pink hair looks like, e...
11/10/2019

Don't Say That! Making Inconsistent Dialogue Unlikely with Unlikelihood Training

Generative dialogue models currently suffer from a number of problems wh...
04/13/2020

Public Self-consciousness for Endowing Dialogue Agents with Consistent Persona

Although consistency has been a long-standing issue in dialogue agents, ...
05/09/2016

GLEU Without Tuning

The GLEU metric was proposed for evaluating grammatical error correction...