Q^2: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering

by   Or Honovich, et al.

Neural knowledge-grounded generative models for dialogue often produce content that is factually inconsistent with the source text they rely on. As a consequence, such models are unreliable, limiting their real-world applicability. Inspired by recent work on evaluating factual consistency in abstractive summarization (Durmus et al., 2020; Wang et al., 2020), we propose an automatic evaluation metric for factual consistency in knowledge-grounded dialogue models using automatic question generation and question answering. Unlike previous works which use naive token-based comparison of answer spans, our metric makes use of co-reference resolution and natural language inference capabilities which greatly improve its performance. To foster proper evaluation, we curate a novel dataset of state-of-the-art dialogue system outputs for the Wizard-of-Wikipedia dataset (Dinan et al., 2019), which we manually annotate for factual consistency. We perform a thorough meta-evaluation of our metric against other metrics using the new dataset and two others, where it greatly outperforms the baselines.


A Knowledge-Grounded Multimodal Search-Based Conversational Agent

Multimodal search-based dialogue is a challenging new task: It extends v...

DeepCopy: Grounded Response Generation with Hierarchical Pointer Networks

Recent advances in neural sequence-to-sequence models have led to promis...

TRUE: Re-evaluating Factual Consistency Evaluation

Grounded text generation systems often generate text that contains factu...

Generative Models of Visually Grounded Imagination

It is easy for people to imagine what a man with pink hair looks like, e...

Don't Say That! Making Inconsistent Dialogue Unlikely with Unlikelihood Training

Generative dialogue models currently suffer from a number of problems wh...

Public Self-consciousness for Endowing Dialogue Agents with Consistent Persona

Although consistency has been a long-standing issue in dialogue agents, ...

GLEU Without Tuning

The GLEU metric was proposed for evaluating grammatical error correction...