Log In Sign Up

An Evaluation Protocol for Generative Conversational Systems

by   Seolhwa Lee, et al.

There is a multitude of novel generative models for open-domain conversational systems; however, there is no systematic evaluation of different systems. Systematic comparisons require consistency in experimental design, evaluation sets, conversational systems and their outputs, and statistical analysis. We lay out a protocol for the evaluation of conversational models using head-to-head pairwise comparison. We analyze ten recent models that claim state-of-the-art performance using a paired head-to-head performance (win-loss-tie) on five evaluation datasets. Our findings show that DialoGPT and Blender are superior systems using Bradley-Terry model and TrueSkill ranking methods. These findings demonstrate the feasibility of our protocol to evaluate conversational agents and evaluation sets. Finally, we make all code and evaluations publicly available for researchers to compare their model to other state-of-the-art dialog models.


page 6

page 15

page 18


A Review of Evaluation Practices of Gesture Generation in Embodied Conversational Agents

Embodied Conversational Agents (ECA) take on different forms, including ...

Training Conversational Agents with Generative Conversational Networks

Rich, open-domain textual data available on the web resulted in great ad...

On the Origin of Hallucinations in Conversational Models: Is it the Datasets or the Models?

Knowledge-grounded conversational models are known to suffer from produc...

A Repository of Conversational Datasets

Progress in Machine Learning is often driven by the availability of larg...

ROC Analysis for Paired Comparison Data

Paired comparison models are used for analyzing data that involves pairw...

A Crowd-based Evaluation of Abuse Response Strategies in Conversational Agents

How should conversational agents respond to verbal abuse through the use...

Combining 3D Morphable Models: A Large scale Face-and-Head Model

Three-dimensional Morphable Models (3DMMs) are powerful statistical tool...