Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems

10/05/2020
by   Jan Deriu, et al.
6

The lack of time-efficient and reliable evaluation methods hamper the development of conversational dialogue systems (chatbots). Evaluations requiring humans to converse with chatbots are time and cost-intensive, put high cognitive demands on the human judges, and yield low-quality results. In this work, we introduce Spot The Bot, a cost-efficient and robust evaluation framework that replaces human-bot conversations with conversations between bots. Human judges then only annotate for each entity in a conversation whether they think it is human or not (assuming there are humans participants in these conversations). These annotations then allow us to rank chatbots regarding their ability to mimic the conversational behavior of humans. Since we expect that all bots are eventually recognized as such, we incorporate a metric that measures which chatbot can uphold human-like behavior the longest, i.e., Survival Analysis. This metric has the ability to correlate a bot's performance to certain of its characteristics (e.g., fluency or sensibleness), yielding interpretable results. The comparably low cost of our framework allows for frequent evaluations of chatbots during their evaluation cycle. We empirically validate our claims by applying Spot The Bot to three domains, evaluating several state-of-the-art chatbots, and drawing comparisons to related work. The framework is released as a ready-to-use tool.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/10/2019

Survey on Evaluation Methods for Dialogue Systems

In this paper we survey the methods and concepts developed for the evalu...
research
07/20/2023

Learning and Evaluating Human Preferences for Conversational Head Generation

A reliable and comprehensive evaluation metric that aligns with manual p...
research
04/16/2021

Human-like informative conversations: Better acknowledgements using conditional mutual information

This work aims to build a dialogue agent that can weave new factual cont...
research
09/23/2020

ConvAI3: Generating Clarifying Questions for Open-Domain Dialogue Systems (ClariQ)

This document presents a detailed description of the challenge on clarif...
research
01/12/2022

Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents

At the heart of improving conversational AI is the open problem of how t...
research
01/11/2018

On Evaluating and Comparing Conversational Agents

Conversational agents are exploding in popularity. However, much work re...
research
11/24/2022

How "open" are the conversations with open-domain chatbots? A proposal for Speech Event based evaluation

Open-domain chatbots are supposed to converse freely with humans without...

Please sign up or login with your details

Forgot password? Click here to reset