Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach

02/20/2021
by   Haoming Jiang, et al.
8

Reliable automatic evaluation of dialogue systems under an interactive environment has long been overdue. An ideal environment for evaluating dialog systems, also known as the Turing test, needs to involve human interaction, which is usually not affordable for large-scale experiments. Though researchers have attempted to use metrics (e.g., perplexity, BLEU) in language generation tasks or some model-based reinforcement learning methods (e.g., self-play evaluation) for automatic evaluation, these methods only show a very weak correlation with the actual human evaluation in practice. To bridge such a gap, we propose a new framework named ENIGMA for estimating human evaluation scores based on recent advances of off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation, making automatic evaluations feasible. More importantly, ENIGMA is model-free and agnostic to the behavior policies for collecting the experience data (see details in Section 2), which significantly alleviates the technical difficulties of modeling complex dialogue environments and human behaviors. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.

READ FULL TEXT

page 24

page 28

page 32

page 35

page 37

research
08/23/2017

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Automatically evaluating the quality of dialogue responses for unstructu...
research
09/02/2022

Dialogue Evaluation with Offline Reinforcement Learning

Task-oriented dialogue systems aim to fulfill user goals through natural...
research
05/15/2023

NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist

In this study, we analyze NLG automatic metrics based on whether human e...
research
07/25/2018

Evaluating Creativity in Computational Co-Creative Systems

This paper provides a framework for evaluating creativity in co-creative...
research
08/24/2020

How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for Token-level Evaluation Metrics

Though generative dialogue modeling is widely seen as a language modelin...
research
09/12/2022

Open-Domain Dialog Evaluation using Follow-Ups Likelihood

Automatic evaluation of open-domain dialogs remains an unsolved problem....
research
04/05/2013

Model-based Bayesian Reinforcement Learning for Dialogue Management

Reinforcement learning methods are increasingly used to optimise dialogu...

Please sign up or login with your details

Forgot password? Click here to reset