Re-evaluating ADEM: A Deeper Look at Scoring Dialogue Responses

02/23/2019
by   Ananya B. Sai, et al.
0

Automatically evaluating the quality of dialogue responses for unstructured domains is a challenging problem. ADEM(Lowe et al. 2017) formulated the automatic evaluation of dialogue systems as a learning problem and showed that such a model was able to predict responses which correlate significantly with human judgements, both at utterance and system level. Their system was shown to have beaten word-overlap metrics such as BLEU with large margins. We start with the question of whether an adversary can game the ADEM model. We design a battery of targeted attacks at the neural network based ADEM evaluation system and show that automatic evaluation of dialogue systems still has a long way to go. ADEM can get confused with a variation as simple as reversing the word order in the text! We report experiments on several such adversarial scenarios that draw out counterintuitive scores on the dialogue responses. We take a systematic look at the scoring function proposed by ADEM and connect it to linear system theory to predict the shortcomings evident in the system. We also devise an attack that can fool such a system to rate a response generation system as favorable. Finally, we allude to future research directions of using the adversarial attacks to design a truly automated dialogue evaluation system.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/23/2017

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Automatically evaluating the quality of dialogue responses for unstructu...
research
04/29/2020

Utterance Pair Scoring for Noisy Dialogue Data Filtering

Filtering noisy training data is one of the key approaches to improving ...
research
09/15/2023

RADE: Reference-Assisted Dialogue Evaluation for Open-Domain Dialogue

Evaluating open-domain dialogue systems is challenging for reasons such ...
research
01/20/2021

WeChat AI's Submission for DSTC9 Interactive Dialogue Evaluation Track

We participate in the DSTC9 Interactive Dialogue Evaluation Track (Gunas...
research
04/29/2020

Evaluating Dialogue Generation Systems via Response Selection

Existing automatic evaluation metrics for open-domain dialogue response ...
research
08/05/2020

An Interpretable Deep Learning System for Automatically Scoring Request for Proposals

The Managed Care system within Medicaid (US Healthcare) uses Request For...
research
11/24/2018

Strategy of the Negative Sampling for Training Retrieval-Based Dialogue Systems

The article describes the new approach for quality improvement of automa...

Please sign up or login with your details

Forgot password? Click here to reset