DeepAI AI Chat
Log In Sign Up

How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for Token-level Evaluation Metrics

by   Prasanna Parthasarathi, et al.

Though generative dialogue modeling is widely seen as a language modeling task, the task demands an agent to have a complex natural language understanding of its input text to carry a meaningful interaction with an user. The automatic metrics used evaluate the quality of the generated text as a proxy to the holistic interaction of the agent. Such metrics were earlier shown to not correlate with the human judgement. In this work, we observe that human evaluation of dialogue agents can be inconclusive due to the lack of sufficient information for appropriate evaluation. The automatic metrics are deterministic yet shallow and human evaluation can be relevant yet inconclusive. To bridge this gap in evaluation, we propose designing a set of probing tasks to evaluate dialogue models. The hand-crafted tasks are aimed at quantitatively evaluating a generative dialogue model's understanding beyond the token-level evaluation on the generated text. The probing tasks are deterministic like automatic metrics and requires human judgement in their designing; benefiting from the best of both worlds. With experiments on probe tasks we observe that, unlike RNN based architectures, transformer model may not be learning to comprehend the input text despite its generated text having higher overlap with the target text.


page 9

page 10

page 16

page 18

page 19

page 20

page 21

page 23


Do Encoder Representations of Generative Dialogue Models Encode Sufficient Information about the Task ?

Predicting the next utterance in dialogue is contingent on encoding of u...

GRUEN for Evaluating Linguistic Quality of Generated Text

Automatic evaluation metrics are indispensable for evaluating generated ...

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Automatically evaluating the quality of dialogue responses for unstructu...

Learning an Unreferenced Metric for Online Dialogue Evaluation

Evaluating the quality of a dialogue interaction between two agents is a...

Evaluating Groundedness in Dialogue Systems: The BEGIN Benchmark

Knowledge-grounded dialogue agents are systems designed to conduct a con...

Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach

Reliable automatic evaluation of dialogue systems under an interactive e...

Dynamic Human Evaluation for Relative Model Comparisons

Collecting human judgements is currently the most reliable evaluation me...

Code Repositories