Shades of BLEU, Flavours of Success: The Case of MultiWOZ

06/10/2021
by   Tomáš Nekvinda, et al.
6

The MultiWOZ dataset (Budzianowski et al.,2018) is frequently used for benchmarking context-to-response abilities of task-oriented dialogue systems. In this work, we identify inconsistencies in data preprocessing and reporting of three corpus-based metrics used on this dataset, i.e., BLEU score and Inform Success rates. We point out a few problems of the MultiWOZ benchmark such as unsatisfactory preprocessing, insufficient or under-specified evaluation metrics, or rigid database. We re-evaluate 7 end-to-end and 6 policy optimization models in as-fair-as-possible setups, and we show that their reported scores cannot be directly compared. To facilitate comparison of future systems, we release our stand-alone standardized evaluation scripts. We also give basic recommendations for corpus-based benchmarking in future works.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/25/2016

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

We investigate evaluation metrics for dialogue response generation syste...
research
03/30/2022

Reproducibility Issues for BERT-based Evaluation Metrics

Reproducibility is of utmost concern in machine learning and natural lan...
research
09/02/2022

Dialogue Evaluation with Offline Reinforcement Learning

Task-oriented dialogue systems aim to fulfill user goals through natural...
research
04/11/2022

Data Splits and Metrics for Method Benchmarking on Surgical Action Triplet Datasets

In addition to generating data and annotations, devising sensible data s...
research
04/23/2018

A Call for Clarity in Reporting BLEU Scores

The field of machine translation is blessed with new challenges resultin...
research
09/14/2023

DiariST: Streaming Speech Translation with Speaker Diarization

End-to-end speech translation (ST) for conversation recordings involves ...
research
10/12/2021

Anatomy of OntoGUM–Adapting GUM to the OntoNotes Scheme to Evaluate Robustness of SOTA Coreference Algorithms

SOTA coreference resolution produces increasingly impressive scores on t...

Please sign up or login with your details

Forgot password? Click here to reset