Can we do that simpler? Simple, Efficient, High-Quality Evaluation Metrics for NLG

09/20/2022
by   Jens Grünwald, et al.
0

We explore efficient evaluation metrics for Natural Language Generation (NLG). To implement efficient metrics, we replace (i) computation-heavy transformers in metrics such as BERTScore, MoverScore, BARTScore, XMoverScore, etc. with lighter versions (such as distilled ones) and (ii) cubic inference time alignment algorithms such as Word Mover Distance with linear and quadratic approximations. We consider six evaluation metrics (both monolingual and multilingual), assessed on three different machine translation datasets, and 16 light-weight transformers as replacement. We find, among others, that (a) TinyBERT shows best quality-efficiency tradeoff for semantic similarity metrics of the BERTScore family, retaining 97% quality and being 5x faster at inference time on average, (b) there is a large difference in speed-ups on CPU vs. GPU (much higher speed-ups on CPU), and (c) WMD approximations yield no efficiency gains but lead to a substantial drop in quality on 2 out of 3 datasets we examine.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/16/2021

IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation

A benchmark provides an ecosystem to measure the advancement of models w...
research
08/15/2022

MENLI: Robust Evaluation Metrics from Natural Language Inference

Recently proposed BERT-based evaluation metrics perform well on standard...
research
06/28/2017

Data-driven Natural Language Generation: Paving the Road to Success

We argue that there are currently two major bottlenecks to the commercia...
research
10/04/2017

Discourse Structure in Machine Translation Evaluation

In this article, we explore the potential of using sentence-level discou...
research
03/30/2022

Reproducibility Issues for BERT-based Evaluation Metrics

Reproducibility is of utmost concern in machine learning and natural lan...
research
10/26/2020

Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale

Automatic evaluation of language generation systems is a well-studied pr...

Please sign up or login with your details

Forgot password? Click here to reset