Better Smatch = Better Parser? AMR evaluation is not so simple anymore

10/12/2022
by   Juri Opitz, et al.
0

Recently, astonishing advances have been observed in AMR parsing, as measured by the structural Smatch metric. In fact, today's systems achieve performance levels that seem to surpass estimates of human inter annotator agreement (IAA). Therefore, it is unclear how well Smatch (still) relates to human estimates of parse quality, as in this situation potentially fine-grained errors of similar weight may impact the AMR's meaning to different degrees. We conduct an analysis of two popular and strong AMR parsers that – according to Smatch – reach quality levels on par with human IAA, and assess how human quality ratings relate to Smatch and other AMR metrics. Our main findings are: i) While high Smatch scores indicate otherwise, we find that AMR parsing is far from being solved: we frequently find structurally small, but semantically unacceptable errors that substantially distort sentence meaning. ii) Considering high-performance parsers, better Smatch scores may not necessarily indicate consistently better parsing quality. To obtain a meaningful and comprehensive assessment of quality differences of parse(r)s, we recommend augmenting evaluations with macro statistics, use of additional metrics, and more human analysis.

READ FULL TEXT

page 8

page 12

research
05/15/2016

Anchoring and Agreement in Syntactic Annotations

We present a study on two key characteristics of human syntactic annotat...
research
08/05/2022

Out of the BLEU: how should we assess quality of the Code Generation models?

In recent years, researchers have created and introduced a significant n...
research
06/30/2016

HUME: Human UCCA-Based Evaluation of Machine Translation

Human evaluation of machine translation normally uses sentence-level mea...
research
05/27/2020

Object-QA: Towards High Reliable Object Quality Assessment

In object recognition applications, object images usually appear with di...
research
06/23/2020

Unsupervised Evaluation of Interactive Dialog with DialoGPT

It is important to define meaningful and interpretable automatic evaluat...
research
01/22/2021

Evaluation Discrepancy Discovery: A Sentence Compression Case-study

Reliable evaluation protocols are of utmost importance for reproducible ...
research
09/19/2023

What is the Best Automated Metric for Text to Motion Generation?

There is growing interest in generating skeleton-based human motions fro...

Please sign up or login with your details

Forgot password? Click here to reset