Searching for a higher power in the human evaluation of MT

10/20/2022
by   Johnny Tian-Zheng Wei, et al.
0

In MT evaluation, pairwise comparisons are conducted to identify the better system. In conducting the comparison, the experimenter must allocate a budget to collect Direct Assessment (DA) judgments. We provide a cost effective way to spend the budget, but show that typical budget sizes often do not allow for solid comparison. Taking the perspective that the basis of solid comparison is in achieving statistical significance, we study the power (rate of achieving significance) on a large collection of pairwise DA comparisons. Due to the nature of statistical estimation, power is low for differentiating less than 1-2 DA points, and to achieve a notable increase in power requires at least 2-3x more samples. Applying variance reduction alone will not yield these gains, so we must face the reality of undetectable differences and spending increases. In this context, we propose interim testing, an "early stopping" collection procedure that yields more power per judgment collected, which adaptively focuses the budget on pairs that are borderline significant. Interim testing can achieve up to a 27 budget, or 18

READ FULL TEXT
research
08/18/2023

How Discriminative Are Your Qrels? How To Study the Statistical Significance of Document Adjudication Methods

Creating test collections for offline retrieval evaluation requires huma...
research
11/04/2019

A General Early-Stopping Module for Crowdsourced Ranking

Crowdsourcing can be used to determine a total order for an object set (...
research
06/29/2021

Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers

This paper presents the first large-scale meta-evaluation of machine tra...
research
10/14/2019

Estimating post-editing effort: a study on human judgements, task-based and reference-based metrics of MT quality

Devising metrics to assess translation quality has always been at the co...
research
12/20/2017

Adaptive Mantel Test for Penalized Inference, with Applications to Imaging Genetics

Mantel's test (MT) for association is conducted by testing the linear re...
research
12/11/2017

A practical guide and software for analysing pairwise comparison experiments

Most popular strategies to capture subjective judgments from humans invo...

Please sign up or login with your details

Forgot password? Click here to reset