DeepAI AI Chat
Log In Sign Up

Searching for a higher power in the human evaluation of MT

10/20/2022
by   Johnny Tian-Zheng Wei, et al.
0

In MT evaluation, pairwise comparisons are conducted to identify the better system. In conducting the comparison, the experimenter must allocate a budget to collect Direct Assessment (DA) judgments. We provide a cost effective way to spend the budget, but show that typical budget sizes often do not allow for solid comparison. Taking the perspective that the basis of solid comparison is in achieving statistical significance, we study the power (rate of achieving significance) on a large collection of pairwise DA comparisons. Due to the nature of statistical estimation, power is low for differentiating less than 1-2 DA points, and to achieve a notable increase in power requires at least 2-3x more samples. Applying variance reduction alone will not yield these gains, so we must face the reality of undetectable differences and spending increases. In this context, we propose interim testing, an "early stopping" collection procedure that yields more power per judgment collected, which adaptively focuses the budget on pairs that are borderline significant. Interim testing can achieve up to a 27 budget, or 18

READ FULL TEXT
08/18/2023

How Discriminative Are Your Qrels? How To Study the Statistical Significance of Document Adjudication Methods

Creating test collections for offline retrieval evaluation requires huma...
11/04/2019

A General Early-Stopping Module for Crowdsourced Ranking

Crowdsourcing can be used to determine a total order for an object set (...
06/29/2021

Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers

This paper presents the first large-scale meta-evaluation of machine tra...
12/20/2017

Adaptive Mantel Test for Penalized Inference, with Applications to Imaging Genetics

Mantel's test (MT) for association is conducted by testing the linear re...
12/11/2017

A practical guide and software for analysing pairwise comparison experiments

Most popular strategies to capture subjective judgments from humans invo...