k-Rater Reliability: The Correct Unit of Reliability for Aggregated Human Annotations

03/24/2022
by   Ka Wong, et al.
0

Since the inception of crowdsourcing, aggregation has been a common strategy for dealing with unreliable data. Aggregate ratings are more reliable than individual ones. However, many natural language processing (NLP) applications that rely on aggregate ratings only report the reliability of individual ratings, which is the incorrect unit of analysis. In these instances, the data reliability is under-reported, and a proposed k-rater reliability (kRR) should be used as the correct data reliability for aggregated datasets. It is a multi-rater generalization of inter-rater reliability (IRR). We conducted two replications of the WordSim-353 benchmark, and present empirical, analytical, and bootstrap-based methods for computing kRR on WordSim-353. These methods produce very similar results. We hope this discussion will nudge researchers to report kRR in addition to IRR.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/15/2018

RankME: Reliable Human Ratings for Natural Language Generation

Human evaluation for natural language generation (NLG) often suffers fro...
research
07/19/2022

Selecting applicants based on multiple ratings: Using binary classification framework as an alternative to inter-rater reliability

Inter-rater reliability (IRR) has been the prevalent quality and precisi...
research
08/03/2023

Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings

This study investigates the consistency of feedback ratings generated by...
research
08/10/2023

Inter-Rater Reliability is Individual Fairness

In this note, a connection between inter-rater reliability and individua...
research
03/08/2023

Student's t-Distribution: On Measuring the Inter-Rater Reliability When the Observations are Scarce

In natural language processing (NLP) we always rely on human judgement a...
research
01/04/2017

Probabilistic Multigraph Modeling for Improving the Quality of Crowdsourced Affective Data

We proposed a probabilistic approach to joint modeling of participants' ...
research
05/06/2021

Reliability Testing for Natural Language Processing Systems

Questions of fairness, robustness, and transparency are paramount to add...

Please sign up or login with your details

Forgot password? Click here to reset