With Little Power Comes Great Responsibility

10/13/2020
by   Dallas Card, et al.
5

Despite its importance to experimental design, statistical power (the probability that, given a real effect, an experiment will reject the null hypothesis) has largely been ignored by the NLP community. Underpowered experiments make it more difficult to discern the difference between statistical noise and meaningful model improvements, and increase the chances of exaggerated findings. By meta-analyzing a set of existing NLP papers and datasets, we characterize typical power for a variety of settings and conclude that underpowered experiments are common in the NLP literature. In particular, for several tasks in the popular GLUE benchmark, small test sets mean that most attempted comparisons to state of the art models will not be adequately powered. Similarly, based on reasonable assumptions, we find that the most typical experimental design for human rating studies will be underpowered to detect small model differences, of the sort that are frequently studied. For machine translation, we find that typical test sets of 2000 sentences have approximately 75 situation going forward, we give an overview of best practices for power analysis in NLP and release a series of notebooks to assist with future power analyses.

READ FULL TEXT

page 30

page 31

page 32

05/02/2020

Predicting Performance for Natural Language Processing Tasks

Given the complexity of combinations of tasks, languages, and domains in...
07/09/2020

Supplemental Studies for Simultaneous Goodness-of-Fit Testing

Testing to see whether a given data set comes from some specified distri...
10/20/2021

Better than Average: Paired Evaluation of NLP Systems

Evaluation in NLP is usually done by comparing the scores of competing s...
05/28/2020

Language (Technology) is Power: A Critical Survey of "Bias" in NLP

We survey 146 papers analyzing "bias" in NLP systems, finding that their...
08/19/2019

Polly Want a Cracker: Analyzing Performance of Parroting on Paraphrase Generation Datasets

Paraphrase generation is an interesting and challenging NLP task which h...
08/12/2016

Measuring the State of the Art of Automated Pathway Curation Using Graph Algorithms - A Case Study of the mTOR Pathway

This paper evaluates the difference between human pathway curation and c...
12/02/2021

How not to Lie with a Benchmark: Rearranging NLP Leaderboards

Comparison with a human is an essential requirement for a benchmark for ...

Code Repositories