With Little Power Comes Great Responsibility

by   Dallas Card, et al.

Despite its importance to experimental design, statistical power (the probability that, given a real effect, an experiment will reject the null hypothesis) has largely been ignored by the NLP community. Underpowered experiments make it more difficult to discern the difference between statistical noise and meaningful model improvements, and increase the chances of exaggerated findings. By meta-analyzing a set of existing NLP papers and datasets, we characterize typical power for a variety of settings and conclude that underpowered experiments are common in the NLP literature. In particular, for several tasks in the popular GLUE benchmark, small test sets mean that most attempted comparisons to state of the art models will not be adequately powered. Similarly, based on reasonable assumptions, we find that the most typical experimental design for human rating studies will be underpowered to detect small model differences, of the sort that are frequently studied. For machine translation, we find that typical test sets of 2000 sentences have approximately 75 situation going forward, we give an overview of best practices for power analysis in NLP and release a series of notebooks to assist with future power analyses.


page 30

page 31

page 32


Predicting Performance for Natural Language Processing Tasks

Given the complexity of combinations of tasks, languages, and domains in...

Supplemental Studies for Simultaneous Goodness-of-Fit Testing

Testing to see whether a given data set comes from some specified distri...

Better than Average: Paired Evaluation of NLP Systems

Evaluation in NLP is usually done by comparing the scores of competing s...

Language (Technology) is Power: A Critical Survey of "Bias" in NLP

We survey 146 papers analyzing "bias" in NLP systems, finding that their...

Polly Want a Cracker: Analyzing Performance of Parroting on Paraphrase Generation Datasets

Paraphrase generation is an interesting and challenging NLP task which h...

How not to Lie with a Benchmark: Rearranging NLP Leaderboards

Comparison with a human is an essential requirement for a benchmark for ...

Code Repositories

Please sign up or login with your details

Forgot password? Click here to reset