-
Predicting Performance for Natural Language Processing Tasks
Given the complexity of combinations of tasks, languages, and domains in...
read it
-
Supplemental Studies for Simultaneous Goodness-of-Fit Testing
Testing to see whether a given data set comes from some specified distri...
read it
-
Language (Technology) is Power: A Critical Survey of "Bias" in NLP
We survey 146 papers analyzing "bias" in NLP systems, finding that their...
read it
-
Polly Want a Cracker: Analyzing Performance of Parroting on Paraphrase Generation Datasets
Paraphrase generation is an interesting and challenging NLP task which h...
read it
-
Measuring the State of the Art of Automated Pathway Curation Using Graph Algorithms - A Case Study of the mTOR Pathway
This paper evaluates the difference between human pathway curation and c...
read it
-
ERASER: A Benchmark to Evaluate Rationalized NLP Models
State-of-the-art models in NLP are now predominantly based on deep neura...
read it
With Little Power Comes Great Responsibility
Despite its importance to experimental design, statistical power (the probability that, given a real effect, an experiment will reject the null hypothesis) has largely been ignored by the NLP community. Underpowered experiments make it more difficult to discern the difference between statistical noise and meaningful model improvements, and increase the chances of exaggerated findings. By meta-analyzing a set of existing NLP papers and datasets, we characterize typical power for a variety of settings and conclude that underpowered experiments are common in the NLP literature. In particular, for several tasks in the popular GLUE benchmark, small test sets mean that most attempted comparisons to state of the art models will not be adequately powered. Similarly, based on reasonable assumptions, we find that the most typical experimental design for human rating studies will be underpowered to detect small model differences, of the sort that are frequently studied. For machine translation, we find that typical test sets of 2000 sentences have approximately 75 situation going forward, we give an overview of best practices for power analysis in NLP and release a series of notebooks to assist with future power analyses.
READ FULL TEXT
Comments
There are no comments yet.