Show Your Work: Improved Reporting of Experimental Results

09/06/2019
by   Jesse Dodge, et al.
26

Research in natural language processing proceeds, in part, by demonstrating that new models achieve superior performance (e.g., accuracy) on held-out test data, compared to previous results. In this paper, we demonstrate that test-set performance scores alone are insufficient for drawing accurate conclusions about which model performs best. We argue for reporting additional details, especially performance on validation data obtained during model development. We present a novel technique for doing so: expected validation performance of the best-found model as a function of computation budget (i.e., the number of hyperparameter search trials or the overall training time). Using our approach, we find multiple recent model comparisons where authors would have reached a different conclusion if they had used more (or less) computation. Our approach also allows us to estimate the amount of computation required to obtain a given accuracy; applying it to several recently published results yields massive variation across papers, from hours to weeks. We conclude with a set of best practices for reporting experimental results which allow for robust future comparisons, and provide code to allow researchers to use our technique.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/28/2020

Showing Your Work Doesn't Always Work

In natural language processing, a recently popular line of work explores...
research
04/08/2021

Stable deep neural network architectures for mitochondria segmentation on electron microscopy volumes

Electron microscopy (EM) allows the identification of intracellular orga...
research
10/01/2021

Expected Validation Performance and Estimation of a Random Variable's Maximum

Research in NLP is often supported by experimental results, and improved...
research
03/20/2023

Self-Improving-Leaderboard(SIL): A Call for Real-World Centric Natural Language Processing Leaderboards

Leaderboard systems allow researchers to objectively evaluate Natural La...
research
08/02/2021

Underreporting of errors in NLG output, and what to do about it

We observe a severe under-reporting of the different kinds of errors tha...
research
01/31/2019

Towards Machine-assisted Meta-Studies: The Hubble Constant

We present an approach for automatic extraction of measured values from ...

Please sign up or login with your details

Forgot password? Click here to reset