We Need to Talk About Random Splits

05/01/2020
by   Anders Søgaard, et al.
0

Gorman and Bedrick (2019) recently argued for using random splits rather than standard splits in NLP experiments. We argue that random splits, like standard splits, lead to overly optimistic performance estimates. In some cases, even worst-case splits under-estimate the error observed on new samples of in-domain data, i.e., the data that models should minimally generalize to at test time. This proves wrong the common conjecture that bias can be corrected for by re-weighting data (Shimodaira, 2000; Shah et al., 2020). Instead of using multiple random splits, we propose that future benchmarks instead include multiple, independent test sets.

READ FULL TEXT
research
09/01/2023

Prediction Error Estimation in Random Forests

In this paper, error estimates of classification Random Forests are quan...
research
08/24/2021

Deterministic Dynamic Matching In Worst-Case Update Time

We present deterministic algorithms for maintaining a (3/2 + ϵ) and (2 +...
research
07/01/2019

A Kernel Stein Test for Comparing Latent Variable Models

We propose a nonparametric, kernel-based test to assess the relative goo...
research
04/02/2021

Independent Sets in Semi-random Hypergraphs

A set of vertices in a hypergraph is called an independent set if no hyp...
research
12/23/2017

Distance Labelings on Random Power Law Graphs

A Distance Labeling scheme is a data structure that can answer shortest...
research
02/06/2023

It's about Time: Rethinking Evaluation on Rumor Detection Benchmarks using Chronological Splits

New events emerge over time influencing the topics of rumors in social m...
research
12/09/2018

Bootstrapping F test for testing Random Effects in Linear Mixed Models

Recently Hui et al. (2018) use F tests for testing a subset of random ef...

Please sign up or login with your details

Forgot password? Click here to reset