We Need to Talk About Random Splits

by   Anders Søgaard, et al.

Gorman and Bedrick (2019) recently argued for using random splits rather than standard splits in NLP experiments. We argue that random splits, like standard splits, lead to overly optimistic performance estimates. In some cases, even worst-case splits under-estimate the error observed on new samples of in-domain data, i.e., the data that models should minimally generalize to at test time. This proves wrong the common conjecture that bias can be corrected for by re-weighting data (Shimodaira, 2000; Shah et al., 2020). Instead of using multiple random splits, we propose that future benchmarks instead include multiple, independent test sets.


Prediction Error Estimation in Random Forests

In this paper, error estimates of classification Random Forests are quan...

Deterministic Dynamic Matching In Worst-Case Update Time

We present deterministic algorithms for maintaining a (3/2 + ϵ) and (2 +...

A Kernel Stein Test for Comparing Latent Variable Models

We propose a nonparametric, kernel-based test to assess the relative goo...

Independent Sets in Semi-random Hypergraphs

A set of vertices in a hypergraph is called an independent set if no hyp...

Distance Labelings on Random Power Law Graphs

A Distance Labeling scheme is a data structure that can answer shortest...

It's about Time: Rethinking Evaluation on Rumor Detection Benchmarks using Chronological Splits

New events emerge over time influencing the topics of rumors in social m...

Bootstrapping F test for testing Random Effects in Linear Mixed Models

Recently Hui et al. (2018) use F tests for testing a subset of random ef...

Please sign up or login with your details

Forgot password? Click here to reset