Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks

05/17/2023
by   Alon Jacovi, et al.
0

Data contamination has become especially prevalent and challenging with the rise of models pretrained on very large, automatically-crawled corpora. For closed models, the training data becomes a trade secret, and even for open models, it is not trivial to ascertain whether a particular test instance has been compromised. Strategies such as live leaderboards with hidden answers, or using test data which is guaranteed to be unseen, are expensive and become fragile with time. Assuming that all relevant actors value clean test data and will cooperate to mitigate data contamination, what can be done? We propose three strategies that can make a difference: (1) Test data made public should be encrypted with a public key and licensed to disallow derivative distribution; (2) demand training exclusion controls from closed API holders, and protect your test data by refusing to evaluate until demands are met; (3) in case of test data based on internet text, avoid data which appears with its solution on the internet, and release the context of internet-derived data along with the data. These strategies are practical and can be effective in preventing data contamination and allowing trustworthy evaluation of models' capabilities.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/27/2019

Towards Constraint Logic Programming over Strings for Test Data Generation

In order to properly test software, test data of a certain quality is ne...
research
02/25/2021

Rip van Winkle's Razor: A Simple Estimate of Overfit to Test Data

Traditional statistics forbids use of test data (a.k.a. holdout data) du...
research
01/17/2019

The Oracle of DLphi

We present a novel technique based on deep learning and set theory which...
research
07/22/2019

Model Adaptation via Model Interpolation and Boosting for Web Search Ranking

This paper explores two classes of model adaptation methods for Web sear...
research
02/26/2021

Gradient-guided Loss Masking for Neural Machine Translation

To mitigate the negative effect of low quality training data on the perf...
research
05/29/2019

Model Similarity Mitigates Test Set Overuse

Excessive reuse of test data has become commonplace in today's machine l...
research
07/28/2023

Exploring a Test Data-Driven Method for Selecting and Constraining Metamorphic Relations

Identifying and selecting high-quality Metamorphic Relations (MRs) is a ...

Please sign up or login with your details

Forgot password? Click here to reset