Time Travel in LLMs: Tracing Data Contamination in Large Language Models

08/16/2023
by   Shahriar Golchin, et al.
0

Data contamination, i.e., the presence of test data from downstream tasks in the training data of large language models (LLMs), is a potential major issue in understanding LLMs' effectiveness on other tasks. We propose a straightforward yet effective method for identifying data contamination within LLMs. At its core, our approach starts by identifying potential contamination in individual instances that are drawn from a small random sample; using this information, our approach then assesses if an entire dataset partition is contaminated. To estimate contamination of individual instances, we employ "guided instruction:" a prompt consisting of the dataset name, partition type, and the initial segment of a reference instance, asking the LLM to complete it. An instance is flagged as contaminated if the LLM's output either exactly or closely matches the latter segment of the reference. To understand if an entire partition is contaminated, we propose two ideas. The first idea marks a dataset partition as contaminated if the average overlap score with the reference instances (as measured by ROUGE or BLEURT) is statistically significantly better with the guided instruction vs. a general instruction that does not include the dataset and partition name. The second idea marks a dataset as contaminated if a classifier based on GPT-4 with in-context learning prompting marks multiple instances as contaminated. Our best method achieves an accuracy between 92 datasets, containing train and test/validation partitions, when contrasted with manual evaluation by human expert. Further, our findings indicate that GPT-4 is contaminated with AG News, WNLI, and XSum datasets.

READ FULL TEXT
research
09/07/2023

From Base to Conversational: Japanese Instruction Dataset and Tuning Large Language Models

Instruction tuning is essential for large language models (LLMs) to beco...
research
08/12/2023

VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use

We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for...
research
04/05/2023

ParroT: Translating During Chat Using Large Language Models

Large language models (LLMs) like ChatGPT and GPT-4 have exhibited remar...
research
08/23/2023

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

In the realm of Large Language Models, the balance between instruction d...
research
07/17/2023

AlpaGasus: Training A Better Alpaca with Fewer Data

Large language models (LLMs) obtain instruction-following capability thr...
research
09/11/2023

Flesch or Fumble? Evaluating Readability Standard Alignment of Instruction-Tuned Language Models

Readability metrics and standards such as Flesch Kincaid Grade Level (FK...
research
05/30/2023

Shapley Based Residual Decomposition for Instance Analysis

In this paper, we introduce the idea of decomposing the residuals of reg...

Please sign up or login with your details

Forgot password? Click here to reset