Data Portraits: Recording Foundation Model Training Data

03/06/2023
by   Marc Marone, et al.
0

Foundation models are trained on increasingly immense and opaque datasets. Even while these models are now key in AI system building, it can be difficult to answer the straightforward question: has the model already encountered a given example during training? We therefore propose a widespread adoption of Data Portraits: artifacts that record training data and allow for downstream inspection. First we outline the properties of such an artifact and discuss how existing solutions can be used to increase transparency. We then propose and implement a solution based on data sketching, stressing fast and space efficient querying. Using our tool, we document a popular large language modeling corpus (the Pile) and show that our solution enables answering questions about test set leakage and model plagiarism. Our tool is lightweight and fast, costing only 3 our tools at dataportraits.org and call on dataset and model creators to release Data Portraits as a complement to current documentation practices.

READ FULL TEXT
research
07/25/2018

Repartitioning of the ComplexWebQuestions Dataset

Recently, Talmor and Berant (2018) introduced ComplexWebQuestions - a da...
research
03/23/2018

Datasheets for Datasets

Currently there is no standard way to identify how a dataset was created...
research
07/14/2021

Deduplicating Training Data Makes Language Models Better

We find that existing language modeling datasets contain many near-dupli...
research
03/20/2023

Generative AI and the Digital Commons

Many generative foundation models (or GFMs) are trained on publicly avai...
research
03/22/2023

The Shaky Foundations of Clinical Foundation Models: A Survey of Large Language Models and Foundation Models for EMRs

The successes of foundation models such as ChatGPT and AlphaFold have sp...
research
02/14/2023

BLIAM: Literature-based Data Synthesis for Synergistic Drug Combination Prediction

Language models pre-trained on scientific literature corpora have substa...
research
02/23/2023

Data leakage in cross-modal retrieval training: A case study

The recent progress in text-based audio retrieval was largely propelled ...

Please sign up or login with your details

Forgot password? Click here to reset