DataComp: In search of the next generation of multimodal datasets

04/27/2023
by   Samir Yitzhak Gadre, et al.
0

Large multimodal datasets have been instrumental in recent breakthroughs such as CLIP, Stable Diffusion, and GPT-4. At the same time, datasets rarely receive the same research attention as model architectures or training algorithms. To address this shortcoming in the machine learning ecosystem, we introduce DataComp, a benchmark where the training code is fixed and researchers innovate by proposing new training sets. We provide a testbed for dataset experiments centered around a new candidate pool of 12.8B image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing on 38 downstream test sets. Our benchmark consists of multiple scales, with four candidate pool sizes and associated compute budgets ranging from 12.8M to 12.8B samples seen during training. This multi-scale design facilitates the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow is a promising way of improving multimodal datasets. We introduce DataComp-1B, a dataset created by applying a simple filtering algorithm to the 12.8B candidate pool. The resulting 1.4B subset enables training a CLIP ViT-L/14 from scratch to 79.2 zero-shot accuracy on ImageNet. Our new ViT-L/14 model outperforms a larger ViT-g/14 trained on LAION-2B by 0.7 percentage points while requiring 9x less training compute. We also outperform OpenAI's CLIP ViT-L/14 by 3.7 percentage points, which is trained with the same compute budget as our model. These gains highlight the potential for improving model performance by carefully curating training sets. We view DataComp-1B as only the first step and hope that DataComp paves the way toward the next generation of multimodal datasets.

READ FULL TEXT
research
07/06/2023

T-MARS: Improving Visual Representations by Circumventing Text Feature Learning

Large web-sourced multimodal datasets have powered a slew of new methods...
research
10/04/2022

ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training

Aligning the visual and language spaces requires to train deep neural ne...
research
06/27/2023

CLIPA-v2: Scaling CLIP Training with 81.1 within a $10,000 Budget; An Extra $4,000 Unlocks 81.8

The recent work CLIPA presents an inverse scaling law for CLIP training ...
research
06/03/2021

Unsupervised Learning of General-Purpose Embeddings for Code Changes

Applying machine learning to tasks that operate with code changes requir...
research
11/22/2022

Retrieval-Augmented Multimodal Language Modeling

Recent multimodal models such as DALL-E and CM3 have achieved remarkable...
research
08/29/2019

Zero-shot Text-to-SQL Learning with Auxiliary Task

Recent years have seen great success in the use of neural seq2seq models...
research
01/30/2023

Benchmarking Robustness to Adversarial Image Obfuscations

Automated content filtering and moderation is an important tool that all...

Please sign up or login with your details

Forgot password? Click here to reset