LAION-5B: An open large-scale dataset for training next generation image-text models

10/16/2022
by   Christoph Schuhmann, et al.
2

Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of training on large amounts of noisy image-text data, without relying on expensive accurate labels used in standard vision unimodal supervised learning. The resulting models showed capabilities of strong text-guided image generation and transfer to downstream tasks, while performing remarkably at zero-shot classification with noteworthy out-of-distribution robustness. Since then, large-scale language-vision models like ALIGN, BASIC, GLIDE, Flamingo and Imagen made further improvements. Studying the training and capabilities of such models requires datasets containing billions of image-text pairs. Until now, no datasets of this size have been made openly available for the broader research community. To address this problem and democratize research on large-scale multi-modal models, we present LAION-5B - a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language. We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset, and discuss further experiments enabled with an openly available dataset of this scale. Additionally we provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation, and detection scores for watermark, NSFW, and toxic content detection. Announcement page https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/

READ FULL TEXT

page 7

page 34

page 35

page 36

page 38

research
03/22/2022

WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models

Compared with the domain-specific model, the vision-language pre-trainin...
research
11/03/2021

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Multi-modal language-vision models trained on hundreds of millions of im...
research
06/20/2023

Quilt-1M: One Million Image-Text Pairs for Histopathology

Recent accelerations in multi-modal applications have been made possible...
research
08/16/2023

Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis

Existing text-to-image generation approaches have set high standards for...
research
03/17/2023

GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation

Text-to-image (T2I) models based on diffusion processes have achieved re...
research
10/05/2022

DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics

We introduce the first work to explore web-scale diffusion models for ro...
research
10/05/2021

Multimodal datasets: misogyny, pornography, and malignant stereotypes

We have now entered the era of trillion parameter machine learning model...

Please sign up or login with your details

Forgot password? Click here to reset