OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

06/21/2023
by   Hugo Laurençon, et al.
0

Large multimodal models trained on natural documents, which interleave images and text, outperform models trained on image-text pairs on various multimodal benchmarks that require reasoning over one or multiple images to generate a text. However, the datasets used to train these models have not been released, and the collection process has not been fully specified. We introduce the OBELISC dataset, an open web-scale filtered dataset of interleaved image-text documents comprising 141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens. We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content. To show the viability of OBELISC, we train an 80 billion parameters vision and language model on the dataset and obtain competitive performance on various multimodal benchmarks. We release the code to reproduce the dataset along with the dataset itself.

READ FULL TEXT

page 2

page 25

page 26

research
10/05/2021

Multimodal datasets: misogyny, pornography, and malignant stereotypes

We have now entered the era of trillion parameter machine learning model...
research
08/04/2021

Mitigating harm in language models with conditional-likelihood filtration

Language models trained on large-scale unfiltered datasets curated from ...
research
11/03/2021

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Multi-modal language-vision models trained on hundreds of millions of im...
research
10/04/2022

ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training

Aligning the visual and language spaces requires to train deep neural ne...
research
01/21/2023

MTTN: Multi-Pair Text to Text Narratives for Prompt Generation

The increased interest in diffusion models has opened up opportunities f...
research
12/21/2022

Contrastive Language-Vision AI Models Pretrained on Web-Scraped Multimodal Data Exhibit Sexual Objectification Bias

Nine language-vision AI models trained on web scrapes with the Contrasti...
research
03/15/2023

GPT-4 Technical Report

We report the development of GPT-4, a large-scale, multimodal model whic...

Please sign up or login with your details

Forgot password? Click here to reset