RedCaps: web-curated image-text data created by the people, for the people

11/22/2021
by   Karan Desai, et al.
4

Large datasets of paired images and text have become increasingly popular for learning generic representations for vision and vision-and-language tasks. Such datasets have been built by querying search engines or collecting HTML alt-text – since web data is noisy, they require complex filtering pipelines to maintain quality. We explore alternate data sources to collect high quality data with minimal filtering. We introduce RedCaps – a large-scale dataset of 12M image-text pairs collected from Reddit. Images and captions from Reddit depict and describe a wide variety of objects and scenes. We collect data from a manually curated set of subreddits, which give coarse image labels and allow us to steer the dataset composition without labeling individual instances. We show that captioning models trained on RedCaps produce rich and varied captions preferred by humans, and learn visual representations that transfer to many downstream tasks.

READ FULL TEXT

page 1

page 2

page 8

page 9

page 10

page 11

page 13

page 14

research
12/27/2022

Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning

Image captioning is one of the straightforward tasks that can take advan...
research
07/19/2023

Improving Multimodal Datasets with Image Captioning

Massive web datasets play a key role in the success of large vision-lang...
research
02/11/2021

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Pre-trained representations are becoming crucial for many NLP and percep...
research
07/05/2020

Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training

In this work, we present Auto-captions on GIF, which is a new large-scal...
research
10/24/2022

Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision

Weakly-supervised vision-language (V-L) pre-training (W-VLP) aims at lea...
research
01/05/2023

CiT: Curation in Training for Effective Vision-Language Data

Large vision-language models are generally applicable to many downstream...
research
09/02/2021

An Empirical Exploration in Quality Filtering of Text Data

While conventional wisdom suggests that more aggressively filtering data...

Please sign up or login with your details

Forgot password? Click here to reset