Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

02/17/2021
by   Soravit Changpinyo, et al.
0

The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pre-training. However, these datasets are often collected with overrestrictive requirements, inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pre-training data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [Sharma et al. 2018] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training. We perform an analysis of this dataset, as well as benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. The quantitative and qualitative results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.

READ FULL TEXT

page 1

page 3

page 4

page 6

page 14

research
11/24/2021

Scaling Up Vision-Language Pre-training for Image Captioning

In recent years, we have witnessed significant performance boost in the ...
research
07/05/2020

Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training

In this work, we present Auto-captions on GIF, which is a new large-scal...
research
02/11/2021

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Pre-trained representations are becoming crucial for many NLP and percep...
research
05/28/2023

FuseCap: Leveraging Large Language Models to Fuse Visual Data into Enriched Image Captions

Image captioning is a central task in computer vision which has experien...
research
05/19/2023

Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner

Large pre-trained multimodal models have demonstrated significant succes...
research
12/01/2022

Scaling Language-Image Pre-training via Masking

We present Fast Language-Image Pre-training (FLIP), a simple and more ef...
research
05/31/2023

Too Large; Data Reduction for Vision-Language Pre-Training

This paper examines the problems of severe image-text misalignment and h...

Please sign up or login with your details

Forgot password? Click here to reset