Improving Multimodal Datasets with Image Captioning

07/19/2023
by   Thao Nguyen, et al.
0

Massive web datasets play a key role in the success of large vision-language models like CLIP and Flamingo. However, the raw web data is noisy, and existing filtering methods to reduce noise often come at the expense of data diversity. Our work focuses on caption quality as one major source of noise, and studies how generated captions can increase the utility of web-scraped datapoints with nondescript text. Through exploring different mixing strategies for raw and generated captions, we outperform the best filtering method proposed by the DataComp benchmark by 2 candidate pool of 128M image-text pairs. Our best approach is also 2x better at Flickr and MS-COCO retrieval. We then analyze what makes synthetic captions an effective source of text supervision. In experimenting with different image captioning models, we also demonstrate that the performance of a model on standard image captioning benchmarks (e.g., NoCaps CIDEr) is not a reliable indicator of the utility of the captions it generates for multimodal training. Finally, our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text, as well as the importance of image curation with increasing training data quantity.

READ FULL TEXT

page 10

page 11

page 18

page 19

page 20

page 21

page 22

page 27

research
12/27/2022

Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning

Image captioning is one of the straightforward tasks that can take advan...
research
05/03/2023

Multimodal Data Augmentation for Image Captioning using Diffusion Models

Image captioning, an important vision-language task, often requires a tr...
research
11/22/2021

RedCaps: web-curated image-text data created by the people, for the people

Large datasets of paired images and text have become increasingly popula...
research
01/05/2023

CiT: Curation in Training for Effective Vision-Language Data

Large vision-language models are generally applicable to many downstream...
research
04/04/2023

Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data

Scaling up weakly-supervised datasets has shown to be highly effective i...
research
09/22/2021

Caption Enriched Samples for Improving Hateful Memes Detection

The recently introduced hateful meme challenge demonstrates the difficul...
research
03/10/2023

ICStega: Image Captioning-based Semantically Controllable Linguistic Steganography

Nowadays, social media has become the preferred communication platform f...

Please sign up or login with your details

Forgot password? Click here to reset