Data Curation for Image Captioning with Text-to-Image Generative Models

05/05/2023
by   Wenyan Li, et al.
4

Recent advances in image captioning are mainly driven by large-scale vision-language pretraining, relying heavily on computational resources and increasingly large multimodal datasets. Instead of scaling up pretraining data, we ask whether it is possible to improve performance by improving the quality of the samples in existing datasets. We pursue this question through two approaches to data curation: one that assumes that some examples should be avoided due to mismatches between the image and caption, and one that assumes that the mismatch can be addressed by replacing the image, for which we use the state-of-the-art Stable Diffusion model. These approaches are evaluated using the BLIP model on MS COCO and Flickr30K in both finetuning and few-shot learning settings. Our simple yet effective approaches consistently outperform baselines, indicating that better image captioning models can be trained by curating existing resources. Finally, we conduct a human study to understand the errors made by the Stable Diffusion model and highlight directions for future work in text-to-image generation.

READ FULL TEXT

page 2

page 3

page 7

page 12

page 13

page 14

page 15

research
05/03/2023

Multimodal Data Augmentation for Image Captioning using Diffusion Models

Image captioning, an important vision-language task, often requires a tr...
research
11/13/2022

Large-Scale Bidirectional Training for Zero-Shot Image Captioning

When trained on large-scale datasets, image captioning models can unders...
research
05/24/2022

On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization

Integrating vision and language has gained notable attention following t...
research
05/25/2023

On Architectural Compression of Text-to-Image Diffusion Models

Exceptional text-to-image (T2I) generation results of Stable Diffusion m...
research
08/08/2022

Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning

We present Bit Diffusion: a simple and generic approach for generating d...
research
08/18/2022

Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning

Automatically discovering failures in vision models under real-world set...
research
05/25/2022

Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

Text-to-image generation and image captioning are recently emerged as a ...

Please sign up or login with your details

Forgot password? Click here to reset