Text-Only Image Captioning with Multi-Context Data Generation

05/29/2023
by   Feipeng Ma, et al.
0

Text-only Image Captioning (TIC) is an approach that aims to construct a model solely based on text that can accurately describe images. Recently, diffusion models have demonstrated remarkable capabilities in generating high-quality images that are semantically coherent with given texts. This presents an opportunity to generate synthetic training images for TIC. However, we have identified a challenge that the images generated from simple descriptions typically exhibit a single perspective with one or limited contexts, which is not aligned with the complexity of real-world scenes in the image domain. In this paper, we propose a novel framework that addresses this issue by introducing multi-context data generation. Starting with an initial text corpus, our framework employs a large language model to select multiple sentences that describe the same scene from various perspectives. These sentences are then summarized into a single sentence with multiple contexts. We generate simple images using the straightforward sentences and complex images using the summarized sentences through diffusion models. Finally, we train the model exclusively using the synthetic image-text pairs obtained from this process. Experimental results demonstrate that our proposed framework effectively tackles the central challenge we have identified, achieving the state-of-the-art performance on popular datasets such as MSCOCO, Flickr30k, and SS1M.

READ FULL TEXT

page 2

page 4

page 8

page 13

page 14

research
11/20/2016

A Hierarchical Approach for Generating Descriptive Image Paragraphs

Recent progress on image captioning has made it possible to generate nov...
research
05/03/2023

Multimodal Data Augmentation for Image Captioning using Diffusion Models

Image captioning, an important vision-language task, often requires a tr...
research
02/28/2015

Generating Multi-Sentence Lingual Descriptions of Indoor Scenes

This paper proposes a novel framework for generating lingual description...
research
06/20/2015

Aligning where to see and what to tell: image caption with region-based attention and scene factorization

Recent progress on automatic generation of image captions has shown that...
research
07/27/2023

Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation

Training an image captioner without annotated image-sentence pairs has g...
research
09/01/2021

ConRPG: Paraphrase Generation using Contexts as Regularizer

A long-standing issue with paraphrase generation is how to obtain reliab...
research
06/08/2023

SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions

The remarkable capabilities of pretrained image diffusion models have be...

Please sign up or login with your details

Forgot password? Click here to reset