FuseCap: Leveraging Large Language Models to Fuse Visual Data into Enriched Image Captions

05/28/2023
by   Noam Rotstein, et al.
1

Image captioning is a central task in computer vision which has experienced substantial progress following the advent of vision-language pre-training techniques. In this paper, we highlight a frequently overlooked limitation of captioning models that often fail to capture semantically significant elements. This drawback can be traced back to the text-image datasets; while their captions typically offer a general depiction of image content, they frequently omit salient details. To mitigate this limitation, we propose FuseCap - a novel method for enriching captions with additional visual information, obtained from vision experts, such as object detectors, attribute recognizers, and Optical Character Recognizers (OCR). Our approach fuses the outputs of such vision experts with the original caption using a large language model (LLM), yielding enriched captions that present a comprehensive image description. We validate the effectiveness of the proposed caption enrichment method through both quantitative and qualitative analysis. Our method is then used to curate the training set of a captioning model based BLIP which surpasses current state-of-the-art approaches in generating accurate and detailed captions while using significantly fewer parameters and training data. As additional contributions, we provide a dataset comprising of 12M image-enriched caption pairs and show that the proposed method largely improves image-text retrieval.

READ FULL TEXT

page 1

page 3

page 5

page 16

page 17

page 18

research
02/17/2021

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

The availability of large-scale image captioning and visual question ans...
research
06/28/2023

VisText: A Benchmark for Semantically Rich Chart Captioning

Captions that describe or explain charts help improve recall and compreh...
research
08/14/2023

Diffusion Based Augmentation for Captioning and Retrieval in Cultural Heritage

Cultural heritage applications and advanced machine learning models are ...
research
06/03/2022

Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning

People say, "A picture is worth a thousand words". Then how can we get t...
research
05/30/2018

Neural Joking Machine : Humorous image captioning

What is an effective expression that draws laughter from human beings? I...
research
11/16/2016

A Semi-supervised Framework for Image Captioning

State-of-the-art approaches for image captioning require supervised trai...
research
12/13/2022

CREPE: Can Vision-Language Foundation Models Reason Compositionally?

A fundamental characteristic common to both human vision and natural lan...

Please sign up or login with your details

Forgot password? Click here to reset