CapText: Large Language Model-based Caption Generation From Image Context and Description

06/01/2023
by   Shinjini Ghosh, et al.
0

While deep-learning models have been shown to perform well on image-to-text datasets, it is difficult to use them in practice for captioning images. This is because captions traditionally tend to be context-dependent and offer complementary information about an image, while models tend to produce descriptions that describe the visual features of the image. Prior research in caption generation has explored the use of models that generate captions when provided with the images alongside their respective descriptions or contexts. We propose and evaluate a new approach, which leverages existing large language models to generate captions from textual descriptions and context alone, without ever processing the image directly. We demonstrate that after fine-tuning, our approach outperforms current state-of-the-art image-text alignment models like OSCAR-VinVL on this task on the CIDEr metric.

READ FULL TEXT
research
03/15/2021

Knowledge driven Description Synthesis for Floor Plan Interpretation

Image captioning is a widely known problem in the area of AI. Caption ge...
research
01/03/2020

Discoverability in Satellite Imagery: A Good Sentence is Worth a Thousand Pictures

Small satellite constellations provide daily global coverage of the eart...
research
05/10/2021

Matching Visual Features to Hierarchical Semantic Topics for Image Paragraph Captioning

Observing a set of images and their corresponding paragraph-captions, a ...
research
06/03/2022

Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning

People say, "A picture is worth a thousand words". Then how can we get t...
research
07/21/2023

OxfordTVG-HIC: Can Machine Make Humorous Captions from Images?

This paper presents OxfordTVG-HIC (Humorous Image Captions), a large-sca...
research
12/27/2022

Using Large Language Models to Generate Engaging Captions for Data Visualizations

Creating compelling captions for data visualizations has been a longstan...
research
09/21/2023

ContextRef: Evaluating Referenceless Metrics For Image Description Generation

Referenceless metrics (e.g., CLIPScore) use pretrained vision–language m...

Please sign up or login with your details

Forgot password? Click here to reset