Exploring Diverse In-Context Configurations for Image Captioning

05/24/2023
by   Xu Yang, et al.
0

After discovering that Language Models (LMs) can be good in-context few-shot learners, numerous strategies have been proposed to optimize in-context sequence configurations. Recently, researchers in Vision-Language (VL) domains also develop their few-shot learners, while they only use the simplest way, i.e., randomly sampling, to configure in-context image-text pairs. In order to explore the effects of varying configurations on VL in-context learning, we devised four strategies for image selection and four for caption assignment to configure in-context image-text pairs for image captioning. Here Image Captioning is used as the case study since it can be seen as the visually-conditioned LM. Our comprehensive experiments yield two counter-intuitive but valuable insights, highlighting the distinct characteristics of VL in-context learning due to multi-modal synergy, as compared to the NLP case.

READ FULL TEXT

page 4

page 6

page 8

page 9

page 16

page 17

page 18

research
01/31/2022

Deep Learning Approaches on Image Captioning: A Review

Automatic image captioning, which involves describing the contents of an...
research
05/17/2021

Multi-Modal Image Captioning for the Visually Impaired

One of the ways blind people understand their surroundings is by clickin...
research
01/31/2020

iCap: Interative Image Captioning with Predictive Text

In this paper we study a brand new topic of interactive image captioning...
research
07/10/2023

SITTA: A Semantic Image-Text Alignment for Image Captioning

Textual and semantic comprehension of images is essential for generating...
research
03/27/2023

Graph Sequence Learning for Premise Selection

Premise selection is crucial for large theory reasoning as the sheer siz...
research
07/14/2021

From Show to Tell: A Survey on Image Captioning

Connecting Vision and Language plays an essential role in Generative Int...
research
04/05/2023

Towards Self-Explainability of Deep Neural Networks with Heatmap Captioning and Large-Language Models

Heatmaps are widely used to interpret deep neural networks, particularly...

Please sign up or login with your details

Forgot password? Click here to reset