Improving Generalization of Image Captioning with Unsupervised Prompt Learning

08/05/2023
by   Hongchen Wei, et al.
0

Pretrained visual-language models have demonstrated impressive zero-shot abilities in image captioning, when accompanied by hand-crafted prompts. Meanwhile, hand-crafted prompts utilize human prior knowledge to guide the model. However, due to the diversity between different domains, such hand-crafted prompt that provide invariant prior knowledge may result in mode collapse for some domains. Some researches attempted to incorporate expert knowledge and instruction datasets, but the results were costly and led to hallucinations. In this paper, we propose an unsupervised prompt learning method to improve Generalization of Image Captioning (GeneIC), which learns a domain-specific prompt vector for the target domain without requiring annotated data. GeneIC aligns visual and language modalities with a pre-trained Contrastive Language-Image Pre-Training (CLIP) model, thus optimizing the domain-specific prompt vector from two aspects: attribute and semantic consistency. Specifically, GeneIC first generates attribute-transferred images with differing attributes, while retaining semantic similarity with original images. Then, GeneIC uses CLIP to measure the similarity between the images and the generated sentences. By exploring the variable and invariant features in the original images and attribute-transferred images, attribute consistency constrains the attribute change direction of both images and sentences to learn domain-specific knowledge. The semantic consistency directly measures the similarity between the generated sentences and images to ensure the accuracy and comprehensiveness of the generated sentences. Consequently, GeneIC only optimizes the prompt vectors, which effectively retains the knowledge in the large model and introduces domain-specific knowledge.

READ FULL TEXT

page 2

page 11

page 14

research
05/26/2022

Prompt-based Learning for Unpaired Image Captioning

Unpaired Image Captioning (UIC) has been developed to learn image descri...
research
12/14/2022

Cross-Modal Similarity-Based Curriculum Learning for Image Captioning

Image captioning models require the high-level generalization ability to...
research
06/01/2019

ZstGAN: An Adversarial Approach for Unsupervised Zero-Shot Image-to-Image Translation

Image-to-image translation models have shown remarkable ability on trans...
research
09/24/2018

Give me a hint! Navigating Image Databases using Human-in-the-loop Feedback

In this paper, we introduce an attribute-based interactive image search ...
research
06/19/2015

A Neural Conversational Model

Conversational modeling is an important task in natural language underst...
research
02/16/2020

Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings

Learned joint representations of images and text form the backbone of se...
research
03/29/2021

Evaluation of Correctness in Unsupervised Many-to-Many Image Translation

Given an input image from a source domain and a "guidance" image from a ...

Please sign up or login with your details

Forgot password? Click here to reset