Transferable Decoding with Visual Entities for Zero-Shot Image Captioning

07/31/2023
by   Junjie Fei, et al.
0

Image-to-text generation aims to describe images using natural language. Recently, zero-shot image captioning based on pre-trained vision-language models (VLMs) and large language models (LLMs) has made significant progress. However, we have observed and empirically demonstrated that these methods are susceptible to modality bias induced by LLMs and tend to generate descriptions containing objects (entities) that do not actually exist in the image but frequently appear during training (i.e., object hallucination). In this paper, we propose ViECap, a transferable decoding model that leverages entity-aware decoding to generate descriptions in both seen and unseen scenarios. ViECap incorporates entity-aware hard prompts to guide LLMs' attention toward the visual entities present in the image, enabling coherent caption generation across diverse scenes. With entity-aware hard prompts, ViECap is capable of maintaining performance when transferring from in-domain to out-of-domain scenarios. Extensive experiments demonstrate that ViECap sets a new state-of-the-art cross-domain (transferable) captioning and performs competitively in-domain captioning compared to previous VLMs-based zero-shot methods. Our code is available at: https://github.com/FeiElysia/ViECap

READ FULL TEXT

page 3

page 8

page 15

research
04/26/2023

From Association to Generation: Text-only Captioning by Unsupervised Cross-modal Mapping

With the development of Vision-Language Pre-training Models (VLPMs) repr...
research
03/06/2023

DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training

Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate stro...
research
08/25/2023

MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning

Supervised visual captioning models typically require a large scale of i...
research
08/02/2023

Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model

Current captioning approaches tend to generate correct but "generic" des...
research
06/29/2023

ZeroGen: Zero-shot Multimodal Controllable Text Generation with Multiple Oracles

Automatically generating textual content with desired attributes is an a...
research
06/18/2019

Zero-Shot Entity Linking by Reading Entity Descriptions

We present the zero-shot entity linking task, where mentions must be lin...
research
04/18/2022

Cross-view Brain Decoding

How the brain captures the meaning of linguistic stimuli across multiple...

Please sign up or login with your details

Forgot password? Click here to reset