Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment

11/14/2022
by   Junyang Wang, et al.
0

CLIP (Contrastive Language-Image Pre-Training) has shown remarkable zero-shot transfer capabilities in cross-modal correlation tasks such as visual classification and image retrieval. However, its performance in cross-modal generation tasks like zero-shot image captioning remains unsatisfied. In this work, we discuss that directly employing CLIP for zero-shot image captioning relies more on the textual modality in context and largely ignores the visual information, which we call contextual language prior. To address this, we propose Cross-modal Language Models (CLMs) to facilitate unsupervised cross-modal learning. We further propose Anchor Augment to guide the generative model's attention to the fine-grained information in the representation of CLIP. Experiments on MS COCO and Flickr 30K validate the promising performance of proposed approach in both captioning quality and computational efficiency.

READ FULL TEXT
research
04/26/2023

From Association to Generation: Text-only Captioning by Unsupervised Cross-modal Mapping

With the development of Vision-Language Pre-training Models (VLPMs) repr...
research
08/29/2023

Cross-Modal Retrieval Meets Inference:Improving Zero-Shot Classification with Cross-Modal Retrieval

Contrastive language-image pre-training (CLIP) has demonstrated remarkab...
research
12/14/2022

NLIP: Noise-robust Language-Image Pre-training

Large-scale cross-modal pre-training paradigms have recently shown ubiqu...
research
03/23/2023

CoBIT: A Contrastive Bi-directional Image-Text Generation Model

The field of vision and language has witnessed a proliferation of pre-tr...
research
05/20/2021

More Than Just Attention: Learning Cross-Modal Attentions with Contrastive Constraints

Attention mechanisms have been widely applied to cross-modal tasks such ...
research
02/28/2020

Exploring and Distilling Cross-Modal Information for Image Captioning

Recently, attention-based encoder-decoder models have been used extensiv...
research
12/14/2022

Cross-Modal Similarity-Based Curriculum Learning for Image Captioning

Image captioning models require the high-level generalization ability to...

Please sign up or login with your details

Forgot password? Click here to reset