CgT-GAN: CLIP-guided Text GAN for Image Captioning

08/23/2023
by   Jiarui Yu, et al.
0

The large-scale visual-language pre-trained model, Contrastive Language-Image Pre-training (CLIP), has significantly improved image captioning for scenarios without human-annotated image-caption pairs. Recent advanced CLIP-based image captioning without human annotations follows a text-only training paradigm, i.e., reconstructing text from shared embedding space. Nevertheless, these approaches are limited by the training/inference gap or huge storage requirements for text embeddings. Given that it is trivial to obtain images in the real world, we propose CLIP-guided text GAN (CgT-GAN), which incorporates images into the training process to enable the model to "see" real visual modality. Particularly, we use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus and CLIP-based reward to provide semantic guidance. The caption generator is jointly rewarded based on the caption naturalness to human language calculated from the GAN's discriminator and the semantic guidance reward computed by the CLIP-based reward module. In addition to the cosine similarity as the semantic guidance reward (i.e., CLIP-cos), we further introduce a novel semantic guidance reward called CLIP-agg, which aligns the generated caption with a weighted text embedding by attentively aggregating the entire corpus. Experimental results on three subtasks (ZS-IC, In-UIC and Cross-UIC) show that CgT-GAN outperforms state-of-the-art methods significantly across all metrics. Code is available at https://github.com/Lihr747/CgtGAN.

READ FULL TEXT

page 2

page 4

page 8

page 11

page 12

research
06/15/2016

Watch What You Just Said: Image Captioning with Text-Conditional Attention

Attention mechanisms have attracted considerable interest in image capti...
research
03/21/2023

Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation

The CLIP model has been recently proven to be very effective for a varie...
research
05/26/2022

Prompt-based Learning for Unpaired Image Captioning

Unpaired Image Captioning (UIC) has been developed to learn image descri...
research
10/31/2019

Can adversarial training learn image captioning ?

Recently, generative adversarial networks (GAN) have gathered a lot of i...
research
05/27/2020

TIME: Text and Image Mutual-Translation Adversarial Networks

Focusing on text-to-image (T2I) generation, we propose Text and Image Mu...
research
07/17/2020

Consensus-Aware Visual-Semantic Embedding for Image-Text Matching

Image-text matching plays a central role in bridging vision and language...
research
04/30/2018

Improved Image Captioning with Adversarial Semantic Alignment

In this paper we propose a new conditional GAN for image captioning that...

Please sign up or login with your details

Forgot password? Click here to reset