Semi-Supervised Image Captioning with CLIP

06/26/2023
by   Chuanyang Jin, et al.
0

Image captioning, a fundamental task in vision-language understanding, seeks to generate accurate natural language descriptions for provided images. The CLIP model, with its rich semantic features learned from a large corpus of image-text pairs, is well-suited for this task. In this paper, we present a two-stage semi-supervised image captioning approach that exploits the potential of CLIP encoding. Our model comprises a CLIP visual encoder, a mapping network, and a language model for text generation. In the initial stage, we train the model using a small labeled dataset by contrasting the generated captions with the ground truth captions. In the subsequent stage, we continue the training using unlabeled images, aiming to maximize the image-caption similarity based on CLIP embeddings. Remarkably, despite utilizing less than 2 COCO-captions, our approach delivers a performance comparable to state-of-the-art models trained on the complete dataset. Furthermore, the captions generated by our approach are more distinctive, informative, and in line with human preference.

READ FULL TEXT

page 1

page 2

page 3

page 5

research
06/20/2023

Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion

State-of-The-Art (SoTA) image captioning models often rely on the Micros...
research
12/19/2018

Generating Diverse and Meaningful Captions

Image Captioning is a task that requires models to acquire a multi-modal...
research
08/25/2019

Towards Unsupervised Image Captioning with Shared Multimodal Embeddings

Understanding images without explicit supervision has become an importan...
research
04/04/2023

Cross-Domain Image Captioning with Discriminative Finetuning

Neural captioners are typically trained to mimic human-generated referen...
research
09/25/2022

Paraphrasing Is All You Need for Novel Object Captioning

Novel object captioning (NOC) aims to describe images containing objects...
research
07/26/2019

Cooperative image captioning

When describing images with natural language, the descriptions can be ma...
research
05/08/2023

IIITD-20K: Dense captioning for Text-Image ReID

Text-to-Image (T2I) ReID has attracted a lot of attention in the recent ...

Please sign up or login with your details

Forgot password? Click here to reset