Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation

07/27/2023
∙
by   Zhiyuan Li, et al.
∙
0
∙

Training an image captioner without annotated image-sentence pairs has gained traction in recent years. Previous approaches can be categorized into two strategies: crawling sentences from mismatching corpora and aligning them with the given images as pseudo annotations, or pre-training the captioner using external image-text pairs. However, the aligning setting seems to reach its performance limit due to the quality problem of pairs, and pre-training requires significant computational resources. To address these challenges, we propose a new strategy “LPM + retrieval-augmented learning" where the prior knowledge from large pre-trained models (LPMs) is leveraged as supervision, and a retrieval process is integrated to further reinforce its effectiveness. Specifically, we introduce Retrieval-augmented Pseudo Sentence Generation (RaPSG), which adopts an efficient approach to retrieve highly relevant short region descriptions from the mismatching corpora and use them to generate a variety of pseudo sentences with distinct representations as well as high quality via LPMs. In addition, a fluency filter and a CLIP-guided training objective are further introduced to facilitate model optimization. Experimental results demonstrate that our method surpasses the SOTA pre-training model (Flamingo3B) by achieving a CIDEr score of 78.1 (+5.1) while utilizing only 0.3 eliminates the need of computationally expensive pre-training processes on external datasets (e.g., the requirement of 312M image-text pairs for Flamingo3B). We further show that with a simple extension, the generated pseudo sentences can be deployed as weak supervision to boost the 1 image caption benchmark up to 93.4 CIDEr score (+8.9) which showcases the versatility and effectiveness of our approach.

READ FULL TEXT

page 1

page 8

page 9

research
∙ 07/14/2023

PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language Pre-training via Prompting

Vision-language (VL) Pre-training (VLP) has shown to well generalize VL ...
research
∙ 05/10/2021

REPT: Bridging Language Models and Machine Reading Comprehension via Retrieval-Based Pre-training

Pre-trained Language Models (PLMs) have achieved great success on Machin...
research
∙ 10/26/2022

FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning

Multimodal tasks in the fashion domain have significant potential for e-...
research
∙ 05/29/2023

Text-Only Image Captioning with Multi-Context Data Generation

Text-only Image Captioning (TIC) is an approach that aims to construct a...
research
∙ 06/18/2021

Weakly Supervised Pre-Training for Multi-Hop Retriever

In multi-hop QA, answering complex questions entails iterative document ...
research
∙ 09/02/2020

Video Captioning Using Weak Annotation

Video captioning has shown impressive progress in recent years. One key ...
research
∙ 02/22/2022

Retrieval Augmented Classification for Long-Tail Visual Recognition

We introduce Retrieval Augmented Classification (RAC), a generic approac...

Please sign up or login with your details

Forgot password? Click here to reset