SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation

09/30/2022
by   Rita Ramos, et al.
11

Recent advances in image captioning have focused on scaling the data and model size, substantially increasing the cost of pre-training and finetuning. As an alternative to large models, we present SmallCap, which generates a caption conditioned on an input image and related captions retrieved from a datastore. Our model is lightweight and fast to train as the only learned parameters are in newly introduced cross-attention layers between a pre-trained CLIP encoder and GPT-2 decoder. SmallCap can transfer to new domains without additional finetuning and exploit large-scale data in a training-free fashion because the contents of the datastore can be readily replaced. Our experiments show that SmallCap, trained only on COCO, has competitive performance on this benchmark, and also transfers to other domains without retraining, solely through retrieval from target-domain data. Further improvement is achieved through the training-free exploitation of diverse human-labeled and web data, which proves effective for other domains, including the nocaps image captioning benchmark, designed to test generalization to unseen visual concepts.

READ FULL TEXT

page 3

page 7

page 8

page 12

research
03/03/2020

XGPT: Cross-modal Generative Pre-Training for Image Captioning

While many BERT-based cross-modal pre-trained models produce excellent r...
research
11/24/2021

Scaling Up Vision-Language Pre-training for Image Captioning

In recent years, we have witnessed significant performance boost in the ...
research
09/10/2023

Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image Captioning

While impressive performance has been achieved in image captioning, the ...
research
06/06/2023

Putting Humans in the Image Captioning Loop

Image Captioning (IC) models can highly benefit from human feedback in t...
research
06/23/2021

Neural Fashion Image Captioning : Accounting for Data Diversity

Image captioning has increasingly large domains of application, and fash...
research
12/04/2022

Controllable Image Captioning via Prompting

Despite the remarkable progress of image captioning, existing captioners...
research
06/06/2023

Towards Adaptable and Interactive Image Captioning with Data Augmentation and Episodic Memory

Interactive machine learning (IML) is a beneficial learning paradigm in ...

Please sign up or login with your details

Forgot password? Click here to reset