Large-Scale Bidirectional Training for Zero-Shot Image Captioning

11/13/2022
by   Taehoon Kim, et al.
0

When trained on large-scale datasets, image captioning models can understand the content of images from a general domain but often fail to generate accurate, detailed captions. To improve performance, pretraining-and-finetuning has been a key strategy for image captioning. However, we find that large-scale bidirectional training between image and text enables zero-shot image captioning. In this paper, we introduce Bidirectional Image Text Training in largER Scale, BITTERS, an efficient training and inference framework for zero-shot image captioning. We also propose a new evaluation benchmark which comprises of high quality datasets and an extensive set of metrics to properly evaluate zero-shot captioning accuracy and societal bias. We additionally provide an efficient finetuning approach for keyword extraction. We show that careful selection of large-scale training set and model architecture is the key to achieving zero-shot image captioning.

READ FULL TEXT

page 2

page 12

page 13

page 14

research
01/22/2022

Visual Information Guided Zero-Shot Paraphrase Generation

Zero-shot paraphrase generation has drawn much attention as the large-sc...
research
04/18/2022

Cross-view Brain Decoding

How the brain captures the meaning of linguistic stimuli across multiple...
research
05/05/2023

Data Curation for Image Captioning with Text-to-Image Generative Models

Recent advances in image captioning are mainly driven by large-scale vis...
research
09/05/2023

NICE 2023 Zero-shot Image Captioning Challenge

In this report, we introduce NICE project[<https://nice.lgresearch.ai/>]...
research
02/16/2021

FEWS: Large-Scale, Low-Shot Word Sense Disambiguation with the Dictionary

Current models for Word Sense Disambiguation (WSD) struggle to disambigu...
research
07/31/2019

Image Captioning with Unseen Objects

Image caption generation is a long standing and challenging problem at t...
research
08/25/2023

MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning

Supervised visual captioning models typically require a large scale of i...

Please sign up or login with your details

Forgot password? Click here to reset