XGPT: Cross-modal Generative Pre-Training for Image Captioning

03/03/2020
by   Qiaolin Xia, et al.
9

While many BERT-based cross-modal pre-trained models produce excellent results on downstream understanding tasks like image-text retrieval and VQA, they cannot be applied to generation tasks directly. In this paper, we propose XGPT, a new method of Cross-modal Generative Pre-Training for Image Captioning that is designed to pre-train text-to-image caption generators through three novel generation tasks, including Image-conditioned Masked Language Modeling (IMLM), Image-conditioned Denoising Autoencoding (IDA), and Text-conditioned Image Feature Generation (TIFG). As a result, the pre-trained XGPT can be fine-tuned without any task-specific architecture modifications to create state-of-the-art models for image captioning. Experiments show that XGPT obtains new state-of-the-art results on the benchmark datasets, including COCO Captions and Flickr30k Captions. We also use XGPT to generate new image captions as data augmentation for the image retrieval task and achieve significant improvement on all recall metrics.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/22/2020

ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

In this paper, we introduce a new vision-language pre-trained model – Im...
research
09/30/2022

SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation

Recent advances in image captioning have focused on scaling the data and...
research
05/27/2020

TIME: Text and Image Mutual-Translation Adversarial Networks

Focusing on text-to-image (T2I) generation, we propose Text and Image Mu...
research
03/14/2019

Show, Translate and Tell

Humans have an incredible ability to process and understand information ...
research
12/18/2022

Efficient Image Captioning for Edge Devices

Recent years have witnessed the rapid progress of image captioning. Howe...
research
02/14/2022

I-Tuning: Tuning Language Models with Image for Caption Generation

Recently, tuning the pre-trained language model (PLM) in a parameter-eff...
research
05/09/2022

Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning

Significant progress has been made on visual captioning, largely relying...

Please sign up or login with your details

Forgot password? Click here to reset