TIME: Text and Image Mutual-Translation Adversarial Networks

05/27/2020
by   Bingchen Liu, et al.
10

Focusing on text-to-image (T2I) generation, we propose Text and Image Mutual-Translation Adversarial Networks (TIME), a lightweight but effective model that jointly learns a T2I generator G and an image captioning discriminator D under the Generative Adversarial Network framework. While previous methods tackle the T2I problem as a uni-directional task and use pre-trained language models to enforce the image-text consistency, TIME requires neither extra modules nor pre-training. We show that the performance of G can be boosted substantially by training it jointly with D as a language model. Specifically, we adopt Transformers to model the cross-modal connections between the image features and word embeddings, and design a hinged and annealing conditional loss that dynamically balances the adversarial learning. In our experiments, TIME establishes the new state-of-the-art Inception Score of 4.88 on the CUB dataset, and shows competitive performance on MS-COCO on both text-to-image and image captioning tasks.

READ FULL TEXT

page 2

page 6

page 8

page 9

page 11

page 14

research
03/03/2020

XGPT: Cross-modal Generative Pre-Training for Image Captioning

While many BERT-based cross-modal pre-trained models produce excellent r...
research
06/15/2016

Watch What You Just Said: Image Captioning with Text-Conditional Attention

Attention mechanisms have attracted considerable interest in image capti...
research
05/25/2022

Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

Text-to-image generation and image captioning are recently emerged as a ...
research
02/14/2022

I-Tuning: Tuning Language Models with Image for Caption Generation

Recently, tuning the pre-trained language model (PLM) in a parameter-eff...
research
03/17/2022

DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training

Due to the limitations of the model structure and pre-training objective...
research
07/09/2019

M3D-GAN: Multi-Modal Multi-Domain Translation with Universal Attention

Generative adversarial networks have led to significant advances in cros...
research
08/23/2023

CgT-GAN: CLIP-guided Text GAN for Image Captioning

The large-scale visual-language pre-trained model, Contrastive Language-...

Please sign up or login with your details

Forgot password? Click here to reset