DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis

08/13/2020
by   Ming Tao, et al.
17

Synthesizing high-resolution realistic images from text descriptions is a challenging task. Almost all existing text-to-image methods employ stacked generative adversarial networks as the backbone, utilize cross-modal attention mechanisms to fuse text and image features, and use extra networks to ensure text-image semantic consistency. The existing text-to-image models have three problems: 1) For the backbone, there are multiple generators and discriminators stacked for generating different scales of images making the training process slow and inefficient. 2) For semantic consistency, the existing models employ extra networks to ensure the semantic consistency increasing the training complexity and bringing an additional computational cost. 3) For the text-image feature fusion method, cross-modal attention is only applied a few times during the generation process due to its computational cost impeding fusing the text and image features deeply. To solve these limitations, we propose 1) a novel simplified text-to-image backbone which is able to synthesize high-quality images directly by one pair of generator and discriminator, 2) a novel regularization method called Matching-Aware zero-centered Gradient Penalty which promotes the generator to synthesize more realistic and text-image semantic consistent images without introducing extra networks, 3) a novel fusion module called Deep Text-Image Fusion Block which can exploit the semantics of text descriptions effectively and fuse text and image features deeply during the generation process. Compared with the previous text-to-image models, our DF-GAN is simpler and more efficient and achieves better performance. Extensive experiments and ablation studies on both Caltech-UCSD Birds 200 and COCO datasets demonstrate the superiority of the proposed model in comparison to state-of-the-art models.

READ FULL TEXT

page 1

page 3

page 9

page 10

page 12

research
02/17/2023

Fine-grained Cross-modal Fusion based Refinement for Text-to-Image Synthesis

Text-to-image synthesis refers to generating visual-realistic and semant...
research
04/22/2022

Recurrent Affine Transformation for Text-to-image Synthesis

Text-to-image synthesis aims to generate natural images conditioned on t...
research
10/27/2022

Towards Better Text-Image Consistency in Text-to-Image Generation

Generating consistent and high-quality images from given texts is essent...
research
05/17/2023

Fusion-S2iGan: An Efficient and Effective Single-Stage Framework for Speech-to-Image Generation

The goal of a speech-to-image transform is to produce a photo-realistic ...
research
01/03/2023

Class-Continuous Conditional Generative Neural Radiance Field

The 3D-aware image synthesis focuses on conserving spatial consistency b...
research
04/02/2019

Semantics Disentangling for Text-to-Image Generation

Synthesizing photo-realistic images from text descriptions is a challeng...
research
06/26/2023

A Simple and Effective Baseline for Attentional Generative Adversarial Networks

Synthesising a text-to-image model of high-quality images by guiding the...

Please sign up or login with your details

Forgot password? Click here to reset