DiverGAN: An Efficient and Effective Single-Stage Framework for Diverse Text-to-Image Generation

11/17/2021
by   Zhenxing Zhang, et al.
0

In this paper, we present an efficient and effective single-stage framework (DiverGAN) to generate diverse, plausible and semantically consistent images according to a natural-language description. DiverGAN adopts two novel word-level attention modules, i.e., a channel-attention module (CAM) and a pixel-attention module (PAM), which model the importance of each word in the given sentence while allowing the network to assign larger weights to the significant channels and pixels semantically aligning with the salient words. After that, Conditional Adaptive Instance-Layer Normalization (CAdaILN) is introduced to enable the linguistic cues from the sentence embedding to flexibly manipulate the amount of change in shape and texture, further improving visual-semantic representation and helping stabilize the training. Also, a dual-residual structure is developed to preserve more original visual features while allowing for deeper networks, resulting in faster convergence speed and more vivid details. Furthermore, we propose to plug a fully-connected layer into the pipeline to address the lack-of-diversity problem, since we observe that a dense layer will remarkably enhance the generative capability of the network, balancing the trade-off between a low-dimensional random latent code contributing to variants and modulation modules that use high-dimensional and textual contexts to strength feature maps. Inserting a linear layer after the second residual block achieves the best variety and quality. Both qualitative and quantitative results on benchmark data sets demonstrate the superiority of our DiverGAN for realizing diversity, without harming quality and semantic consistency.

READ FULL TEXT

page 2

page 5

page 11

page 12

page 15

page 16

research
11/05/2020

DTGAN: Dual Attention Generative Adversarial Networks for Text-to-Image Generation

Most existing text-to-image generation methods adopt a multi-stage modul...
research
03/14/2019

MirrorGAN: Learning Text-to-image Generation by Redescription

Generating an image from a given text description has two goals: visual ...
research
05/17/2023

Fusion-S2iGan: An Efficient and Effective Single-Stage Framework for Speech-to-Image Generation

The goal of a speech-to-image transform is to produce a photo-realistic ...
research
08/29/2020

Dual Attention GANs for Semantic Image Synthesis

In this paper, we focus on the semantic image synthesis task that aims a...
research
11/02/2020

Dual Attention on Pyramid Feature Maps for Image Captioning

Generating natural sentences from images is a fundamental learning task ...
research
05/27/2023

Towards Consistent Video Editing with Text-to-Image Diffusion Models

Existing works have advanced Text-to-Image (TTI) diffusion models for vi...

Please sign up or login with your details

Forgot password? Click here to reset