Designing a Better Asymmetric VQGAN for StableDiffusion

by   Zixin Zhu, et al.

StableDiffusion is a revolutionary text-to-image generator that is causing a stir in the world of image generation and editing. Unlike traditional methods that learn a diffusion model in pixel space, StableDiffusion learns a diffusion model in the latent space via a VQGAN, ensuring both efficiency and quality. It not only supports image generation tasks, but also enables image editing for real images, such as image inpainting and local editing. However, we have observed that the vanilla VQGAN used in StableDiffusion leads to significant information loss, causing distortion artifacts even in non-edited image regions. To this end, we propose a new asymmetric VQGAN with two simple designs. Firstly, in addition to the input from the encoder, the decoder contains a conditional branch that incorporates information from task-specific priors, such as the unmasked image region in inpainting. Secondly, the decoder is much heavier than the encoder, allowing for more detailed recovery while only slightly increasing the total inference cost. The training cost of our asymmetric VQGAN is cheap, and we only need to retrain a new asymmetric decoder while keeping the vanilla VQGAN encoder and StableDiffusion unchanged. Our asymmetric VQGAN can be widely used in StableDiffusion-based inpainting and local editing methods. Extensive experiments demonstrate that it can significantly improve the inpainting and editing performance, while maintaining the original text-to-image capability. The code is available at <>.


page 1

page 4

page 6

page 7

page 8


DiffEdit: Diffusion-based semantic image editing with mask guidance

Image generation has recently seen tremendous advances, with diffusion m...

Gradient Adjusting Networks for Domain Inversion

StyleGAN2 was demonstrated to be a powerful image generation engine that...

Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces

Conditional text-to-image generation has seen countless recent improveme...

Region-Aware Diffusion for Zero-shot Text-driven Image Editing

Image manipulation under the guidance of textual descriptions has recent...

Semantic Editing On Segmentation Map Via Multi-Expansion Loss

Semantic editing on segmentation map has been proposed as an intermediat...

Delving Globally into Texture and Structure for Image Inpainting

Image inpainting has achieved remarkable progress and inspired abundant ...

Training and Tuning Generative Neural Radiance Fields for Attribute-Conditional 3D-Aware Face Generation

3D-aware GANs based on generative neural radiance fields (GNeRF) have ac...

Please sign up or login with your details

Forgot password? Click here to reset