CLIP2GAN: Towards Bridging Text with the Latent Space of GANs

11/28/2022
by   Yixuan Wang, et al.
0

In this work, we are dedicated to text-guided image generation and propose a novel framework, i.e., CLIP2GAN, by leveraging CLIP model and StyleGAN. The key idea of our CLIP2GAN is to bridge the output feature embedding space of CLIP and the input latent space of StyleGAN, which is realized by introducing a mapping network. In the training stage, we encode an image with CLIP and map the output feature to a latent code, which is further used to reconstruct the image. In this way, the mapping network is optimized in a self-supervised learning way. In the inference stage, since CLIP can embed both image and text into a shared feature embedding space, we replace CLIP image encoder in the training architecture with CLIP text encoder, while keeping the following mapping network as well as StyleGAN model. As a result, we can flexibly input a text description to generate an image. Moreover, by simply adding mapped text features of an attribute to a mapped CLIP image feature, we can effectively edit the attribute to the image. Extensive experiments demonstrate the superior performance of our proposed CLIP2GAN compared to previous methods.

READ FULL TEXT

page 1

page 6

page 7

page 8

page 15

page 16

page 17

page 18

research
10/10/2022

Bridging CLIP and StyleGAN through Latent Alignment for Image Editing

Text-driven image manipulation is developed since the vision-language mo...
research
06/26/2021

ShapeEditer: a StyleGAN Encoder for Face Swapping

In this paper, we propose a novel encoder, called ShapeEditor, for high-...
research
11/05/2020

Transforming Facial Weight of Real Images by Editing Latent Space of StyleGAN

We present an invert-and-edit framework to automatically transform facia...
research
08/12/2019

Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking

A major challenge in matching images and text is that they have intrinsi...
research
11/14/2019

HUSE: Hierarchical Universal Semantic Embeddings

There is a recent surge of interest in cross-modal representation learni...
research
12/09/2021

Self-Supervised Image-to-Text and Text-to-Image Synthesis

A comprehensive understanding of vision and language and their interrela...
research
10/13/2021

Bag-of-Vectors Autoencoders for Unsupervised Conditional Text Generation

Text autoencoders are often used for unsupervised conditional text gener...

Please sign up or login with your details

Forgot password? Click here to reset