UFC-BERT: Unifying Multi-Modal Controls for Conditional Image Synthesis

05/29/2021
by   Zhu Zhang, et al.
0

Conditional image synthesis aims to create an image according to some multi-modal guidance in the forms of textual descriptions, reference images, and image blocks to preserve, as well as their combinations. In this paper, instead of investigating these control signals separately, we propose a new two-stage architecture, UFC-BERT, to unify any number of multi-modal controls. In UFC-BERT, both the diverse control signals and the synthesized image are uniformly represented as a sequence of discrete tokens to be processed by Transformer. Different from existing two-stage autoregressive approaches such as DALL-E and VQGAN, UFC-BERT adopts non-autoregressive generation (NAR) at the second stage to enhance the holistic consistency of the synthesized image, to support preserving specified image blocks, and to improve the synthesis speed. Further, we design a progressive algorithm that iteratively improves the non-autoregressively generated image, with the help of two estimators developed for evaluating the compliance with the controls and evaluating the fidelity of the synthesized image, respectively. Extensive experiments on a newly collected large-scale clothing dataset M2C-Fashion and a facial dataset Multi-Modal CelebA-HQ verify that UFC-BERT can synthesize high-fidelity images that comply with flexible multi-modal controls.

READ FULL TEXT

page 7

page 8

page 9

research
05/24/2022

M6-Fashion: High-Fidelity Multi-modal Image Generation and Editing

The fashion industry has diverse applications in multi-modal image gener...
research
06/01/2023

Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation

Text-conditional diffusion models are able to generate high-fidelity ima...
research
08/11/2022

ARMANI: Part-level Garment-Text Alignment for Unified Cross-Modal Fashion Design

Cross-modal fashion image synthesis has emerged as one of the most promi...
research
09/06/2022

Semantic Image Synthesis with Semantically Coupled VQ-Model

Semantic image synthesis enables control over unconditional image genera...
research
03/17/2023

MRIS: A Multi-modal Retrieval Approach for Image Synthesis on Diverse Modalities

Multiple imaging modalities are often used for disease diagnosis, predic...
research
10/27/2022

Masked Vision-Language Transformer in Fashion

We present a masked vision-language transformer (MVLT) for fashion-speci...
research
03/17/2023

MMFace4D: A Large-Scale Multi-Modal 4D Face Dataset for Audio-Driven 3D Face Animation

Audio-Driven Face Animation is an eagerly anticipated technique for appl...

Please sign up or login with your details

Forgot password? Click here to reset