Multimodal Image Synthesis and Editing: A Survey

As information exists in various modalities in real world, effective interaction and fusion among multimodal information plays a key role for the creation and perception of multimodal data in computer vision and deep learning research. With superb power in modelling the interaction among multimodal information, multimodal image synthesis and editing have become a hot research topic in recent years. Different from traditional visual guidance which provides explicit clues, multimodal guidance offers intuitive and flexible means in image synthesis and editing. On the other hand, this field is also facing several challenges in alignment of features with inherent modality gaps, synthesis of high-resolution images, faithful evaluation metrics, etc. In this survey, we comprehensively contextualize the advance of the recent multimodal image synthesis & editing and formulate taxonomies according to data modality and model architectures. We start with an introduction to different types of guidance modalities in image synthesis and editing. We then describe multimodal image synthesis and editing approaches extensively with detailed frameworks including Generative Adversarial Networks (GANs), GAN Inversion, Transformers, and other methods such as NeRF and Diffusion models. This is followed by a comprehensive description of benchmark datasets and corresponding evaluation metrics as widely adopted in multimodal image synthesis and editing, as well as detailed comparisons of different synthesis methods with analysis of respective advantages and limitations. Finally, we provide insights into the current research challenges and possible future research directions. A project associated with this survey is available at




ersion, Transformers, and other methods such as NeRF and Diffusion models. This is followed by a comprehensive description of benchmark datasets and corres

Batouda Amana

very nice to share this


page 3

page 6

page 7

page 8

page 9

page 10

page 11

page 12


An Introduction to Image Synthesis with Generative Adversarial Nets

There has been a drastic growth of research in Generative Adversarial Ne...

Adversarial Text-to-Image Synthesis: A Review

With the advent of generative adversarial networks, synthesizing images ...

Multimodal Conditional Image Synthesis with Product-of-Experts GANs

Existing conditional image synthesis frameworks generate images based on...

Recent Advances and Trends in Multimodal Deep Learning: A Review

Deep Learning has implemented a wide range of applications and has becom...

A Survey on Audio Synthesis and Audio-Visual Multimodal Processing

With the development of deep learning and artificial intelligence, audio...

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Diffusion models have recently been shown to generate high-quality synth...

A Survey on Multimodal Disinformation Detection

Recent years have witnessed the proliferation of fake news, propaganda, ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans could naturally imagine a scene according to a piece of text or audio description. However, the underlying processes are not that straightforward to neural networks due to the inherent modality gap. Targeting to mimic human imagination and creativity in the real world, the tasks of multimodal image synthesis and editing provide profound insights into how deep neural networks correlate cross-modal information with visual attributes in generative modeling.

Image synthesis and editing aims to create realistic images or edit real images with natural textures. In the last few years, it has witnessed very impressive progress thanks to the advance of deep learning especially Generative Adversarial Networks (GANs) [1]. To achieve more controllable generation, a popular line of research focuses on generating and editing images conditioned on certain guidance. Typically, visual clues such as segmentation maps and image edge have been widely adopted with superior synthesis and editing performance [2, 3, 4]. Beyond these visual clues, cross-modal guidance such as texts, audios, and scene graph provides an alternative but often more intuitive and flexible way of expressing visual concepts. However, effective retrieval and fusion of heterogeneous information from data of different modalities remains a grand challenge for image generation and editing.

As one pioneer effort in multimodal image synthesis, [5] shows that recurrent variational auto-encoder could generate novel visual scenes conditioned on image captions. The research of multimodal image synthesis is then greatly advanced with the prosperity of generative adversarial networks [1, 6, 7, 3, 2, 8, 9, 10]. For example, Reed et al.[11] extend conditional GANs [6] to generate natural images based on textual descriptions. Chen et al.[12] introduce conditional GANs to achieve cross-modal audio-visual generation of musical performances. However, the two pioneer studies conduct synthesis on restricted datasets only (e.g., CUB-200 Birds [13] and Sub-URMP [12]) with relatively low image resolution (e.g., 64 64). In the last few years, this field has achieved notable improvements owe to the improved multimodal encodings [14, 15], novel architectures [16, 17], and cycle structure [18]). On the other hand, these early studies largely focus on multimodal image synthesis, while the task of multimodal image editing draws much less attention.

With the development of large scale GANs, a bunch of generative networks such as BigGAN [19] and StyleGAN [20, 21, 22] have been developed to synthesize images with high quality and diversity from random noise input. Recent studies show that GANs can effectively encode rich semantic information in the intermediate features [23] and latent space [24] as the result of image generation. Instead of synthesizing various images by varying the latent code, GAN inversion [25] is introduced to invert a given image back into the latent space of a pretrained GAN model, yielding an inverted code which can faithfully reconstruct the given image by the generator. Since GAN inversion enables to control attribute directions found in latent spaces, pre-trained GANs become applicable to real image editing, without requiring ad-hoc supervision or expensive optimization. Quite a number of studies [26, 27] have explored to vary the inverted code of real images along one specific direction to edit the corresponding attribute of the image. In terms of multimodal guidance, StyleClip [28] leverages the power of Contrastive Language-Image Pre-training (CLIP) [29] models to develop a text-based interface for StyleGAN image manipulation without requiring cumbersome manual effort. Talk-to-Edit [30] introduces an interactive facial editing framework that allows fine-grained attribute manipulation through dialog between the user and the system through a fine-grained attribute landscape on the semantic field.

With the prevalence of Transformer model [31]

which naturally allows cross-modal input, impressive improvements have been achieved in several domains such as language models

[32], image generative pre-training [33], and audio generation [34]. These recent advances fueled by Transformer suggest a possible route for multimodal image synthesis. Specifically, DALL-E [35] demonstrates that training a large-scale auto-regressive transformer on numerous image-text pairs can produce a high-fidelity generative model with controllable results through text prompt. Taming Transformer [36] introduces a VQGAN with discriminator and perceptual loss [37, 38, 39] to learn discrete image representation, and demonstrates the effectiveness of combining inductive bias of CNNs with expressivity of transformers in high-resolution image synthesis. ImageBART [40] tackles auto-regressive (AR) image synthesis by learning to invert a multinomial diffusion process which mitigates the well-known exposure bias of AR models by introducing contextual information. NUWA [41] presents a unified multimodal pre-trained model that allows to generate or manipulate visual data (i.e., images and videos) with an 3D transformer encoder-decoder framework and a 3D Nearby Attention (3DNA) mechanism.

With the development of generative models and neural rendering, several other models such as Neural Radiance Fields (NeRF) [42] and Diffusion models [43, 44] have also been explored for achieving multimodal image synthesis and editing.

The key contributions of this survey can be summarized in the following five points:

This survey covers the contemporary literature with respect to multimodal image synthesis and editing, and provides a comprehensive overview of the recent efforts in terms of the modalities, methods, datasets, evaluation metrics, and future research directions.

We provide a foundation of different types of guidance modality underlying image synthesis & editing tasks and elaborate the specifics of encoding approach associated with each guidance modality.

By focusing on the cross-modal guidance in image synthesis and editing, we develop a taxonomy of the recent approaches according to the essential architectures and highlight the major strengths and weakness of existing methods.

We provide an overview of various datasets and evaluation metrics in multimodal image synthesis & editing, and critically examine the performance of existing methods.

This survey summarizes the open challenges in the current research with an outlook towards promising areas and directions for future research.

The remainder of this survey is organized as follows. Section 2 presents the foundation of popular guidance modalities in image synthesis and editing. Section 3 provides a comprehensive overview and description of multimodal image synthesis & editing methods with detailed pipelines. Section 4 reviews the popular datasets and evaluation metrics, with quantitative experimental results of typical methods. In Section 5, we discuss the main challenges and future directions for multimodal image synthesis & editing. A few concluding remarks are drawn in Section 6.

Fig. 1: Typical multimodal guidance in image synthesis and editing: The first row shows visual guidance including semantic maps, scene layouts, keypoints, and edge maps, text guidance, audio guidance, and scene graph guidance from left to right. The second row shows the corresponding image synthesis and editing (the sample images in the first four columns are from [45] and those in last three columns are from [46, 47, 48]).

2 Modality Foundations

Each source or form of information can be called a modality. For example, people have the sense of touch, hearing, sight, and smell; the medium of information includes voice, video, text, etc.; and a variety of sensors, such as radar, infrared, and accelerometer. Each above data form can be called a modality. In terms of image synthesis and editing, we group the guidance modality as visual guidance, text guidance, audio guidance, and other modalities. Detailed description with dedicated processing method will presented in the sub-sections.

2.1 Visual Guidance

Visual guidance has attracted broad attention in image synthesis and editing thanks to its wide applications. Typically, visual guidance represents certain image properties in pixel space, e.g., segmentation maps [3, 2], keypoints [49, 50, 51], edge maps [52, 53], and scene layouts [54, 55, 56] as illustrated in Fig. 1. Realistic images can be naturally induced from visual guidance, as visual clues can be regarded as certain type of image which allows directly encoding with convolution layers to yield the final generation or editing results. Thanks to the accurate and clear guidance in visual information, visual guidance can be paired or unpaired with real images in image synthesis, namely, paired image translation [2, 3] and unpaired image translation [57, 58].

By editing the visual guidance such as semantic maps, image synthesis methods can be directly adapted for image manipulation tasks. In addition, visual guided image synthesis and editing can be applied in many low-level vision tasks. For example, we can achieve image colorization by putting grayscale images as visual guidance and the corresponding color images as ground truth. Other tasks like image super-resolution, image de-haze, image de-rain, etc., can be formulated in the similar way.

2.2 Text Guidance

Compared to visual guidance such as edges and object masks, text prompt provides a more flexible way to express visual concepts. The text-to-image synthesis task aims to produce clear, photo-realistic scenes with high semantic relevance to the corresponding text guidance. This task is very challenging as text descriptions are often ambiguous and can lead to numerous images with correct semantics. In addition, images and texts come with heterogeneous features, which makes it hard to learn accurate and reliable mapping across the two modalities. Thus, learning an accurate embedding of text description plays an import for text-guided image synthesis and editing.

Fig. 2: The framework of the CLIP (The image is from [29]).

Text Encoding. Learning useful encodings from textual representations is a non-trivial task. There are a number of traditional text representations such as Word2Vec [59] and Bag-of-Words [60]. With the prevalence of deep neural networks, Reed et al.[11]

propose to encode texts with a character-level convolutional recurrent neural network (char-CNN-RNN), that is pre-trained to learn correspondences between texts and images. Instead of using char-CNN-RNN, AttnGAN

[14] learns text encoding with a bi-directional LSTM [61] by concatenating its hidden states. Instead of obtaining the embedding with a pre-trained network, StackGAN [16]

introduces Conditioning Augmentation (CA) which randomly samples latent variables from a Gaussian distribution defined by the text embedding. This encoding technique is widely adopted as it encourages smoothness over the textual guidance manifold. With the development of pre-trained models in natural language processing field, several studies

[62, 63] also explore to perform text encoding by leveraging large-scale pre-trained language models such as BERT [64].

Recently, Contrastive Language-Image Pre-training (CLIP) [29] achieves SOTA image representation performance by learning the alignment of images and the corresponding captions from a large amount of (image, text) pairs. As illustrated in Fig. 2

, CLIP jointly optimizes an image encoder and a text encoder to maximize the cosine similarity between positive pairs and minimizing that of negative pairs, yielding accurate text embeddings.

2.3 Audio Guidance

Hearing helps human to sense the world. The relation between auditory contents and visual contents has been explored in previous cross-modal research [65], demonstrating that specific objects can be attended while the corresponding words are pronounced. In addition, Harwath et al.[66] explore to learn neural network embeddings from natural images and the corresponding speech waveforms describing the images. With natural image embedding as an inter-lingual, the experiments in [66] show that the learnt models allow to perform cross-lingual speech-to-speech retrieval. Sounds can not only interact with visual contents but also capture rich semantic information. For example, by transferring the knowledge from other pre-trained scene and object recognition model, SoundNet [67] (a deep model for sound recognition) can learn to identify scenes and objects by using auditory contents only.

Audio Encoding Recently, there is research [68] generating sounds from given videos where deep convolution network is employed to extract features from video screenshots followed by LSTM [69] to generate waveform that is correspond to input video. There are some works [12, 70, 71] that generate images condition on sounds. Specially, Wang et al.[71] first represent the input sound segment by a sequence of features which can be spectrograms, fbanks, and mel-frequency cepstral coefficients (MFCCs), and the hidden layer outputs of the pre-trained SoundNet model [67]

. Then all the features in the sequence are averaged into a single vector which is taken as the condition for image generation.

Audio-driven talking face generation is one of the important applications of audio guided image synthesis, which has attracted increasing interest in recent years [72, 73, 74, 75, 76, 47, 77]. For instance, Chung et al.[72] propose an encoder-decoder CNN model that generates talking faces from a joint embedding of face and the corresponding audio. Song et al.[73] introduce a conditional RNN network for adversarial generation of talking face. Chen et al.[74] design a hierarchical structure that first predicts facial landmarks from the audio and then generates faces conditioned on the landmarks. However, the head pose is almost fixed in the talking faces generated by these approaches. To improve the perceptual realism, recent approaches [75, 78, 47, 79] take head pose into consideration when generating talking face.

2.4 Other Modality Guidance

Image generation from textual descriptions usually work on simple scenes (e.g., CUB-200 Birds [80]) but struggle in complex scenarios with multiple contextual objects. Therefore, Johnson et al.[48] propose to generate images from scene graphs which define the explicit relationship among objects. Specifically, the guided scene graph is encoded through a graph convolution, yielding a scene layout by predicting bounding boxes for objects. Then, realistic images can be generated through adversarial training against a pair of discriminators.

Some work aims to synthesize images conditioned on certain specific parameters. For example, Liu et al.[81] explore to generate one-hot images conditioned on point coordinates in (x, y) Cartesian space. EMLight [82, 83] and NeedleLight [84] aim to synthesize High Dynamic Range (HDR) scene illumination conditioned to a set of lighting parameters.

3 Methods

We broadly categorize multimodal image synthesis and editing methods into four categories: the GAN-based methods (Sec. 3.1), the GAN inversion methods (Sec. 3.2), the Transformer-based methods (Sec. 3.3), and other methods (Sec. 3.4). We first discuss the GAN-based methods and GAN inversion methods, which generally rely on generative adversarial networks. We then discuss the prevailing Transformer-based frameworks comprehensively. Finally, we present several different methods in multimodal image synthesis and editing.

3.1 GAN-based Methods

Different GAN-based networks have been designed for multimodal image synthesis and editing. We will introduce them according to their adopted guidance modality in the following subsections.

3.1.1 Visual Guidance

Visual guidance in multimodal image synthesis and editing includes paired visual guidance and unpaired visual guidance.

Fig. 3: Illustration of the spatially-adaptive de-normalization [3]. The image is from [3].

Paired Visual Guidance. Paired visual guidance means the provided guidance is accompanied with corresponding ground truth images to provide certain direct supervision. Except for adversarial loss, image synthesis with paired visual guidance is usually trained with certain supervised loss between the generated image and the ground truth. Isola et al.[2]

first investigate conditional GAN as a general framework named Pix2Pix for various image translation tasks (

e.g., edge-to-image, day-to-night, and semantic-to-image). To mitigate the constraint in high-resolution image synthesis in Pix2Pix [2], Wang et al.[85] propose Pix2PixHD that enables to synthesize images of 20481024. However, Pix2Pix [2] as well as its variant [85] cannot encode complex scene structural relationships between the guidance and real images when there exist very different views or severe deformations. Therefore, Tang et al.[86] proposed an attention selection module to align the cross-view guidance with the target view. On the other hand, previous methods directly encode the visual guidance with deep networks for further generation which is suboptimal as part of the guidance information tends to be lost in normalizatin layers. SPADE [3] is designed to inject the guided feature effectively through a spatially-adaptive de-normalization as shown in Fig. 3. SEAN [87] introduces a semantic region-adaptive normalization layer to achieve region-wise style injection. Claiming that traditional image translation networks [2, 85, 3] suffer from high computational cost while handling high-resolution images, Shaham et al. [88] propose ASAPNet which is a lightweight yet efficient network for the translation of high-resolution images. Recently, Zhang et al.[89] and Zhan et al.[45] introduce exemplar-based image translation frameworks which build dense correspondence between exemplar and condition input to provide accurate guidance. However, building the dense correspondence incurs quadratic memory cost. Thus, Zhou et al.[90] proposed to leverage PatchMatch [91] with GRU assistance to build high-resolution correspondence efficiently. In addition, Zhan et al.[92] introduce a bi-level alignment scheme to reduce memory cost while building dense correspondence.

At the other end, as the mapping between visual guidance and real images are naturally non-deterministic, several studies [52, 93] focus on learning to map the same guidance to different images which leads to diverse generation outcome. For example, BicycleGAN [52] combines cVAE-GAN [94, 95, 39] and cLR-GAN [96, 97, 98] to generate diverse and realistic outputs. Besides, image translation models can also be combined with disentangled representation methods [96, 99, 100, 101] to achieve diverse outputs by randomly sampling disentangled features (e.g., style) from a Gaussian distribution. For instance, Gonzalez-Garcia et al.[102] propose to disentangle the domain feature representation into shared part across domains, and two exclusive parts for specific domain, enabling diverse generation by sampling from the disentangled domains.

Unpaired Visual Guidance. Unpaired image synthesis utilizes unpaired training images to convert images from one domain to another. The generation of realistic images mainly relies on adversarial learning [1] with certain constraint losses. Specially, Zhu et al.[57] design a cycle-consistency loss to preserve the image content by ensuring the input image can be recovered from the translation result. However, cycle-consistency loss is too restrictive for image translation as it assumes a bi-jectional relationship between the two domains. Several studies [103, 104, 105] thus aim to explore one-way translation and bypass the bijection constraint of cycle-consistency. With the emergence of contrastive learning, CUT [58]

proposes to maximize the mutual information of positive pairs via noise contrastive estimation

[106] for the preservation of contents in unpaired image translation. Andonian et al.[107]

introduce contrastive learning to measure the inter-image similarity in paired image translation. However, there exist mappings between the two domains where an individual image in one domain may not share any characteristics with its representation in the other domain after mapping. Therefore, TravelGAN

[104] proposes to preserve the intra-domain vector transformations in a latent space learned by a siamese network, which enables to learn mappings between more complex domains that are very different from each other.

3.1.2 Text Guidance

Reed et al.[11] are the first who extend conditional GANs [6] for image synthesis conditioned on textual descriptions. Empowered by the advance of GANs for image synthesis, text guided image synthesis has made significant progress with the employment of stacked architecture, attention mechanism, siamese architecture, cycle consistency, and adapting unconditional models.

Stacked Architectures Targeting to synthesize high-resolution images, stacked architectures are widely adopted in text-to-image synthesis. Specially, StackGAN [16] generates a coarse image of at the first stage, followed by a second generator to further output an image of at the second stage. StackGAN++ [17] further improves StackGAN [16] by jointly training three generators and discriminators. Instead of using multiple generators, HDGAN [108] propose to employ hierarchically-nested discriminators at multi-scale layers to generate high-resolution images. PPAN [109] proposes to employ an auxiliary classification loss and a perceptual loss [110] based on a pre-trained VGG [111] network.

Attention Mechanisms By allowing the model to focus on specific part of an input, attention mechanisms have proven to be beneficial to language and vision models [112, 31]. In terms of text guided image synthesis, AttnGAN [14] incorporates attention mechanisms in a multi-stage manner to synthesize fine-grained details based on both relevant words and global sentence. Huang et al.[113] introduce an attention mechanism between text words and object regions obtained from bounding boxes. SEGAN [114] introduces an attention regularization term [115] that only preserve the weights for keywords with zero weight for other words. As the spatial attention in [14] mainly focuses on color information, ControlGAN [116] proposes a word-level spatial attention which allows to correlate the words with the corresponding semantic region.

Fig. 4: Illustration of the cycle structure in MirrorGAN [18]. The image is from [18].

Cycle Consistency

To ensure cycle consistency for the text prompt or encoded text feature, some works explore to pass the generated images through an image captioning

[117, 18] or image encoder network [118] as shown in Fig. 4. Specifically, PPGN [117] employs an image captioning model to iteratively retrieve a latent code which maximizes a feature activation of the corresponding image according to a feedback network. Inspired by CycleGAN [57], cycle-consistent re-description architectures [18, 119] allows to learn a consistent feature embedding between images and the corresponding text description. Specially, MirrorGAN [18] aims to re-describe the generated images via a semantic text regeneration and alignment module. Inspired by adversarial inference methods [97], Lao et al.[118] proposed to disentangle style and content in the latent space, with a cycle consistency loss to learn consistent encoder and decoder.

Adapting Unconditional Models Grounded in the progress of large-scale GAN models [20, 19], several studies explore to leverage the architecture of large-scale models for text-to-image generation. Specially, textStyleGAN [120] extends StyleGAN [20] to achieve high-resolution image synthesis from text guidance. Similar to [121], Bridge-GAN [122] employs a progressive generator and discriminator with cross-modal projection matching and cross-modal projection classification losses [123] to align generated images with text description. Built on BigGAN [19] which achieves SOTA performance on conditional image synthesis, Souza et al.[124]

propose to generate interpolated sentence embeddings leveraging the available captions corresponding to a particular image. Similarly, TVBi-GAN

[125] employs the architecture of BiGAN [97] with a latent space defined in ALI [98] to project sentence features.

Fig. 5: Illustration of the disentangled spaces of pose, identity, and speech content in PC-AVS [47]. To learn the disentangled spaces, three augmented images are created as shown in (1), followed by a training framework as shown in (2). The image is from [47].

3.1.3 Audio Guidance

The task of audio-driven talking face generation aims to synthesize talking faces that say the given audio clips [74], which has wide applications in digital face animation, film production, visual dubbing, etc. One fundamental challenge in audio-driven talking face generation is how to accurately convert audio contents into visual information. Leveraging generative adversarial models [1], researchers develop different techniques to address this challenge. For instance, Chung et al.[72] learn the joint embedding of raw audio and video data and project it to image plane with a decoder to generate talking faces. Following [72], Zhou et al.[76] propose DAVS that learns a disentangled audio-visual representation which helps improve the quality of the synthesized talking faces. Chen et al.[74] design a hierarchical structure that first maps the audio clip into facial landmarks and further generates talking faces based on the landmarks. Zhou et al.[126] introduce MakeItTalk that predicts speaker-aware facial landmarks from the speech content for better preserving the characteristic of the speaker. Yi et al.[78] propose to map audio content to 3DMM parameters [127] for guiding the generation process of talking faces. Zhou et al.[47] present PC-AVS that achieves pose-controllable talking face generation by learning disentangled feature spaces of pose, identity, and speech content.

Fig. 6: GAN inversion method with cross-modal matching in latent space: Both image and guidance embeddings are projected into the StyleGAN [20] latent space . The cross-modal similarity learning aims to pull the visual embedding and guidance embedding to be closer. For cross-modal image editing (e.g., text guidance), the cross-modal embedding and be first obtained through the corresponding encoders. Then image editing can be performed through style mixing to get the edited latent code which is further updated through instance-level optimization. The edited latent code is fed into the the StyleGAN generator to yield the edited image. The illustration is from [128].
Fig. 7: The architecture of text-guided mapper in StyleCLIP [28]. The source image (left) is inverted into a latent code . Three separate mapping functions are trained to generate residuals (in blue) that are added to to yield the target code, from which a pre-trained StyleGAN (in green) generates an image (right), assessed by the CLIP and identity losses. The image is from [28].

3.2 GAN Inversion Methods

3.2.1 Preliminary

GANs [1, 20] have achieved remarkable progress in high-resolution and realistic image synthesis. To bridge real and fake image domains, a series of studies aim to invert a given image back into the latent space of a pre-trained GAN model, which is termed as GAN inversion. We first define the problem of GAN inversion under a unified mathematical formulation. The generator of an unconditional GAN learns the mapping , where and denote the spaces of latent codes and real images. When are close in the space, the corresponding images are visually similar. GAN inversion maps data back to latent code that be fed into pre-trained generator to reconstruct . Formally, denoting the signal to be inverted as , the well-trained generator as , and the latent code as , GAN inversion can be formulated as below:


where is a distance metric in the image or feature space. Typically, can be based on , , perceptual [37] or LPIPS [129] metrics. Some other constraints on latent codes [25] or face identity [130] could also be included in practice. With the obtained latent , we can faithfully reconstruct the original image and conduct image manipulation in the latent space.

In terms of multimodal image synthesis and editing, the key lies in how to edit or generate latent code according to the corresponding cross-modal guidance.

Fig. 8: Overview of the training setup of StyleGAN-NADA. Two intertwined generators - and are initialized using the weights of a generator pre-trained on images from a source domain (e.g.FFHQ [20]). The weights of remain fixed throughout the process, while those of are modified through optimization and an iterative layer-freezing scheme. The process shifts the domain of according to a user-provided textual direction while maintaining a shared latent space.
Photo Raphael Painting
Dog The Joker
Dog Nicolas Cage
Church The Shire
Latent Optimization
Latent Mapper
Global Directions
Fig. 9: Manipulation comparisons between StyleCLIP [28] and StyleGAN-NADA. The left column shows an image synthesized from a source generator with a given latent code. All three StyleCLIP [28] methods are used to edit the latent code towards an out-of-domain textual direction. The last column shows the image produced by feeding the original latent code to a generator converted using StyleGAN-NADA. Driving texts are shown to the left of each row. The latent optimization and mapper utilize only the target text. StyleGAN-NADA successfully applies out-of-domain changes which are beyond the scope of all StyleCLIP approaches.

3.2.2 Cross-modal Matching in Latent Space

TediGAN [128] proposes to achieve multimodal image synthesis & editing by matching the embeddings of images and cross-modal input (e.g., semantic map, text) in a common embedding space as shown in Fig. 6. Specifically, a cross-modal encoder is trained to learn the embeddings with a visual-linguistic similarity loss and a pairwise ranking loss [131, 132]. To preserve the identity after editing, an instance-level optimization module is employed in the objective which enables to modify the target attributes according to text description. As texts are encoded into the StyleGAN latent space, TediGAN inherently allows image generation with given multimodal inputs. To perform text guided image manipulation, TediGAN encodes both the image and the text into a shared latent space, and then image manipulation can be performed through style-mixing.

3.2.3 Image Code Optimization in Latent Space

Instead of mapping the text into the latent space, a popular line of research aims to optimize the latent code of the original image directly, guided by certain loss that measures cross-modal consistency.

Particularly, Jiang et al.[30] propose to optimize the image latent code through a pre-trained fine-grained attribute predictor which pushes the output latent code to change in a direction consistent with the text description. The attribute predictor also helps to keep the other irrelevant attributes unchanged through a cross-entropy score indicating whether the predicted attribute is consistent with its ground-truth label. However, this attribute predictor is specially designed for face editing with fine-grained attribute annotations, which make it hard to be generalized for other scenes.

Rather than employing specific attribute predictor, several concurrent projects use Contrastive Language-Image Pre-training (CLIP) [29]

to guide text-to-image generation through optimization. Aiming for text-guided image inpainting, Bau

et al.[133] define a semantic consistency loss based on CLIP to optimize latent code inside the inpainting region to achieve semantic consistency with the given text. StyleClip [28] proposes to use pre-trained CLIP as the loss supervision to make the manipulated results match the text condition as illustrated in Fig. 7 Besides, StyleCLIP also introduces a latent residual mapper trained for a specific text prompt. Given a starting point in latent space (the input image to be manipulated), the mapper yields a local step in latent space. Finally, StyleCLIP introduce a method for mapping a text prompt into an input-agnostic (global) direction in StyleGAN’s style space, providing control over the manipulation strength as well as the degree of disentanglement.

Fig. 10: Taming Transformer [36] first learn discrete and compressed representation which can reconstruct the original image faithfully, followed by an autoregressive transformer to model the dependency of discrete sequence. The image is from [36].

3.2.4 Domain Generalization

However, StyleCLIP requires to train a separate mapper for each specific text description which is not flexible in real applications. HairCLIP [134] thus introduces a hair editing framework that supports different texts by exploring the potential of CLIP to go beyond measuring image text similarity. Specially, HairCLIP introduce a shared condition embedding strategy which unifies the text and image conditions into the same domain. With the strategy of shared condition embedding, StyleCLIP possesses certain extrapolation capabilities after training with only a limited number of hair-editing descriptions, which allows to produce reasonable editing for texts that never appear in the training descriptions.

Instead of generalization on text description, StyleGAN-NADA [135] presents a text-guided image editing method that allows to shift a generative model to new domains, without having to collect even a single image from those domains as illustrated in Fig. 8. The domain shift is achieved by modifying the generator’s weights towards images aligned with the driving text, along certain textually-prescribed paths in CLIP’s embedding space. Through natural language prompts and a few minutes of training, StyleGAN-NADA can adapt a generator across a multitude of domains as illustrated in Fig. 9

3.3 Transformer-based Methods

3.3.1 Transformer Preliminary

Leveraging their powerful attention mechanisms, Transformer [31, 33, 35, 136] models have emerged as a paradigm in sequence-dependent modeling. Inspired by the success of GPT model [32] in natural language modeling, image GPT (iGPT) [33] employs Transformer for auto-regressive image generation, by treating the flattened image sequences as discrete tokens. The plausibility of generated images demonstrate that the Transformer model is able to model the spatial relationships between pixels and high-level attributes (texture, semantic, and scale).

As Transformer models inherently support multimodal inputs, a series of studies have been proposed to explore multimodal image synthesis based on Transformer [36, 35]. Overall, the pipeline for Transformer-based image synthesis consists of a vector quantization process to achieve discrete representation and compress the data dimensionality, and an auto-regressive modeling process which establishes the dependency between discrete tokens in a raster-scan order.

3.3.2 Discrete Vector Representation

Directly treating all image pixels as a sequence for auto-regressive modeling is expensive in terms of memory consumption as the self-attention mechanism in Transformer incurs quadratic memory cost. Chen et al.[33]

adopt a color palette to reduce the dimensionality to 512 while faithfully preserving the main structure of original images, which is generated by k-means clustering of RGB pixel values with k=512 from ImageNet

[137] dataset. However, k-means cluster only reduces the size of codebook dimensionality and the sequence length is still unchanged. Thus, the Transformer model still cannot be scaled to higher resolutions, due to the quadratically increasing cost in sequence length. To this end, Vector Quantised VAE [138] is introduced with a series of improvements in terms of Gumbel Softmax, extra loss, and Transformer Architecture.

VQ-VAE. Instead of conducting quantization in pixel space, Oord et al.propose VQ-VAE [138] to quantize image patches into discrete tokens with a learnt vector codebook as shown in Fig. 10. Consider an image , VQ-VAE allows to achieve a discrete representation , where comes from a visual codebook (vocabulary) consisting of -dimensional code. For simplicity, we have , where with indicates the spatial location of the feature. Specifically, VQ-VAE consists of an encoder, a feature quantizer, and a decoder. The image is fed into the encoder to learn a continues representation , where . Then the continues feature is quantized by the feature quantizer, by assigning it to the nearest codebook entry:


where quantization module maps the feature to an index of the codebook, and quantization module reconstructs the feature from the index. Then the decoder aims to reconstruct the original image from the quantized feature , yielding a reconstruction result as denoted by

. To make the quantization process -differentiable, the gradient is approximated with the straight-through estimator 

[139] by copying the gradient from decoder to encoder [138]. Thus, the full learning objective of VQ-VAE can be formulated as:

Fig. 11: The visualization of the first 1024 codes in vanilla vector quantization (VQ) and Gumbel Softmax quantization (both with downsampling factor f=8).

Gumbel Softmax. However, the vanilla VQ-VAE with argmin operation suffers from severe codebook collapse, e.g., only few codebook entries is effectively utilized for quantization. As shown in Fig. 11 (images are from 111, most codes in the original VQ-VAE are invalid or not utilized for feature encoding. Recently, vq-wav2vec [140] introduces Gumbel Softmax [141] to replace argmin for quantization. The Gumbel-Softmax allows to sample discrete representation in a differentiable way through straight-through gradient estimator [139]. At training stage, with an outputs logits for the Gumbel-Softmax, the predicted distribution for the -th discrete representation is


where and

are samples from uniform distribution

. Then the gradient of the Gumbel-Softmax outputs is can be used for back-propagation. At inference stage, largest index in are picked.

Fig. 12: The comparison of discrete representation learning with different loss types. The images in the first column are the original images, the images in the second to fourth column are the reconstructed images with pixel loss, pixel loss and perceptual loss, pixel loss and both the perceptual and the adversarial loss, respectively. The images are from [142].

Extra Loss. To achieve good perceptual quality for the reconstructed image, Esser et al.[36] propose VQGAN which incorporates an adversarial loss with a patch-based discriminator and a perceptual loss [37, 38, 39]

for image reconstruction in VQVAE. Instead of using VGG features pre-trained on ImageNet, Dong

et al.[142] leverage self-supervised trained network [143, 64] for the learning of deep visual features to enforce perceptual similarity during the dVAE training. With the extra adversarial loss and perceptual loss, the image quality is clearly improved compared with the original pixel loss for image reconstruction as shown Fig. 12.

Transformer Architecture.

In above approaches, a convolution neural network (CNN) is learned to quantize and generate images. Instead, Yu

et al.[144] propose ViT-VQGAN which replaces the CNN encoder and decoder with Vision Transformer (ViT) [145]. Given sufficient data (for which unlabeled image data is plentiful), ViT-VQGAN is shown to be less constrained by the inductive priors imposed by convolutions. Furthermore, ViT-VQGAN yields better computational efficiency on accelerators, and produces higher quality reconstructions. Besides, ViT-VQGAN also introduce two improvements that can significantly encourage the codebook usage even with a larger codebook size, including 1). factorized codes that introduce a linear projection from the output of the encoder to a low dimensional latent variable space for code index lookup which boost the codebook usage substantially. 2). normalization on the encoded latent variables and codebook latent variables which improves training stability and reconstruction quality.

3.3.3 Auto-regressive Modeling

Autoregresssive model has been widely explored for building sequence dependency. Previously auto-regressive models such as PixelCNN [146] struggle in modeling long term relationships within an image thanks to the limited receptive field. With the prevailing of Transformer [31], Parmar et al.[147] develop an Transformer-based generation model with enhanced receptive field that allows to sequentially predict each pixel conditioned on previous prediction results.

Auto-regressive (AR) modeling is representative objective to accommodate sequence dependencies, complying with the chain rule of probability. The probability of each token in the sequence is conditioned on all previously prediction, yielding a joint distribution of sequences as the product of conditional distributions:

. During inference, each token is predicted auto-regressively in a raster-scan order. A top- sampling strategy is adopted to randomly sample from the most likely next tokens, which naturally enables diverse sampling results. The predicted tokens are then concatenated with the previous sequence as conditions for the prediction of next token. This process repeats iteratively until all the tokens are sampled.

Fig. 13: The Sliding window strategy for image sampling in auto-regressive models. The image is from [36].

Sliding Window Sampling To speed up auto-regressive image generation, Esser et al.[36] employ a sliding-window strategy to conduct sampling from the trained Transformer model as illustrated in Fig. 13. Instead of estimating current results leveraging all previous predictions, sliding window strategy only utilizes the predictions within a local window which reduces the inference time significantly. As long as that the spatial conditioning information is available or the dataset statistics are approximately spatially invariant, the local context in the sliding window is sufficient for the faithful modeling of images sequences. Actually, this is not a constraint as we can simply model the sequence condition on image coordinates when it is violated (i.e., unconditional image synthesis on aligned data).

Fig. 14: Overview over ImageBART [40]

: A compressed and discrete image representation is firstly learnt. Then ImageBART inverts a multinomial diffusion process via a Markov Chain. The individual transition probabilities in the Markov Chain are modeled as independent AR models, which lead to a coarse-to-fine hierarchy where each individual AR model can attend to preceding global context in the hierarchy. The image is from


Bidirectional Context. On the other hand, previous method incorporates image context in a raster-scan order by attending only to previously generation results. This strategy is unidirectional and suffers from sequential bias as it disregards much context information until autoregression is nearly complete. It also ignores much contextual information in different scales as it only processes the image on a single scale. Grounded in above observation, ImageBART [40]

proposes to address the unidirectional bias of autoregressive modelling and the corresponding exposure bias a coarse-to-fine approach in a unified framework. Specifically, it handles the learning of data density in a coarse-to-fine approach. The compressed contextual information of the image obtained in the coarse stage is provided for the autoregressive modeling on the next finer stage. A diffusion process is applied to successively eliminate information, yielding a hierarchy of representations which is further compressed via a multinomial diffusion process

[148, 149]. By training a Markov chain, above diffusion process is can be inverted to recover the data from the hierarchy representation. By modeling the the Markovian transition autoregressively with attending to the preceding hierarchical state, crucial global context can be leveraged for each individual autoregressive step.

Fig. 15: Overview structure of NUWA [41]. It contains an adaptive encoder supporting different conditions and a pre-trained decoder benefiting from both image and video data. For image completion, video prediction, image manipulation, and video manipulation tasks, the input partial images or videos are fed to the decoder directly. The image is from [41].
Fig. 16: Examples of 8 typical visual generation and manipulation tasks supported by the NUWA model. The image is from [41].

3D Nearby Self-Attention. With obtained discrete tokens, early work leverage PixelCNN or PixelRNN to build the sequence dependency. To cover language, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder framework is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively. A unified 3D Nearby Self-Attention (3DNA) module which supports both self-attention and cross-attention can be defined as below:


where both and are 3D representations. If , 3DNA denotes the self-attention on target and if , 3DNA is cross-attention on target conditioned on . denotes learnable weights.

With a coordinate under , the corresponding coordinate under can be denoted by after a linear projection. Then, the local neighborhood around with a width, height and temporal extent is defined in Eq. (7),



is a sub-tensor of condition

and consists of the corresponding nearby information that needs to attend. With three learnable weights , the output tensor for the position is denoted in Eq. (8)(11):


where the position queries and collects corresponding nearby information in . This also handles , then just queries the nearby position of itself. 3NDA not only reduces the complexity of full attention from to , but also shows superior performance.

3.4 Other Methods

With the development of generative models and neural rendering, other up-to-date models are also explored for multimodal image synthesis and editing.

Fig. 17: The framework of AD-NeRF [150] for the generation of talking-head. With a video sequence of a person, two neural radiance fields are leveraged to generate high-fidelity talking head via volume rendering. The image is from [150].

Neural Radiance Fields. Neural radiance fields (NeRF) [42]

achieves impressive performance for novel views synthesis by using neural network to define an implicit scene representation. Specially, a fully-connected neural network is adopted in NeRF, by taking a spatial location (x, y, z) with the corresponding viewing direction (

, )) as input, and the volume density with the corresponding emitted radiance as ground truth. Specially, AD-NeRF [150] proposes to achieve high-fidelity talking-head synthesis based on the framework of neural radiance fields. Different from previous methods which bridges audio input and video output based on the intermediate representations, AD-NeRF directly feed the audio feature into a implicit function to yield a dynamic neural radiance field, which is further leveraged to synthesize high-fidelity talking-head video accompanied with the audio via volume rendering as shown in Fig. 17. CLIP-NeRF [151] introduces a multimodal manipulation method for neural radiance fields (NeRF). By leveraging the Contrastive Language-Image Pre-Training (CLIP) model, CLIP-NeRF allows to manipulate NeRF according to a short text prompt or an exemplar image. To bridge generative latent space and the CLIP embedding space, two code mappers are designed to optimize the latent codes towards the targeted manipulation driven by a CLIP-based matching loss.

Fig. 18: Overview of DiffusionCLIP [152]. The input image is first converted to the latent via diffusion models. Then, guided by directional CLIP loss, the diffusion model is fine-tuned, and the updated sample is generated during reverse diffusion. The image is from [152].

Diffusion Models. Recently, diffusion models such as denoising diffusion probabilistic models (DDPM) [43, 149] and score-based generative models [44, 153] have achieved great successes in image generation tasks [43, 154, 153, 155]. The latest works [153, 156]

have demonstrated even higher quality of image synthesis performance compared to variational autoencoders (VAEs)

[157], flows [158, 159], auto-regressive models [160, 161] and generative adversarial networks (GANs) [1, 20]. Furthermore, a recent denoising diffusion implicit models (DDIM) [154] further accelerates sampling procedure and enables nearly perfect inversion [156]. In terms of multimodal image synthesis & editing, Avrahami et al.[162] are the first to performing region-based editing in generic natural images, by leveraging the text description along with the region of interest. Kim et al.[152] propose a DiffusionCLIP which performs text-driven image manipulation with diffusion models by using CLIP loss to steer the edit towards the given text prompt as shown in Fig. 18. Besides, Gu et al.[163] present a vector quantized diffusion (VQ-Diffusion) model for text-to-image generation. Specially, VQ-Diffusion propose to model latent space of a vector quantized variational autoencoder [138]

by learning a parametric model using a conditional variant of the Denoising Diffusion Probabilistic Model (DDPM)

[43, 149]. GLIDE [164]

compares CLIP guidance and classifier-free guidance in diffusion models for the text-guided image synthesis, and concludes that a diffusion model of 3.5 billion parameters with classifier-free guidance outperforms DALL-E in terms of human evaluation.

Style Transfer CLVA [165] proposes to manipulate the style of a content image through text guidance, by comparing the contrastive pairs of content image and style instruction to achieve the mutual relativeness. CLIPstyler [166] propose to achieve text guided style transfer by training a lightweight network which transform a content image to follow the text condition by matching the similarity between the CLIP model output.

4 Experimental Evaluation

4.1 Datasets

Datasets are at the core of image synthesis and editing task. We will introduce the widely adopted datasets for multimodal image synthesis and editing in the ensuing subsections.

4.1.1 Visual Guidance Datasets

Semantic Segmentation: ADE20K


dataset is annotated with a 150-class semantic segmentation. Image generation can be conducted by using its semantic segmentation as conditional inputs. Besides, COCO-Stuff


, and Cityscapes

[169] also serve as the benchmark datasets for semantic image synthesis.

Scene Layout: Datasets with bounding box annotation can be applied to layout-to-image generation. Specially, Cityscapes [169], ADE20K [167], and COCO-Stuff [168] are commonly adopted for benchmarking.

Edge Map: CelebA-HQ [170] consists of 30,000 high quality face images. By connecting the face landmarks as face edges, Canny edge detector can be applied to detect the edges in the background for guided image synthesis. Besides, several edge-to-photo datasets are introduced in [2].

Keypoints: DeepFashion [171] contains 52,712 person images with human keypoints annotation. Besides, Radboud Faces dataset [172]

and Market-1501 dataset

[173] are also used for keypoint guided image generation as introduced in [174].

Fig. 19: Example images and corresponding captions of common text-to-image synthesis datasets.

4.1.2 Text Guidance Datasets

Widely adopted datasets for text-to-image synthesis research are Oxford-120 Flowers [175], CUB-200 Birds [80], and COCO [176]. Both Oxford-102 Flowers [175] and CUB-200 Birds [80] are relatively small datasets in which each image only contains a single object associated with 10 captions. In contrast, COCO [176] are much larger dataset with around 123k images and contains multiple objects in complex scenes as shown in 19.

On the other hand, lots of works for text guided image synthesis and editing rely on pre-trained text and image representation models. These models are mainly trained on MS-COCO [176], Visual Genome [177], and YFCC100M [178]. However, MS-COCO and Visual Genome are relatively small datasets with roughly 100,000 images for training. Although YFCC100M contains 100 million images, the metadata for each image presents varying quality and only 15 million images are associated with natural text descriptions. CLIP [29] constructs a new dataset named WebImageText which consists of 400 million (image, text) pairs collected from the Internet.

4.1.3 Audio Guidance Datasets

Sub-URMP [12] is a subset of URMP [179] and we use it for audio-to-image generation.

VoxCeleb2 [180] contains 6,112 celebrities with more than 1 million utterances. The videos present varying qualities, e.g., large head pose movements, low-light conditions, and different extents of blurry.

Lip Reading in the Wild (LRW) [181] contains over 1000 utterances of 500 different words. The videos in this dataset are associated with high-quality and near-frontal faces.

4.1.4 Other Guidance Datasets

Image synthesis conditioned on scene graph can be conducted Visual Genome [177] which provides annotation of the scene graphs, and COCO-Stuff [168] where synthetic scene graphs are constructed from ground-truth object positions.

Methods # param VGG ADE20K ADE-outdoor Cityscapes COCO-stuff
CRN 84M 73.3 22.4 99.0 16.5 104.7 52.4 70.4 23.7
SIMS 56M n/a n/a 67.7 13.1 49.7 47.2 n/a n/a
Pix2pixHD 183M 81.8 20.3 97.8 17.4 95.0 58.3 111.5 14.6
LGGAN N/A 31.6 41.6 n/a n/a 57.7 68.4 n/a n/a
CC-FPSE 131M 31.7 43.7 n/a n/a 54.3 65.5 19.2 41.6
SPADE 102M 33.9 38.5 63.3 30.8 71.8 62.3 22.6 37.4
OASIS 94M 28.3 48.8 48.6 40.4 47.7 69.3 17.0 44.1
Taming [36] 465M 35.5 - - - - - - -
TABLE I: Visual guided (semantic map) image synthesis performance of on different benchmark datasets. Bold denotes the best performance. Rows in Grey denote the results of Transformer based methods, others are the results of GAN based methods.
Model IS FID R-Prec.
Real Images - - -
GAN-INT-CLS [182] 2.88 68.79 -
TAC-GAN [15] - - -
GAWWN [182] 3.62 67.22 -
StackGAN [16] 3.70 51.89 -
StackGAN++ [17] 4.04 15.30 -
CVAEGAN [183] 4.97 - -
HDGAN [108] 4.15 - -
FusedGAN [184] 3.92 - -
PPAN [109] 4.38 - -
HfGAN [185] 4.48 - -
LeicaGAN [186] 4.62 - -
AttnGAN [14] 4.36 - 67.82
MirrorGAN [18] 4.56 - 57.67
SEGAN [114] 4.67 18.17 -
ControlGAN [116] 4.58 - 69.33
DM-GAN [187] 4.75 16.09 72.31
DM-GAN [187] 4.71 11.91 76.58
SD-GAN [188] 4.67 - -
textStyleGAN [120] 4.78 - 74.72
AGAN-CL [189] 4.97 - 63.87
TVBi-GAN [125] 5.03 11.83 -
Souza et al. [124] 4.23 11.17 -
RiFeGAN [190] 5.23 - -
Wang et al. [191] 5.06 12.34 86.50
Bridge-GAN [122] 4.74 - -
TABLE II: Text-to-Image generation performance on the CUB-200 Birds dataset.

denotes the result obtained by using the corresponding open-source code.

Model IS FID
Real Images - -
GAN-INT-CLS [182] 2.66 79.55
TAC-GAN [15] 3.45 -
StackGAN [16] 3.20 55.28
StackGAN++ [17] 3.26 48.68
CVAEGAN [183] 4.21 -
HDGAN [108] 3.45 -
Lao et al. [118] - 37.94
PPAN [109] 3.52 -
C4Synth [192] 3.52 -
HfGAN [185] 3.57 -
LeicaGAN [186] 3.92 -
Text-SeGAN [193] 4.03 -
RiFeGAN [190] 4.53 -
AGAN-CL [189] 4.72 -
Souza et al. [124] 3.71 16.47
TABLE III: Text-to-Image generation performance on the Oxford-102 Flowers dataset.
Model IS FID R-Prec.
Real Images [194] 34.88 6.09 68.58
GAN-INT-CLS [182] 7.88 60.62 -
StackGAN [16] 8.45 74.05 -
StackGAN [16] 10.62 -
StackGAN++ [17] 8.30 81.59 -
ChatPainter [195] 9.74 - -
HDGAN [108] 11.86 - -
HfGAN [185] 27.53 - -
Text2Scene [196] 24.77 - -
AttnGAN [14] 25.89 35.20 85.47
MirrorGAN [18] 26.47 - 74.52
AttnGAN+OP [194] 24.76 33.35 82.44
OP-GAN [194] 27.88 24.70 89.01
SEGAN [114] 27.86 32.28 -
ControlGAN [116] 24.06 - 82.43
DM-GAN [187] 30.49 32.64 88.56
DM-GAN [187] 32.43 24.24 92.23
Hong et al. [197] 11.46 - -
Obj-GAN [198] 27.37 25.64 91.05
Obj-GAN [198] 27.32 24.70 91.91
SD-GAN [188] 35.69 - -
textStyleGAN [120] 33.00 - 87.02
AGAN-CL [189] 29.87 - 79.57
TVBi-GAN [125] 31.01 31.97 -
RiFeGAN [190] 31.70 - -
Wang et al. [191] 29.03 16.28 82.70
Bridge-GAN [122] 16.40 - -
Rombach et al. [199] 34.7 30.63 -
CPGAN [200] 52.73 - 93.59
Pavllo et al. [63] - 19.65 -
XMC-GAN [201] 30.45 9.33 -
CogView [202] 18.20 27.10 -
DALL-E [35] 17.9 27.50 -
NUWA [41] 27.2 12.90 -
TABLE IV: Text-to-Image generation performance on the COCO dataset. denotes the results obtained by using the corresponding open-source code. The rows in grey denotes the results of Transformer-based methods, others are the results of GAN based methods.
LRW [181] VoxCeleb2 [180]
Method LMD LMD

ATVG [74]
0.810 0.102 5.25 4.1 0.826 0.061 6.49 4.3
Wav2Lip [203] 0.862 0.152 5.73 6.9 0.846 0.078 12.26 4.5
MakeitTalk [126] 0.796 0.161 7.13 3.1 0.817 0.068 31.44 2.8
Rhythmic Head [75] - - - - 0.779 0.802 14.76 3.8
PC-AVS [47] (Fix Pose) 0.815 0.180 6.14 6.3 0.820 0.084 7.68 5.8
PC-AVS [47] 0.861 0.185 3.93 6.4 0.886 0.083 6.88 5.9
Ground Truth 1.000 0.173 0.00 6.5 1.000 0.090 0.00 5.9
TABLE V: The audio guided image editing (talking-head) performance on LRW [181] and VoxCeleb2 [180] under four metrics. denotes that the model is evaluated by directly using the authors’ generated samples under their setting.

4.2 Evaluation Metrics

Precise Evaluation metrics are of great importance to drive progress in the fields. However, the evaluation of synthesized images is a challenging tasks as multiple attributes account for a fine generation result and the notion of image evaluation is often subjective. Normally, synthesized images are usually assessed from two aspects including image quality and alignment with the guidance.

4.2.1 Image Quality Metrics

A number of metrics have been introduced for the evaluation of generated image quality.

Inception Score (IS). IS [204] is computed through a conditional distribution which is obtained through a pre-trained Inception-v3 network [205]. The IS roughly measures the categorical distinction and overall variation between images, by computing the Kullback-Leibler (KL) divergence between and as shown below:


On the other hand, IS is restrictive evaluation metric as it struggles to detect overfitting results (the model memorizes the training set) and measure intra-domain variation (the model only produces one good sample to reach a high IS).

Fréchet Inception Distance (FID)

With visual features extracted by a pre-trained Inception-v3

[205] model, FID [206] measures the distance between the real image distribution and generated image distribution. Compared with IS, FID is a more consistent evaluation metric as it captures various kinds of disturbances [206]. Specially, FID assumes that the activations of last pooling layer in the pre-trained Inception-v3 model follow a multidimensional Gaussian distribution. Denoting the mean and covariance of real and synthesized data as and , respectively, the FID between real and synthesized data can be formulated as:


On the other hand, FID share the same problem with IS such as struggling to detect overfitting results.

Except for above common image quality metrics, some evaluation metrics are specially designed for certain generation tasks. For image synthesis conditioned on semantic map, the image quality can be assessed by leveraging pre-trained segmentation model to compute the mean average precision (mAP) and pixel accuracy (Acc). In terms of image editing, some works [207, 92] also explore to construct paired evaluation set, where other metrics such as LPIPS [208], PSNR and SSIM can be applied to conduct accurate evaluation.

4.2.2 Guidance Alignment Metrics

Except for the visual realism, the synthesized images are also expected to match the corresponding guidance (e.g., semantic map, a text description, reference audio). Thus, some evaluation metrics has been introduced to assess the alignment between the guidance and synthesized images, including R-precision [14] and Visual-Semantic similarity (VS) [108] for semantic image synthesis tasks, and Semantic Object Accuracy (SOA) [194] for text-guided image synthesis.

R-precision. R-precision [14] measures the semantic similarity between the text guidance and synthesized images. With the ground truth caption and several random captions sampled from the dataset, the similarity between image features and the text embedding is calculated through the Cosine distance, followed by a ranking in decreasing similarity. It will be regarded as a successful matching if the ground truth caption is ranked as the top (normally, is set to 1, the number of randomly sampled captions is 99).

Visual-Semantic (VS) Similarity. VS similarity [108] is designed to measure the alignment between text description and synthesized images through a trained cross-modal embedding model which maps texts and images into a common representation space. Denoting text encoder and image encoder as and , respectively, the similarity is computed as below:


On the other hand, the standard deviation of VS score is very high, which hinders it from evaluating the model performance precisely.

Captioning Metrics. Hong et al.[197] propose to evaluate the relevance between the generated image and the corresponding text by generating captions for the generated images with image caption generator [209]. The generated captions are expected to be similar to the guided captions for image generation, which can be measured by natural language metrics such as BLEU [210], METEOR [211], and CIDEr [212]

. However, many natural language metrics is computed based on n-gram overlap which may not coincides with human judgement


Semantic Object Accuracy (SOA).

Hinz et al.propose to evaluate the semantic alignment of individual objects specifically with a pre-trained object detector [194]. Specifically, two metrics SOA-C and SOA-I are designed to report the recalls as class average and image average, respectively. On the other hand, SOA may not be suitable for the evaluation of interaction and relationship between objects as it assumes the text description is roughly a word-list of the visual objects.

4.2.3 User Studies

Different from above heuristic metrics, some works directly perform evaluation by conducting user studies. Normally, users are presented with several sets of generated images and are asked to pick or rank images according to certain criterion. The evaluation method of user study is flexible as it can be designed according to the specific tasks and evaluation targets. We take the evaluation of semantic alignment in text-guided generation task as an example. With a number of images generated with randomly sampled captions, the participating users are presented with guided text and images generated by compared model to pick or rank the images according to the alignment with the caption.

On the other hand, the setting of user studies (e.g., number of users) are usually not consistent in different experiments, which makes the comparisons between the user studies in different papers unreliable. Besides, different person may possess distinctive opinion towards the same concept, which also makes the results less reliable. Employing large number of users could improve the reliability, while large scale studies are costly and time consuming.

4.3 Experimental Results

We quantitatively compare the image synthesis performance of different models in terms of visual guidance, text guidance, and audio guidance.

4.3.1 Visual Guidance

For visual guidance, we mainly conduct comparison on image synthesis conditioned on semantic map as there are numbers of works for benchmarking. The experimental comparison is conducted on three challenging datasets: ADE20K [167], ADE20K-outdoors [167], COCO-stuff [168] and Cityscapes [169], following the setting of [3]. The evaluation is performed with Fréchet Inception Distance (FID) [213] and mean Intersection-over-Union (mIoU). Specially, the mIoU aims to assess the alignment between the generated image and the ground truth segmentation via a pre-trained semantic segmentation network. Pre-trained UperNet101 [214], multi-scale DRN-D-105 [215], and DeepLabV2 [216] are adopted for Cityscapes, ADE20K, and COCO-Stuff, respectively.

4.3.2 Text Guidance

The text-to-image generation performance of different methods on three benchmark datasets (i.e., Oxford-102 Flowers, CUB-200 Birds, COCO) are collected from the related literature, as shown in Table III, Table II, and Table IV respectively.

4.3.3 Audio Guidance

In terms of audio guided image synthesis and editing, we conduct quantitative comparison in the task of audio guided talking face as there are numbers of benchmark methods for comparison. The quantitative results on LRW [181] and VoxCeleb2 [180] datasets are shown in Table V.

5 Future Challenges &Directions

Though multimodal image synthesis and editing have made notable progress and achieved superior performance in recent years, there exist several challenges for their practical applicability. In this section, we overview the typical challenges, share some recent efforts for addressing them, and highlight the future research directions.

5.1 Towards Integrating All Modalities

As current datasets mainly provide annotations in a single or a few modalities, most existing methods focus on image synthesis and editing conditioned on guidance from a single modality. However, humans possess the capability of creating visual contents with guidance of multiple modalities concurrently. Targeting to mimic the human intelligence, the generation models are expected to be able to handle guidance from multiple modalities concurrently. To achieve that, a comprehensive dataset which is equipped with annotations from all modalities (e.g., semantic segmentation, text description, audios) needs to be created.

5.2 Evaluation Metrics

The evaluation of multimodal image synthesis and editing is still an open problem. Leveraging pre-trained models (e.g., FID) to conduct evaluations is constraint to the pre-trained datasets, which tends to pose discrepancy with the target datasets. User study recruits human subjects to assess the synthesized images directly, which is however too subjective. Designing accurate yet faithful evaluation metrics is thus very meaningful and critical to development of multimodal image synthesis and editing.

5.3 Model Architecture

With inherent support for multimodal input, Transformer model has been established as a new paradigm for multimodal image synthesis and editing. However, Transformer model suffers from slow inference speed, which is more severe in high-resolution image synthesis. How to design an architecture with natural support for multimodal inputs and fast inference speed remains a grand challenge to explore.

6 Conclusion

This review has covered main approaches for multimodal image synthesis and editing. We provide an overview of different guidance modalities including visual guidance, text guidance, audio guidance, and other modal guidance (e.g., scene graph). We then provided a detailed introduction of the main image synthesis & editing paradigms: GAN-based method, GAN inversion method, Transformer-based method, and other methods (e.g., NeRF, Diffusion model). The corresponding strengths and weaknesses were comprehensively discussed to inspire new paradigm that takes advantage of the strengths of existing frameworks. After the introduction of methods, we conduct a comprehensive survey of datasets and evaluation metrics for image synthesis conditioned on different guidance modalities. Then, we tabularize and compare the generation performance of existing approaches in different multimodal synthesis & editing tasks. Last but not least, we provided our perspective on the current challenges and future directions related to integrating all modalities, comprehensive datasets, evaluation metrics, and model architecture.


This study is supported under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).


  • [1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014.
  • [2] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pp. 1125–1134, 2017.
  • [3] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “Semantic image synthesis with spatially-adaptive normalization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2337–2346, 2019.
  • [4] C.-H. Lee, Z. Liu, L. Wu, and P. Luo, “Maskgan: Towards diverse and interactive facial image manipulation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5549–5558, 2020.
  • [5] E. Mansimov, E. Parisotto, J. L. Ba, and R. Salakhutdinov, “Generating images from captions with attention,” arXiv preprint arXiv:1511.02793, 2015.
  • [6] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
  • [7] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in

    International Conference on Machine Learning

    , pp. 214–223, 2017.
  • [8] C.-H. Lin, E. Yumer, O. Wang, E. Shechtman, and S. Lucey, “St-gan: Spatial transformer generative adversarial networks for image compositing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9455–9464, 2018.
  • [9] F. Zhan, C. Xue, and S. Lu, “Ga-dan: Geometry-aware domain adaptation network for scene text detection and recognition,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 9105–9115, 2019.
  • [10] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb, “Learning from simulated and unsupervised images through adversarial training,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 2107–2116, 2017.
  • [11] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in International Conference on Machine Learning, pp. 1060–1069, PMLR, 2016.
  • [12] L. Chen, S. Srivastava, Z. Duan, and C. Xu, “Deep cross-modal audio-visual generation,” in Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pp. 349–357, 2017.
  • [13] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona, “Caltech-UCSD Birds 200,” Tech. Rep. CNS-TR-2010-001, California Institute of Technology, 2010.
  • [14] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “Attngan: Fine-grained text to image generation with attentional generative adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1316–1324, 2018.
  • [15] A. Dash, J. C. B. Gamboa, S. Ahmed, M. Liwicki, and M. Z. Afzal, “Tac-gan-text conditioned auxiliary classifier generative adversarial network,” arXiv preprint arXiv:1703.06412, 2017.
  • [16] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks,” in ICCV, 2017.
  • [17] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “StackGAN++: Realistic image synthesis with stacked generative adversarial networks,” TPAMI, 2018.
  • [18] T. Qiao, J. Zhang, D. Xu, and D. Tao, “Mirrorgan: Learning text-to-image generation by redescription,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1505–1514, 2019.
  • [19] A. Brock, J. Donahue, and K. Simonyan, “Large scale gan training for high fidelity natural image synthesis,” arXiv preprint arXiv:1809.11096, 2018.
  • [20] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410, 2019.
  • [21] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110–8119, 2020.
  • [22] T. Karras, M. Aittala, S. Laine, E. Härkönen, J. Hellsten, J. Lehtinen, and T. Aila, “Alias-free generative adversarial networks,” in Proc. NeurIPS, 2021.
  • [23] D. Bau, H. Strobelt, W. Peebles, J. Wulff, B. Zhou, J.-Y. Zhu, and A. Torralba, “Semantic photo manipulation with a generative image prior,” arXiv preprint arXiv:2005.07727, 2020.
  • [24] L. Goetschalckx, A. Andonian, A. Oliva, and P. Isola, “Ganalyze: Toward visual definitions of cognitive image properties,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5744–5753, 2019.
  • [25] J. Zhu, Y. Shen, D. Zhao, and B. Zhou, “In-domain gan inversion for real image editing,” in European conference on computer vision, pp. 592–608, Springer, 2020.
  • [26] A. Jahanian, L. Chai, and P. Isola, “On the” steerability” of generative adversarial networks,” arXiv preprint arXiv:1907.07171, 2019.
  • [27] Y. Shen, J. Gu, X. Tang, and B. Zhou, “Interpreting the latent space of gans for semantic face editing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9243–9252, 2020.
  • [28] O. Patashnik, Z. Wu, E. Shechtman, D. Cohen-Or, and D. Lischinski, “Styleclip: Text-driven manipulation of stylegan imagery,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2085–2094, 2021.
  • [29] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” arXiv preprint arXiv:2103.00020, 2021.
  • [30] Y. Jiang, Z. Huang, X. Pan, C. C. Loy, and Z. Liu, “Talk-to-edit: Fine-grained facial editing via dialog,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13799–13808, 2021.
  • [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, pp. 5998–6008, 2017.
  • [32] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  • [33] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever, “Generative pretraining from pixels,” in International Conference on Machine Learning, pp. 1691–1703, PMLR, 2020.
  • [34] P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever, “Jukebox: A generative model for music,” arXiv preprint arXiv:2005.00341, 2020.
  • [35] A. Ramesh, M. Pavlov, G. Goh, and S. Gray, “DALL·E: Creating images from text,” tech. rep., OpenAI, 2021.
  • [36] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” arXiv:2012.09841, 2020.
  • [37] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European conference on computer vision, pp. 694–711, Springer, 2016.
  • [38] A. Lamb, V. Dumoulin, and A. Courville, “Discriminative regularization for generative models,” arXiv preprint arXiv:1602.03220, 2016.
  • [39] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther, “Autoencoding beyond pixels using a learned similarity metric,” in International conference on machine learning, pp. 1558–1566, PMLR, 2016.
  • [40] P. Esser, R. Rombach, A. Blattmann, and B. Ommer, “Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis,” in Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
  • [41] C. Wu, J. Liang, L. Ji, F. Yang, Y. Fang, D. Jiang, and N. Duan, “N” uwa: Visual synthesis pre-training for neural visual world creation,” arXiv preprint arXiv:2111.12417, 2021.
  • [42] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in European conference on computer vision, pp. 405–421, Springer, 2020.
  • [43] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” arXiv preprint arXiv:2006.11239, 2020.
  • [44] Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” arXiv preprint arXiv:1907.05600, 2019.
  • [45] F. Zhan, Y. Yu, K. Cui, G. Zhang, S. Lu, J. Pan, C. Zhang, F. Ma, X. Xie, and C. Miao, “Unbalanced feature transport for exemplar-based image translation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021.
  • [46] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” arXiv preprint arXiv:2102.12092, 2021.
  • [47] H. Zhou, Y. Sun, W. Wu, C. C. Loy, X. Wang, and Z. Liu, “Pose-controllable talking face generation by implicitly modularized audio-visual representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4176–4186, 2021.
  • [48] J. Johnson, A. Gupta, and L. Fei-Fei, “Image generation from scene graphs,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1219–1228, 2018.
  • [49] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool, “Pose guided person image generation,” arXiv preprint arXiv:1705.09368, 2017.
  • [50] Y. Men, Y. Mao, Y. Jiang, W.-Y. Ma, and Z. Lian, “Controllable person image synthesis with attribute-decomposed gan,” in Computer Vision and Pattern Recognition (CVPR), 2020 IEEE Conference on, 2020.
  • [51] C. Zhang, F. Zhan, and Y. Chang, “Deep monocular 3d human pose estimation via cascaded dimension-lifting,” arXiv preprint arXiv:2104.03520, 2021.
  • [52] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman, “Toward multimodal image-to-image translation,” in Advances in neural information processing systems, pp. 465–476, 2017.
  • [53] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H. Yang, “Diverse image-to-image translation via disentangled representations,” in Proceedings of the European conference on computer vision (ECCV), pp. 35–51, 2018.
  • [54] W. Sun and T. Wu, “Image synthesis from reconfigurable layout and style,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 10531–10540, 2019.
  • [55] B. Zhao, L. Meng, W. Yin, and L. Sigal, “Image generation from layout,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8584–8593, 2019.
  • [56] Y. Li, Y. Cheng, Z. Gan, L. Yu, L. Wang, and J. Liu, “Bachgan: High-resolution image synthesis from salient object layout,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
  • [57] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, pp. 2223–2232, 2017.
  • [58] T. Park, A. A. Efros, R. Zhang, and J.-Y. Zhu, “Contrastive learning for unpaired image-to-image translation,” in European Conference on Computer Vision, pp. 319–345, Springer, 2020.
  • [59]

    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in

    Advances in neural information processing systems, pp. 3111–3119, 2013.
  • [60] Z. S. Harris, “Distributional structure,” Word, vol. 10, no. 2-3, pp. 146–162, 1954.
  • [61]

    M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,”

    IEEE transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
  • [62] T. Wang, T. Zhang, and B. Lovell, “Faces à la carte: Text-to-face generation via attribute disentanglement,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3380–3388, 2021.
  • [63] D. Pavllo, A. Lucchi, and T. Hofmann, “Controlling style and semantics in weakly-supervised image generation,” in European Conference on Computer Vision, pp. 482–499, Springer, 2020.
  • [64] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  • [65] D. Harwath and J. R. Glass, “Learning word-like units from joint audio-visual analysis,” arXiv preprint arXiv:1701.07481, 2017.
  • [66] D. Harwath, G. Chuang, and J. Glass, “Vision as an interlingua: Learning multilingual semantic embeddings of untranscribed speech,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4969–4973, IEEE, 2018.
  • [67] Y. Aytar, C. Vondrick, and A. Torralba, “Soundnet: Learning sound representations from unlabeled video,” Advances in neural information processing systems, vol. 29, pp. 892–900, 2016.
  • [68] A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adelson, and W. T. Freeman, “Visually indicated sounds,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2405–2413, 2016.
  • [69]

    S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

    Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [70] W. Hao, Z. Zhang, and H. Guan, “Cmcgan: A uniform framework for cross-modal visual-audio mutual generation,” in

    Proceedings of the AAAI Conference on Artificial Intelligence

    , vol. 32, 2018.
  • [71] C.-H. Wan, S.-P. Chuang, and H.-Y. Lee, “Towards audio to scene image synthesis using generative adversarial network,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 496–500, IEEE, 2019.
  • [72] J. S. Chung, A. Jamaludin, and A. Zisserman, “You said that?,” arXiv preprint arXiv:1705.02966, 2017.
  • [73] Y. Song, J. Zhu, D. Li, X. Wang, and H. Qi, “Talking face generation by conditional recurrent adversarial network,” arXiv preprint arXiv:1804.04786, 2018.
  • [74] L. Chen, R. K. Maddox, Z. Duan, and C. Xu, “Hierarchical cross-modal talking face generation with dynamic pixel-wise loss,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7832–7841, 2019.
  • [75] L. Chen, G. Cui, C. Liu, Z. Li, Z. Kou, Y. Xu, and C. Xu, “Talking-head generation with rhythmic head motion,” in European Conference on Computer Vision, pp. 35–51, Springer, 2020.
  • [76] H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang, “Talking face generation by adversarially disentangled audio-visual representation,” in Proceedings of the AAAI Conference on Artificial Intelligence, pp. 9299–9306, 2019.
  • [77] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman, “Synthesizing obama: learning lip sync from audio,” ACM Transactions on Graphics (ToG), vol. 36, no. 4, pp. 1–13, 2017.
  • [78] R. Yi, Z. Ye, J. Zhang, H. Bao, and Y.-J. Liu, “Audio-driven talking face video generation with learning-based personalized head pose,” arXiv preprint arXiv:2002.10137, 2020.
  • [79] S. Wang, L. Li, Y. Ding, C. Fan, and X. Yu, “Audio2head: Audio-driven one-shot talking-head generation with natural head motion,” arXiv preprint arXiv:2107.09293, 2021.
  • [80] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona, “Caltech-ucsd birds 200,” California Institute of Technology, 2010.
  • [81] R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev, and J. Yosinski, “An intriguing failing of convolutional neural networks and the coordconv solution,” arXiv preprint arXiv:1807.03247, 2018.
  • [82] F. Zhan, C. Zhang, Y. Yu, Y. Chang, S. Lu, F. Ma, and X. Xie, “Emlight: Lighting estimation via spherical distribution approximation,” in Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3287–3295, 2021.
  • [83] F. Zhan, Y. Yu, R. Wu, C. Zhang, S. Lu, L. Shao, F. Ma, and X. Xie, “Gmlight: Lighting estimation via geometric distribution approximation,” arXiv preprint arXiv:2102.10244, 2021.
  • [84] F. Zhan, C. Zhang, W. Hu, S. Lu, F. Ma, X. Xie, and L. Shao, “Sparse needlets for lighting estimation with spherical transport loss,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12830–12839, 2021.
  • [85] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, “High-resolution image synthesis and semantic manipulation with conditional gans,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798–8807, 2018.
  • [86] H. Tang, D. Xu, N. Sebe, Y. Wang, J. J. Corso, and Y. Yan, “Multi-channel attention selection gan with cascaded semantic guidance for cross-view image translation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2417–2426, 2019.
  • [87] P. Zhu, R. Abdal, Y. Qin, and P. Wonka, “Sean: Image synthesis with semantic region-adaptive normalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [88] T. R. Shaham, M. Gharbi, R. Zhang, E. Shechtman, and T. Michaeli, “Spatially-adaptive pixelwise networks for fast image translation,” arXiv preprint arXiv:2012.02992, 2020.
  • [89] P. Zhang, B. Zhang, D. Chen, L. Yuan, and F. Wen, “Cross-domain correspondence learning for exemplar-based image translation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5143–5153, 2020.
  • [90] X. Zhou, B. Zhang, T. Zhang, P. Zhang, J. Bao, D. Chen, Z. Zhang, and F. Wen, “Full-resolution correspondence learning for image translation,” arXiv preprint arXiv:2012.02047, 2020.
  • [91] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman, “Patchmatch: A randomized correspondence algorithm for structural image editing,” ACM Trans. Graph., vol. 28, no. 3, p. 24, 2009.
  • [92] F. Zhan, Y. Yu, R. Wu, K. Cui, A. Xiao, S. Lu, and L. Shao, “Bi-level feature alignment for semantic image translation & manipulation,” arXiv preprint, 2021.
  • [93] A. Bansal, Y. Sheikh, and D. Ramanan, “Pixelnn: Example-based image synthesis,” in International Conference on Learning Representations, 2018.
  • [94] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
  • [95] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” stat, vol. 1050, p. 1, 2014.
  • [96] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “Infogan: Interpretable representation learning by information maximizing generative adversarial nets,” in Advances in neural information processing systems, pp. 2172–2180, 2016.
  • [97] J. Donahue, P. Krähenbühl, and T. Darrell, “Adversarial feature learning,” arXiv preprint arXiv:1605.09782, 2016.
  • [98] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. Courville, “Adversarially learned inference,” stat, vol. 1050, p. 2, 2016.
  • [99] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner, “beta-vae: Learning basic visual concepts with a constrained variational framework,” International Conference on Learning Representations, 2017.
  • [100] H. Kim and A. Mnih, “Disentangling by factorising,” in International Conference on Machine Learning, pp. 2649–2658, PMLR, 2018.
  • [101] E. Denton and V. Birodkar, “Unsupervised learning of disentangled representations from video,” arXiv preprint arXiv:1705.10915, 2017.
  • [102] A. Gonzalez-Garcia, J. Van De Weijer, and Y. Bengio, “Image-to-image translation for cross-domain disentanglement,” in Advances in neural information processing systems, pp. 1287–1298, 2018.
  • [103] S. Benaim and L. Wolf, “One-sided unsupervised domain mapping,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [104] M. Amodio and S. Krishnaswamy, “Travelgan: Image-to-image translation by transformation vector learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8983–8992, 2019.
  • [105] H. Fu, M. Gong, C. Wang, K. Batmanghelich, K. Zhang, and D. Tao, “Geometry-consistent generative adversarial networks for one-sided unsupervised domain mapping,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2427–2436, 2019.
  • [106] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
  • [107] A. Andonian, T. Park, B. Russell, P. Isola, J.-Y. Zhu, and R. Zhang, “Contrastive feature loss for image prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1934–1943, 2021.
  • [108] Z. Zhang, Y. Xie, and L. Yang, “Photographic text-to-image synthesis with a hierarchically-nested adversarial network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6199–6208, 2018.
  • [109] L. Gao, D. Chen, J. Song, X. Xu, D. Zhang, and H. T. Shen, “Perceptual pyramid adversarial networks for text-to-image synthesis,” in Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8312–8319, 2019.
  • [110] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al., “Photo-realistic single image super-resolution using a generative adversarial network,” in CVPR, 2017.
  • [111] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [112] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
  • [113] W. Huang, R. Y. Da Xu, and I. Oppermann, “Realistic image generation using region-phrase attention,” in Asian Conference on Machine Learning, pp. 284–299, PMLR, 2019.
  • [114] H. Tan, X. Liu, X. Li, Y. Zhang, and B. Yin, “Semantics-enhanced adversarial nets for text-to-image synthesis,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10501–10510, 2019.
  • [115] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio, “A structured self-attentive sentence embedding,” arXiv preprint arXiv:1703.03130, 2017.
  • [116] B. Li, X. Qi, T. Lukasiewicz, and P. H. Torr, “Controllable text-to-image generation,” arXiv preprint arXiv:1909.07083, 2019.
  • [117] A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and J. Yosinski, “Plug & play generative networks: Conditional iterative generation of images in latent space,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4467–4477, 2017.
  • [118] Q. Lao, M. Havaei, A. Pesaranghader, F. Dutil, L. D. Jorio, and T. Fevens, “Dual adversarial inference for text-to-image synthesis,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7567–7576, 2019.
  • [119] Z. Chen and Y. Luo, “Cycle-consistent diverse image synthesis from natural language,” in 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 459–464, IEEE, 2019.
  • [120] D. Stap, M. Bleeker, S. Ibrahimi, and M. ter Hoeve, “Conditional image generation and manipulation for user-specified content,” arXiv preprint arXiv:2005.04909, 2020.
  • [121] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,” arXiv preprint arXiv:1710.10196, 2017.
  • [122] M. Yuan and Y. Peng, “Bridge-gan: Interpretable representation learning for text-to-image synthesis,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 11, pp. 4258–4268, 2019.
  • [123] Y. Zhang and H. Lu, “Deep cross-modal projection learning for image-text matching,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 686–701, 2018.
  • [124] D. M. Souza, J. Wehrmann, and D. D. Ruiz, “Efficient neural architecture for text-to-image synthesis,” in 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8, IEEE, 2020.
  • [125] Z. Wang, Z. Quan, Z.-J. Wang, X. Hu, and Y. Chen, “Text to image synthesis with bidirectional generative adversarial network,” in 2020 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6, IEEE, 2020.
  • [126] Y. Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, “Makelttalk: speaker-aware talking-head animation,” ACM Transactions on Graphics (TOG), vol. 39, no. 6, pp. 1–15, 2020.
  • [127] V. Blanz, T. Vetter, et al., “A morphable model for the synthesis of 3d faces.,” in Siggraph, pp. 187–194, 1999.
  • [128] W. Xia, Y. Yang, J.-H. Xue, and B. Wu, “Tedigan: Text-guided diverse face image generation and manipulation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2256–2265, 2021.
  • [129]

    R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595, 2018.
  • [130] E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan, Y. Azar, S. Shapiro, and D. Cohen-Or, “Encoding in style: a stylegan encoder for image-to-image translation,” arXiv preprint arXiv:2008.00951, 2020.
  • [131] H. Dong, S. Yu, C. Wu, and Y. Guo, “Semantic image synthesis via adversarial learning,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 5706–5714, 2017.
  • [132] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual-semantic embeddings with multimodal neural language models,” arXiv preprint arXiv:1411.2539, 2014.
  • [133] D. Bau, A. Andonian, A. Cui, Y. Park, A. Jahanian, A. Oliva, and A. Torralba, “Paint by word,” arXiv preprint arXiv:2103.10951, 2021.
  • [134] T. Wei, D. Chen, W. Zhou, J. Liao, Z. Tan, L. Yuan, W. Zhang, and N. Yu, “Hairclip: Design your hair by text and reference image,” arXiv preprint arXiv:2112.05142, 2021.
  • [135] R. Gal, O. Patashnik, H. Maron, G. Chechik, and D. Cohen-Or, “Stylegan-nada: Clip-guided domain adaptation of image generators,” arXiv preprint arXiv:2108.00946, 2021.
  • [136] Y. Yu, F. Zhan, R. Wu, J. Pan, K. Cui, S. Lu, F. Ma, X. Xie, and C. Miao, “Diverse image inpainting with bidirectional and autoregressive transformers,” arXiv preprint arXiv:2104.12335, 2021.
  • [137] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in CVPR, 2009.
  • [138] A. v. d. Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” arXiv preprint arXiv:1711.00937, 2017.
  • [139] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating gradients through stochastic neurons for conditional computation,” arXiv preprint arXiv:1308.3432, 2013.
  • [140] A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-supervised learning of discrete speech representations,” arXiv preprint arXiv:1910.05453, 2019.
  • [141] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” arXiv preprint arXiv:1611.01144, 2016.
  • [142] X. Dong, J. Bao, T. Zhang, D. Chen, W. Zhang, L. Yuan, D. Chen, F. Wen, and N. Yu, “Peco: Perceptual codebook for bert pre-training of vision transformers,” arXiv preprint arXiv:2111.12710, 2021.
  • [143] H. Bao, L. Dong, and F. Wei, “Beit: Bert pre-training of image transformers,” arXiv preprint arXiv:2106.08254, 2021.
  • [144] J. Yu, X. Li, J. Y. Koh, H. Zhang, R. Pang, J. Qin, A. Ku, Y. Xu, J. Baldridge, and Y. Wu, “Vector-quantized image modeling with improved vqgan,” arXiv preprint arXiv:2110.04627, 2021.
  • [145] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2020.
  • [146] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al., “Conditional image generation with pixelcnn decoders,” in NeurIPS, 2016.
  • [147] N. Parmar, A. Vaswani, J. Uszkoreit, Ł. Kaiser, N. Shazeer, A. Ku, and D. Tran, “Image transformer,” in ICML, 2018.
  • [148] E. Hoogeboom, D. Nielsen, P. Jaini, P. Forré, and M. Welling, “Argmax flows and multinomial diffusion: Towards non-autoregressive language models,” arXiv preprint arXiv:2102.05379, 2021.
  • [149]

    J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in

    International Conference on Machine Learning, pp. 2256–2265, PMLR, 2015.
  • [150] Y. Guo, K. Chen, S. Liang, Y. Liu, H. Bao, and J. Zhang, “Ad-nerf: Audio driven neural radiance fields for talking head synthesis,” arXiv preprint arXiv:2103.11078, 2021.
  • [151] C. Wang, M. Chai, M. He, D. Chen, and J. Liao, “Clip-nerf: Text-and-image driven manipulation of neural radiance fields,” arXiv preprint arXiv:2112.05139, 2021.
  • [152] G. Kim and J. C. Ye, “Diffusionclip: Text-guided image manipulation using diffusion models,” arXiv preprint arXiv:2110.02711, 2021.
  • [153] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” arXiv preprint arXiv:2011.13456, 2020.
  • [154] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020.
  • [155] A. Jolicoeur-Martineau, R. Piché-Taillefer, R. T. d. Combes, and I. Mitliagkas, “Adversarial score matching and improved sampling for image generation,” arXiv preprint arXiv:2009.05475, 2020.
  • [156] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” arXiv preprint arXiv:2105.05233, 2021.
  • [157] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  • [158] D. Rezende and S. Mohamed, “Variational inference with normalizing flows,” in International conference on machine learning, pp. 1530–1538, PMLR, 2015.
  • [159] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real nvp,” arXiv preprint arXiv:1605.08803, 2016.
  • [160] J. Menick and N. Kalchbrenner, “Generating high fidelity images with subscale pixel networks and multidimensional upscaling,” arXiv preprint arXiv:1812.01608, 2018.
  • [161] A. Van Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” in International Conference on Machine Learning, pp. 1747–1756, PMLR, 2016.
  • [162] O. Avrahami, D. Lischinski, and O. Fried, “Blended diffusion for text-driven editing of natural images,” arXiv preprint arXiv:2111.14818, 2021.
  • [163] S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, and B. Guo, “Vector quantized diffusion model for text-to-image synthesis,” arXiv preprint arXiv:2111.14822, 2021.
  • [164] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021.
  • [165] T.-J. Fu, X. E. Wang, and W. Y. Wang, “Language-driven image style transfer,” arXiv preprint arXiv:2106.00178, 2021.
  • [166] G. Kwon and J. C. Ye, “Clipstyler: Image style transfer with a single text condition,” arXiv preprint arXiv:2112.00374, 2021.
  • [167] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 633–641, 2017.
  • [168] H. Caesar, J. Uijlings, and V. Ferrari, “Coco-stuff: Thing and stuff classes in context,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1209–1218, 2018.
  • [169] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in CVPR, 2016.
  • [170] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in Proceedings of the IEEE international conference on computer vision, pp. 3730–3738, 2015.
  • [171] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “Deepfashion: Powering robust clothes recognition and retrieval with rich annotations,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1096–1104, 2016.
  • [172] O. Langner, R. Dotsch, G. Bijlstra, D. H. Wigboldus, S. T. Hawk, and A. Van Knippenberg, “Presentation and validation of the radboud faces database,” Cognition and emotion, vol. 24, no. 8, pp. 1377–1388, 2010.
  • [173] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable person re-identification: A benchmark,” in Proceedings of the IEEE international conference on computer vision, pp. 1116–1124, 2015.
  • [174] H. Tang, D. Xu, G. Liu, W. Wang, N. Sebe, and Y. Yan, “Cycle in cycle generative adversarial networks for keypoint-guided image generation,” in Proceedings of the 27th ACM International Conference on Multimedia, pp. 2052–2060, 2019.
  • [175] M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729, IEEE, 2008.
  • [176] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision, pp. 740–755, Springer, 2014.
  • [177] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” arXiv preprint arXiv:1602.07332, 2016.
  • [178] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li, “Yfcc100m: The new data in multimedia research,” Communications of the ACM, vol. 59, no. 2, pp. 64–73, 2016.
  • [179] B. Li, X. Liu, K. Dinesh, Z. Duan, and G. Sharma, “Creating a multitrack classical music performance dataset for multimodal music analysis: Challenges, insights, and applications,” IEEE Transactions on Multimedia, vol. 21, no. 2, pp. 522–535, 2018.
  • [180] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622, 2018.
  • [181] J. S. Chung and A. Zisserman, “Lip reading in the wild,” in Asian conference on computer vision, pp. 87–103, Springer, 2016.
  • [182] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, “Learning what and where to draw,” Advances in neural information processing systems, vol. 29, pp. 217–225, 2016.
  • [183] C. Zhang and Y. Peng, “Stacking vae and gan for context-aware text-to-image generation,” in 2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM), pp. 1–5, IEEE, 2018.
  • [184] N. Bodla, G. Hua, and R. Chellappa, “Semi-supervised fusedgan for conditional image generation,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 669–683, 2018.
  • [185] X. Huang, M. Wang, and M. Gong, “Hierarchically-fused generative adversarial network for text to realistic image synthesis,” in 2019 16th Conference on Computer and Robot Vision (CRV), pp. 73–80, IEEE, 2019.
  • [186] T. Qiao, J. Zhang, D. Xu, and D. Tao, “Learn, imagine and create: Text-to-image generation from prior knowledge,” Advances in Neural Information Processing Systems, vol. 32, pp. 887–897, 2019.
  • [187] M. Zhu, P. Pan, W. Chen, and Y. Yang, “Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5802–5810, 2019.
  • [188] G. Yin, B. Liu, L. Sheng, N. Yu, X. Wang, and J. Shao, “Semantics disentangling for text-to-image generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2327–2336, 2019.
  • [189] M. Wang, C. Lang, L. Liang, G. Lyu, S. Feng, and T. Wang, “Attentive generative adversarial network to bridge multi-domain gap for image synthesis,” in 2020 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6, IEEE, 2020.
  • [190] J. Cheng, F. Wu, Y. Tian, L. Wang, and D. Tao, “Rifegan: Rich feature generation for text-to-image synthesis from prior knowledge,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10911–10920, 2020.
  • [191] M. Wang, C. Lang, L. Liang, S. Feng, T. Wang, and Y. Gao, “End-to-end text-to-image synthesis with spatial constrains,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 11, no. 4, pp. 1–19, 2020.
  • [192] K. Joseph, A. Pal, S. Rajanala, and V. N. Balasubramanian, “C4synth: Cross-caption cycle-consistent text-to-image synthesis,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 358–366, IEEE, 2019.
  • [193] M. Cha, Y. L. Gwon, and H. Kung, “Adversarial learning of semantic relevance in text to image synthesis,” in Proceedings of the AAAI conference on artificial intelligence, pp. 3272–3279, 2019.
  • [194] T. Hinz, S. Heinrich, and S. Wermter, “Semantic object accuracy for generative text-to-image synthesis,” arXiv preprint arXiv:1910.13321, 2019.
  • [195] S. Sharma, D. Suhubdy, V. Michalski, S. E. Kahou, and Y. Bengio, “Chatpainter: Improving text to image generation using dialogue,” arXiv preprint arXiv:1802.08216, 2018.
  • [196] F. Tan, S. Feng, and V. Ordonez, “Text2scene: Generating compositional scenes from textual descriptions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6710–6719, 2019.
  • [197] S. Hong, D. Yang, J. Choi, and H. Lee, “Inferring semantic layout for hierarchical text-to-image synthesis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7986–7994, 2018.
  • [198] W. Li, P. Zhang, L. Zhang, Q. Huang, X. He, S. Lyu, and J. Gao, “Object-driven text-to-image synthesis via adversarial training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12174–12182, 2019.
  • [199] R. Rombach, P. Esser, and B. Ommer, “Network-to-network translation with conditional invertible neural networks,” arXiv preprint arXiv:2005.13580, 2020.
  • [200] J. Liang, W. Pei, and F. Lu, “Cpgan: Content-parsing generative adversarial networks for text-to-image synthesis,” in European Conference on Computer Vision, pp. 491–508, Springer, 2020.
  • [201] H. Zhang, J. Y. Koh, J. Baldridge, H. Lee, and Y. Yang, “Cross-modal contrastive learning for text-to-image generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 833–842, 2021.
  • [202] M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, et al., “Cogview: Mastering text-to-image generation via transformers,” arXiv preprint arXiv:2105.13290, 2021.
  • [203] K. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in Proceedings of the 28th ACM International Conference on Multimedia, pp. 484–492, 2020.
  • [204] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” Advances in neural information processing systems, vol. 29, pp. 2234–2242, 2016.
  • [205] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826, 2016.
  • [206] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in Advances in neural information processing systems, pp. 6626–6637, 2017.
  • [207] H. Zheng, Z. Lin, J. Lu, S. Cohen, J. Zhang, N. Xu, and J. Luo, “Semantic layout manipulation with high-resolution sparse attention,” arXiv preprint arXiv:2012.07288, 2020.
  • [208] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595, 2018.
  • [209] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164, 2015.
  • [210] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002.
  • [211] A. Lavie and A. Agarwal, “Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments,” in Proceedings of the second workshop on statistical machine translation, pp. 228–231, 2007.
  • [212] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575, 2015.
  • [213] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017.
  • [214] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Unified perceptual parsing for scene understanding,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434, 2018.
  • [215] F. Yu, V. Koltun, and T. Funkhouser, “Dilated residual networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 472–480, 2017.
  • [216] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected crfs,” arXiv preprint arXiv:1412.7062, 2014.