Log In Sign Up

Efficient Semantic Image Synthesis via Class-Adaptive Normalization

Spatially-adaptive normalization (SPADE) is remarkably successful recently in conditional semantic image synthesis <cit.>, which modulates the normalized activation with spatially-varying transformations learned from semantic layouts, to prevent the semantic information from being washed away. Despite its impressive performance, a more thorough understanding of the advantages inside the box is still highly demanded to help reduce the significant computation and parameter overhead introduced by this novel structure. In this paper, from a return-on-investment point of view, we conduct an in-depth analysis of the effectiveness of this spatially-adaptive normalization and observe that its modulation parameters benefit more from semantic-awareness rather than spatial-adaptiveness, especially for high-resolution input masks. Inspired by this observation, we propose class-adaptive normalization (CLADE), a lightweight but equally-effective variant that is only adaptive to semantic class. In order to further improve spatial-adaptiveness, we introduce intra-class positional map encoding calculated from semantic layouts to modulate the normalization parameters of CLADE and propose a truly spatially-adaptive variant of CLADE, namely CLADE-ICPE.Through extensive experiments on multiple challenging datasets, we demonstrate that the proposed CLADE can be generalized to different SPADE-based methods while achieving comparable generation quality compared to SPADE, but it is much more efficient with fewer extra parameters and lower computational cost. The code and pretrained models are available at <>.


page 2

page 3

page 7

page 9

page 10

page 11


Rethinking Spatially-Adaptive Normalization

Spatially-adaptive normalization is remarkably successful recently in co...

Semantic Image Synthesis with Spatially-Adaptive Normalization

We propose spatially-adaptive normalization, a simple but effective laye...

Image Synthesis via Semantic Composition

In this paper, we present a novel approach to synthesize realistic image...

Diverse Semantic Image Synthesis via Probability Distribution Modeling

Semantic image synthesis, translating semantic layouts to photo-realisti...

Retrieval-based Spatially Adaptive Normalization for Semantic Image Synthesis

Semantic image synthesis is a challenging task with many practical appli...

Semantic-shape Adaptive Feature Modulation for Semantic Image Synthesis

Recent years have witnessed substantial progress in semantic image synth...

Collaging on Internal Representations: An Intuitive Approach for Semantic Transfiguration

We present a novel CNN-based image editing method that allows the user t...

Code Repositories


Efficient Semantic Image Synthesis via Class-Adaptive Normalization (TPAMI 2021)

view repo

1 Introduction

Image synthesis has made great progress recently thanks to the advances of deep generative models. The latest successes, such as StyleGAN [23, 24], are already capable of producing highly realistic images from random latent codes. Yet conditional image synthesis, the task of generating photo-realistic images conditioned on some input data, is still very challenging. In this work, we focus on semantic image synthesis, a specific conditional image generation task that aims at converting a semantic segmentation mask into a photo-realistic image.

To tackle this problem, some previous methods [19, 44] directly feed the semantic segmentation mask to the conventional deep network architecture built by stacking convolution, normalization, and nonlinearity layers. However, as pointed out in [34], common normalization layers like instance normalization [43] tend to wash away the semantic information, especially for flat segmentation masks. To compensate for the information loss, a novel spatially-adaptive normalization, SPADE [34], is proposed, which modulates the normalized activation in a spatially-adaptive manner, conditioned on the input segmentation mask. Therefore, by replacing all the common normalization layers with SPADE blocks, the semantic information can be successfully propagated throughout the network, which can improve performance in terms of visual fidelity and spatial alignment.

Fig. 1: Some semantic image synthesis results produced by our method. Our method can not only handle the synthesis from a pure semantic segmentation mask (left six columns) but also support controllable synthesis via different reference style images (right two columns).

Despite effectiveness of the spatially-adaptive normalization, its advantages have not been fully uncovered yet. Is spatial-adaptiveness the sole or main reason for its superior performance? Does there exist any better design that can improve efficiency without compromising the resulting quality? In this paper, we try to answer these questions by analyzing it in depth. Our key observation is that semantic-awareness may actually contribute much more than the spatial-adaptiveness. In fact, since the two-layer modulation network used to regress the transformation parameters is so shallow, the resulting denormalization parameters are almost spatial-invariant within regions with the same semantic class, especially for high-resolution input masks. Meanwhile, given that a SPADE block is placed before almost every convolutional layer, such redundancy is recurring multiple times in the generation pass, which can easily lead to a heavy amount of unnecessary computation and parameter overhead.

Motivated by this observation, we propose a novel normalization layer, namely CLass-Adaptive (DE)normalization (CLADE). Different from the spatially adaptive solution of SPADE, CLADE instead uses the input semantic mask to modulate the normalized activation in a class-adaptive manner. Specifically, CLADE is only adaptive to different semantic classes to maintain the crucial semantic-awareness property, independent of the spatial position, semantic shape, or layout of the semantic mask. Thanks to this lightweight design, CLADE is surprisingly simple to implement and requires no extra modulation network. Therefore, its computation and parameter overhead is almost negligible compared with SPADE, making it a better alternative to those conventional normalization layers. Take the generator for the ADE20k dataset [52] as an example, the extra parameter and computation cost introduced by CLADE is only and while that of SPADE is and respectively.

Although class-adaptiveness

greatly reduces the computational overhead and achieves excellent performance, we believe that spatial-adaptiveness could still be beneficial to better semantic synthesis. To enhance the spatial-adaptiveness expected by SPADE, we further propose to utilize an extra positional encoding map representing the intra-class spatial variance, which defines the normalized relative distance from each pixel to its semantic object center. This positional encoding is then integrated into the CLADE modulation parameters and makes them spatially-adaptive in the regions with the same semantic class. This can be viewed as a spatially-adaptive variant of CLADE, namely CLADE-ICPE.

To demonstrate the effectiveness and efficiency of CLADE, we conduct extensive experiments on multiple challenging datasets, including Cityscapes [7], COCO-Stuff [3], and ADE20k (including ADE20k-outdoor[52]. Without bells and whistles, just by replacing all the SPADE layers with CLADE, comparable performance can be achieved with much smaller model size and much lower computation cost. Some visual results are given in Figure 1.

2 Related Works

2.1 Generative Adversarial Networks

In recent years, image synthesis has achieved significant progress thanks to the emergence of generative adversarial networks (GANs) 

[12]. This adversarial training strategy enables the generator network to synthesize images with semantic meaning from a random noise. Starting from the early work [12], many following works have been proposed from different aspects. For example, to make the network training more stable, some works [1, 39, 32]

propose improvements based on the loss functions. DCGAN 

[36] proposes a set of constraints on the architectural topology of Convolutional GANs that make them stable to train in most settings. For higher resolution and quality, ProgressiveGAN [22] designs a training strategy to gradually synthesize high-resolution images. BigGAN [2] proposes to train the network on a large-scale image dataset to improve the capabilities of generator. The recent works [23, 24] have not only pursued the realistic image synthesis, but also attempted to improve the accurate control of the synthesized image through the exploration of latent code. Different from this work, we are more interested in controlling the synthesized image in a more intuitive way, by using additional conditional inputs to control the synthesis results.

2.2 Conditional Image Synthesis

Instead of generation from a random noise, conditional image synthesis refers to the task of generating photo-realistic images conditioned on the input such as texts [14, 37, 47, 49] and images [17, 19, 29, 53, 33, 34]. Our work focuses on a special form of conditional image synthesis that aims at generating photo-realistic images conditioned on input segmentation masks, which is called semantic image synthesis.

For this task, many impressive works have been proposed in the past several years. One of the most representative works is pix2pix 

[19], which proposes a unified image-to-image translation framework based on the conditional generative adversarial network. To further improve its quality or enable more functionality, many following works have appeared, such as pix2pixHD [44], SIMS [35], and SPADE [34]. SPADE proposes a spatial-varying normalization layer for the first time and has a profound impact as a basic backbone. Many recent works for different downstream tasks have used this architecture, such as semantic image synthesis [10, 55, 51], portrait synthesis or editing [54, 42] and semantic view synthesis [15]. Other works [20, 50], although not using SPADE directly, are inspired by it to introduce spatial-adaptiveness into normalization layers. It is precisely because of the success of SPADE that we conduct an in-depth analysis of its superiority and propose a new efficient and effective normalization layer.

2.3 Normalization Layers

In the deep learning era, normalization layers play a vital role in achieving better convergence and performance, especially for deep networks. They follow a similar operating logic, which first normalizes the input features into zero mean and unit deviation, and then modulates the normalized features with learnable modulation scale/shift parameters.

Existing normalization layers can be generally divided into two different types: unconditional and conditional. Typical unconditional normalization layers include Batch Normalization (BN) 

[18], Instance Normalization (IN) [43], Group Normalization (GN) [45] and Positional Normalization (PONO) [28]. Compared to unconditional normalization, the behavior of conditional normalization is not static and depends on the external input. Conditional Instance Normalization (Conditional IN) [9] and Adaptive Instance Normalization (AdaIN) [16] are two popular conditional normalization layers originally designed for style transfer. To transfer the style from one image to another, they model the style information into the modulation scale/shift parameters.

For semantic image synthesis, most previous works [34] just leveraged unconditional normalization layers BN or IN in their networks. Recently, Park et al. [34] point out that common normalization layers used in the existing methods tend to “wash away” semantic information when applied to flat segmentation masks. To compensate for the missing information, they innovatively propose a new spatially-adaptive normalization layer named SPADE. Different from common normalization layers, SPADE puts the semantic information back by making the modulation parameters be the function of semantic mask in a spatially-adaptive way. Based on our analysis and observation that the semantic-awareness is the possible essential property leading to the superior performance of SPADE rather than the spatially-adaptiveness, we propose CLADE, a normalization layer that can achieve comparable performance as SPADE but with negligible cost.

3 Semantic Image Synthesis

Conditioned on a semantic segmentation map , semantic image synthesis aims at generating a corresponding high-quality realistic image . Here, is the set of class integers that denote different semantic categories. and are the target image height and width.

Most vanilla synthesis networks, like pix2pix [19] and pix2pixHD [44], adopt a similar network structure concatenating repeated blocks of convolutional, normalization and nonlinearity layers. Among them, normalization layers are essential for better convergence and performance. They can be generally formulated as:


with the indices of width, height, channel denoted as . In what follows, for the simplicity of notation, these subscripts will be omitted if the variable is independent of them. Specifically, the input feature is first normalized with the mean

and standard deviation

(normalization step), and then modulated with the learned scale and shift (modulation step). For most common normalization layers such as BN [19] and IN [43], all four parameters are calculated in a channel-wise manner (independent of ), with the modulating parameters and independent of .

3.1 Revisit Spatially-Adaptive Normalization

As pointed out in [34], one common issue of the aforementioned normalization layers is that they tend to wash away the semantic information on flat segmentation masks in image synthesis. Motivated by this observation, a new spatially-adaptive normalization layer, namely SPADE, is proposed in [34]. By making the modulation parameters and be functions of the input mask , i.e., and , the semantic information, which is lost after the normalization step, will be added back during the modulation step. The functions of and are both implemented with a shallow modulation network consisting of two convolutional layers, as illustrated in the left of Figure 4. By replacing all the normalization layers with SPADE, the generation network proposed in [34] can achieve much better synthesis results than previous methods like pix2pixHD [44].

Fig. 2: Visualization of learned modulation parameters for two example semantic masks from the ADE20k dataset, where the original pre-trained SPADE generator is used. Obviously, for the same semantic class are almost identical within each semantic region.
Fig. 3: Statistical histograms of (left) and (right) for the ”building” (top), ”sky” (middle) and ”tree” (bottom) class from the ADE20k validation dataset on SPADE blocks with various resolutions of input masks. It can be seen that the distribution of and is concentrated and the centralized trend becomes more obvious as the resolution of input mask goes higher.
Fig. 4: The illustration diagrams of SPADE (left) and our class-adaptive normalization layer CLADE with a guided sampling operation (right). Using a shallow modulation network consisting of two convolutional layers to model the modulation parameters as the function of input semantic mask, SPADE can add the semantic information lost in the normalization step back. Unlike SPADE, CLADE does not introduce any external modulation network but instead uses an efficient guided sampling operation to sample class-adaptive modulation parameters for each semantic region.

As explained in [34], the advantages of SPADE mainly come from two important properties: spatial-adaptiveness and semantic-awareness. The former indicates the modulation parameters are spatially varying in a pixel-wise manner, while the latter property means that depend on semantic classes to bring back the lost information. As the name of SPADE implies, it may indicate that the spatial-adaptiveness is more important. However, through the following analysis, we think that the semantic-awareness may be the de facto main contributor to SPADE.

In Figure 2, we show two examples with the masks from the ADE20k validation dataset [52], which consist of two semantic labels ”Sky” and ”Field”. We visualize the intermediate parameters of and with the original pre-trained SPADE generator. It can be easily observed that are almost identical within each semantic region, except for the boundary area which is especially negligible for high-resolution input masks due to the shallowness of the modulation network. In fact, for any two regions sharing the same semantic class within one input mask or even across different input masks, their learned will also be almost identical if the sizes of regions are much larger than the receptive field of the two-layer modulation network. We further conduct statistical analyses of and with the original pre-trained SPADE generator for some semantic classes on the ADE20k validation dataset [52]. In Figure 3, we show the statistical histograms of and for the three common classes (”building”, ”sky” and ”tree”) on SPADE blocks with various resolutions of input masks. We can observe that the distributions of within the same semantic class are concentrated and as the resolution of the input mask increases, the trend of concentration becomes more obvious. This further proves that, compared with the spatially-adaptiveness, the semantic-awareness may be the underlying key to the superior performance of SPADE.

Fig. 5: Left: The relative ratios of the parameter and FLOPs between each SPADE and its following convolutional layer in the original SPADE generator. Middle and right: the numbers of absolute parameters and FLOPs of different layers. axis indicates layer index from shallow to deep.

3.2 Class-Adaptive Normalization

Inspired by the above observation, we propose a new efficient conditional normalization layer, called CLass-Adaptive (DE)normalization (CLADE), as shown in the right of Figure 4. Inheriting the idea of semantic information compensation from SPADE, the modulation parameters in CLADE are also adaptive to the semantic input of . However, instead of adopting the pixel-wise spatial-adaptiveness as in SPADE, CLADE is spatially-invariant and only adaptive to different semantic classes. More concretely, in CLADE vary on the corresponding semantic classes to maintain the essential property of semantic-awareness, but they are independent of any spatial information including the position, semantic shape, or layout of .

Therefore, rather than learning modulation parameters through an extra modulation network like SPADE, we directly maintain a modulation parameter bank for CLADE and optimize it as regular network parameters. Assuming the total class number in to be , the parameter bank consists of channel-wise modulation scale parameters and shift parameters . During training, given an input mask , we fill each semantic region of class with its corresponding modulation parameter

to generate dense modulation parameter tensors

and respectively. We call this process Guided Sampling in Figure 4.

In fact, CLADE can also be regarded as a generalized formulation of some existing normalization layers. If and for any , CLADE becomes BN [18]. And if we make the modulation tensors and both spatially uniform, and replace the mean and std statistics of BN with the corresponding ones from IN, we implement Conditional IN.

By default, CLADE uses the additional input of instance maps if provided by the datasets (Cityscapes and COCO-Stuff) to better distinguish the different instances of the same categories. Similar to pix2pixHD and SPADE, we feed the edge map calculated from the instance map (‘edge’ and ‘non-edge’ are represented as ‘1’ and ‘0’) into the network. However, the special architecture of CLADE does not allow us to stack the edge map with the semantic layout directly. Thus, we embed the edge information in the modulated features. To match the activation values in the feature, we first modulate the edge map as follows:


where is the modulated edge map. and are two constant float point numbers that can be learned as regular parameters. Then, we combine the modulated with the feature maps modulated by the CLADE layer along the channel dimension, and feed them into the following layers. Since only two constant numbers are involved and Equation (2) can also be implemented by pixel-wise value assignment operations, the extra parameter and computation overhead is extremely low and negligible.

3.3 Computation and Parameter Complexity Analysis

3.3.1 Analysis of SPADE

In the original SPADE generator backbone [34], a SPADE block is placed before almost every convolution to replace the conventional normalization layer. For convenience, we denote the input and output channel numbers of the following convolutional layer as and its kernel size as . For its modulation network, we simply assume a same kernel size and intermediate channel number are used for all convolutional layers. Therefore, the parameter numbers for the convolutional layer and the SPADE block are calculated as:


With the default implementation settings of SPADE, we have , so the parameter ratio between both networks is:


This to say, the extra parameter introduced by SPADE becomes a significant overhead, especially when are relatively large ( by default in SPADE). Take the ADE20k dataset [52] as an example, which contains 151 classes (). On image resolution of , the SPADE generator consists of 7 SPADE residual blocks. We show the parameter ratio of each convolutional layer in Figure 5. It can be seen that SPADE indeed brings considerable parameter overhead to all the convolutional layers. This becomes even more serious when the network goes deeper since is designed to be smaller for higher feature resolution. The ratios for some layers even exceed . Taking all the convolutional layers in SPADE generators into consideration, the average ratio is about .

In addition to the parameter numbers, we also analyze the computation complexity. Here, we use the popular floating-point operation per second (FLOPs) as the metric. Since the convolutional layers within the modulation network dominate the computation cost of SPADE, the FLOPs of both the convolutional layer and the SPADE block can be simply calculated as:


where are the width and height of the output feature respectively. Therefore, the FLOPs ratio is identical to the parameter ratio shown in Figure 5. However, different from the parameter number, with the increasing feature resolutions, the absolute FLOPs are relatively larger in deeper layers, which makes the computation overhead even worse. Taking the same ADE20k dataset as an example, the average extra FLOPs ratio introduced by SPADE is about , which means the computation cost of SPADE is even heavier than the convolutional layers. More importantly, it is now popular to adopt very large synthesis networks to ensure good performance, which is already consuming a surprisingly large amount of parameter space and computation resources, and SPADE will further aggravate this situation, which might be unaffordable in many cases.

3.3.2 Analysis of CLADE

Compared to SPADE, our CLADE does not require any extra modulation network to regress the modulation parameters. Specifically, the corresponding numbers of its parameters and FLOPs are:


We take the value assignment operation as one float-point operation. Similar to SPADE, if every convolutional layer is followed by one CLADE layer, the relative ratios of parameter and FLOPs are:


In most existing synthesis networks, the above ratios are extremely small. For example, with the same backbone as the above SPADE generator for the ADE20k dataset, the average ratios for parameter and FLOPs are only and , respectively. Therefore, compared to SPADE, the parameter and computation overhead of CLADE are negligible, which is friendly to practical scenarios regarding both training and inference. Despite its simplicity and efficiency, we demonstrate that it can still achieve comparable performance as SPADE with extensive experiments in Section 4.

Fig. 6: The illustration of class-adaptive normalization layer (CLADE) with intra-class positional encoding (ICPE). The positional encoding map is calculated from the semantic segmentation map. and represent the positional encoding along the dimension.

3.4 Spatially-Adaptive CLADE

As mentioned before, the modulation parameters of SPADE are almost spatially invariant within the same semantic region, especially for high-resolution input masks. In other words, the spatial-adaptiveness is not fully utilized in SPADE. This is mainly due to the limited receptive fields of modulation layers in a shallow network. Theoretically, if we increase the depth of the network, better spatial-adaptiveness could be achieved with the accumulation of receptive fields, but along with prohibitively high computational cost. Based on this observation, we propose a variant of CLADE, CLADE-ICPE, to further improve intra-class spatial adaptiveness by leveraging a positional encoding map as the extra input.

The positional encoding map is defined as the relative distance from each pixel to its corresponding object center, which can be calculated using the input semantic mask . Specifically, for each pixel in the positional encoding map , we first find its belonging semantic object () by detecting the largest connected component of the corresponding semantic category and obtain the object center (). Then the distance map along the dimension is defined as:


We further define the maximum offset of each object as:


Finally, we get the normalized distance map by normalizing with the maximum offset:


As shown in Figure 6, in order to utilize the positional encoding map , we follow the modulation idea and use a convolution layer to map the positional encoding to the modulation parameters ():


where and are convolution operations with one-channel outputs. And is the element-wise multiplication. Since the input and output channel numbers of and are and , respectively, the extra parameter and computation overhead is almost negligible. Specifically, the corresponding relative ratios of parameters an FLOPs defined in Section 3.3.2 are:


Compared with Equation (10), the ratio of parameters is almost the same, while the ratio of FLOPs is almost twice that of CLADE. However, the absolute ratio is still relatively low, especially compared to SPADE (0.14% vs. 234.73%).

Fig. 7: Architecture of our generator. By default, we feed the down sampled semantic mask to the generator. When processing multi-modal image generation, the input of generator is replaced by a random noise. For style-guided synthesis, a style encoder is used to guide the specified distribution.

3.5 CLADE Generator

Similar as SPADE, our proposed CLADE can be integrated into different generator backbones. In this paper, the CLADE generator follows the similar network architecture of the SPADE generator [34] by default, but all the SPADE blocks are replaced by CLADE. As shown in Figure 7

, it adopts several residual blocks with upsampling layers and progressively increases the output resolution. The residual block consists of CLADE layers, ReLU layers and convolution layers, and the skip connection is also replaced by these layers when the number of channels before and after the residual block are different. For multi-modal synthesis, we follow the strategy as 


and attach an extra encoder that encodes the image into a random vector. Specifically, this encoder consists of a series of convolutional layers with stride 2, instance normalization layers and LReLU activation layers and outputs the mean and variance vector of the distribution of the specified image. Then a random vector sampled from this distribution is fed into the CLADE generator as the style guidance to enable global diversity of the generated results.

4 Experiments

4.1 Datasets

Main experiments are conducted on four popular datasets: ADE20k, ADE20k-outdoor, COCO-Stuff, and Cityscapes. The ADE20k dataset [52] consists of 25,210 images (20,210 for training, 2,000 for validation and 3,000 for testing), covering a total of 150 object and stuff categories. ADE20k-outdoor is a subset of ADE20k that only contains outdoor scenes. Similar to previous work [34, 35], we directly select the images containing categories such as sky, trees, and sea without manual inspection. There are 9,649 training images and 943 validation images. The COCO-Stuff dataset [3] augments COCO by adding dense pixel-wise stuff annotations. It has 118,000 training images and 5,000 validation images with 182 semantic categories. The Cityscapes dataset [7] is a widely used dataset for semantic image synthesis. It contains 2,975 high-resolution training images and 500 validation images of 35 semantic categories.

We use two additional datasets to evaluate the generalization ability when applying our CLADE to some recent semantic synthesis methods that depends on SPADE. The CelebAMask-HQ [27, 22, 31] contains 30,000 segmentation masks with 19 different classes from CelebAHQ face imgae dataset. There are split into 28,000 training images and 2,000 validation images. The DeepFashion [30] contains of 52,712 person images with fashion clothes. We use the processed dataset provided by SMIS [55] which consists of 30,000 training images and 2,247 validation images.

Dataset Method mIoU accu FID Params (M) FLOPs (G) Runtime (s)
ADE20k pix2pixHD 27.27 72.61 45.87 182.9 99.3 0.041
SPADE 36.28 78.13 29.79 96.5 181.3 0.042
CLADE 35.43 77.36 30.48 71.4 42.2 0.024
CLADE-ICPE 35.06 77.09 28.69 71.4 42.2 0.027
ADE20k-outdoor pix2pixHD 14.89 76.70 67.13 182.9 99.3 0.041
SPADE 19.30 80.44 45.92 96.5 181.3 0.042
CLADE 18.71 80.77 46.37 71.4 42.2 0.024
CLADE-ICPE 18.89 80.04 45.59 71.4 42.2 0.027
COCO-Stuff pix2pixHD 21.07 54.80 58.52 183.0 106.1 0.046
SPADE 36.74 67.81 27.69 97.5 191.3 0.046
CLADE 36.77 68.08 29.16 72.5 42.4 0.027
CLADE-ICPE 36.39 67.57 27.76 72.5 42.4 0.030
Cityscapes pix2pixHD 60.50 93.06 66.04 182.5 151.3 0.038
SPADE 61.95 93.39 51.98 93.0 281.5 0.065
CLADE 60.44 93.42 50.62 67.9 75.5 0.035
CLADE-ICPE 60.40 93.26 42.39 67.9 75.5 0.039

Performance and complexity comparison with other semantic image synthesis methods. All the metrics are tested by ourselves on the PyTorch and Titan XP GPU.

Model Backbone SPADE CLADE Backbone SPADE CLADE Backbone SPADE CLADE
Params (M) 68.1 28.4 3.3 67.1 25.9 0.8 68.4 29.1 4.1
Runtime (s) 0.015 0.027 0.009 0.022 0.043 0.013 0.017 0.029 0.010
TABLE II: Detailed comparison with SPADE and CLADE on the ADE20k (Col 2-4), Cityscapes (Col 5-7) and COCO-Stuff (Col 8-10) datasets. Backbone represents the generator without normalization layers, SPADE and CLADE represent the different normalization layers.
Dataset SPADE-light CLADE
mIoU accu FID FLOPs (G) mIoU accu FID FLOPs (G)
ADE20k 26.29 71.76 40.45 58.0 35.43 77.36 30.48 42.2
ADE20k-outdoor 15.54 77.69 58.55 58.0 18.71 80.77 46.37 42.2
COCO-Stuff 27.01 60.64 44.19 68.0 36.77 68.08 29.16 42.4
Cityscapes 59.70 93.13 52.07 132.9 60.44 93.42 50.62 75.5
TABLE III: Performance comparison with a lightweight model of SPADE on four datasets. The compared methods have the similar FLOPs.

4.2 Implementation Details

We follow the same training setting as SPADE [34]. In details, the generator is trained with the same multi-scale discriminator and the loss function is as follows:


where is the hinge version of GAN loss, and is the feature matching loss between the real and synthesized images. The feature is extracted by the multi-scale discriminator. is the perceptual loss [21] with the feature extractor of VGG network [40]. For multi-modal synthesis, we add KL-divergence loss term (

) to minimize the gap between the encoded distribution and Gaussian distribution. By default, we set

, and the Adam optimizer [26] (

) is used with the total epoch number of 200. The learning rates for the generator and discriminator are set to 0.0001 and 0.0004, respectively. We evaluate the model every 10 epochs and select the model with the best performance. To demonstrate the effectiveness of our method, we not only compare our CLADE with the SPADE baseline 

[34] but also include another comparison with the popular semantic image synthesis method pix2pixHD [44]. For pix2pixHD, we use the codes and settings provided by the authors to train all the models. For SPADE, we directly use the pre-trained models provided by the authors to get the result images for evaluation. The resolution of images () is set to except for Cityscapes, which is set to .

4.3 Evaluation Metrics

We leverage the protocol from previous works [5, 44] for evaluation, which is also used in SPADE [34]. Specifically, we run semantic segmentation algorithms on the synthesized images and evaluate the quality of the predicted semantic masks. To measure the segmentation accuracy, two popular metrics, mean Intersection-over-Union (mIoU) and pixel accuracy (accu) metrics are used. For different datasets, we select corresponding state-of-the-art segmentation models: UperNet101 [46, 8] for ADE20k and ADE20k-outdoor, DeepLabv2 [4, 25] for COCO-Stuff, DRN [48, 11] for Cityscapes and UNet [38, 41] for CelebAMask-HQ. As for DeepFashion, we also use UNet but train the model by ourselves. We also leverage the commonly used Fréchet Inception Distance (FID) [13] to measure the distribution distance between synthesized images and real images. Specifically, we calculate FID between generated validation images and real training images, not generated validation images and real validation images. This is because the number of training images is more than of validation images, which can better reflect the distribution characteristics of real images. The same protocol is also adopted in the recent work[6].

4.4 Quantitative Results

As shown in Table I, our method can achieve comparable performance with SPADE while significantly reducing the parameter number and computational complexity of the original SPADE generator on all the datasets. For example, on the COCO-Stuff dataset, the proposed CLADE achieves a mIoU score of 36.77 and a pixel accuracy score of 68.08, which is even slightly better than SPADE. When compared to pix2pixHD, CLADE outperforms it by more than and points in terms of mIoU and pixel accuracy respectively. As for the FID score, our CLADE is also close to SPADE and much better than pix2pixHD. On the Cityscapes dataset, our CLADE performs better than SPADE in terms of FID, but the parameter number in our CLADE generator is only about of that in the original SPADE generator and of that in pix2pixHD. As for the computation complexity in terms of FLOPs, CLADE generator is about fewer than that in the SPADE generator and fewer than that in the pix2pixHD.

Since the GPU computation capacity is often overqualified for single image processing, the real runtime speedup is less significant than FLOPs, but we still observe about speedup when compared to SPADE. More significant speedup can be observed on low-end devices. Taking one step forward, we further analyze the extra parameter and computation cost introduced by SPADE and CLADE in Table II. In details, we calculate the parameter and computation cost brought by the backbone network (operations except normalization) and the SPADE (or CLADE) layers respectively. It can be seen that in Table II, the advantages of CLADE layers in terms of parameters and runtime are much more obvious when ignoring the backbone part.

When introducing additional spatial information, CLADE-ICPE has made a significant improvement in terms of FID on all the datasets. Even compared to SPADE, CLADE-ICPE shows a considerable advantage, especially on Cityscapes dataset. But as for the model complexity, the additional parameters and FLOPs are negligible, and the overhead increase in the average running time is also small.

To further demonstrate the efficiency and effectiveness of CLADE, we also train a lightweight variant of SPADE (denoted as SPADE-light in Table III) by reducing the number of channels in its convolution layers to ensure it has similar FLOPs as CLADE. Obviously, SPADE-light performs much worse than CLADE on all datasets.

Method ADE20k ADE20k COCO-Stuff Cityscapes
CLADE vs. 48.375 57.000 55.000 53.375
CLADE vs. 68.375 73.375 95.000 57.500
TABLE IV: User study. The numbers indicate the percentage of users who favor the results of the proposed CLADE over the competing method.

4.5 User Study

Since judging the visual quality of one image is usually subjective, we further conduct a user study to compare the results generated by different methods. Specifically, we give the users two synthesis images generated from the same semantic mask by two different methods (our method and the baseline method) and ask them “which is more realistic”. To ensure a more detailed comparison, there is no time limit set for the users. And for each pairwise comparison, we randomly choose 40 results for each method and involve 20 users. In Table IV, we report the evaluation results on four different datasets. According to the results, we find that users have no obvious preference between our CLADE and SPADE, which once again proves the comparable performance to SPADE. But compared to the results of pix2pixHD, users clearly prefer our results on all the datasets, especially including the challenging COCO-Stuff dataset.

4.6 Qualitative Results

Besides the above quantitative comparison, we further provide some qualitative comparison results on the four different datasets. In detail, Figure 8 shows some visual results on some indoor cases on the ADE20k dataset and outdoor cases on the ADE20k-outdoor dataset. Despite the simplicity of our method, it can generate very high-fidelity images that are comparable to the ones generated by SPADE. In some cases, we find our method is even slightly better than SPADE. In contrast, because of semantic information lost problem existing in common normalization layers, the results generated by Pix2pixHD are worse than both SPADE and our CLADE. In Figure 9, some visual results on the COCO-Stuff dataset are provided. Compared to ADE20k, COCO-stuff has more categories and contains more small objects, so it is more challenging. However, our method can still work very well and generate high-fidelity results. According to results in Figure 10, a similar conclusion can also be drawn for higher-resolution semantic image synthesis on the Cityscapes dataset ().

We also show the results in Figure 12 to compare the visual effect of intra-class spatial-adaptiveness. Given additional spatial information, we can see more rich details from the results. Taking ADE20k dataset as an example, SPADE and CLADE can only give a blurred view out of the window, while CLADE-ICPE can generate a high-quality view with rich textures. In particular, for some classes with large regions, both SPADE and CLADE produce repeated or blurry pattern (see the last column of Figure 12) because they cannot differentiate the difference between different positions within the same category. In contrast, CLADE-ICPE can produce vivid textures with the spatial guidance of the positional encoding map.

Fig. 8: Visual comparison results on the ADE20k (top five rows) and ADE20k-outdoor (bottom five rows) dataset. It shows that images generated by our method are very comparable or even slightly better than SPADE. Compared to Pix2pixHD, SPADE and CLADE are overall more realistic.
Fig. 9: Visual comparison results on the challenging COCO-Stuff dataset. Though very diverse categories and small structures exist in this dataset, our method can work very well and generate very high-fidelity results.
Fig. 10: High-resolution synthesis () results on the Cityscapes dataset. Our method produces realistic images with faithful spatial alignment and semantic meaning.
Fig. 11: Multi-modal semantic synthesis results guided by different noise vectors (top row) or reference style images (bottom row). Obviously, our method can produce very diverse realistic images.
Fig. 12: Visual comparison results on ADE20k-outdoor, ADE20k, Cityscapes and COCO-Stuff datasets with or without position prior. We also show the results of SPADE as a reference.
ADE20k-outdoor Method C(1-7) S(1)+C(2-7) S(1-2)+C(3-7) S(1-4)+C(5-7) S(1-5)+C(6-7) S(1-6)+C(7) S(1-7)
mIoU 18.71 19.28 18.48 19.06 19.68 19.63 19.30
Runtime (s) 0.024 0.025 0.025 0.028 0.029 0.033 0.042
Cityscapes Method C(1-7) S(1)+C(2-7) S(1-2)+C(3-7) S(1-4)+C(5-7) S(1-5)+C(6-7) S(1-6)+C(7) S(1-7)
mIoU 60.44 61.25 62.14 62.08 62.00 61.47 61.95
Runtime (s) 0.039 0.040 0.040 0.043 0.048 0.057 0.065
TABLE V: Ablation results on ADE20k-outdoor and Cityscapes by mixing SPADE and CLADE with the transition points at different resolutions. Here C and S represent the CLADE and SPADE layers respectively. The values in parentheses indicate the numbers of ResBlks use the specified normalized layer.

4.7 Multi-Modal and Style-Guided Synthesis

As above mentioned, it is easy for our method to support multi-model and style-guided synthesis by introducing an extra style encoder before the generator network. Specifically, we get different style vectors either by random sampling or feeding different reference images into the style encoder, and then input these style vectors into the generator network to produce diverse images. In Figure 11, some visual results are shown. The results in the top row demonstrate that our method can synthesize diverse images with the same semantic layout. Similarly, as shown in the bottom row, different reference images can be used to further control the global style of the generated images, including but are not limited to sunny days, dusk, night, etc.

4.8 Ablation Analysis of Combining CLADE and SPADE

In the method part, we have shown that, for higher resolution layers, the distributions of in SPADE are more centralized (Figure 2) and the corresponding extra computation cost is also more significant (Figure 5). And for low resolution layers, is less centralized and can supply some spatial variance. In contrast, the basic CLADE is only class-adaptive but not spatial-adaptive. Therefore, it is intuitive to use SPADE on lower-resolution layers and CLADE on higher-resolution layers to achieve better balance between generation quality and efficiency. To verify this point, we mix SPADE and CLADE with the transition points at different resolution layers, and test the performance on the ADE20k-outdoor and Cityscapes datasets. In the original SPADE generator, there are seven SPADE ResBlks which are numbered from 1 to 7 as the resolution increases. The second and third ResBlks are at the same resolution if the resolution of the synthesized image is , otherwise they are at different resolutions.

As shown in Table V, the average running time decreases with the use of more CLADE layers, which is in line with our expectations. More interestingly, by using SPADE on low-res layers and CLADE starting from the middle ResBlks (e.g. 6th and 7th on ADE20k-outdoor dataset, and 3rd, 5th and 6th on Cityscapes dataset), it can even achieve slightly higher mIoU than using SPADE on all layers while being more efficient.

4.9 Ablation Analysis of Intra-Class Positional Encoding

Although the introduction of the positional encoding can provide a prior spatial information within the semantic category and help synthesize richer details, how to properly utilize this information is not trivial. Empirically, we find that inappropriate use may even be harmful. Here we study three different ways to apply the positional encoding map:

  • The positional encoding map is directly concatenated with the downsampled semantic mask as extra channels and fed to the generator. This version is called +disti.

  • The positional encoding map is first transformed with one 1x1 convolutional layer and then concatenated with each normalized features (after CLADE layer) as extra channels. This version is called +distf.

  • As described in Section 3.4, the positional encoding map is used to modulate the original semantic-adaptive modulation parameters of CLADE. This version is called +distp.

In the Table VI, we compare these three variants with the original CLADE, in terms of FID on the ADE20k, ADE20k-outdoor and Cityscapes datasets. It can be seen that, +distp achieves the best performance of FID on these datasets, while +disti is the worst. Specifically, by comparing +distf and +disti, we can easily observe that adding the spatial information at different feature levels is beneficial. And by comparing +distf and +distp, we can find that the concatenation of the positional encoding feature with normalized features along the channel dimension is not as effective as the element-wise multiplication used by +distp.

Particularly, the performance gain on the Cityscapes dataset is much more significant than that on the ADE20k dataset. We think it should be because Cityscapes contains many large-area categories with clear internal structure, such as buildings and cars. By comparison, though ADE20k-outdoor also has some large-area categories like sky and sea, they have relatively less complex internal structures, thus benefitting less from spatial adaptiveness.

DatasetMethod baseline +disti +distf +distp
ADE20k 30.48 31.75 31.13 28.69
ADE20k-outdoor 46.37 48.67 46.81 45.59
Cityscapes 50.62 50.50 48.07 42.39
TABLE VI: Comparison with different positional encoding map embedding on the ADE20k, ADE20k-outdoor and Cityscapes datasets in terms of FID. Baseline denotes the original CLADE without position prior, +distp is the version called CLADE-ICPE in Table I.
Dataset Method mIoU accu FID Params (M) FLOPs (G) Runtime (s)
CelebAMask-HQ SEAN 75.94 95.03 24.30 266.9 420.8 0.165
SEAN-CLADE 74.83 94.51 20.35 241.3 247.1 0.152
GroupDNet 76.13 95.21 29.39 145.3 225.5 0.090
GroupDNet-CLADE 76.70 95.38 29.30 134.6 213.6 0.074
DeepFashion SEAN 76.28 97.46 7.37 223.2 342.9 0.165
SEAN-CLADE 76.32 97.52 7.33 197.8 247.1 0.152
GroupDNet 76.19 97.48 9.72 96.3 291.6 0.062
GroupDNet-CLADE 76.82 97.67 9.79 79.2 118.5 0.042
Cityscapes SEAN 59.02 93.21 53.85 330.4 681.8 0.507
SEAN-CLADE 60.11 93.15 52.76 304.5 476.1 0.471
GroupDNet 59.20 92.78 41.12 76.5 463.6 0.224
GroupDNet-CLADE 59.82 92.83 42.10 57.7 434.6 0.128
TABLE VII: Performance and complexity comparison when applying CLADE onto some recent SPADE-based methods. All the models are trained with the same settings by using the official code.

4.10 Generalization Ability to SPADE-Based Methods

To demonstrate the general applicability, we further replace the SPADE layer with the proposed CLADE layer for some recent SPADE-based methods and show the results in Table VII. Without loss of generality, we still focus on the semantic image synthesis task and select two representative methods: GroupDNet [55] and SEAN [54]. GroupDNet is a semantic-level multimodal image synthesis method that achieves great success on the DeepFashion dataset, while SEAN focuses on face image synthesis and shows excellent performance on the CelebAMask-HQ dataset. Therefore, considering the performance of these two methods on their respective datasets, we also choose to conduct experiments on these two datasets, and add Cityscapes dataset as a supplement. All the models are trained with the same settings in the official codes and the only difference is the normalization layer.

The detailed comparison results are shown in the Table VII. In general, after replacing SPADE with CLADE, the original performance of such methods are almost not affected but the parameter number and computational overhead are significantly reduced. In detail, for SEAN, the performance on the CelebAMask-HQ dataset in terms of FID is significantly improved. As for model parameters, a reduction in model size of about 20M on different datasets is observed, which is consistent with the comparison between SPADE and CLADE in Table I.

5 Conclusion

In this paper, we conduct an in-depth analysis on the spatially-adaptive normalization used in semantic image synthesis. We observe that its most essential advantage comes from semantic-awareness instead of spatial-adaptiveness as originally suggested in [34]. Motivated by this observation, we design a more efficient conditional normalization structure CLADE. Compared to SPADE, CLADE can achieve comparable synthesis results but greatly reduce the parameter and computation overhead. To introduce true spatial adaptiveness, we further explore the role of position prior and propose an improved version of CLADE by modulating the parameters of CLADE through an extra intra-class positional encoding. We further adopt CLADE in some recent SPADE-based methods and get comparable or even better results with greatly reduced parameters and computational costs.


  • [1] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §2.1.
  • [2] A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations, Cited by: §2.1.
  • [3] H. Caesar, J. Uijlings, and V. Ferrari (2018) Coco-stuff: thing and stuff classes in context. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1209–1218. Cited by: §1, §4.1.
  • [4] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §4.3.
  • [5] Q. Chen and V. Koltun (2017) Photographic image synthesis with cascaded refinement networks. In Proceedings of the IEEE international conference on computer vision, pp. 1511–1520. Cited by: §4.3.
  • [6] Y. Choi, Y. Uh, J. Yoo, and J. Ha (2020) Stargan v2: diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8188–8197. Cited by: §4.3.
  • [7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §1, §4.1.
  • [8] CSAILVision (2019) Pytorch implementation for semantic segmentation/scene parsing on mit ade20k dataset. Note: Cited by: §4.3.
  • [9] V. Dumoulin, J. Shlens, and M. Kudlur (2016) A learned representation for artistic style. arXiv preprint arXiv:1610.07629. Cited by: §2.3.
  • [10] A. Dundar, K. Sapra, G. Liu, A. Tao, and B. Catanzaro (2020) Panoptic-based image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8070–8079. Cited by: §2.2.
  • [11] fyu (2019) Dilated residual networks. Note: Cited by: §4.3.
  • [12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.1.
  • [13] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637. Cited by: §4.3.
  • [14] S. Hong, D. Yang, J. Choi, and H. Lee (2018) Inferring semantic layout for hierarchical text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7986–7994. Cited by: §2.2.
  • [15] H. Huang, H. Tseng, H. Lee, and J. Huang (2020) Semantic view synthesis. In European Conference on Computer Vision, pp. 592–608. Cited by: §2.2.
  • [16] X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510. Cited by: §2.3.
  • [17] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189. Cited by: §2.2.
  • [18] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In

    International Conference on Machine Learning

    pp. 448–456. Cited by: §2.3, §3.2.
  • [19] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §1, §2.2, §2.2, §3.
  • [20] L. Jiang, C. Zhang, M. Huang, C. Liu, J. Shi, and C. C. Loy (2020) TSIT: a simple and versatile framework for image-to-image translation. arXiv preprint arXiv:2007.12072. Cited by: §2.2.
  • [21] J. Johnson, A. Alahi, and L. Fei-Fei (2016)

    Perceptual losses for real-time style transfer and super-resolution

    In European conference on computer vision, pp. 694–711. Cited by: §4.2.
  • [22] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §2.1, §4.1.
  • [23] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, pp. 4401–4410. Cited by: §1, §2.1.
  • [24] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila (2020) Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110–8119. Cited by: §1, §2.1.
  • [25] kazuto1011 (2019)

    PyTorch implementation of deeplab v2 on coco-stuff / pascal voc

    Note: Cited by: §4.3.
  • [26] D. Kingma and J. Ba (2015) Adam: a method for stochastic optimization in: proceedings of international conference on learning representations. Cited by: §4.2.
  • [27] C. Lee, Z. Liu, L. Wu, and P. Luo (2020) Maskgan: towards diverse and interactive facial image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5549–5558. Cited by: §4.1.
  • [28] B. Li, F. Wu, K. Q. Weinberger, and S. Belongie (2019) Positional normalization. In Advances in Neural Information Processing Systems, pp. 1622–1634. Cited by: §2.3.
  • [29] M. Liu, T. Breuel, and J. Kautz (2017) Unsupervised image-to-image translation networks. In Advances in neural information processing systems, pp. 700–708. Cited by: §2.2.
  • [30] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang (2016) Deepfashion: powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1096–1104. Cited by: §4.1.
  • [31] Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730–3738. Cited by: §4.1.
  • [32] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley (2017) Least squares generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2794–2802. Cited by: §2.1.
  • [33] M. Oza, H. Vaghela, and S. Bagul (2019) Semi-supervised image-to-image translation. In

    2019 International Conference of Artificial Intelligence and Information Technology (ICAIIT)

    pp. 16–20. Cited by: §2.2.
  • [34] T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2337–2346. Cited by: Semantic Image Synthesis via Efficient Class-Adaptive Normalization, §1, §2.2, §2.2, §2.3, §3.1, §3.1, §3.3.1, §3.5, §4.1, §4.2, §4.3, §5.
  • [35] X. Qi, Q. Chen, J. Jia, and V. Koltun (2018) Semi-parametric image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8808–8816. Cited by: §2.2, §4.1.
  • [36] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §2.1.
  • [37] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee (2016) Generative adversarial text to image synthesis. In International Conference on Machine Learning, pp. 1060–1069. Cited by: §2.2.
  • [38] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §4.3.
  • [39] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: §2.1.
  • [40] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.2.
  • [41] switchablenorms (2020) CelebAMask-hq. Note: Cited by: §4.3.
  • [42] Z. Tan, M. Chai, D. Chen, J. Liao, Q. Chu, L. Yuan, S. Tulyakov, and N. Yu (2020) MichiGAN: multi-input-conditioned hair image generation for portrait editing. ACM Transactions on Graphics (TOG) 39 (4), pp. 95–1. Cited by: §2.2.
  • [43] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §1, §2.3, §3.
  • [44] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798–8807. Cited by: §1, §2.2, §3.1, §3, §4.2, §4.3.
  • [45] Y. Wu and K. He (2018) Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §2.3.
  • [46] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun (2018) Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434. Cited by: §4.3.
  • [47] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He (2018) Attngan: fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1316–1324. Cited by: §2.2.
  • [48] F. Yu, V. Koltun, and T. Funkhouser (2017) Dilated residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 472–480. Cited by: §4.3.
  • [49] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas (2017) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 5907–5915. Cited by: §2.2.
  • [50] P. Zhang, B. Zhang, D. Chen, L. Yuan, and F. Wen (2020) Cross-domain correspondence learning for exemplar-based image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5143–5153. Cited by: §2.2.
  • [51] H. Zheng, H. Liao, L. Chen, W. Xiong, T. Chen, and J. Luo (2019) Example-guided scene image synthesis using masked spatial-channel attention and patch-based self-supervision. arXiv preprint arXiv:1911.12362. Cited by: §2.2.
  • [52] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2017) Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 633–641. Cited by: §1, §1, §3.1, §3.3.1, §4.1.
  • [53] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §2.2.
  • [54] P. Zhu, R. Abdal, Y. Qin, and P. Wonka (2020) SEAN: image synthesis with semantic region-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5104–5113. Cited by: §2.2, §4.10.
  • [55] Z. Zhu, Z. Xu, A. You, and X. Bai (2020) Semantically multi-modal image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5467–5476. Cited by: §2.2, §4.1, §4.10.