Image Synthesis via Semantic Composition

09/15/2021 ∙ by Yi Wang, et al. ∙ Megvii Technology Limited The Chinese University of Hong Kong 0

In this paper, we present a novel approach to synthesize realistic images based on their semantic layouts. It hypothesizes that for objects with similar appearance, they share similar representation. Our method establishes dependencies between regions according to their appearance correlation, yielding both spatially variant and associated representations. Conditioning on these features, we propose a dynamic weighted network constructed by spatially conditional computation (with both convolution and normalization). More than preserving semantic distinctions, the given dynamic network strengthens semantic relevance, benefiting global structure and detail synthesis. We demonstrate that our method gives the compelling generation performance qualitatively and quantitatively with extensive experiments on benchmarks.



There are no comments yet.


page 1

page 3

page 4

page 7

page 8

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantic image synthesis transforms the abstract semantic layout to realistic images, an inverse task of semantic segmentation (Figure 1

). It is widely used in image manipulations and content creation. Recent methods on this task are built upon generative adversarial networks (GAN)

[7], modeling image distribution conditioning on segmentation masks.

Despite its substantial achievement [14, 31, 24, 21, 29, 4, 26, 16, 30], this line of research is still challenging due to the high complexity of characterizing object distributions. Recent advance [24, 21]

on GAN-based image synthesis concentrates on how to exploit spatial semantic variance in the input for better preserving layout and independent semantics, leading to further generative performance improvement. They both use different parametric operators to handle different objects.

Specifically, SPADE [24] proposes a spatial semantics-dependent denormalization operation for the common normalization, as the feature statistics are highly correlated with semantics. CC-FPSE [21] extends this idea to convolution using dynamic weights, generating spatially-variant convolutions from the semantic layouts. Relationships between objects are implicitly modeled by the weight-sharing convolutional kernels (SPADE) or hierarchical structures brought by stacked convolutions. We believe when performing semantic-aware operations, enhancing object relationship could be further beneficial to final synthesis, since context and long-range dependency have proven effective in several vision tasks [6, 32, 43, 41, 34].

Figure 1: Semantic image synthesis results of our method on face and scene datasets.

To explicitly build connection between objects and stuff in image synthesis, we propose a semantic encoding and stylization method, named semantic-composition generative adversarial network (SC-GAN). We first generate the semantic-aware and appearance-correlated representations from mapping the discrete semantic layout to their corresponding images.

Our idea is inspired by the following observation. Different semantics are with labels in scene or face parsing datasets. Some of them are highly correlated in appearance, e.g., the left and right eyes in CelebAMask-HQ [19]

. We abstract objects in images or feature maps into vectors by encoding the semantic layout, which we call semantic vectors. This intermediate representation is to characterize the relationship between different semantics based on their appearance similarity.

With these semantic vectors, we create semantic-aware and appearance-consistent local operations in a semantic-stylization fashion. Consistent with the proposed semantic vectors, these dynamic operators are also variant in semantics and correlated in appearance. Our proposed operators are extended from existing conditional computation [38] to stylize the input conditioned on the semantic vectors. Specifically, we exploit semantic vectors to combine a shared group of learnable weights, parameterizing the convolutions and normalizations used in semantic stylization.

Note the learning of semantic vectors and render candidates is non-trivial. Intuitive designs that encode the semantic layout to semantic vectors for later dynamic computations make semantic vectors stationary, since these semantic vectors are not directly regularized by the appearance information in the image. In this paper, we make the learning of semantic encoding and stylization relatively independent. Semantic encoding is trained by estimating the corresponding natural images from the input semantic layouts by maximum likelihood. Semantic stylization is trained in an adversarial manner.

Our method is validated on several image synthesis benchmark datasets quantitatively and qualitatively. Also, the decoupled design of semantics encoding and stylization makes our method applicable to other tasks, e.g., unpaired image translation [46]. Our contribution is threefold.

  • We propose a new generator design for GAN-based semantic image synthesis. We present spatially-variant and appearance-correlated operations (both convolution and normalization) via spatially conditional computation, exploiting both local and global characteristics for semantic-aware processing.

  • Our proposed generator with a compact capacity outperforms other popular ones on face and scene synthesis datasets in visual and numerical comparisons.

  • The decoupled design in our method finds other applications,   unpaired image-to-image translation.

Figure 2: Framework of SC-GAN. SCB denotes the spatially conditional block. It is constructed by spatial conditional convolutions and normalizations, parameterized by semantic vectors . Its design is given in Figure 4 and Section 3.2.

2 Related Work

2.1 Semantic Image Synthesis

This task is to create a realistic image based on the given semantic layout (pixel-level semantic labels). Essentially, it is an ill-posed problem as one semantic layout may correspond to several feasible pictures. It can be dated back to ‘Image analogy’ in 2001 [10], in which the mentioned mapping uncertainties are resolved by the local matching and constraints from a reference image.

Recent learning-based approaches [14, 31, 35, 24, 21, 29, 36, 4, 26, 16, 30, 33, 47, 28] greatly advance this area, formulating it as an image distribution learning problem conditioned on the given semantic maps. Due to the development of conditional generative adversarial networks (cGAN) [22], pix2pix makes seminal exploration on image synthesis [14]. It gives some components and principles about how to apply cGAN to this problem, including loss design, generator structures (e.g., encoder-decoder and U-Net), and Markovian discriminator (also known as PatchGAN).

Later, Wang  propose a new enhanced version pix2pixHD [31]. By introducing a U-Net style generator with a larger capacity and several practical techniques for improving GAN training, including feature matching loss, multi-scale discriminators, and perceptual loss, their method boosts the image synthesis performance on producing vivid details. Further, SPADE is developed on improving the realism of the synthesized images by working on the normalization issue [24]. It shows using the given segmentation masks to explicitly control the affine transformation in the used normalization can better preserve the generated semantics, leading to a noticeable improvement. Such an idea is further extended in CC-FPSE [21]. It dynamically generates convolutional kernels in the generator conditioning on the input semantics.

Besides of GAN-based methods, research explores this task from other perspectives. Chen  [4] produce images using cascaded refinement networks (CRN) progressively with regression, starting from small-scale semantic layouts. Qi  [26] proposed a semi-parametric approach to directly utilize object segments in the training data. They retrieve object segments with the same semantics and similar shapes from the training set to fill the given semantic layout, and then regress these assembled results to the final images. Additionally, Li  [20] employ implicit maximum likelihood estimation to CRN, to pursue more diverse results from a semantic layout.

2.2 Dynamic Computation

In the development of neural network components, dynamic filters

[15] or hypernetworks [8] are proposed for their flexibility to input samples. It generates dynamic weights conditioned on the input or input-related features for parameterizing some operators (mostly fully-connected layers or convolutions). This has been applied to many tasks [39, 25, 12, 2, 24, 21, 17].

Conditionally Parameterized Convolutions

It is a special case of dynamic filters, which produces dynamic weights by combining the provided candidates conditionally [38]. It uses the input features to generate the input-dependent convolution kernels in the neural nets by a learnable linear combination of a group of kernels , as , where computes an input-dependent scalar to choose the given kernel candidates, and is the expert number. It is equivalent to a linear mixture of experts, with more efficient implementation.

In this paper, our proposed dynamic computation units are extended from conditionally parameterized convolutions. We generalize the scalar condition into a spatial one and also apply these techniques to normalization.

3 Semantic-Composition GAN

We aim to transform a semantic layout to a realistic picture (where , , and denote the height, width, and the category number in semantic layout, respectively), as , in which is a nonlinear mapping. During training, paired-wise data is available given the corresponding natural image of provided. We demand synthesized image to match the given semantic layout. But and are not necessarily identical.

Our semantic-composition GAN (SC-GAN) decouples semantic image synthesis into two parts: semantic encoding and stylization. They are realized by semantic vector generator (SVG) and semantic render generator (SRG), respectively. As shown in Figure 10, SVG takes the semantic layout and produces multi-scale semantic vectors in a feature map form (since we treat each feature point as a semantic vector). SRG is to transform a random sampled noise to the final synthesized image with a dynamic network. The key operators (convolution and normalization) in this network are conditionally parameterized by the semantic vectors provided by SVG and a group of weight candidates.

3.1 Semantic Vectors Generation

SVG is to transform the input discrete semantic labels

into semantic-and-appearance correlated representation of semantic vectors, building the relationship between different semantics according to their appearance similarities. For example, grass and trees are represented by different semantic labels in COCO-stuff

[1], sharing similarities in color and texture. Our method represents their corresponding regions with different but similar representations.

Generally, semantic vectors are learned from encoding of the input semantic labels into the corresponding image. SVG takes input of the semantic layout and generates the corresponding semantic vectors in feature map form as with different scales. It is expressed as


where denotes a different spatial scale.

SVG is in a cascaded refinement form, structured as CRN [4]. We feed a small-scale semantic layout into it, encode the input and upsample the features, and concatenate it with a larger-scale semantic layout. We repeat this process until the output reaches the final resolution.

Nonlinear Mapping

We regularize the computed semantic vectors using a nonlinear mapping. Directly employing the these unconstrained vectors would lead to performance drop (shown in Section 4.3). Suppose , where and index height and width. We further normalize its values into with softmax for better performance and interpretability as


where denotes the th scalar in , and is the temperature to control smoothness of the semantic vectors . The smaller is, the smoother is. Note could be sigmoid, tanh, or other nonlinear functions. The performance along with our choices will be empirically compared and analyzed in the Section 4.3.

Figure 3: Feature correlation matrix between semantic vectors of different semantics (left) and 2D feature distributions (compressed by t-SNE) of semantic vectors (right) on CelebAMask-HQ.

Effectiveness of Semantic Vectors

We visualize the correlation between different semantic regions using semantic vectors. It is found that these vectors are related by the appearance similarity. Figure 3

(left) shows cosine similarities between mean semantic vectors of different semantics. Note the semantic vectors representing the left and right eyes are almost identical (with cosine similarity 0.999). This is also observed in the relationship between the upper and lower lips. Intriguingly, it also reveals that the semantic vectors for skin are close to those of the nose.

Figure 3 (right) illustrates how these semantic vectors are distributed (compressed to 2D by t-SNE, only 1% points are visualized for clarity). Note that blue points standing for left eyes are much overlapped with those of right eyes. Also, the orange point cluster is close to the green one for upper and lower lips. It validates that semantic vectors can give similar representations to different semantic regions with a similar appearance. These semantic vectors are extracted from 100 random images of CelebAMask-HQ.

Figure 4: Employed residual block using the proposed spatially conditional convolution (SCC) and normalization (SCN).

3.2 Semantic Render Generation

SRG is a generator in a progressive manner, built with the residual block [9] (Figure 4) formed by our proposed spatial conditional convolution (SCC, left of Figure 5) and spatial conditional normalization (SCN, right of Figure 5). It transforms random noise into the target image , conditioned on of . It is formulated as


where and

is a standard normal distribution.

Both SCC and SCN are spatially conditional, parameterized from semantic vectors and a shared group of weights, making them produce semantic-aware and appearance-correlated operators. Their designs are detailed below.

Spatially Conditional Convolution

For input feature maps (intermediate representation of in SRG), assuming its learnable kernel candidates are , with semantic vectors , we compute the corresponding output as


where and indicate the row and column indexes of the given feature maps, and indexes both the channel of and the weight candidate.

Spatially Conditional Normalization

Similarly, the spatially conditional normalization employs linearly combined mean and variance for the affine transformation after normalization. Still for and , suppose its learnable mean and variance candidates are and , we yield the normalized output as


where and are to compute the mean and standard variance of their input, respectively. Mean and variance candidates are initialized to 0 and 1, respectively.

Figure 5: Conceptual illustration of spatial conditional computation. Left: spatially conditional convolution (SCC), right: spatially conditional normalization (SCN).


Inheriting from the spatially adaptive processing idea from SPADE [24] in semantic image synthesis, our model generates semantic-aware convolutions and normalization to handle different semantic regions indicated by the input. This idea is also explored in CC-FPSE [21], which employs segmentation masks to parameterize the conditional convolution directly by generating weights. Although it improves the generation quality of SPADE, the computation is expensive, since its independent spatially variant convolutional operations can only be implemented using local linear projection instead of standard convolution.

In contrast, our operators are correlated when they cope with regions with similar appearance and yet with different labels, achieved by the semantic vectors and the shared group of weights. Besides, our SCC utilizes standard convolutions for efficient training and evaluation. As validated in our experiments, our design is beneficial to long-range dependency modeling, consistently improving generation performance.

3.3 Learning Objective

The optimization goal of our method consists of two parts with style-related loss and regression loss. The former contains perceptual loss, GAN loss, and feature matching loss. Regression loss on SVG is to learn semantic-aware and appearance-consistent vectors, named semantic vector generation loss. In general, our optimization target is


where , , , and are trade-off regularization parameters, set to 10, 1, 10, and 2, respectively.

Perceptual Loss

We employ the pretrained natural image manifold to constrain the generative space of our model. Specifically, we minimize the discrepancy between the produced images and their corresponding ground truth in the feature space of VGG19 [27] as



indicates extracting the feature maps from the layer ReLU

of VGG19.

GAN Loss

We train our generator and discriminator using hinge loss. Its learning goal on generator is


where and denote the generator (including SVG and SRG) and discriminator of our SC-GAN, respectively. The corresponding optimization goal for the discriminator is


For the discriminator, we directly employ the feature pyramid semantics-embedding discriminator from [21].

Semantic Vector Generation Loss

To prevent SVG generates trivial semantic vectors for later generation, we regularize its learning by predicting the corresponding information of the fed semantic layout as


where denotes the predicted image from SVG. or . Note that we can also conduct such measure in the perceptual space of a pretrained classification network like Eq. (7). This is studied in Section 4.3.

Method CelebAMask-HQ Cityscapes ADE20K COCO-Stuff
CRN [4] N/A N/A 52.4 77.1 104.7 22.4 68.8 73.3 23.7 40.4 70.4
SIMS [26] N/A N/A 47.2 75.5 49.7 N/A N/A N/A N/A N/A N/A
pix2pixHD [31] 54.7 0.529 58.3 81.4 95.0 20.3 69.2 81.8 14.6 45.7 111.5
SPADE [24] 42.2 0.487 62.3 81.9 71.8 38.5 79.9 33.9 37.4 67.9 22.6
CC-FPSE [21] N/A N/A 65.6 82.3 54.3 43.7 82.9 31.7 41.6 70.7 19.2
Ours 19.2 0.395 66.9 82.5 49.5 45.2 83.8 29.3 42.0 72.0 18.1
Table 1: Quantitative results on the validation sets from different methods.

3.4 Implementation

We apply spectral normalization (SN) [23] both on the generator and discriminator. Also, a two time-scale update rule [11] is used during training. The learning rates for the generator and discriminator are and , respectively. The training is conducted with Adam [18] optimizer with and

. For the used batch normalization, all statistics are synchronized across GPUs. Unless otherwise specified, the used candidate number

from Eqs. (4) and (5) of our method is set to 3 in experiments.

4 Experiments

Our experiments are conducted on four face and scene parsing datasets: CelebAMask-HQ [19], Cityscapes [5], ADE20K [45], and COCO-Stuff [1]. In our experiments, images and their corresponding semantic layouts in CelebAMask-HQ, Cityscapes, ADE20K, and COCO-Stuff are resized and cropped into , , , , respectively. The train/val splits follow the setting of these datasets.

For training of our method, we take 100, 200, 200, and 100 epochs on CelebAMask-HQ, Cityscapes, ADE20K, and COCO-Stuff, respectively. The first half of epochs on the first three datasets are with full learning rates and the remaining half linearly decays with learning rates approaching 0. Our computational platform is with 8 NVIDIA 2080Ti GPUs.


We take CRN [4], SIMS [26], Pix2pixHD [31], SPADE [24], Mask-GAN [19], and CC-FPSE [21] as baselines. The following evaluation is all with their original implementation and pretrained models from their official release. Some new results on CelebAMask-HQ from SPADE are trained from scratch with their default training setting. CC-FPSE is not applicable here since it consumes more GPU memory on the face dataset than we can afford. The reason is that it realizes conditional convolution with spatially independent local linear processing, leading to considerable computational overhead. Note that Mask-GAN is developed for facial image manipulation in style transfer, which exploits both the semantic layout and a reference image. Its qualitative and quantitative evaluation is only for reference because it uses extra input information.

About the model complexity, our SC-GAN has the minimal capacity of 66.2M parameters, compared to referred numbers of parameters of 183.4M for Pix2pixHD, 93.0M for SPADE, and 138.6M for CC-FPSE.

Evaluation Metrics

Following the testing protocol in the semantic image synthesis task [31, 24], we evaluate our methods along with baselines in the following perspectives: quantitative performance in Fréchet Inception Score (FID) [11] and learned perceptual image patch similarity (LPIPS) [44], semantic segmentation, and user study. In semantic segmentation, like the existing work [31, 24], we use mIoU and mAcc (mean pixel accuracy) performed on the synthesized results from the trained segmentation models to assess the result quality. DeepLabV2 [3], UperUnet101 [37], and DRN-D-105 [40] are employed for COCO-Stuff, ADE20K, and Cityscapes, respectively.

Figure 6: Visual comparison on CelebAMask-HQ.
Figure 7: Visual comparison on COCO-stuff and ADE20K.

4.1 Quantitative Comparison

Table 1 indicates that our method yields decent performance, manifesting the effectiveness of our design of generating layout-aligned and appearance-related operators. On CelebAMask-HQ, our proposed SC-GAN improves face synthesis in terms of FID to 19.2, compared to FID of Spade 42.2. Our score is even lower than that computed from MaskGAN (21.4), which specializes in face manipulation utilizing ground truth face images for style reference. In scene-related datasets,  Cityscapes, ADE20K, and COCO-Stuff, our method works nicely regarding segmentation and generation evaluation, giving non-trivial improvements compared with baselines, especially on FID.

In the comparison shown in Table 1 among CRN, CC-FPSE, and our SC-GAN, SC-GAN also shows better results in terms of mIoU, Acc, and FID. It proves the necessity to learn the semantic vectors from the segmentation masks in our method, since CRN concatenates segmentation masks in every input stage and CC-FPSE parameterizes convolution by generating weights from segmentation masks.

Methods CelebAHQ Cityscapes ADE20K COCO
Ours SPADE 76.00% 59.50% 66.62% 57.78%
Ours CC N/A 54.12% 60.28% 53.60%
Table 2: User study. Each entry gives the percentage of cases where our results are favored.

User Study

Adhering to the protocol in SPADE [24], we list our user study results in Table 2 to compare our method with SPADE and CC-FPSE. Specifically, the subject judges which synthesizing result looks more natural corresponding to the input semantic layout. In all conditions, results from our model are more preferred by users compared with those from others, especially on CelebAMask-HQ.

4.2 Qualitative Results

Figures 6 and 7 give the visual comparison of our method and other baselines. Our method generates natural results with less noticeable visual artifacts and more appearing details. Note that the skin of persons by our method is more photorealistic. Moreover, due to the effectiveness of semantic vectors, our method yields more consistent eye regions in Figure 6, and even creates intriguing reflections on the water in the bottom row of Figure 7. More visual results are given in the supplementary material.

Figure 8:

Multi-modal predictions (top row) and interpolation (bottom row) of our method on CelebAMask-HQ.

Figure 9: Unpaired image-to-image translation results from our model on summerwinter. The source images, their reference ones (target images), and their corresponding translated results are marked by blue, purple, and black rectangles, respectively.

Multi-modal Outputs and Interpolation

Our method synthesizes multi-modal results from the same semantic layout using an additional encoder trained in the VAE manner. With different sampled random noise, our method gives diverse output in terms of appearance (top row in Figure 8). Also, we can apply linear interpolation of these random vectors, achieving a smooth transition from having a beard to not with it (bottom row in Figure 8). It validates the effectiveness of the learned manifold in our model.

mIoU 62.6 63.1 66.3 66.9
FID 69.7 60.9 54.0 49.5
Table 3: Impact of different operators (Op) on the generative performance on Cityscapes.

4.3 Ablation Studies

We ablate the key design of our method on Cityscapes.

SCC vs. Standard Conv

We construct three baselines, where SRG employs conventional convolutions (Conv) and Batch normalizations (BN), Conv and SCN, SCC and BN, respectively, and SVG remains intact. For baselines using Conv, we triple its Conv number in SRG compared to SCC for fair comparisons, as in Eq. (4). We use extra input (concatenating to the input feature maps ) for every Conv, to see how the usage of affects the performance.

From Table 3, we notice SCC boosts Cityscapes synthesis from Conv in terms of mIoU and FID (mIoU: , FID: with BN, and mIoU: , FID: with SCN). It manifests the effectiveness of SCC. Exploiting spatially variant features () by the proposed dynamic operators is more beneficial to yielded results than by static ones in this task, with a slimmer capacity (66.2M vs. 67.3M (Conv+BN)).

SCN vs. BN vs. SPADE

The design of SCN is also validated in Table 3. SCN improves synthesis more on the generation quality (FID: with Conv, and FID: with SCC). Its influence on the semantic alignment (mIoU) is relatively small. Also, Conv+SCN gives better mIoU and FID compared SPADE (Table 1), further validating the importance of SCN.

w/o w/ () w/ () w/ (vgg)
mIoU 60.3 66.9 63.4 64.2
FID 82.5 49.5 53.3 51.8
Table 4: Impact of semantic vector generation loss in Eq. (10) about the generative performance on Cityscapes.

Semantic Vector Generation Loss

As mentioned in Section 3.1, semantic vector generation loss is to improve the relationship between different semantics according to their appearance. As shown in Table 4, without it in Eq. (10), the corresponding quantitative performance degrades notably both on object alignment (mIoU: ) and generation (FID: ). Also, using norm in Eq. (10) is better than using norm or perceptual loss considering both alignment and generation.

mIoU 67.9 67.3 64.8 61.2
Table 5: Generative performance on Cityscapes regarding the number of the shared weight candidates in SCC.

Number of Shared Weight Candidates in SRG

We reduce the number of conv weight candidates in SRG while preserving that of norm candidates (fixed to 4). The corresponding segmentation results are shown in Table 5. With the increase of , the semantic alignment becomes better.

Nonlinear Sigmoid Tanh ReLU None Softmax
mIoU 62.6 63.1 61.7 60.2 66.9
FID 57.6 56.8 66.5 73.9 49.5
Table 6: Different nonlinear functions affect the generative performance on Cityscapes.

Transformation of Semantic Vector Computation

We evaluate different nonlinear functions for semantic vector computation, as given in Table 6. Note incorporating functions on

is vital as mIoU and FID scores become worse without it (mIoU: 60.2, FID: 73.9), and using softmax yields the best quantitative generation performance. In our model, bounded activation functions work better than unbounded (sigmoid, tanh, softmax vs. ReLU) ones, and the normalized one performs better than the unnormalized setting (softmax vs. sigmoid and tanh).

With A New Discriminator and Training Tricks

We note new efforts [28] were made to enhance synthesis results by using a more effective discriminator and training approaches. Incorporating these techniques into our method could further boost our generation performance, , mIoU: 69.9, FID: 47.2 on Cityscapes, and mIoU: 49.1, FID: 27.6 on ADE20K, as given in Section 2.1 of the supp. material.

Methods MUNIT [13] DMIT [42] Ours
FID 118.225 87.969 82.882
IS 2.537 2.884 3.183
Table 7: Unpaired image-to-image translation evaluation on summer-to-winter dataset.

4.4 Unpaired Image-to-Image Translation

Due to our semantic encoding and stylization design, our model can also be applied to unpaired image-to-image (i2i) translation with minor modification. This modified framework is given in the supp. material. Table 7 shows the quantitative evaluation about our model along with unpaired i2i baselines MUNIT [13] and DMIT [42] on Yosemite summer-to-winter dataset. The superiority of FID and IS from our model demonstrates its generality. Visual results are given in Figure 11.

5 Concluding Remarks

In this paper, we have presented a new method to represent visual content in a semantic encoding and stylization manner for generative image synthesis. Our method introduces appearance similarity to semantic-aware operations, proposing a novel spatially conditional processing for both convolution and normalization. It outperforms the compared popular synthesis baselines on several benchmark datasets both qualitatively and quantitatively.

This new representation is also beneficial to unpaired image-to-image translation. We will study its applicability to other generation tasks in the future.


  • [1] H. Caesar, J. Uijlings, and V. Ferrari (2018) Coco-stuff: thing and stuff classes in context. In CVPR, pp. 1209–1218. Cited by: §3.1, §4.
  • [2] J. Chen, X. Wang, Z. Guo, X. Zhang, and J. Sun (2020) Dynamic region-aware convolution. arXiv preprint arXiv:2003.12243. Cited by: §2.2.
  • [3] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI 40 (4), pp. 834–848. Cited by: §4.
  • [4] Q. Chen and V. Koltun (2017) Photographic image synthesis with cascaded refinement networks. In ICCV, pp. 1511–1520. Cited by: §1, §2.1, §2.1, §3.1, Table 1, §4.
  • [5] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    In CVPR, pp. 3213–3223. Cited by: §4.
  • [6] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In ICCV, pp. 764–773. Cited by: §1.
  • [7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NeurIPS, pp. 2672–2680. Cited by: §1.
  • [8] D. Ha, A. Dai, and Q. V. Le (2016) Hypernetworks. arXiv preprint arXiv:1609.09106. Cited by: §2.2.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.2.
  • [10] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. Salesin (2001) Image analogies. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pp. 327–340. Cited by: §2.1.
  • [11] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, pp. 6626–6637. Cited by: §3.4, §4.
  • [12] X. Hu, H. Mu, X. Zhang, Z. Wang, T. Tan, and J. Sun (2019)

    Meta-sr: a magnification-arbitrary network for super-resolution

    In CVPR, pp. 1575–1584. Cited by: §2.2.
  • [13] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In ECCV, Cited by: §4.4, Table 7.
  • [14] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In CVPR, pp. 1125–1134. Cited by: §1, §2.1.
  • [15] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool (2016) Dynamic filter networks. In NeurIPS, pp. 667–675. Cited by: §2.2.
  • [16] L. Jiang, C. Zhang, M. Huang, C. Liu, J. Shi, and C. C. Loy (2020) TSIT: a simple and versatile framework for image-to-image translation. In ECCV, Cited by: §1, §2.1.
  • [17] Y. Jo, S. W. Oh, J. Kang, and S. J. Kim (2018) Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In CVPR, Cited by: §2.2.
  • [18] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.4.
  • [19] C. Lee, Z. Liu, L. Wu, and P. Luo (2020) Maskgan: towards diverse and interactive facial image manipulation. In CVPR, pp. 5549–5558. Cited by: §1, §4, §4.
  • [20] K. Li, T. Zhang, and J. Malik (2019) Diverse image synthesis from semantic layouts via conditional imle. In ICCV, pp. 4220–4229. Cited by: §2.1.
  • [21] X. Liu, G. Yin, J. Shao, X. Wang, et al. (2019) Learning to predict layout-to-image conditional convolutions for semantic image synthesis. In NeurIPS, pp. 570–580. Cited by: Table 8, §1, §1, §2.1, §2.1, §2.2, §3.2, §3.3, Table 1, §4.
  • [22] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §2.1.
  • [23] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §3.4.
  • [24] T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In CVPR, pp. 2337–2346. Cited by: Table 8, §1, §1, §2.1, §2.1, §2.2, §3.2, Table 1, §4, §4, §4.1.
  • [25] L. Qi, X. Zhang, Y. Chen, Y. Chen, J. Sun, and J. Jia (2020) PointINS: point-based instance segmentation. arXiv preprint arXiv:2003.06148. Cited by: §2.2.
  • [26] X. Qi, Q. Chen, J. Jia, and V. Koltun (2018) Semi-parametric image synthesis. In CVPR, pp. 8808–8816. Cited by: §1, §2.1, §2.1, Table 1, §4.
  • [27] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.3.
  • [28] V. Sushko, E. Schönfeld, D. Zhang, J. Gall, B. Schiele, and A. Khoreva (2020) You only need adversarial supervision for semantic image synthesis. arXiv preprint arXiv:2012.04781. Cited by: §B.1, Table 8, §2.1, §4.3.
  • [29] H. Tang, D. Xu, N. Sebe, Y. Wang, J. J. Corso, and Y. Yan (2019) Multi-channel attention selection gan with cascaded semantic guidance for cross-view image translation. In CVPR, pp. 2417–2426. Cited by: §1, §2.1.
  • [30] H. Tang, D. Xu, Y. Yan, P. H. Torr, and N. Sebe (2020) Local class-specific and global image-level generative adversarial networks for semantic-guided scene generation. In CVPR, pp. 7870–7879. Cited by: §1, §2.1.
  • [31] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, pp. 8798–8807. Cited by: §1, §2.1, §2.1, Table 1, §4, §4.
  • [32] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In CVPR, pp. 7794–7803. Cited by: §1.
  • [33] Y. Wang, Y. Chen, X. Tao, and J. Jia (2020)

    VCNet: a robust approach to blind image inpainting

    In ECCV, pp. 752–768. Cited by: §2.1.
  • [34] Y. Wang, Y. Chen, X. Zhang, J. Sun, and J. Jia (2020) Attentive normalization for conditional image generation. In CVPR, pp. 5094–5103. Cited by: §1.
  • [35] Y. Wang, X. Tao, X. Qi, X. Shen, and J. Jia (2018)

    Image inpainting via generative multi-column convolutional neural networks

    In NeurIPS, Cited by: §2.1.
  • [36] Y. Wang, X. Tao, X. Shen, and J. Jia (2019) Wide-context semantic image extrapolation. In CVPR, Cited by: §2.1.
  • [37] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun (2018) Unified perceptual parsing for scene understanding. In ECCV, pp. 418–434. Cited by: §4.
  • [38] B. Yang, G. Bender, Q. V. Le, and J. Ngiam (2019) Condconv: conditionally parameterized convolutions for efficient inference. In NeurIPS, pp. 1307–1318. Cited by: §1, §2.2.
  • [39] T. Yang, X. Zhang, Z. Li, W. Zhang, and J. Sun (2018) Metaanchor: learning to detect objects with customized anchors. In NeurIPS, pp. 320–330. Cited by: §2.2.
  • [40] F. Yu, V. Koltun, and T. Funkhouser (2017) Dilated residual networks. In CVPR, pp. 472–480. Cited by: §4.
  • [41] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang (2018) Generative image inpainting with contextual attention. arXiv preprint arXiv:1801.07892. Cited by: §1.
  • [42] X. Yu, Y. Chen, S. Liu, T. Li, and G. Li (2019) Multi-mapping image-to-image translation via learning disentanglement. In NeurIPS, Cited by: §4.4, Table 7.
  • [43] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2018) Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318. Cited by: §1.
  • [44] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    In CVPR, pp. 586–595. Cited by: §4.
  • [45] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2017) Scene parsing through ade20k dataset. In CVPR, pp. 633–641. Cited by: §4.
  • [46] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, pp. 2223–2232. Cited by: §B.2, §1.
  • [47] P. Zhu, R. Abdal, Y. Qin, and P. Wonka (2020) Sean: image synthesis with semantic region-adaptive normalization. In CVPR, pp. 5104–5113. Cited by: §2.1.


In this supplementary file, our descriptions contain the following components:

  • Detailed configuration of our proposed SC-GAN and the implementation of the given spatially conditional convolution and normalization.

  • The synthesis performance of our generator trained with a more effective discriminator and other tricks.

  • The specification of how to apply our framework to unpaired image-to-image translation and the corresponding visual results.

  • The limitations and some failure cases of our method.

Appendix A Network Architectures

SC-GAN consists of Semantic Vector Generator (SVG) and Semantic Render Generator (SRG). Their detailed designs are given blew. For convenience, we suppose Conv(k, s, c) indicates a convolutional operation whose kernel size, stride size, and output channel number of the used convolution are k, s, and c, respectively. The dilation ratio and padding size of Conv(k, s, c) are both set to 1.

and denotes a bilinear upsampling and concatenation (along with the channel dimension) operations, respectively. Besides, SCResBlock(k, s, c) denotes a residual block variant using spatially conditional convolution (SCC) and normalization (SCN). Its schematic illustration is given in Figure 4 of our paper. We use to indicate an extra input of the current operation . For example, SCResBlock(3,1,512) indicates the semantic vectors (in feature maps form) is also incorporated into SCResBlock(3,1,512) for generating dynamic operators.


: Conv(3,1,512) LReLU Conv(3,1,512) LReLU Conv (3,1,256) LReLU Conv(3,1,256) LReLU Conv(3,1,128) LReLU Conv(3,1,128) LReLU Conv(3,1,64) LReLU Conv(3,1,64) LReLU Conv(3,1,32) LReLU Conv(3,1,32) LReLU Conv(3,1,32) LReLU Conv(3,1,32) LReLU Conv(3,1,16) LReLU Conv(3,1,3) Hardtanh ,
where indicates the input segmentation mask.


: SCResBlock(3,1,512) [+] SCResBlock(3,1,512) [+] SCResBlock(3,1,512) [+] SCResBlock(3,1,256) [+] SCResBlock(3,1,128) [+] SCResBlock(3,1,64) [+] SCResBlock(3,1,32) [+] Conv (k3c3) Hardtanh ,
where is sampled from a standard normal distribution, from SVG, and denotes the adaptive pooling operation (in channel dimension).

Figure 10: The framework of SC-GAN for unpaired image-to-image translation.

a.1 Implementation of Key Components

The pseudo codes to realize the proposed spatially conditional convolution (SCC) and normalization (SCN) are given in Alg 1 and 2, respectively. In general, they are designed that the regional parameterized weights are generated by combining a group of candidates, and the manner how they are combined is indicated by the fed semantic vectors (e.g. the following ).

1:Input feature maps and the generated semantic vectors , and learnable parameter candidates where , and , , and denote the input, output channel number, and kernel size.
2:The convolved feature maps .
3: # shape:
5:for  to  do
7:end for
8: # shape:
10: # shape:
Algorithm 1

The pseudo code of SCC (PyTorch style)

1:Input feature maps and the generated semantic vectors , and learnable parameter candidates , where .
2:The normalized feature maps.
3: # shape:
4: # shape:
5: # shape:
Algorithm 2 The pseudo code of SCN (PyTorch style)

Appendix B More Experimental Results and Analysis

Method Cityscapes ADE20K
SPADE [24] 62.3 71.8 38.5 33.9
CC-FPSE [21] 65.6 54.3 43.7 31.7
OASIS [28] 69.3 47.7 48.8 28.3
Ours 66.9 49.5 45.2 29.3
Ours w OASIS 69.9 47.2 49.1 27.6
Table 8: Quantitative results on the validation sets of Cityscapes and ADE20K from different methods. Ours w OASIS denotes our generator is trained with the discriminator and other techniques from OASIS [28].

b.1 Performance with a Stronger Discriminator and Training Tricks

We evaluate the compatibility between our proposed generator and a newly introduced discriminator [28], along with some effective training techniques. Sushko  presented a powerful discriminator exploiting semantic layouts (OASIS) by semantic segmentation loss. Their approach further strengthens GAN training by 1) balancing the class weights by their frequencies, 2) removing VGG loss, and 3) exponential moving average (EMA) model merging (for generator). During training, they mask generated regions with a random class and add 3D random noise to all input segmentation masks to enhance local detail synthesis. By integrating their techniques (except 3D random noises), our model can be further improved over performance, as given in Table 8. Compared with OASIS [28], our proposed generator yields better quantitative results with a smaller capacity (66.2M (ours) vs. 94M (OASIS)).

b.2 Unpaired Image-to-image Translation

As we claimed in the paper, our method is also applicable to unpaired image-to-image translation applications with minor modifications. Its corresponding framework is presented in Figure 10

. Specifically, changing the input of SRG from the random noise to the downsampled image from the source domain, then the output of SRG should be close to its input one in semantic layout (using perceptual loss), and similar to images from the target domain in style (texture, detail, etc). The input of SVG is set to the image from the target domain, and it still regresses to the input image like an autoencoder. Figure

11 gives visual results from our model on summerwinter dataset [46], in which our proposed method can alter the source image style by changing its color and texture according to the reference.

b.3 Limitations and Failure Cases

Although our design enhances generation performance by explicitly learning the relations between different semantics, it may fail to fully recover the intrinsic geometry in the original image when the given segmentation map is short of such cues (Figure 12). This is a common issue in numerous generative models, usually addressed by introducing extra modal ( depth) or multi-view data.

Moreover, the explicit modeling between semantics by their appearances may lead to creating undesired objects/stuff in the target semantic region, as given in Figure 12. Though the details in building regions are vivid, unexpected plants are synthesized in these areas. We suppose it is caused by that our model learns such symbiotic bias between these two kinds of stuff in the training data. Utilizing segmentation masks to further constrain semantic vectors ( adding semantic segmentation loss to SVG) may address this issue. We will study it in the future.

Figure 11: Unpaired image-to-image translation results from our model on summerwinter. The source images, their reference ones (target images), and their corresponding translated results are marked by blue, purple, and black rectangles, respectively.
Figure 12: Failure cases. The top row: failing to add dimension and depth in facade. The bottom row: introducing undesired objects in the given semantic regions.