Efficient Semantic Image Synthesis via Class-Adaptive Normalization (TPAMI 2021)
Spatially-adaptive normalization (SPADE) is remarkably successful recently in conditional semantic image synthesis <cit.>, which modulates the normalized activation with spatially-varying transformations learned from semantic layouts, to prevent the semantic information from being washed away. Despite its impressive performance, a more thorough understanding of the advantages inside the box is still highly demanded to help reduce the significant computation and parameter overhead introduced by this novel structure. In this paper, from a return-on-investment point of view, we conduct an in-depth analysis of the effectiveness of this spatially-adaptive normalization and observe that its modulation parameters benefit more from semantic-awareness rather than spatial-adaptiveness, especially for high-resolution input masks. Inspired by this observation, we propose class-adaptive normalization (CLADE), a lightweight but equally-effective variant that is only adaptive to semantic class. In order to further improve spatial-adaptiveness, we introduce intra-class positional map encoding calculated from semantic layouts to modulate the normalization parameters of CLADE and propose a truly spatially-adaptive variant of CLADE, namely CLADE-ICPE.Through extensive experiments on multiple challenging datasets, we demonstrate that the proposed CLADE can be generalized to different SPADE-based methods while achieving comparable generation quality compared to SPADE, but it is much more efficient with fewer extra parameters and lower computational cost. The code and pretrained models are available at <https://github.com/tzt101/CLADE.git>.READ FULL TEXT VIEW PDF
Efficient Semantic Image Synthesis via Class-Adaptive Normalization (TPAMI 2021)
Image synthesis has made great progress recently thanks to the advances of deep generative models. The latest successes, such as StyleGAN [23, 24], are already capable of producing highly realistic images from random latent codes. Yet conditional image synthesis, the task of generating photo-realistic images conditioned on some input data, is still very challenging. In this work, we focus on semantic image synthesis, a specific conditional image generation task that aims at converting a semantic segmentation mask into a photo-realistic image.
To tackle this problem, some previous methods [19, 44] directly feed the semantic segmentation mask to the conventional deep network architecture built by stacking convolution, normalization, and nonlinearity layers. However, as pointed out in , common normalization layers like instance normalization  tend to wash away the semantic information, especially for flat segmentation masks. To compensate for the information loss, a novel spatially-adaptive normalization, SPADE , is proposed, which modulates the normalized activation in a spatially-adaptive manner, conditioned on the input segmentation mask. Therefore, by replacing all the common normalization layers with SPADE blocks, the semantic information can be successfully propagated throughout the network, which can improve performance in terms of visual fidelity and spatial alignment.
Despite effectiveness of the spatially-adaptive normalization, its advantages have not been fully uncovered yet. Is spatial-adaptiveness the sole or main reason for its superior performance? Does there exist any better design that can improve efficiency without compromising the resulting quality? In this paper, we try to answer these questions by analyzing it in depth. Our key observation is that semantic-awareness may actually contribute much more than the spatial-adaptiveness. In fact, since the two-layer modulation network used to regress the transformation parameters is so shallow, the resulting denormalization parameters are almost spatial-invariant within regions with the same semantic class, especially for high-resolution input masks. Meanwhile, given that a SPADE block is placed before almost every convolutional layer, such redundancy is recurring multiple times in the generation pass, which can easily lead to a heavy amount of unnecessary computation and parameter overhead.
Motivated by this observation, we propose a novel normalization layer, namely CLass-Adaptive (DE)normalization (CLADE). Different from the spatially adaptive solution of SPADE, CLADE instead uses the input semantic mask to modulate the normalized activation in a class-adaptive manner. Specifically, CLADE is only adaptive to different semantic classes to maintain the crucial semantic-awareness property, independent of the spatial position, semantic shape, or layout of the semantic mask. Thanks to this lightweight design, CLADE is surprisingly simple to implement and requires no extra modulation network. Therefore, its computation and parameter overhead is almost negligible compared with SPADE, making it a better alternative to those conventional normalization layers. Take the generator for the ADE20k dataset  as an example, the extra parameter and computation cost introduced by CLADE is only and while that of SPADE is and respectively.
greatly reduces the computational overhead and achieves excellent performance, we believe that spatial-adaptiveness could still be beneficial to better semantic synthesis. To enhance the spatial-adaptiveness expected by SPADE, we further propose to utilize an extra positional encoding map representing the intra-class spatial variance, which defines the normalized relative distance from each pixel to its semantic object center. This positional encoding is then integrated into the CLADE modulation parameters and makes them spatially-adaptive in the regions with the same semantic class. This can be viewed as a spatially-adaptive variant of CLADE, namely CLADE-ICPE.
To demonstrate the effectiveness and efficiency of CLADE, we conduct extensive experiments on multiple challenging datasets, including Cityscapes , COCO-Stuff , and ADE20k (including ADE20k-outdoor) . Without bells and whistles, just by replacing all the SPADE layers with CLADE, comparable performance can be achieved with much smaller model size and much lower computation cost. Some visual results are given in Figure 1.
In recent years, image synthesis has achieved significant progress thanks to the emergence of generative adversarial networks (GANs). This adversarial training strategy enables the generator network to synthesize images with semantic meaning from a random noise. Starting from the early work , many following works have been proposed from different aspects. For example, to make the network training more stable, some works [1, 39, 32]
propose improvements based on the loss functions. DCGAN proposes a set of constraints on the architectural topology of Convolutional GANs that make them stable to train in most settings. For higher resolution and quality, ProgressiveGAN  designs a training strategy to gradually synthesize high-resolution images. BigGAN  proposes to train the network on a large-scale image dataset to improve the capabilities of generator. The recent works [23, 24] have not only pursued the realistic image synthesis, but also attempted to improve the accurate control of the synthesized image through the exploration of latent code. Different from this work, we are more interested in controlling the synthesized image in a more intuitive way, by using additional conditional inputs to control the synthesis results.
Instead of generation from a random noise, conditional image synthesis refers to the task of generating photo-realistic images conditioned on the input such as texts [14, 37, 47, 49] and images [17, 19, 29, 53, 33, 34]. Our work focuses on a special form of conditional image synthesis that aims at generating photo-realistic images conditioned on input segmentation masks, which is called semantic image synthesis.
For this task, many impressive works have been proposed in the past several years. One of the most representative works is pix2pix, which proposes a unified image-to-image translation framework based on the conditional generative adversarial network. To further improve its quality or enable more functionality, many following works have appeared, such as pix2pixHD , SIMS , and SPADE . SPADE proposes a spatial-varying normalization layer for the first time and has a profound impact as a basic backbone. Many recent works for different downstream tasks have used this architecture, such as semantic image synthesis [10, 55, 51], portrait synthesis or editing [54, 42] and semantic view synthesis . Other works [20, 50], although not using SPADE directly, are inspired by it to introduce spatial-adaptiveness into normalization layers. It is precisely because of the success of SPADE that we conduct an in-depth analysis of its superiority and propose a new efficient and effective normalization layer.
In the deep learning era, normalization layers play a vital role in achieving better convergence and performance, especially for deep networks. They follow a similar operating logic, which first normalizes the input features into zero mean and unit deviation, and then modulates the normalized features with learnable modulation scale/shift parameters.
Existing normalization layers can be generally divided into two different types: unconditional and conditional. Typical unconditional normalization layers include Batch Normalization (BN), Instance Normalization (IN) , Group Normalization (GN)  and Positional Normalization (PONO) . Compared to unconditional normalization, the behavior of conditional normalization is not static and depends on the external input. Conditional Instance Normalization (Conditional IN)  and Adaptive Instance Normalization (AdaIN)  are two popular conditional normalization layers originally designed for style transfer. To transfer the style from one image to another, they model the style information into the modulation scale/shift parameters.
For semantic image synthesis, most previous works  just leveraged unconditional normalization layers BN or IN in their networks. Recently, Park et al.  point out that common normalization layers used in the existing methods tend to “wash away” semantic information when applied to flat segmentation masks. To compensate for the missing information, they innovatively propose a new spatially-adaptive normalization layer named SPADE. Different from common normalization layers, SPADE puts the semantic information back by making the modulation parameters be the function of semantic mask in a spatially-adaptive way. Based on our analysis and observation that the semantic-awareness is the possible essential property leading to the superior performance of SPADE rather than the spatially-adaptiveness, we propose CLADE, a normalization layer that can achieve comparable performance as SPADE but with negligible cost.
Conditioned on a semantic segmentation map , semantic image synthesis aims at generating a corresponding high-quality realistic image . Here, is the set of class integers that denote different semantic categories. and are the target image height and width.
Most vanilla synthesis networks, like pix2pix  and pix2pixHD , adopt a similar network structure concatenating repeated blocks of convolutional, normalization and nonlinearity layers. Among them, normalization layers are essential for better convergence and performance. They can be generally formulated as:
with the indices of width, height, channel denoted as . In what follows, for the simplicity of notation, these subscripts will be omitted if the variable is independent of them. Specifically, the input feature is first normalized with the mean(normalization step), and then modulated with the learned scale and shift (modulation step). For most common normalization layers such as BN  and IN , all four parameters are calculated in a channel-wise manner (independent of ), with the modulating parameters and independent of .
As pointed out in , one common issue of the aforementioned normalization layers is that they tend to wash away the semantic information on flat segmentation masks in image synthesis. Motivated by this observation, a new spatially-adaptive normalization layer, namely SPADE, is proposed in . By making the modulation parameters and be functions of the input mask , i.e., and , the semantic information, which is lost after the normalization step, will be added back during the modulation step. The functions of and are both implemented with a shallow modulation network consisting of two convolutional layers, as illustrated in the left of Figure 4. By replacing all the normalization layers with SPADE, the generation network proposed in  can achieve much better synthesis results than previous methods like pix2pixHD .
As explained in , the advantages of SPADE mainly come from two important properties: spatial-adaptiveness and semantic-awareness. The former indicates the modulation parameters are spatially varying in a pixel-wise manner, while the latter property means that depend on semantic classes to bring back the lost information. As the name of SPADE implies, it may indicate that the spatial-adaptiveness is more important. However, through the following analysis, we think that the semantic-awareness may be the de facto main contributor to SPADE.
In Figure 2, we show two examples with the masks from the ADE20k validation dataset , which consist of two semantic labels ”Sky” and ”Field”. We visualize the intermediate parameters of and with the original pre-trained SPADE generator. It can be easily observed that are almost identical within each semantic region, except for the boundary area which is especially negligible for high-resolution input masks due to the shallowness of the modulation network. In fact, for any two regions sharing the same semantic class within one input mask or even across different input masks, their learned will also be almost identical if the sizes of regions are much larger than the receptive field of the two-layer modulation network. We further conduct statistical analyses of and with the original pre-trained SPADE generator for some semantic classes on the ADE20k validation dataset . In Figure 3, we show the statistical histograms of and for the three common classes (”building”, ”sky” and ”tree”) on SPADE blocks with various resolutions of input masks. We can observe that the distributions of within the same semantic class are concentrated and as the resolution of the input mask increases, the trend of concentration becomes more obvious. This further proves that, compared with the spatially-adaptiveness, the semantic-awareness may be the underlying key to the superior performance of SPADE.
Inspired by the above observation, we propose a new efficient conditional normalization layer, called CLass-Adaptive (DE)normalization (CLADE), as shown in the right of Figure 4. Inheriting the idea of semantic information compensation from SPADE, the modulation parameters in CLADE are also adaptive to the semantic input of . However, instead of adopting the pixel-wise spatial-adaptiveness as in SPADE, CLADE is spatially-invariant and only adaptive to different semantic classes. More concretely, in CLADE vary on the corresponding semantic classes to maintain the essential property of semantic-awareness, but they are independent of any spatial information including the position, semantic shape, or layout of .
Therefore, rather than learning modulation parameters through an extra modulation network like SPADE, we directly maintain a modulation parameter bank for CLADE and optimize it as regular network parameters. Assuming the total class number in to be , the parameter bank consists of channel-wise modulation scale parameters and shift parameters . During training, given an input mask , we fill each semantic region of class with its corresponding modulation parameter
to generate dense modulation parameter tensorsand respectively. We call this process Guided Sampling in Figure 4.
In fact, CLADE can also be regarded as a generalized formulation of some existing normalization layers. If and for any , CLADE becomes BN . And if we make the modulation tensors and both spatially uniform, and replace the mean and std statistics of BN with the corresponding ones from IN, we implement Conditional IN.
By default, CLADE uses the additional input of instance maps if provided by the datasets (Cityscapes and COCO-Stuff) to better distinguish the different instances of the same categories. Similar to pix2pixHD and SPADE, we feed the edge map calculated from the instance map (‘edge’ and ‘non-edge’ are represented as ‘1’ and ‘0’) into the network. However, the special architecture of CLADE does not allow us to stack the edge map with the semantic layout directly. Thus, we embed the edge information in the modulated features. To match the activation values in the feature, we first modulate the edge map as follows:
where is the modulated edge map. and are two constant float point numbers that can be learned as regular parameters. Then, we combine the modulated with the feature maps modulated by the CLADE layer along the channel dimension, and feed them into the following layers. Since only two constant numbers are involved and Equation (2) can also be implemented by pixel-wise value assignment operations, the extra parameter and computation overhead is extremely low and negligible.
In the original SPADE generator backbone , a SPADE block is placed before almost every convolution to replace the conventional normalization layer. For convenience, we denote the input and output channel numbers of the following convolutional layer as and its kernel size as . For its modulation network, we simply assume a same kernel size and intermediate channel number are used for all convolutional layers. Therefore, the parameter numbers for the convolutional layer and the SPADE block are calculated as:
With the default implementation settings of SPADE, we have , so the parameter ratio between both networks is:
This to say, the extra parameter introduced by SPADE becomes a significant overhead, especially when are relatively large ( by default in SPADE). Take the ADE20k dataset  as an example, which contains 151 classes (). On image resolution of , the SPADE generator consists of 7 SPADE residual blocks. We show the parameter ratio of each convolutional layer in Figure 5. It can be seen that SPADE indeed brings considerable parameter overhead to all the convolutional layers. This becomes even more serious when the network goes deeper since is designed to be smaller for higher feature resolution. The ratios for some layers even exceed . Taking all the convolutional layers in SPADE generators into consideration, the average ratio is about .
In addition to the parameter numbers, we also analyze the computation complexity. Here, we use the popular floating-point operation per second (FLOPs) as the metric. Since the convolutional layers within the modulation network dominate the computation cost of SPADE, the FLOPs of both the convolutional layer and the SPADE block can be simply calculated as:
where are the width and height of the output feature respectively. Therefore, the FLOPs ratio is identical to the parameter ratio shown in Figure 5. However, different from the parameter number, with the increasing feature resolutions, the absolute FLOPs are relatively larger in deeper layers, which makes the computation overhead even worse. Taking the same ADE20k dataset as an example, the average extra FLOPs ratio introduced by SPADE is about , which means the computation cost of SPADE is even heavier than the convolutional layers. More importantly, it is now popular to adopt very large synthesis networks to ensure good performance, which is already consuming a surprisingly large amount of parameter space and computation resources, and SPADE will further aggravate this situation, which might be unaffordable in many cases.
Compared to SPADE, our CLADE does not require any extra modulation network to regress the modulation parameters. Specifically, the corresponding numbers of its parameters and FLOPs are:
We take the value assignment operation as one float-point operation. Similar to SPADE, if every convolutional layer is followed by one CLADE layer, the relative ratios of parameter and FLOPs are:
In most existing synthesis networks, the above ratios are extremely small. For example, with the same backbone as the above SPADE generator for the ADE20k dataset, the average ratios for parameter and FLOPs are only and , respectively. Therefore, compared to SPADE, the parameter and computation overhead of CLADE are negligible, which is friendly to practical scenarios regarding both training and inference. Despite its simplicity and efficiency, we demonstrate that it can still achieve comparable performance as SPADE with extensive experiments in Section 4.
As mentioned before, the modulation parameters of SPADE are almost spatially invariant within the same semantic region, especially for high-resolution input masks. In other words, the spatial-adaptiveness is not fully utilized in SPADE. This is mainly due to the limited receptive fields of modulation layers in a shallow network. Theoretically, if we increase the depth of the network, better spatial-adaptiveness could be achieved with the accumulation of receptive fields, but along with prohibitively high computational cost. Based on this observation, we propose a variant of CLADE, CLADE-ICPE, to further improve intra-class spatial adaptiveness by leveraging a positional encoding map as the extra input.
The positional encoding map is defined as the relative distance from each pixel to its corresponding object center, which can be calculated using the input semantic mask . Specifically, for each pixel in the positional encoding map , we first find its belonging semantic object () by detecting the largest connected component of the corresponding semantic category and obtain the object center (). Then the distance map along the dimension is defined as:
We further define the maximum offset of each object as:
Finally, we get the normalized distance map by normalizing with the maximum offset:
As shown in Figure 6, in order to utilize the positional encoding map , we follow the modulation idea and use a convolution layer to map the positional encoding to the modulation parameters ():
where and are convolution operations with one-channel outputs. And is the element-wise multiplication. Since the input and output channel numbers of and are and , respectively, the extra parameter and computation overhead is almost negligible. Specifically, the corresponding relative ratios of parameters an FLOPs defined in Section 3.3.2 are:
Compared with Equation (10), the ratio of parameters is almost the same, while the ratio of FLOPs is almost twice that of CLADE. However, the absolute ratio is still relatively low, especially compared to SPADE (0.14% vs. 234.73%).
Similar as SPADE, our proposed CLADE can be integrated into different generator backbones. In this paper, the CLADE generator follows the similar network architecture of the SPADE generator  by default, but all the SPADE blocks are replaced by CLADE. As shown in Figure 7
, it adopts several residual blocks with upsampling layers and progressively increases the output resolution. The residual block consists of CLADE layers, ReLU layers and convolution layers, and the skip connection is also replaced by these layers when the number of channels before and after the residual block are different. For multi-modal synthesis, we follow the strategy as
and attach an extra encoder that encodes the image into a random vector. Specifically, this encoder consists of a series of convolutional layers with stride 2, instance normalization layers and LReLU activation layers and outputs the mean and variance vector of the distribution of the specified image. Then a random vector sampled from this distribution is fed into the CLADE generator as the style guidance to enable global diversity of the generated results.
Main experiments are conducted on four popular datasets: ADE20k, ADE20k-outdoor, COCO-Stuff, and Cityscapes. The ADE20k dataset  consists of 25,210 images (20,210 for training, 2,000 for validation and 3,000 for testing), covering a total of 150 object and stuff categories. ADE20k-outdoor is a subset of ADE20k that only contains outdoor scenes. Similar to previous work [34, 35], we directly select the images containing categories such as sky, trees, and sea without manual inspection. There are 9,649 training images and 943 validation images. The COCO-Stuff dataset  augments COCO by adding dense pixel-wise stuff annotations. It has 118,000 training images and 5,000 validation images with 182 semantic categories. The Cityscapes dataset  is a widely used dataset for semantic image synthesis. It contains 2,975 high-resolution training images and 500 validation images of 35 semantic categories.
We use two additional datasets to evaluate the generalization ability when applying our CLADE to some recent semantic synthesis methods that depends on SPADE. The CelebAMask-HQ [27, 22, 31] contains 30,000 segmentation masks with 19 different classes from CelebAHQ face imgae dataset. There are split into 28,000 training images and 2,000 validation images. The DeepFashion  contains of 52,712 person images with fashion clothes. We use the processed dataset provided by SMIS  which consists of 30,000 training images and 2,247 validation images.
|Dataset||Method||mIoU||accu||FID||Params (M)||FLOPs (G)||Runtime (s)|
Performance and complexity comparison with other semantic image synthesis methods. All the metrics are tested by ourselves on the PyTorch and Titan XP GPU.
|mIoU||accu||FID||FLOPs (G)||mIoU||accu||FID||FLOPs (G)|
We follow the same training setting as SPADE . In details, the generator is trained with the same multi-scale discriminator and the loss function is as follows:
where is the hinge version of GAN loss, and is the feature matching loss between the real and synthesized images. The feature is extracted by the multi-scale discriminator. is the perceptual loss  with the feature extractor of VGG network . For multi-modal synthesis, we add KL-divergence loss term (
) to minimize the gap between the encoded distribution and Gaussian distribution. By default, we set, and the Adam optimizer  (
) is used with the total epoch number of 200. The learning rates for the generator and discriminator are set to 0.0001 and 0.0004, respectively. We evaluate the model every 10 epochs and select the model with the best performance. To demonstrate the effectiveness of our method, we not only compare our CLADE with the SPADE baseline but also include another comparison with the popular semantic image synthesis method pix2pixHD . For pix2pixHD, we use the codes and settings provided by the authors to train all the models. For SPADE, we directly use the pre-trained models provided by the authors to get the result images for evaluation. The resolution of images () is set to except for Cityscapes, which is set to .
We leverage the protocol from previous works [5, 44] for evaluation, which is also used in SPADE . Specifically, we run semantic segmentation algorithms on the synthesized images and evaluate the quality of the predicted semantic masks. To measure the segmentation accuracy, two popular metrics, mean Intersection-over-Union (mIoU) and pixel accuracy (accu) metrics are used. For different datasets, we select corresponding state-of-the-art segmentation models: UperNet101 [46, 8] for ADE20k and ADE20k-outdoor, DeepLabv2 [4, 25] for COCO-Stuff, DRN [48, 11] for Cityscapes and UNet [38, 41] for CelebAMask-HQ. As for DeepFashion, we also use UNet but train the model by ourselves. We also leverage the commonly used Fréchet Inception Distance (FID)  to measure the distribution distance between synthesized images and real images. Specifically, we calculate FID between generated validation images and real training images, not generated validation images and real validation images. This is because the number of training images is more than of validation images, which can better reflect the distribution characteristics of real images. The same protocol is also adopted in the recent work.
As shown in Table I, our method can achieve comparable performance with SPADE while significantly reducing the parameter number and computational complexity of the original SPADE generator on all the datasets. For example, on the COCO-Stuff dataset, the proposed CLADE achieves a mIoU score of 36.77 and a pixel accuracy score of 68.08, which is even slightly better than SPADE. When compared to pix2pixHD, CLADE outperforms it by more than and points in terms of mIoU and pixel accuracy respectively. As for the FID score, our CLADE is also close to SPADE and much better than pix2pixHD. On the Cityscapes dataset, our CLADE performs better than SPADE in terms of FID, but the parameter number in our CLADE generator is only about of that in the original SPADE generator and of that in pix2pixHD. As for the computation complexity in terms of FLOPs, CLADE generator is about fewer than that in the SPADE generator and fewer than that in the pix2pixHD.
Since the GPU computation capacity is often overqualified for single image processing, the real runtime speedup is less significant than FLOPs, but we still observe about speedup when compared to SPADE. More significant speedup can be observed on low-end devices. Taking one step forward, we further analyze the extra parameter and computation cost introduced by SPADE and CLADE in Table II. In details, we calculate the parameter and computation cost brought by the backbone network (operations except normalization) and the SPADE (or CLADE) layers respectively. It can be seen that in Table II, the advantages of CLADE layers in terms of parameters and runtime are much more obvious when ignoring the backbone part.
When introducing additional spatial information, CLADE-ICPE has made a significant improvement in terms of FID on all the datasets. Even compared to SPADE, CLADE-ICPE shows a considerable advantage, especially on Cityscapes dataset. But as for the model complexity, the additional parameters and FLOPs are negligible, and the overhead increase in the average running time is also small.
To further demonstrate the efficiency and effectiveness of CLADE, we also train a lightweight variant of SPADE (denoted as SPADE-light in Table III) by reducing the number of channels in its convolution layers to ensure it has similar FLOPs as CLADE. Obviously, SPADE-light performs much worse than CLADE on all datasets.
Since judging the visual quality of one image is usually subjective, we further conduct a user study to compare the results generated by different methods. Specifically, we give the users two synthesis images generated from the same semantic mask by two different methods (our method and the baseline method) and ask them “which is more realistic”. To ensure a more detailed comparison, there is no time limit set for the users. And for each pairwise comparison, we randomly choose 40 results for each method and involve 20 users. In Table IV, we report the evaluation results on four different datasets. According to the results, we find that users have no obvious preference between our CLADE and SPADE, which once again proves the comparable performance to SPADE. But compared to the results of pix2pixHD, users clearly prefer our results on all the datasets, especially including the challenging COCO-Stuff dataset.
Besides the above quantitative comparison, we further provide some qualitative comparison results on the four different datasets. In detail, Figure 8 shows some visual results on some indoor cases on the ADE20k dataset and outdoor cases on the ADE20k-outdoor dataset. Despite the simplicity of our method, it can generate very high-fidelity images that are comparable to the ones generated by SPADE. In some cases, we find our method is even slightly better than SPADE. In contrast, because of semantic information lost problem existing in common normalization layers, the results generated by Pix2pixHD are worse than both SPADE and our CLADE. In Figure 9, some visual results on the COCO-Stuff dataset are provided. Compared to ADE20k, COCO-stuff has more categories and contains more small objects, so it is more challenging. However, our method can still work very well and generate high-fidelity results. According to results in Figure 10, a similar conclusion can also be drawn for higher-resolution semantic image synthesis on the Cityscapes dataset ().
We also show the results in Figure 12 to compare the visual effect of intra-class spatial-adaptiveness. Given additional spatial information, we can see more rich details from the results. Taking ADE20k dataset as an example, SPADE and CLADE can only give a blurred view out of the window, while CLADE-ICPE can generate a high-quality view with rich textures. In particular, for some classes with large regions, both SPADE and CLADE produce repeated or blurry pattern (see the last column of Figure 12) because they cannot differentiate the difference between different positions within the same category. In contrast, CLADE-ICPE can produce vivid textures with the spatial guidance of the positional encoding map.
As above mentioned, it is easy for our method to support multi-model and style-guided synthesis by introducing an extra style encoder before the generator network. Specifically, we get different style vectors either by random sampling or feeding different reference images into the style encoder, and then input these style vectors into the generator network to produce diverse images. In Figure 11, some visual results are shown. The results in the top row demonstrate that our method can synthesize diverse images with the same semantic layout. Similarly, as shown in the bottom row, different reference images can be used to further control the global style of the generated images, including but are not limited to sunny days, dusk, night, etc.
In the method part, we have shown that, for higher resolution layers, the distributions of in SPADE are more centralized (Figure 2) and the corresponding extra computation cost is also more significant (Figure 5). And for low resolution layers, is less centralized and can supply some spatial variance. In contrast, the basic CLADE is only class-adaptive but not spatial-adaptive. Therefore, it is intuitive to use SPADE on lower-resolution layers and CLADE on higher-resolution layers to achieve better balance between generation quality and efficiency. To verify this point, we mix SPADE and CLADE with the transition points at different resolution layers, and test the performance on the ADE20k-outdoor and Cityscapes datasets. In the original SPADE generator, there are seven SPADE ResBlks which are numbered from 1 to 7 as the resolution increases. The second and third ResBlks are at the same resolution if the resolution of the synthesized image is , otherwise they are at different resolutions.
As shown in Table V, the average running time decreases with the use of more CLADE layers, which is in line with our expectations. More interestingly, by using SPADE on low-res layers and CLADE starting from the middle ResBlks (e.g. 6th and 7th on ADE20k-outdoor dataset, and 3rd, 5th and 6th on Cityscapes dataset), it can even achieve slightly higher mIoU than using SPADE on all layers while being more efficient.
Although the introduction of the positional encoding can provide a prior spatial information within the semantic category and help synthesize richer details, how to properly utilize this information is not trivial. Empirically, we find that inappropriate use may even be harmful. Here we study three different ways to apply the positional encoding map:
The positional encoding map is directly concatenated with the downsampled semantic mask as extra channels and fed to the generator. This version is called +disti.
The positional encoding map is first transformed with one 1x1 convolutional layer and then concatenated with each normalized features (after CLADE layer) as extra channels. This version is called +distf.
As described in Section 3.4, the positional encoding map is used to modulate the original semantic-adaptive modulation parameters of CLADE. This version is called +distp.
In the Table VI, we compare these three variants with the original CLADE, in terms of FID on the ADE20k, ADE20k-outdoor and Cityscapes datasets. It can be seen that, +distp achieves the best performance of FID on these datasets, while +disti is the worst. Specifically, by comparing +distf and +disti, we can easily observe that adding the spatial information at different feature levels is beneficial. And by comparing +distf and +distp, we can find that the concatenation of the positional encoding feature with normalized features along the channel dimension is not as effective as the element-wise multiplication used by +distp.
Particularly, the performance gain on the Cityscapes dataset is much more significant than that on the ADE20k dataset. We think it should be because Cityscapes contains many large-area categories with clear internal structure, such as buildings and cars. By comparison, though ADE20k-outdoor also has some large-area categories like sky and sea, they have relatively less complex internal structures, thus benefitting less from spatial adaptiveness.
|Dataset||Method||mIoU||accu||FID||Params (M)||FLOPs (G)||Runtime (s)|
To demonstrate the general applicability, we further replace the SPADE layer with the proposed CLADE layer for some recent SPADE-based methods and show the results in Table VII. Without loss of generality, we still focus on the semantic image synthesis task and select two representative methods: GroupDNet  and SEAN . GroupDNet is a semantic-level multimodal image synthesis method that achieves great success on the DeepFashion dataset, while SEAN focuses on face image synthesis and shows excellent performance on the CelebAMask-HQ dataset. Therefore, considering the performance of these two methods on their respective datasets, we also choose to conduct experiments on these two datasets, and add Cityscapes dataset as a supplement. All the models are trained with the same settings in the official codes and the only difference is the normalization layer.
The detailed comparison results are shown in the Table VII. In general, after replacing SPADE with CLADE, the original performance of such methods are almost not affected but the parameter number and computational overhead are significantly reduced. In detail, for SEAN, the performance on the CelebAMask-HQ dataset in terms of FID is significantly improved. As for model parameters, a reduction in model size of about 20M on different datasets is observed, which is consistent with the comparison between SPADE and CLADE in Table I.
In this paper, we conduct an in-depth analysis on the spatially-adaptive normalization used in semantic image synthesis. We observe that its most essential advantage comes from semantic-awareness instead of spatial-adaptiveness as originally suggested in . Motivated by this observation, we design a more efficient conditional normalization structure CLADE. Compared to SPADE, CLADE can achieve comparable synthesis results but greatly reduce the parameter and computation overhead. To introduce true spatial adaptiveness, we further explore the role of position prior and propose an improved version of CLADE by modulating the parameters of CLADE through an extra intra-class positional encoding. We further adopt CLADE in some recent SPADE-based methods and get comparable or even better results with greatly reduced parameters and computational costs.
The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §1, §4.1.
International Conference on Machine Learning, pp. 448–456. Cited by: §2.3, §3.2.
Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pp. 694–711. Cited by: §4.2.
PyTorch implementation of deeplab v2 on coco-stuff / pascal voc. Note: https://github.com/kazuto1011/deeplab-pytorch.git Cited by: §4.3.
2019 International Conference of Artificial Intelligence and Information Technology (ICAIIT), pp. 16–20. Cited by: §2.2.