Local Class-Specific and Global Image-Level Generative Adversarial Networks for Semantic-Guided Scene Generation
In this paper, we address the task of semantic-guided scene generation. One open challenge in scene generation is the difficulty of the generation of small objects and detailed local texture, which has been widely observed in global image-level generation methods. To tackle this issue, in this work we consider learning the scene generation in a local context, and correspondingly design a local class-specific generative network with semantic maps as a guidance, which separately constructs and learns sub-generators concentrating on the generation of different classes, and is able to provide more scene details. To learn more discriminative class-specific feature representations for the local generation, a novel classification module is also proposed. To combine the advantage of both the global image-level and the local class-specific generation, a joint generation network is designed with an attention fusion module and a dual-discriminator structure embedded. Extensive experiments on two scene image generation tasks show superior generation performance of the proposed model. The state-of-the-art results are established by large margins on both tasks and on challenging public benchmarks. The source code and trained models are available at https://github.com/Ha0Tang/LGGAN.READ FULL TEXT VIEW PDF
We propose a novel Edge guided Generative Adversarial Network (EdgeGAN) ...
State-of-the-art methods in the unpaired image-to-image translation are
Pose-guided person image generation and animation aim to transform a sou...
Image inpainting techniques have shown promising improvement with the
Recipe generation from food images and ingredients is a challenging task...
Most deep image smoothing operators are always trained repetitively when...
Scene labeling is a challenging classification problem where each input ...
Local Class-Specific and Global Image-Level Generative Adversarial Networks for Semantic-Guided Scene Generation
In this work, we focus on semantic-guided scene generation, which is a hot research topic covering several main-stream research directions, including cross-view image translation [21, 52, 36, 37, 43, 38] and semantic image synthesis [48, 8, 34, 32]. The cross-view image translation task proposed in  is essentially an ill-posed problem due to the large ambiguity in the generation if only a single RGB image is given as input. To alleviate this problem, recent works such as SelectionGAN  try to generate the target image based on an image of the scene and several novel semantic maps, as shown in Fig. 1 (bottom). Adding a semantic map allows the model to learn the correspondences in the target view with appropriate object relations and transformations. On the other side, the semantic image synthesis task aims to generate a photo-realistic image from a semantic map [48, 8, 34, 32], as shown in Fig. 1 (top). Recently, Park et al.  propose a spatially-adaptive normalization for synthesizing photo-realistic images given an input semantic map. With the useful semantic information, existing methods on both tasks achieved promising performance in scene generation.
However, one can still observe unsatisfying perspectives, especially on the generation of local scene structure and details as well as small scale objects, which we believe are mainly due to several reasons. First, existing methods on both tasks are mostly based on a global image-level generation, which accepts a semantic map containing several object classes and aims to generate the appearance of all the different classes, by using the same network design or using shared network parameters. In this case, all the classes are treated equally by the network. While different semantic classes have distinct properties, specific network learning for different semantic classes intuitively would benefit the complex multi-class generation. Second, we observe that the number of training samples of different scene classes is imbalanced. For instance, for the Dayton dataset , the cars and buses only occupy less than 2% with respect to all pixels in the training data, which naturally makes the model learning be dominated by the classes with the larger number of training samples. Third, the size of objects in different scene classes is diverse. As shown in the first row of Fig. 1, larger-scale object classes such as road, sky usually occupy bigger area of the image than smaller-scale classes such as pole and traffic light. Since the convolutional network usually shares the parameters at different convolutional positions, the larger-scale object classes would thus take advantage during the learning, further increasing the difficult in generating well the small-scale object classes.
To tackle these issues, a straightforward consideration would be to model the generation of different scene classes specifically in a local context. By so doing, each class could have its own generation network structure or parameters, thus greatly avoiding the learning of a biased generation space. To achieve this goal, in this paper we design a novel class-specific generation network. It consists of several sub-generators for different scene classes with a shared encoded feature map. The input semantic map is utilized as the guidance to obtain feature maps corresponding to each class spatially, which are then used to produce a separate generation for different class regions.
Due to the highly complementary properties of global and local generation, a Local class-specific and Global image-level Generative Adversarial Network (LGGAN) is proposed to combine the advantage of these two. It mainly contains three network branches (see Fig. 2). The first branch is the image-level global generator, which learns a global appearance distribution using the input, and the second branch is the proposed class-specific local generator, which aims to generate different objects classes separately using semantic-guided class-specific feature filtering. Finally, the fusion weight-map generation branch learns two pixel-level weight maps which are used to fuse the local and global sub-networks in a weighted-combination of their final generation results. The proposed LGGAN can be jointly trained in an end-to-end fashion to make the local and global generation benefit each other in the optimization.
Overall, the contributions of this paper are as follows:
We explore scene generation from the local context, which we believe is beneficial to generate richer scene details compared with the existing global image-level generation methods. A new local class-specific generative structure has been designed for this purpose. It can effectively handle the generation of small objects and scene details which are common difficulties encountered by the global-based generation.
We propose a novel global and local generative adversarial network design able to take into account both the global and local contexts. To stabilize the optimization of the proposed joint network structure, a fusion weight-map generator and a dual-discriminator are introduced. Moreover, to learn discriminative class-specific feature representations, a novel classification module is proposed.
Experiments for cross-view image translation on the Dayton  and CVUSA  datasets, and semantic image synthesis on the Cityscapes  and ADE20K  datasets demonstrate the effectiveness of the proposed LGGAN framework, and show significantly better results compared with state-of-the-art methods on both tasks.
. A vanilla GAN has two important components, i.e., a generator and a discriminator. The goal of the generator is to generate photo-realistic images from a noise vector, while the discriminator is trying to distinguish between the real and the generated image. To synthesize user-specific images, Conditional GAN (CGAN) has been proposed. A CGAN combines a vanilla GAN and an external information, such as class labels [30, 31, 9], text descriptions [35, 54], object keypoint , human body/hand skeleton [1, 42, 3, 59], conditional images [58, 21], semantic maps [48, 43, 32, 47], scene graphs [22, 55, 2] and attention maps [53, 28, 41].
Global and Local Generation in GANs. Modelling global and local information in GANs to generate better results has been used in various generative tasks [19, 20, 26, 25, 33, 16]. For instance, Huang et al.  propose TPGAN for frontal view synthesis by simultaneously perceiving global structures and local details. Gu et al.  propose MaskGAN for face editing by separately learning every face component, e.g., mouth and eye. However, these methods are only applied to face-related tasks such as face rotation or face editing, where all the domains have large overlap and similarity. However, we propose a new local and global image generation framework design for a more challenging scene generation task, and the local context modeling is based on semantic-guided class-specific generation, which is not explored by any existing works.
Scene Generation. Scene generation tasks are a hot topic as each image can be parsed into distinctive semantic objects [6, 2, 45, 14, 4, 5]. In this paper, we mainly focus on two scene generation tasks, i.e., cross-view image translation [52, 36, 37, 43] and semantic image synthesis [48, 8, 34, 32]. Most existing works on cross-view image translation have been conducted to synthesize novel views of the same objects [12, 57, 44, 10]. Moreover, several works deal with image translation problems with drastically different views and generate a novel scene from a given different scene [52, 36, 37, 43]. For instance, Tang et al.  propose SelectionGAN to solve the cross-view image translation task using semantic maps and CGAN models. On the other side, the semantic image synthesis task aims to generate a photo-realistic image from a semantic map [48, 8, 34, 32]. For example, Park et al. propose GauGAN , which achieves the best results on this task.
With the semantic maps as guidance, existing approaches on both tasks achieve promising performance. However, we still observe that the results produced these global image-level generation methods are often unsatisfactory, especially on detailed local texture. In contrast, our proposed approach focuses on generating more realistic global structure/layout and local texture details. Both local and global generation branches are jointly learned in an end-to-end fashion that aims at using the mutually improved benefits from each other.
We start by presenting the details of the proposed Local class-specific and Global image-level GANs (LGGAN). An illustration of the overall framework is shown in Fig. 2. The generation module mainly consists of three parts, i.e., a semantic-guided class-specific generator modelling the local context, an image-level generator modelling the global layout, and a weight-map generator for fusing the local and the global generators. We first introduce the used backbone structure, and then present the design of the proposed local and global generation networks.
Semantic-Guided Generation. In this paper, we mainly focus on two tasks, i.e., semantic image synthesis and cross-view image translation. For the former, we follow GauGAN  and use the semantic map as the input of the backbone encoder , as shown in Fig. 2. For the latter, we follow SelectionGAN  and concatenate the input image and a novel semantic map as the input of the backbone encoder . By so doing, the semantic maps act as priors to guide the model to learn the generation of another domain.
Parameter-Sharing Encoder. As we have three different branches for three different generators, the encoder is sharing parameters to all the three branches to make a compact backbone network. The gradients from all the three branches contribute together to the learning of the encoder. We believe that in this way, the encoder can learn both local and global information and the correspondence between them. Then the encoded deep representations from the input can be represented as , as shown in Fig. 2.
Class-Specific Local Generation Network. As shown in Fig. 1 and discussed in the introduction, the issue of training data imbalance between different classes and size difference between scene objects makes it extremely difficult in generation of small object classes and scene details. To overcome this limitation, we propose a novel local class-specific generation network design. It separately constructs a generator for each semantic class and thus is able to largely avoid the interference from the large object classes in the joint optimization. Each sub-generation branch has independent network parameters and concentrates on a specific class, being therefore capable of effectively producing similar generation quality for different classes and yielding richer local scene details.
The overview of the local generation network is illustrated in Fig. 3. The encoded features are first fed into two consecutive deconvolutional layers to increase the spatial size with the number of channels reduced two times. Then the scaled feature map is multiplied by the semantic mask of each class, i.e., , to obtain a filtered class-specific feature map for each one. The mask-guided feature filtering operation can be written as:
where is the number of semantic classes. Then the filtered feature map is fed into several convolutional layers for the corresponding -th class and generate an output image . For better learning each class, we utilize a semantic-mask guided pixel-wise reconstruction loss, which can be expressed as follows:
The final output from the local generation network can be obtained in two ways. The first one is performing an element-wise addition of all the class-specific outputs:
The second one is performing a convolutional operation on all the class-specific outputs, as shown in Fig. 3,
where and denote channel-wise concatenation and convolutional operation, respectively.
|Method||Accuracy (%)||Inception Score||SSIM||PSNR||SD||KL|
|Pix2pix ||6.80||9.15||23.55||27.00||2.8515||1.9342||2.9083||0.4180||17.6291||19.2821||38.26 1.88|
|X-SO ||27.56||41.15||57.96||73.20||2.9459||2.0963||2.9980||0.4772||19.6203||19.2939||7.20 1.37|
|X-Fork ||30.00||48.68||61.57||78.84||3.0720||2.2402||3.0932||0.4963||19.8928||19.4533||6.00 1.28|
|X-Seq ||30.16||49.85||62.59||80.70||2.7384||2.1304||2.7674||0.5031||20.2803||19.5258||5.93 1.32|
|Pix2pix++ ||32.06||54.70||63.19||81.01||3.1709||2.1200||3.2001||0.4871||21.6675||18.8504||5.49 1.25|
|X-Fork++ ||34.67||59.14||66.37||84.70||3.0737||2.1508||3.0893||0.4982||21.7260||18.9402||4.59 1.16|
|X-Seq++ ||31.58||51.67||65.21||82.48||3.1703||2.2185||3.2444||0.4912||21.7659||18.9265||4.94 1.18|
|SelectionGAN ||42.11||68.12||77.74||92.89||3.0613||2.2707||3.1336||0.5938||23.8874||20.0174||2.74 0.86|
|LGGAN (Ours)||48.17||79.35||81.14||94.91||3.3994||2.3478||3.4261||0.5457||22.9949||19.6145||2.18 0.74|
Class-Specific Discriminative Feature Learning. We observe that the filtered feature map is not able to produce very discriminative class-specific generations, leading to similar generation results for some classes, especially for small-scale object classes. In order to have more diverse generation for different object classes, we propose a novel classification-based feature learning module to learn more discriminative class-specific feature representations, as shown in Fig. 3. One input sample of the module is a pack of feature maps produced from different local generation branches, i.e., . First, the packed feature map (with as the number of feature map channels, height and width, respectively) is fed into a semantic-guided averaging pooling layer, and we obtain a pooled feature map with dimension of
. Then the pooled feature map is connected with a fully connected layer to predict classification probability of theobject classes of the scene. Since some object classes may not exist in the input semantic mask sample, the features from the local branches corresponding to the void classes should not contribute to the classification loss. Therefore, we filter the final cross-entropy loss by multiplying it with a void class indicator for each input sample. The indicator is an one hot vector with for a valid class and for a void class. Then, the Cross-Entropy (CE) loss is defined as follows:
where is an indicator function, i.e., having a return 1 if else 0. is a classification function which produces a prediction probability given an input feature map . is a label set of all the object classes.
Image-Level Global Generation Network. Similar to the local generation branch, is also fed into the global generation sub-network for global image-level generation, as shown in Fig. 2. Global generation is capable to capture the global structure information or layout of the input images. Thus, the global result can be obtained through a feed-forward computation: Besides the proposed , many existing global generator architectures can also be used with the proposed local generator , making the proposed framework very flexible.
|Method||Accuracy (%)||Inception Score||SSIM||PSNR||SD||KL|
|Zhai et al. ||13.97||14.03||42.09||52.29||1.8434||1.5171||1.8666||0.4147||17.4886||16.6184||27.43 1.63|
|Pix2pix ||7.33||9.25||25.81||32.67||3.2771||2.2219||3.4312||0.3923||17.6578||18.5239||59.81 2.12|
|X-SO ||0.29||0.21||6.14||9.08||1.7575||1.4145||1.7791||0.3451||17.6201||16.9919||414.25 2.37|
|X-Fork ||20.58||31.24||50.51||63.66||3.4432||2.5447||3.5567||0.4356||19.0509||18.6706||11.71 1.55|
|X-Seq ||15.98||24.14||42.91||54.41||3.8151||2.6738||4.0077||0.4231||18.8067||18.4378||15.52 1.73|
|Pix2pix++ ||26.45||41.87||57.26||72.87||3.2592||2.4175||3.5078||0.4617||21.5739||18.9044||9.47 1.69|
|X-Fork++ ||31.03||49.65||64.47||81.16||3.3758||2.5375||3.5711||0.4769||21.6504||18.9856||7.18 1.56|
|X-Seq++ ||34.69||54.61||67.12||83.46||3.3919||2.5474||3.4858||0.4740||21.6733||18.9907||5.19 1.31|
|SelectionGAN ||41.52||65.51||74.32||89.66||3.8074||2.7181||3.9197||0.5323||23.1466||19.6100||2.96 0.97|
|LGGAN (Ours)||44.75||70.68||78.76||93.40||3.9180||2.8383||3.9878||0.5238||22.5766||19.7440||2.55 0.95|
Pixel-Level Fusion Weight-Map Generation Network. In order to better combine the local and the global generation sub-networks, we further propose a pixel-level weight map generator , which aims at predicting pixel-wise weights for fusing the global generation and the local generation . In our implementation, consists of two Transpose ConvolutionInstanceNormReLU blocks and one ConvolutionInstanceNormReLU block. The number of the output channels for these three block are 128, 64 and 2, respectively. The kernel sizes are
with stride 2 except for the last layer which has a kernel size ofwith stride 1 for dense prediction. We predict a two-channel weight map using the following calculation:
where denotes a channel-wise softmax function used for normalization, i.e., the sum of the weight values at the same pixel position is equal to 1. By so doing, we can guarantee that information from the combination would not explode. is sliced to have a weight map for the global branch and a weight map for the local branch. The fused final generation result is calculated as follows:
where is an element-wise multiplication operation. In this way, the pixel-level weights predicted from directly operate on the output of and . Moreover, generators , and affect and contribute to each other in the model optimization.
Dual-Discriminator. To exploit the prior domain knowledge, i.e., the semantic map, we extend the single domain vanilla discriminator  to a cross domain structure and we refer to it as the semantic-guided discriminator , as shown in Fig. 2. It employs the input semantic map and the generated image (or the real image ) as input:
which aims to preserve scene layout and capture the local-aware information.
For the cross-view image translation task, we also propose another image-guided discriminator , which takes the conditional image and the final generated image (or the ground-truth image ) as input:
In this case, the total loss of our Dual-Discriminator is .
The proposed LGGAN can be applied to different generative tasks such as the cross-view image translation  and the semantic image synthesis . In this section we present experimental results and analysis on both tasks.
Datasets. We follow [43, 36] and perform the cross-view image translation experiments on the Dayton  and CVUSA datasets . The Dayton dataset contains 76,048 images with a train/test split of 55,000/21,048 pairs. The CVUSA dataset consists of 35,532/8,884 image pairs in train/test split.
, we employ Inception Score (IS), Accuracy (Acc.), KL Divergence Score (KL) to evaluate the proposed model. These three metrics evaluate the distance between two different distributions from a high-level feature space. We also employ pixel-level similarity metrics to evaluate our method, i.e., Structural-Similarity (SSIM), Peak Signal-to-Noise Ratio (PSNR) and Sharpness Difference (SD).
State-of-the-Art Comparisons. We compare our LGGAN with several recently proposed state-of-the-art methods, i.e., Zhai et al. , Pix2pix , X-SO , X-Fork  and X-Seq . The comparison results are shown in Tables 1 and 2. We can observe that LGGAN consistently outperforms the competing methods on all metrics.
To study the effectiveness of LGGAN, we conduct experiments with the methods using semantic maps and RGB images as input, including Pix2pix++ , X-Fork++ , X-Seq++  and SelectionGAN . We implement Pix2pix++, X-Fork++ and X-Seq++ using their public source code. Results are shown in Tables 1 and 2. We observe that LGGAN achieves significantly better results than Pix2pix++, X-Fork++ and X-Seq++, confirming the advantage of the proposed LGGAN. A direct comparison with SelectionGAN is also shown in the tables providing better results on most metrics except pixel-level evaluation metrics, i.e., SSIM, PSNR and SD. SelectionGAN uses a two-stage generation strategy and an attention selection module, achieving slightly better results than ours on these three metrics. However, we generate much more photo-realistic results than SelectionGAN as shown in Fig. 4 and 5.
Qualitative Evaluation. The qualitative results are shown in Fig. 4 and 5. We observe that the generated results of LGGAN are visually significantly better than other approaches. It can be seen that our method generates more clear details on objects such as cars, buildings, road, trees than the other methods in the generated images.
Datasets. We follow GauGAN  and conduct extensive experiments on both Cityscapes  and ADE20K  datasets. Cityscapes contains street scenes in German cities. The training and testing set sizes of Cityscapes are 2,975 and 500, respectively. To evaluate the proposed LGGAN on more challenging datasets, we conduct experiments on the ADE20K dataset . This dataset contains challenging scenes with 150 semantic classes, and has 20,210 training and 2,000 validation images.
Evaluation Metric. We adopt the same evaluation metrics from previous work [8, 32, 48], and use the mean Intersection-over-Union (mIoU) and pixel accuracy (Acc) to measure the segmentation accuracy. Specifically, we follow GauGAN  and use the state-of-the-art segmentation networks on the generated images to produce semantic maps: DRN-D-105  for Cityscapes and UperNet101  for ADE20K. We also use the Fréchet Inception Distance (FID)  to measure the distance between the distribution of generated samples and the distribution of real samples. Finally, we follow  and employ Amazon Mechanical Turk (AMT) to measure the perceived visual fidelity of the generated images.
|Ours vs. CRN ||67.38||79.54|
|Ours vs. Pix2pixHD ||56.16||85.69|
|Ours vs. SIMS ||54.84||N/A|
|Ours vs. GauGAN ||53.19||57.31|
State-of-the-Art Comparisons. We compare the proposed LGGAN with several leading semantic image synthesis methods, i.e., Pix2pixHD , CRN , SIMS  and GauGAN . Results of the mIoU, Acc and FID metrics are shown in Table 3. We find that the proposed LGGAN outperforms the existing competing methods by a large margin on both mIoU and Acc metrics. For FID, the proposed method is only worse than SIMS on Cityscapes. However, SIMS has poor segmentation performance. The reason is that SIMS produces an image by searching and copying image patches from the training dataset. The generated images are more realistic since the method uses the real image patches. However, the approach always tends to copy objects with mismatched patches due to queries that cannot be guaranteed to have results in the dataset. Moreover, we follow the evaluation protocol of GauGAN and also provide AMT results, as shown in Table 4. We observe that users favor our synthesized results on both datasets compared with other competing methods including SIMS.
Qualitative Evaluation. The qualitative comparison results are shown in Fig. 6 and 7. We can see that the proposed method generates much better results with fewer visual artifacts while the spatial semantic layout of the generated images is also closer to the input semantic map.
Visualization of Generated Semantic Maps. We follow GauGAN  and apply pretrained segmentation networks on the generated images to produce semantic maps. The generated semantic maps of our LGGAN, GauGAN and the ground truths are shown in Fig. 8 and 9. We observe that the proposed LGGAN generates better semantic maps than GauGAN, especially on local texture (‘car’ in the first row and ‘terrain’ in the second row of Fig. 8) and small objects (‘traffic sign’ and ‘pole’ in the third row of Fig. 8), confirming our initial motivation.
We conduct extensive ablation studies on the Cityscapes dataset to evaluate different components of our LGGAN.
|Setup of LGGAN||mIoU||FID|
|w/ Global + Local (Add.)||64.6||66.1|
|w/ Global + Local (Con.)||65.8||65.6|
|w/ Global + Local (Con.) + Class. Loss||67.0||61.3|
|w/ Global + Local (Con.) + Class. Loss + Weight Map||68.4||57.7|
Baseline Models. The proposed LGGAN has five baselines as shown in Table 5: (i) ‘w/ Global’ means only adopting the global generator; (ii) ‘w/ Global + Local (Add.)’ combines the global generator and the proposed local generator to produce the final results, in which the local results are produced by using an addition operation as proposed in Eq. (3). (iii) The difference between ‘w/ Global + Local (Con.)’ and the previous model is that it uses a convolutional layer to generate the local results as presented in Eq. (4). (iv) ‘w/ Global + Local (Con.) + Class. Loss’ employ the proposed classification-based discriminative feature learning module. (v) ‘w/ Global + Local (Con.) + Class. Loss + Weight Map’ is our full model and adopts the proposed weight map fusion strategy.
Effect of Local and Global Generation. The results of the ablation study are shown in Table 5. When using an addition operation to generate the local result, the local and global generation strategy improves mIoU and FID by 2.3 and 5.7, respectively. When adopting a convolutional operation to produce the local results, the performance boosts further, i.e, 3.5 and 6.2 gain on the mIoU and FID metrics, respectively. Both results confirm the effectiveness of the proposed local and global generation framework. Moreover, we also provide qualitative results of the local and global generation in Fig. 1, 10 and 11. We observe that our full model, i.e., Global + Local, generates visually much better results than both the global and local method.
Effect of Classification-Based Feature Learning. ‘w/ Global + Local (Con.) + Class. Loss’ significantly outperforms the previous baseline with around 1.2 and 4.3 gain on the mIoU and FID metric, respectively. This means that the model indeed learns a more discriminative class-specific feature representation, confirming our design motivation.
Effect of Weight Map Fusion. By adding the proposed weight map fusion scheme, the overall performance is further boosted with 1.4 and 3.6 improvement on the mIoU and FID metric, respectively. This means the proposed LGGAN indeed learns complementary information from the local and the global generation branch. In Fig. 1, 10 and 11, we show some samples of the generated global and local weight maps. We observe that the generated global weight maps mainly focus on learning the global layout and structure, while the learned local weight maps focus on the local details, especially the connection between different classes.
We proposed Local class-specific and Global image-level Generative Adversarial Networks (LGGAN) for semantic-guided scene generation. The proposed LGGAN contains three generation branches, i.e., global image-level generation, local class-level generation and pixel-level fusion weight map generation, respectively. A new class-specific local generation network is designed to alleviate the influence of imbalanced training data and size difference of scene objects in joint learning. To learn more class-specific discriminative feature representations, a novel classification module is further proposed. To stabilize the model optimization, we further introduce a novel dual-discriminator, so that the synthesis results are not only visually appealing but also preserve the semantic layout. Experimental results demonstrate the superiority of the proposed approach and show new state-of-the-art results on both cross-view image translation and semantic image synthesis tasks.
The cityscapes dataset for semantic urban scene understanding.In CVPR, 2016.
Image-to-image translation with conditional adversarial networks.In CVPR, 2017.
Conditional image synthesis with auxiliary classifier gans.In ICML, 2017.