[CVPR 2019 Oral] Multi-Channel Attention Selection GAN with Cascaded Semantic Guidance for Cross-View Image Translation
Cross-view image translation is challenging because it involves images with drastically different views and severe deformation. In this paper, we propose a novel approach named Multi-Channel Attention SelectionGAN (SelectionGAN) that makes it possible to generate images of natural scenes in arbitrary viewpoints, based on an image of the scene and a novel semantic map. The proposed SelectionGAN explicitly utilizes the semantic information and consists of two stages. In the first stage, the condition image and the target semantic map are fed into a cycled semantic-guided generation network to produce initial coarse results. In the second stage, we refine the initial results by using a multi-channel attention selection mechanism. Moreover, uncertainty maps automatically learned from attentions are used to guide the pixel loss for better network optimization. Extensive experiments on Dayton, CVUSA and Ego2Top datasets show that our model is able to generate significantly better results than the state-of-the-art methods. The source code, data and trained models are available at https://github.com/Ha0Tang/SelectionGAN.READ FULL TEXT VIEW PDF
We propose a novel model named Multi-Channel Attention Selection Generat...
It is hard to generate an image at target view well for previous cross-v...
We approach instantaneous mapping, converting images to a top-down view ...
Images acquired from rainy scenes usually suffer from bad visibility whi...
We propose a novel Edge guided Generative Adversarial Network (EdgeGAN) ...
Learning to generate natural scenes has always been a daunting task in
In this paper, we address the task of layout-to-image translation, which...
[CVPR 2019 Oral] Multi-Channel Attention Selection GAN with Cascaded Semantic Guidance for Cross-View Image Translation
[ACM MM 2018 Oral] GestureGAN for Hand Gesture-to-Gesture Translation in the Wild
Cross-view image translation is a task that aims at synthesizing new images from one viewpoint to another. It has been gaining a lot interest especially from computer vision and virtual reality communities, and has been widely investigated in recent years[41, 20, 55, 34, 48, 15, 31, 53, 46]
. Earlier works studied this problem using encoder-decoder Convolutional Neural Networks (CNNs) by involving viewpoint codes in the bottle-neck representations for city scene synthesis and 3D object translation . There also exist some works exploring Generative Adversarial Networks (GAN) for similar tasks . However, these existing works consider an application scenario in which the objects and the scenes have a large degree of overlapping in appearances and views.
Different from previous works, in this paper, we focus on a more challenging setting in which fields of views have little or even no overlap, leading to significantly distinct structures and appearance distributions for the input source and the output target views, as illustrated in Fig. 1. To tackle this challenging problem, Regmi and Borji  recently proposed a conditional GAN model which jointly learns the generation in both the image domain and the corresponding semantic domain, and the semantic predictions are further utilized to supervise the image generation. Although this approach performed an interesting exploration, we observe unsatisfactory aspects mainly in the generated scene structure and details, which are due to different reasons. First, since it is always costly to obtain manually annotated semantic labels, the label maps are usually produced from pretrained semantic models from other large-scale segmentation datasets, leading to insufficiently accurate predictions for all the pixels, and thus misguiding the image generation. Second, we argue that the translation with a single phase generation network is not able to capture the complex scene structural relationships between the two views. Third, a three-channel generation space may not be suitable enough for learning a good mapping for this complex synthesis problem. Given these problems, could we enlarge the generation space and learn an automatic selection mechanism to synthesize more fine-grained generation results?
Based on these observations, in this paper, we propose a novel Multi-Channel Attention Selection Generative Adversarial Network (SelectionGAN), which contains two generation stages. The overall framework of the proposed SelectionGAN is shown in Fig. 2. In this first stage, we learn a cycled image-semantic generation sub-network, which accepts a pair consisting of an image and the target semantic map, and generates images for the other view, which further fed into a semantic generation network to reconstruct the input semantic maps. This cycled generation adds more strong supervision between the image and semantic domains, facilitating the optimization of the network.
The coarse outputs from the first generation network, including the input image, together with the deep feature maps from the last layer, are input into the second stage networks. Several intermediate outputs are produced, and simultaneously we learn a set of multi-channel attention maps with the same number as the intermediate generations. These attention maps are used to spatially select from the intermediate generations, and are combined to synthesize a final output. Finally, to overcome the inaccurate semantic label issue, the multi-channel attention maps are further used to generate uncertainty maps to guide the reconstruction loss. Through extensive experimental evaluations, we demonstrate that SelectionGAN produces remarkably better results than the baselines such as Pix2pix, Zhai et al. , X-Fork  and X-Seq . Moreover, we establish state-of-the-art results on three different datasets for the arbitrary cross-view image synthesis task.
Overall, the contributions of this paper are as follows:
A novel multi-channel attention selection GAN framework (SelectionGAN) for the cross-view image translation task is presented. It explores cascaded semantic guidance with a coarse-to-fine inference, and aims at producing a more detailed synthesis from richer and more diverse multiple intermediate generations.
A novel multi-channel attention selection module is proposed, which is utilized to attentively select interested intermediate generations and is able to significantly boost the quality of the final output. The multi-channel attention module also effectively learns uncertainty maps to guide the pixel loss for more robust optimization.
Extensive experiments clearly demonstrate the effectiveness of the proposed SelectionGAN, and show state-of-the-art results on two public benchmarks, i.e. Dayton  and CVUSA . Meanwhile, we also create a larger-scale cross-view synthesis benchmark using the data from Ego2Top , and present results of multiple baseline models for the research community.
, compared to existing methods such as Restricted Boltzmann Machines and Deep Boltzmann Machines . A vanilla GAN model  has two important components, i.e. a generator and a discriminator . The goal of
is to generate photo-realistic images from a noise vector, whileis trying to distinguish between a real image and the image generated by . Although it is successfully used in generating images of high visual fidelity [18, 49, 32], there are still some challenges, i.e. how to generate images in a controlled setting. To generate domain-specific images, Conditional GAN (CGAN)  has been proposed. CGAN usually combines a vanilla GAN and some external information, such as class labels or tags [29, 30, 4, 40, 37], text descriptions [33, 50], human pose [8, 38, 28, 22, 35] and reference images [25, 16].
Image-to-Image Translation frameworks adopt input-output data to learn a parametric mapping between inputs and outputs. For example, Isola et al.  propose Pix2pix, which is a supervised model and uses a CGAN to learn a translation function from input to output image domains. Zhu et al.  introduce CycleGAN, which targets unpaired image translation using the cycle-consistency loss. To further improve the generation performance, the attention mechanism has been recently investigated in image translation, such as [3, 45, 39, 24, 26]. However, to the best of our knowledge, our model is the first attempt to incorporate a multi-channel attention selection module within a GAN framework for image-to-image translation task.
Learning Viewpoint Transformations. Most existing works on viewpoint transformation have been conducted to synthesize novel views of the same object, such as cars, chairs and tables [9, 41, 5]. Another group of works explore the cross-view scene image generation, such as [47, 53]. However, these works focus on the scenario in which the objects and the scenes have a large degree of overlapping in both appearances and views. Recently, several works started investigating image translation problems with drastically different views and generating a novel scene from a given arbitrary one. This is a more challenging task since different views have little or no overlap. To tackle this problem, Zhai et al.  try to generate panoramic ground-level images from aerial images of the same location by using a convolutional neural network. Krishna and Ali  propose a X-Fork and a X-Seq GAN-based structure to address the aerial to street view image translation task using an extra semantic segmentation map. However, these methods are not able to generate satisfactory results due to the drastic difference between source and target views and their model design. To overcome these issues, we aim at a more effective network design, and propose a novel multi-channel attention selection GAN, which allows to automatically select from multiple diverse and rich intermediate generations and thus significantly improves the generation quality.
In this section we present the details of the proposed multi-channel attention selection GAN. An illustration of the overall network structure is depicted in Fig. 2. In the first stage, we present a cascade semantic-guided generation sub-network, which utilizes the images from one view and conditional semantic maps from another view as inputs, and reconstruct images in another view. These images are further input into a semantic generator to recover the input semantic map forming a generation cycle. In the second stage, the coarse synthesis and the deep features from the first stage are combined, and then are passed to the proposed multi-channel attention selection module, which aims at producing more fine-grained synthesis from a larger generation space and also at generating uncertainty maps to guide multiple optimization losses.
Semantic-guided Generation. Cross-view synthesis is a challenging task, especially when the two views have little overlapping as in our study case, which apparently leads to ambiguity issues in the generation process. To alleviate this problem, we use semantic maps as conditional guidance. Since it is always costly to obtain annotated semantic maps, following  we generate the maps using segmentation deep models pretrained from large-scale scene parsing datasets such as Cityscapes . However,  uses semantic maps only in the reconstruction loss to guide the generation of semantics, which actually provides a weak guidance. Different from theirs, we apply the semantic maps not only in the output loss but also as part of the network’s input. Specifically, as shown in Fig. 2, we concatenate the input image from the source view and the semantic map from a target view, and input them into the image generator and synthesize the target view image as . In this way, the ground-truth semantic maps provide stronger supervision to guide the cross-view translation in the deep network.
Semantic-guided Cycle. Regmi and Borji  observed that the simultaneous generation of both the images and the semantic maps improves the generation performance. Along the same line, we propose a cycled semantic generation network to benefit more the semantic information in learning. The conditional semantic map together with the input image are input into the image generator , and produce the synthesized image . Then is further fed into the semantic generator which reconstructs a new semantic map . We can formalize the process as . Then the optimization objective is to make as close as possible to , which naturally forms a semantic generation cycle, i.e. . The two generators are explicitly connected by the ground-truth semantic maps, which in this way provide extra constraints on the generators to learn better the semantic structure consistency.
Cascade Generation. Due to the complexity of the task, after the first stage, we observe that the image generator outputs a coarse synthesis, which yields blurred scene details and high pixel-level dis-similarity with the target-view images. This inspires us to explore a coarse-to-fine generation strategy in order to boost the synthesis performance based on the coarse predictions. Cascade models have been used in several other computer vision tasks such as object detection  and semantic segmentation , and have shown great effectiveness. In this paper, we introduce the cascade strategy to deal with the complex cross-view translation problem. In both stages we have a basic cycled semantic guided generation sub-network, while in the second stage, we propose a novel multi-channel attention selection module to better utilize the coarse outputs from the first stage and produce fine-grained final outputs. We observed significant improvement by using the proposed cascade strategy, illustrated in the experimental part.
An overview of the proposed multi-channel attention selection module is shown in Fig. 3. The module consists of a multi-scale spatial pooling and a multi-channel attention selection component.
Multi-Scale Spatial Pooling.
Since there exists a large object/scene deformation between the source view and the target view, a single-scale feature may not be able to capture all the necessary spatial information for a fine-grained generation. Thus we propose a multi-scale spatial pooling scheme, which uses a set of different kernel size and stride to perform a global average pooling on the same input features. By so doing, we obtain multi-scale features with different receptive fields to perceive a different spatial context. More specifically, given the coarse inputs and the deep semantic features produced from the stage I, we first concatenate all of them as new features denoted asfor the stage II as:
where is a function for channel-wise concatenation operation; and are features from the last convolution layers of the generators and , respectively. We apply a set of spatial scales in pooling, resulting in pooled features with different spatial resolution. Different from the pooling scheme used in  which directly combines all the features after pooling, we first select each pooled feature via an element-wise multiplication with the input feature. Since in our task the input features are from different sources, highly correlated features would preserve more useful information for the generation. Let us denote as pooling at a scale followed by an up-sampling operation to rescale the pooled feature at the same resolution, and as element-wise multiplication, we can formalize the whole process as follows:
Then the features are fed into a convolutional layer, which produces new multi-scale features for the use in the multi-channel selection module.
Multi-Channel Attention Selection. In previous cross-view image synthesis works, the image is generated only in a three-channel RGB space. We argue that this is not enough for the complex translation problem we are dealing with, and thus we explore using a larger generation space to have a richer synthesis via constructing multiple intermediate generations. Accordingly, we design a multi-channel attention mechanism to automatically perform spatial and temporal selection from the generations to synthesize a fine-grained final output.
Given the multi-scale feature volume , where and are width and height of the features, and is the number of channels, we consider two directions. One is for the generation of multiple intermediate image synthesis, and the other is for the generation of multi-channel attention maps. To produce different intermediate generations , a convolution operation is performed with convolutional filters followed by a non-linear activation operation. For the generation of corresponding attention maps, the other group of filters is applied. Then the intermediate generations and the attention maps are calculated as follows:
where is a channel-wise softmax function used for the normalization. Finally, the learned attention maps are utilized to perform channel-wise selection from each intermediate generation as follows:
where represents the final synthesized generation selected from the multiple diverse results, and the symbol denotes the element-wise addition. We also generate a final semantic map in the second stage as in the first stage, i.e. . Due to the same purpose of the two semantic generators, we use a single twice by sharing the parameters in both stages to reduce the network capacity.
Uncertainty-guided Pixel Loss. As we discussed in the introduction, the semantic maps obtained from the pretrained model are not accurate for all the pixels, which leads to a wrong guidance during training. To tackle this issue, we propose the generated attention maps to learn uncertainty maps to control the optimization loss. The uncertainty learning has been investigated in  for multi-task learning, and here we introduce it for solving the noisy semantic label problem. Assume that we have different loss maps which need a guidance. The multiple generated attention maps are first concatenated and passed to a convolution layer with filters to produce a set of uncertainty maps. The reason of using the attention maps to generate uncertainty maps is that the attention maps directly affect the final generation leading to a close connection with the loss. Let denote a pixel-level loss map and denote the -th uncertainty map, we have:
is a Sigmoid function for pixel-level normalization. The uncertainty map is automatically learned and acts as a weighting scheme to control the optimization loss.
Parameter-Sharing Discriminator. We extend the vanilla discriminator in  to a parameter-sharing structure. In the first stage, this structure takes the real image and the generated image or the ground-truth image as input. The discriminator learns to tell whether a pair of images from different domains is associated with each other or not. In the second stage, it accepts the real image and the generated image or the real image as input. This pairwise input encourages to discriminate the diversity of image structure and capture the local-aware information.
Adversarial Loss. In the first stage, the adversarial loss of for distinguishing synthesized image pairs from real image pairs is formulated as follows,
In the second stage, the adversarial loss of for distinguishing synthesized image pairs from real image pairs is formulated as follows:
|Direction||Method||Dayton (6464)||Dayton (256256)||CVUSA|
|a2g||Zhai et al. ||-||-||-||-||-||-||-||-||0.4147*||17.4886*||16.6184*||27.43 1.63*|
|Pix2pix ||0.4808*||19.4919*||16.4489*||6.29 0.80*||0.4180*||17.6291*||19.2821*||38.26 1.88*||0.3923*||17.6578*||18.5239*||59.81 2.12*|
|X-Fork ||0.4921*||19.6273*||16.4928*||3.42 0.72*||0.4963*||19.8928*||19.4533*||6.00 1.28*||0.4356*||19.0509*||18.6706*||11.71 1.55*|
|X-Seq ||0.5171*||20.1049*||16.6836*||6.22 0.87*||0.5031*||20.2803*||19.5258*||5.93 1.32*||0.4231*||18.8067*||18.4378*||15.52 1.73*|
|SelectionGAN (Ours)||0.6865||24.6143||18.2374||1.70 0.45||0.5938||23.8874||20.0174||2.74 0.86||0.5323||23.1466||19.6100||2.96 0.97|
|g2a||Pix2pix ||0.3675*||20.5135*||14.7813*||6.39 0.90*||0.2693*||20.2177*||16.9477*||7.88 1.24*||-||-||-||-|
|X-Fork ||0.3682*||20.6933*||14.7984*||4.45 0.84*||0.2763*||20.5978*||16.9962*||6.92 1.15*||-||-||-||-|
|X-Seq ||0.3663*||20.4239*||14.7657*||7.20 0.92*||0.2725*||20.2925*||16.9285*||7.07 1.19*||-||-||-||-|
|SelectionGAN (Ours)||0.5118||23.2657||16.2894||2.25 0.56||0.3284||21.8066||17.3817||3.55 0.87||-||-||-||-|
Overall Loss. The total optimization loss is a weighted sum of the above losses. Generators , , attention selection network and discriminator are trained in an end-to-end fashion optimizing the following min-max function,
where uses the L1 reconstruction to separately calculate the pixel loss between the generated images , , and and the corresponding real images. is the total variation regularization  on the final synthesized image . and are the trade-off parameters to control the relative importance of different objectives. The training is performed by solving the min-max optimization problem.
Network Architecture. For a fair comparison, we employ U-Net  as our generator architectures and
. U-Net is a network with skip connections between a down-sampling encoder and an up-sampling decoder. Such architecture comprehensively retains contextual and textural information, which is crucial for removing artifacts and padding textures. Since our focus is on the cross-view image generation task,is more important than . Thus we use a deeper network for and a shallow network for . Specifically, the filters in first convolutional layer of and are 64 and 4, respectively. For the network , the kernel size of convolutions for generating the intermediate images and attention maps are and , respectively. We adopt PatchGAN  for the discriminator .
Training Details. Following , we use RefineNet  and  to generate segmentation maps on Dayton and Ego2Top datasets as training data, respectively. We follow the optimization method in  to optimize the proposed SelectionGAN, i.e. one gradient descent step on discriminator and generators alternately. We first train , , with fixed, and then train with , , fixed. The proposed SelectionGAN is trained and optimized in an end-to-end fashion. We employ Adam  with momentum terms and as our solver. The initial learning rate for Adam is 0.0002. The network initialization strategy is Xavier 
|Dir.||Method||Dayton (6464)||Dayton (256256)||CVUSA|
|Accuracy (%)||Accuracy (%)||Accuracy (%)||Accuracy (%)||Accuracy (%)||Accuracy (%)|
|a2g||Zhai et al. ||-||-||-||-||-||-||-||-||13.97*||14.03*||42.09*||52.29*|
|Dir.||Method||Dayton (6464)||Dayton (256256)||CVUSA|
|a2g||Zhai et al. ||-||-||-||-||-||-||1.8434*||1.5171*||1.8666*|
Datasets. We perform the experiments on three different datasets: (i) For the Dayton dataset , following the same setting of , we select 76,048 images and create a train/test split of 55,000/21,048 pairs. The images in the original dataset have resolution. We resize them to ; (ii) The CVUSA dataset  consists of 35,532/8,884 image pairs in train/test split. Following [48, 34], the aerial images are center-cropped to and resized to . For the ground level images and corresponding segmentation maps, we take the first quarter of both and resize them to ; (iii) The Ego2Top dataset  is more challenging and contains different indoor and outdoor conditions. Each case contains one top-view video and several egocentric videos captured by the people visible in the top-view camera. This dataset has more than 230,000 frames. For training data, we randomly select 386,357 pairs and each pair is composed of two images of the same scene but different viewpoints. We randomly select 25,600 pairs for evaluation.
Parameter Settings. For a fair comparison, we adopt the same training setup as in [16, 34]. All images are scaled to , and we enabled image flipping and random crops for data augmentation. Similar to , the low resolution (
) experiments on Dayton dataset are carried out for 100 epochs with batch size of 16, whereas the high resolution () experiments for this dataset are trained for 35 epochs with batch size of 4. For the CVUSA dataset, we follow the same setup as in [48, 34], and train our network for 30 epochs with batch size of 4. For the Ego2Top dataset, all models are trained with 10 epochs using batch size 8. In our experiment, we set =, , , and in Eq. (9), and in Eq. (8). The number of attention channels in Eq. (3
) is set to 10. The proposed SelectionGAN is implemented in PyTorch. We perform our experiments on Nvidia GeForce GTX 1080 Ti GPU with 11GB memory to accelerate both training and inference.
Evaluation Protocol. Similar to , we employ Inception Score, top-k prediction accuracy and KL score for the quantitative analysis. These metrics evaluate the generated images from a high-level feature space. We also employ pixel-level similarity metrics to evaluate our method, i.e. Structural-Similarity (SSIM), Peak Signal-to-Noise Ratio (PSNR) and Sharpness Difference (SD).
|Method||SSIM||PSNR||SD||Inception Score||Accuracy||KL Score|
|Pix2pix ||0.2213||15.7197||16.5949||2.5418||1.6797||2.4947||1.22||1.57||5.33||6.86||120.46 1.94|
|X-Fork ||0.2740||16.3709||17.3509||4.6447||2.1386||3.8417||5.91||10.22||20.98||30.29||22.12 1.65|
|X-Seq ||0.2738||16.3788||17.2624||4.5094||2.0276||3.6756||4.78||8.96||17.04||24.40||25.19 1.73|
|SelectionGAN (Ours)||0.6024||26.6565||19.7755||5.6200||2.5328||4.7648||28.31||54.56||62.97||76.30||3.05 0.91|
Baseline Models. We conduct ablation study in a2g (aerial-to-ground) direction on Dayton dataset. To reduce the training time, we randomly select 1/3 samples from the whole 55,000/21,048 samples i.e. around 18,334 samples for training and 7,017 samples for testing. The proposed SelectionGAN considers eight baselines (A, B, C, D, E, F, G, H) as shown in Table 4. Baseline A uses a Pix2pix structure  and generates using a single image . Baseline B uses the same Pix2pix model and generates using the corresponding semantic map . Baseline C also uses the Pix2pix structure, and inputs the combination of a conditional image and the target semantic map to the generator . Baseline D uses the proposed cycled semantic generation upon Baseline C. Baseline E represents the pixel loss guided by the learned uncertainty maps. Baseline F employs the proposed multi-channel attention selection module to generate multiple intermediate generations, and to make the neural network attentively select which part is more important for generating a scene image with a new viewpoint. Baseline G adds the total variation regularization on the final result . Baseline H employs the proposed multi-scale spatial pooling module to refine the features from stage I. All the baseline models are trained and tested on the same data using the configuration.
Ablation Analysis. The results of ablation study are shown in Table 4. We observe that Baseline B is better than baseline A since contains more structural information than . By comparison Baseline A with Baseline C, the semantic-guided generation improves SSIM, PSNR and SD by 8.19, 3.1771 and 0.3205, respectively, which confirms the importance of the conditional semantic information; By using the proposed cycled semantic generation, Baseline D further improves over C, meaning that the proposed semantic cycle structure indeed utilizes the semantic information in a more effective way, confirming our design motivation; Baseline E outperforms D showing the importance of using the uncertainty maps to guide the pixel loss map which contains an inaccurate reconstruction loss due to the wrong semantic labels produced from the pretrained segmentation model; Baseline F significantly outperforms E with around 4.67 points gain on the SSIM metric, clearly demonstrating the effectiveness of the proposed multi-channel attention selection scheme; We can also observe from Table 4 that, by adding the proposed multi-scale spatial pool scheme and the TV regularization, the overall performance is further boosted. Finally, we demonstrate the advantage of the proposed two-stage strategy over the one-stage method. Several examples are shown in Fig. 5. It is obvious that the coarse-to-fine generation model is able to generate sharper results and contains more details than the one-stage model.
State-of-the-art Comparisons. We compare our SelectionGAN with four recently proposed state-of-the-art methods, which are Pix2pix , Zhai et al. , X-Fork  and X-Seq . The comparison results are shown in Tables 1, 2, 3, and 5. We can observe the significant improvement of SelectionGAN in these tables. SelectionGAN consistently outperforms Pix2pix, Zhai et al., X-Fork and X-Seq on all the metrics except for Inception Score. In some cases in Table 3 we achieve a slightly lower performance as compared with X-Seq. However, we generate much more photo-realistic results than X-Seq as shown in Fig. 4 and 6.
Qualitative Evaluation. The qualitative results in higher resolution on Dayton and CVUSA datasets are shown in Fig. 4 and 6. It can be seen that our method generates more clear details on objects/scenes such as road, tress, clouds, car than the other comparison methods in the generated ground level images. For the generated aerial images, we can observe that grass, trees and house roofs are well rendered compared to others. Moreover, the results generated by our method are closer to the ground truths in layout and structure, such as the results in a2g direction in Fig. 4 and 6.
Arbitrary Cross-View Image Translation. Since Dayton and CVUSA datasets only contain two views in one scene, i.e. aerial and ground views. We further use the Ego2Top dataset to conduct the arbitrary cross-view image translation experiments. The quantitative and qualitative results are shown in Table 5 and Fig. 7, respectively. Given an image and some novel semantic maps, SelectionGAN is able to generate the same scene but with different viewpoints.
We propose the Multi-Channel Attention Selection GAN (SelectionGAN) to address a novel image synthesizing task by conditioning on a reference image and a target semantic map. In particular, we adopt a cascade strategy to divide the generation procedure into two stages. Stage I aims to capture the semantic structure of the scene and Stage II focus on more appearance details via the proposed multi-channel attention selection module. We also propose an uncertainty map-guided pixel loss to solve the inaccurate semantic labels issue for better optimization. Extensive experimental results on three public datasets demonstrate that our method obtains much better results than the state-of-the-art.
Acknowledgements: This research was partially supported by National Institute of Standards and Technology Grant 60NANB17D191 (YY, JC), Army Research Office W911NF-15-1-0354 (JC) and gift donation from Cisco Inc (YY).
Joint cascade face detection and alignment.In ECCV, 2014.
The cityscapes dataset for semantic urban scene understanding.In CVPR, 2016.
Image-to-image translation with conditional adversarial networks.In CVPR, 2017.
Perceptual losses for real-time style transfer and super-resolution.In ECCV, 2016.
Conditional image synthesis with auxiliary classifier gans.In ICML, 2017.
We investigate the influence of the number of attention channels in Equation 3 in the main paper. Results are shown in Table 6. We observe that the performance tends to be stable after . Thus, taking both performance and training speed into consideration, we have set in all our experiments.
We provide more comparison results of coarse-to-fine generation in Table 7 and Figures 8, 9 and 10. We observe that our two-stage method generate much visually better results than the one-stage model, which further confirms our motivations.
|Baseline||Stage I||Stage II||SSIM||PSNR||SD|
In Figures 8, 9, 10 and 11, we show some samples of the generated uncertainty maps. We can see that the generated uncertainty maps learn the layout and structure of the target images. Note that most textured regions are similar in our generation images, while the junction/edge of different regions is uncertain, and thus the model learns to highlight these parts.
We also conducted the arbitrary cross-view image translation experiments on Ego2Top dataset. As we can see from Figure 11, given an image and some novel semantic maps, SelectionGAN is able to generate the same scene but with different viewpoints in both outdoor and indoor environments.
Since the proposed SelectionGAN can generate segmentation maps, we also compare it with X-Fork  and X-Seq  on Dayton dataset. Following , we compute per-class accuracies and mean IOU for the most common classes in this dataset: “vegetation”, “road”, “building” and “sky” in ground segmentation maps. Results are shown in Table 8. We can see that the proposed SelectionGAN achieves better results than X-Fork  and X-Seq  on both metrics.
In Figures 12, 13, 14, 15 and 16, we show more image generation results on Dayton, CVUSA and Ego2Top datasets compared with the state-of-the-art methods i.e., Pix2pix , X-Fork  and X-Seq . For Figures 12, 13, 14, 15, we reproduced the results of Pix2pix , X-Fork  and X-Seq  using the pre-trained models provided by the authors111https://github.com/kregmi/cross-view-image-synthesis. As we can see from all these figures, the proposed SelectionGAN achieves significantly visually better results than the competing methods.