Conditional generative adversarial network (cGAN) [conditionalGAN] has recently made substantial progresses in realistic image synthesis. In cGAN, a generator aims to output a realistic image with a constraint implicitly encoded by . Conversely, a discriminator learns such a constraint from ground-truth pairs by predicting if is real or generated.
The current cGAN models [spade, pix2pixhd, pix2pix] for semantic image synthesis aim to solve the structural consistency constraint where the output image is required to be aligned to a semantic label map . The limitation of the above generative process is that the styles of the image outputs are inherently determined by the model and thus cannot be controlled by users. To provide desired controllability over the generated styles, previous studies [example_cvpr18, example_cvpr19] impose additional constraints and allow more inputs to the generator: , where is an exemplar image that guides the style of . However, previous studies are designed on datasets such as face [liu2015deep, rossler2018faceforensics], dancing [example_cvpr19] or street view [yu2018bdd100k], where the input images usually contain similar semantics and the spatial structures of and are usually similar as well.
Different from the previous studies, we propose to address a more challenging example-guided scene image generation task. As shown in Fig. 1, given a semantic label map (column 1) and an arbitrary scene image (column 2) with its semantic map (column 3) as the input, the task aims to generate a new scene image (column 4) that matches the semantic structure of and the scene style of . The challenge is that scene images have complex semantic structures as well as diversified scene styles, and more importantly, the inputs and are structurally uncorrelated and semantically unaligned. Therefore, a mechanism is required to better match the structures and semantics for coherent outputs, e.g., the tree styles can be applied to mountains but cannot be applied to sky.
In this paper, we propose a novel Masked Spatial-Channel Attention (MSCA) module (Section 3.2) to propagate features across unstructured scenes. Our module is inspired by a recent work [doubleattention]
for attention-based object recognition, but we apply a new cross-attention approach to model the semantic correspondence for image synthesis instead. To facilitate example-guided synthesis, we further improve the module by including: i) feature masking for semantic outlier filtering, ii) multi-scaling for global and local feature processing, and iii) resolution extending for image synthesis. As a result, our module provides both clear physical meaning and interpretability for the example-guided synthesis task.
We formulate the proposed approach under an unified synthesis network for joint feature extraction, alignment and image synthesis. We achieve this by applying MSCA modules to the extracted features for multi-scale feature domain alignment. Next, we apply a recent feature normalization technique, SPADE[spade] on the aligned features to allow spatially-controllable synthesis. To facilitate the learning of this network, we propose a novel patch-based self-supervision scheme. As opposed to [example_cvpr19], our scheme requires only semantically parsed images for training and does not rely on video data. We show that a model trained with this approach generalizes over scales and across different scene semantics.
Our main contributions include the following:
A novel masked spatial-channel attention (MSCA) module to propagate features for unstructured scenes.
An unified synthesis network for joint feature extraction, alignment and image synthesis.
A novel patch-based self-supervision scheme that requires only annotated images for training.
Experiments on COCO-stuff [cocostuff] dataset that show significant improvements over existing methods. Moreover, our model provides interpretability and can be extended to other tasks of content manipulation.
2 Related work
Generative Adversarial Networks. Recent years have witnessed the progresses of generative adversarial networks (GANs) [gan] for image synthesis. A GAN model consists of a generator and a discriminator where the generator serves to produce realistic images that cannot be distinguished from the real ones by the discriminator. Recent techniques for realistic image synthesis include modified losses [wasserstein, lsgan, improved], model regularization [sn], self-attention [sagan, largescalegan], feature normalization [stylegan] and progressive synthesis [progressivegan].
Image-to-Image translation (I2I). I2I translation aims to translate images from a source domain to a target domain. The initial work of Isola et al. [pix2pix] proposes a conditional GAN framework to learn I2I translation with paired images. Wang et al. [pix2pixhd] improve the conditional GAN for high-resolution synthesis and content manipulation. To enable I2I translation without using paired data, a few works [cyclegan, liu2017unsupervised, munit, drit, pairedcyclegan] apply the cycle consistency constraint in training. Recent works on photo-realistic image synthesis take semantic label maps as inputs for image synthesis. Specifically, Wang et al. [pix2pixhd] extend the conditional GAN for high-resolution synthesis, Chen et al. [CRN] propose a cascade refine pipeline. More recently, Park et al. [spade] propose spatial-adaptive normalization for realistic scene image generation.
Example-Guided Style Transfer and Synthesis. Example guided style transfer [image_analogies, Image_quilting] aims to transfer the style of an example image to a target image. More recent works [gatys2016image, adaptive_instance_normalization, phototransfer, johnson2016perceptual, deep_image_analogy, feature_shuffle, pairedcyclegan, video_style_transfer, wct2]
utilize deep neural network features to model and transfer styles. Several frameworks[munit, huang2018multimodal, example_iclr19] perform style transfer via image domain style and content disentanglement. In addition, domain adaptation [pairedcyclegan] applies a cycle consistency loss to cross-domain style transformation.
More recently, example-guided synthesis [example_cvpr18, example_cvpr19] is proposed to transfer the style of an example image to a target condition, e.g. a semantic label map. Specifically, Lin et al. [example_cvpr18] apply dual learning to disentangle the style for guided synthesis, Wang et al. [example_cvpr19] extract style-consistent data pairs from videos for model training. In addition, Park et al. [spade] adopt I2I networks to self-encoding versions for example-guided style transfer. Different from [example_cvpr18, example_cvpr19, spade], we address spatial alignment of complex scenes for better style integration in multiple regions of an image
. Furthermore, our patch-based self-supervision learning scheme does not require video data and is a general version of self-encoding.
Correspondence Matching for Synthesis. Finding correspondence is critical for many synthesis tasks. For instance, Siarohin et al. [Deformable] apply the affine transformation on reference person images to improve pose-guided person image synthesis, Wang et al. [vid2vid] use optical flow to align frames for coherent video synthesis. However, the affine transformation and optical flow cannot adequately model the correspondences between two arbitrary scenes.
The recent self-attention [wang2018non, sagan] can capture general pair-wise correspondences. However, self-attention is computationally intensive at high-resolution. Later, Chen et al. [doubleattention] propose to factorize self-attention for efficient video classification. Inspired by [doubleattention], we propose an attention-based module named MSCA. It is worth noting MSCA is based on cross-attention and feature masking for spatial alignment and image synthesis.
The proposed approach aims to generate scene images that align with given semantic maps. Differ from conventional semantic image synthesis methods [pix2pix, pix2pixhd, spade], our model takes an exemplary scene as an extra input to provide more controllability over the generated scene image. Unlike existing exemplar-base approaches [example_cvpr18, example_cvpr19], our model addresses the more challenging case where the exemplary inputs are structurally and semantically unaligned with the given semantic map.
Our method takes a semantic map , a reference image and its corresponding semantic map as inputs and synthesizes an image which matches the style of and structure of using a generator , . As shown in Fig. 2, the generator consists of three parts, namely i) feature extraction ii) feature alignment and iii) image synthesis. In Sec. 3.1, we describe the first part that extracts features from inputs of both scenes. In Sec. 3.2, we propose a masked spatial-channel attention (MSCA) module to distill features and discovery relations between two arbitrarily structured scene. Unlike the affine-transformation [stn] and flow-base warping [vid2vid], MSCA provides a better interpretability to the scene alignment task. In Sec. 3.3, we introduce how to use the aligned features for image synthesis. Finally, in Sec. 3.4, we propose a patch-based self-supervision scheme to facilitate learning.
3.1 Feature Extraction
Taking an image and label maps as inputs, the feature extraction module extracts multi-scale feature maps for each input. Specifically, the feature map of image at scale is computed by:
where denotes the convolution operation, denotes the feature map extracted by VGG-19 [vgg] at scale , and denotes a convolutional kernel for feature compression. is the number scales and we set in this paper.
For label map , its feature is computed by:
where denotes bilinear interpolation, denotes the resized label map, denotes a convolutional kernel for feature extraction, and operation denotes channel-wise concatenation. Note that as scale decreases from down to , the feature resolutions in Eq. 2 are progressively increased to match a finer label maps .
Similarly, applying Eq. 2 with the same weights to label map , we can extract its features :
3.2 Masked Spatial-channel Attention Module
As shown in Fig. 3, taking the image features and the label map features , as inputs111We assume spatial resolution at scale being and channel size of , , being , respectively. , the MSCA module generates a new image feature map that has the content of but is aligned with . We elaborate the detailed procedures as follows:
Spatial Attention. Given feature maps
of the exemplar scene, the module first computes a spatial attention tensor:
with denoting a convolutional filter and denoting a 2D softmax function on spatial dimensions . The output tensor contains attention maps of resolution , which serve to attend different spatial regions on image feature .
Spatial Aggregation. Then, the module aggregates
feature vectors fromusing the spatial attention maps of from Eq. 4. Specifically, a matrix dot product is performed:
with and denoting the reshaped versions of and , respectively. The output stores feature vectors spatially aggregated from the independent regions of .
Feature Masking. The exemplar scene may contain irrelevant semantics to the label map , and conversely, may contain semantics that are unrelated to . To address this issue, we apply feature masking on the output of Eq. 5 by multiplying with a length- gating vector at each row:
denotes a 2-layer MLP followed by a sigmoid function,denotes a global average pooling layer, denotes broadcast element-wise multiplication, and denotes the masked features. The design of feature masking in Eq. 6 resembles to Squeeze-and-Excitation [SEnet]. Using the integration of global information from label maps and , features are filtered.
Channel Attention. Given feature of label map , a channel attention tensor is generated as follows:
with denoting a convolutional filter and denoting a softmax function on channel dimension. The output serves to dynamically reuse features from .
Channel Aggregation. With channel attention computed in Eq. 7, feature vectors at spatial locations are aggregated again from via matrix dot product:
where denotes the reshaped version of . The output represents the aggregated features at locations. The output feature map is generated by reshaping to size .
Remarks. Spatial attention (Eq. 4) and aggregation (Eq. 5) attend to independent regions from feature , then store the features into . After feature masking, given a new label map , channel attention (Eq. 4) and aggregation (Eq. 8) combine at each location to compute a output feature map. As results, each output location finds its correspondent regional features or ignored via feature masking. In this way, the feature of example scene is aligned. Note that when and is constant, the above operations is essentially a global average pooling. We show in experiment that is sufficient to dynamically capture visually significant scene regions for alignment.
Multi-scaling. Both global color tone and local appearances are informative for the style-constraint synthesis. Therefore, we apply MSCA modules at all scales to generate global and local features .
3.3 Image Synthesis
The extracted features in Sec. 3.1 capture the semantic structure of , whereas the aligned features in Sec. 3.2 capture the appearance style of the example scene. In this section, we leverage and as control signals to generate output images with desired structures and styles.
Specifically, we adopt a recent synthesis model, SPADE [spade], and feed the concatenation of and to the spatially-adaptive denormalization layer of SPADE at each scale. By taking the style and structure signal as inputs, spatially-controllable image synthesis is achieved. We refer readers to appendix for more network details of the synthesis module.
3.4 Patch-Based Self-Supervision
Training a synthesis model requires style-consistent scene pairs. However, paired scenes are hard to acquire. To overcome the issue, we propose a patch-based self-supervision scheme which enables training.
Our basic assumption is that if patches and come from the same scene, they share the same style. Consequently, using patch as exemplar, both and the other patch can be reconstructed, i.e. self-reconstruction and cross-reconstruction. More formally, we sample non-overlapping patches and at locations and from a same scene . To enable training, four images are synthesized in one training step:
and compared against groundtruths . An illustrative example is shown in Fig. 4. Note that patches do not necessary share the same semantics and our model is required to complete example-missing regions with reasonable content through learning. Our training objective is adopted from to [spade]. However, we apply pixel domain loss to encourage color consistency. In our implementation, the generation processes in Eq. 9 share the same feature extraction, spatial attention, channel attention computation to reduce memory footprint during training.
Dataset Our model is trained on the COCO-stuff dataset [cocostuff]. It contains densely annotated images captured from various scenes. We remove indoor images and images of random objects from the training/validation set, resulting in /499 scene images for training/testing.
During training, we resize images to then crop two non-overlapping patches to facilitate patch-based self-supervision. The two patches are cropped either in the left and right halves of the image, or alternatively in the top and bottom halves.
The COCO-stuff dataset does not provide ground-truth for example-guided scene synthesis, i.e. two scene images with the exact same styles. To qualitatively evaluate model performances, we require a model to transfer the style from to , where is the test image and is the generated image, in three ways: i) duplicating: we use the test image itself to test self-reconstruction, ii) mirroring: is generated by horizontally mirroring , iii) retrieving: is generated by finding the best match from the larger image pool. Specifically, we generate 20 candidate images from the training set with the smallest label histogram intersections. Out of the 20 images, the best-matching image is generated using SIFT Flow [siftflow]. Finally, since the color of and are not the same, we apply [wct2] on image for color correction. Examples of the retrieving pairs are shown in Fig. 5, in columns 3 and 10.
|ours MSCA w/o att||11.76||0.27||0.524||98.35|
|ours MSCA w/o fm||15.64||0.40||0.455||89.58|
|ours MSCA w/o att||12.13||0.28||0.512||98.02|
|ours MSCA w/o fm||16.52||0.42||0.442||88.40|
|ours MSCA w/o att||11.92||0.28||0.508||102.24|
|ours MSCA w/o fm||15.91||0.40||0.437||89.44|
Implementation Details The number of attention maps for MSCA modules are set to from scale to . The learning rate is set to for the generator and the discriminator. The weights of generator are updated every iterations. We adopt the Adam [adam] optimizer ( and ) in all experiments. Our synthesis model and all comparative models are trained for epochs to generate the results in the experiments.
During implementation, we pretrain the spatial-channel attention with a lightweight feature decoder to avoid the ineffective but extremely slow updating of SPADE parameters. Specifically, at each scale, the concatenation of and in Sec. 3.3 at each scale is fed into a convolutional layer to reconstruct the ground-truth VGG feature at the corresponding scale. The pretraining takes around % of the total training time to converge. More details of the pretraining procedure is provided in the appendix.
We compare our approach with an example-guided synthesis approach: variational autoencoding SPADE (SPADE_VAE)[spade] which is based on a self-reconstruction loss for training. Therefore, we directly use the resized images to train the model. We also attempt to train two example-guided synthesis models [example_cvpr18] and [example_cvpr19] ([example_cvpr19] is trained using patch-based self-supervision) but cannot achieve visually good results. We leave the result of [example_cvpr18, example_cvpr19] in the appendix. In addition, three ablation models are evaluated (see Ablation Study).
For quantitative evaluation, we apply low-level metrics including PSNR and SSIM [ssim], and perceptual-level metrics including Perceptual Image Patch Similarity Distance (LPIPS) [lpips] and Fréchet Inception Distance (FID) [fid] on different models. For LPIPS, we use the linearly calibrated VGG model (see [lpips] for details).
As shown in Table 1, our method clearly outperforms the remaining methods. Improvements in low-level and perceptual-level measurements suggest that our model better preserves color and texture appearances. We observe that the performances of various approaches on the retrieving dataset are worse and less differentiated than their counterparts on the mirroring and duplicating datasets. It suggests that the retrieving dataset is harder and noisier, as one cannot retrieve images that have the exact same styles. On retrieving dataset, our approach achieves a moderate +0.36 PSNR gain over SPADE_VAE (from 15.62 to 15.98). By contrast, our approach achieves visually superior results over SPADE_VAE on duplicating and mirroring, e.g. +1.15 PSNR gain (from 15.35 to 16.50) on duplicating and +1.23 PSNR gain (from 15.72 to 16.95) in PSNR on mirroring.
Qualitative Evaluation Fig. 5 qualitatively compares our model against the remaining models on two retrieved scenes (rows 1-2) and two arbitrary scenes (rows 3-4). Our model achieves better style-consistent example-guided synthesis. Remarkably, in rows 3-4, even though the two scenes have very different semantics (indicated by the different colors of the corresponding label maps), our model can still maintain the styles of the exemplars while maintaining the correct semantics of the target label maps, e.g. generating “snow” rather than “grass” in row 4.
Also notice that sometimes our results are more style-consistent than the synthesized ground truths (last column). This further shows that the existing style transfer approach [styletransfer, wct2, phototransfer] cannot be directly applied to exemplar-guided scene synthesis for satisfactory results.
Ablation Study To evaluate the effectiveness of our design, we separately train three variants of our model: i) our GAP that replaces the MSCA module with global average pooling, ii) our MSCA w/o att that keeps MSCA moduels but replaces spatial and channel attention of MSCA by one-hot label maps from source and target domains, respectively. In such way, alignment is performed on regions with the same semantic labeling, and iii) our MSCA w/o fm that keeps MSCA modules but removes the feature masking procedures. In Table 1 and Fig. 5, our model clearly achieves the best quantitative and qualitative results. In comparison, in Fig. 5, our GAP produces similar appearances in each region, as GAP cannot distinguish local appearances. Our w/o att is less stable in training and cannot generate plausible results. We hypothesize that the label-level alignment will generate more misaligned and noisier feature maps, thus hurts training. our MSCA w/o fm cannot perform correct appearance transformation, for instance, transferring “sky” to “snow” (Fig. 5, last row).
The Effect of Attention To understand the effect of spatial-channel attention, we visualize the learned spatial and channel attention in Fig. 7. We observe that: a) spatial attention can attend to multiple regions of the reference image. For each reference region, channel attention finds the corresponding target region. b) spatial-channel attention can detect and utilize the semantic similarities between segments to transfer visual features. In the top row of Fig. 7, attention in channels respectively perform transformations: , . In the bottom row, attention in channels respectively perform transformations: , and .
Interpolation We can easily control the synthesized styles in the test stage by manipulating attentions. Here, we show how to interpolate between two styles using our trained model: given two example images and , we first compute their image features and the spatial-attention maps . Given an interpolating factor where means ignoring the example scene , the spatial attention map of the first scene is modified by . Afterwards, both feature maps and spatial attention are concatenated along the horizontal axis. In addition, the masking score (output of the 2-layer MLP in Eq. 6) is also interpolated. With the remaining procedures unchanged, i.e., same spatial aggregation, feature masking, channel aggregation and synthesis, interpolation results are readily generated. As shown in Fig. 6, with slight modifications, our model can perform effective style interpolation. Specifically, the style traverses along the path is achieved in Fig. 6.
Likewise, by manipulating the channel attention at each spatial location, it is possible to adaptively mix style to synthesize an output image, i.e. spatial styles interpolation. As shown in Figure 8, using the previous input, we interpolate between styles from left to right in a single image.
Given a scene patch at the center, our model can achieve scene extrapolation, i.e. generating beyond-the-border image content according to the semantic map guidance. A extrapolated images is generated by weighted combining synthesized patches at corners and other random locations. As shown in Fig. 9, our model generates visually plausible extrapolated images, showing the promise of our proposed framework for guided scene panorama generation.
Swapping Style Fig. 10 shows reference-guided style swapping on six distinctively different scenes. For the same segmentation mask, we generate multiple outputs using different reference images. Our approach can reasonably transfer styles among multiple scenes, including grassland, dessert, ocean view, ice land, etc. More results are included in the appendix.
We propose to address a challenging example-guided scene image synthesis task. To propagate information between structurally uncorrelated and semantically unaligned scenes, we propose an MSCA module that leverages decoupled cross-attention for adaptive correspondence modeling. With MSCA, we propose a unified model for joint global-local alignment and image synthesis. We further propose a patch-based self-supervision scheme that enables training. Experiments on the COCO-stuff dataset show significant improvements over the existing methods. Furthermore, our approach provides interpretability and can be extended to other content manipulation tasks.
Appendix A The Synthesis Module
As shown in Fig. 11, our image synthesis module (the dash block on the right) takes the image features map and segmentation features map as inputs to output a new image . Specifically, at each scale, a SPADE residue block [spade] with upsampling layer takes the concatenation of and as input to generate an upsampled feature map or image.
Appendix B MSCA Pretraining
As shown in Fig. 12, an auxiliary feature decoder (the dash block on the right) is used to pretrain the feature extractors and the MSCA modules. Specifically, at each scale, the concatenation of and at each scale is fed into a convolutional layer to reconstruct the ground-truth VGG feature of
at the corresponding scale. We weighted sum the L1 losses between predictions and ground-truth at each scales, then apply backpropagation to update weights of the whole model. We pretrain the model forepochs. Because of the light-weight design of the feature decoder, the pretraining step only takes around hours, and around % of the total training time.
Appendix C Results of [example_cvpr18, example_cvpr19]
We provide additional results of conditional image-to-image translation (Conditional I2I) [example_cvpr18] and style-guided synthesis [example_cvpr19] in Fig. 13, column 9 and 10. To train the model of [example_cvpr18], we resize images and semantic label maps to , the original resolution used in [example_cvpr18]. We test different learning rates and early stopping strategies to prevent the generator from model collapse. To implement [example_cvpr19], we train the model of [example_cvpr19] using our patch-based self-supervision. We test multiple learning rates and channel sizes of the generator. However, we could not achieves good results for [example_cvpr18] and [example_cvpr19]. We believe the disentanglement strategy of [example_cvpr18] is too challenging for the highly diversified COCO-stuff dataset. Meanwhile, input domain concatenation used in [example_cvpr19] may not be sufficient to capture and fuse the style information for the more challenging scene image dataset. In addition, spatially-adaptive normalization [spade] might be required for [example_cvpr19] to better utilize the captured style coding.
Appendix D More Style Swapping Results
We show style swapping results on diversified scenes in Fig. 14. As shown in the figure, our model can transfer styles to very different scene semantics and generate style consistent outputs given exemplar images.