Imagine this scene: a sparrow is standing on a branch, and it flies away and lands on another branch. A while later, the sparrow flies out of sight, and another cuckoo lands on this very branch to take a rest from a long flight. It is natural that one object can appear in different scenes, and so can various distinct objects appear in the same scene. Obviously, most images can be typically decomposed into one or multiple foreground objects and a background scene.
Ideally, image synthesis should be equipped with such capability for capturing the diversity of foreground-background composition. On the other hand, most existing GANs are ‘Typical GANs’ as illustrated in Fig. 1, which generate new images at one go without separate handling of foreground objects and background scenes. Consequently, the generated images usually have highly correlated foregrounds and backgrounds depending on the training images. We argue that the foreground-background correlation usually consists of three components including 1) style correlation that captures consistency in color, saturation, etc., 2) geometry correlation that captures geometrical plausibility, and 3) content correlation that captures the co-occurrence of foreground objects and background scenes [zhang2019cad]. ‘Typical GANs’ tend to capture excessive content correlation, especially when the training data do not have sufficient size and variations, leading to limited diversity in the generated images.
Some attempts have been reported to address the excessive content correlation by generating foregrounds and backgrounds separately. One work is LR-GAN [yang2017lr] which first generates backgrounds and foregrounds recursively and then stitches them to produce the final synthesis. However, LR-GAN essentially belongs to ‘Typical GANs’ as its generated foreground is entirely conditioned on the generated background and the whole generation still conditions on a single latent code. Another work is FineGAN [zhao2019image] which disentangles backgrounds and foregrounds and generates them hierarchically. However, FineGAN cannot handle foreground-background mismatch well, and its generated foreground tends to occlude background completely when mismatches happen, making FineGAN degrade to the ‘Typical GANs’. Additionally, the stitching process in FineGAN does not guarantee style alignment, i.e. the generated foreground does not adjust its style for matching with the generated background. To the best of our knowledge, these are only two works that attempt to generate image foreground and background separately.
This paper presents Foreground-Background Composition GAN (FBC-GAN) - an innovative image synthesis network that achieves superior synthesis flexibility and diversity by generating image foreground and background independently as illustrated in Fig. 1. This independent generation approach relaxes not only content correlation (undesired) but also style and geometry correlations (desired) between the generated foregrounds and backgrounds. To reinstate the style and geometry correlations, we adapt the idea of AdaIN [huang2017arbitrary] for style alignment and modify the generated background for geometric plausibility between the generated foregrounds and backgrounds. Leveraging these designs, FBC-GAN can offer superior diversity and flexibility in image synthesis, e.g., it can generate images with the same foreground object but different background scenes, the same background scene but different foreground objects, or the same foreground object and background scene but different object positions, sizes, and poses without additional conditions. Additionally, it allows generating ‘new information’ by enabling image synthesis with foreground and background sampled from different datasets.
The contributions of this work are threefold. First, we propose FBC-GAN – a novel image synthesis network that relaxes the excessive content correlation by generating image foreground and background independently. Second, we design novel consistency mechanisms that achieve style alignment and geometry plausibility (between independently generated foreground and background) by exploiting feature statistics and adaptive translation of the generated image background. Third, FBC-GAN enables image synthesis with foreground and background sampled from different datasets, which allows to generate rich and new information beyond the distribution of a single reference dataset as in most existing image synthesis GANs.
2 Related Works
Image Synthesis. In addition to discriminative models [zhang2019cad, zhang2021meta, zhang2021pnpdet, zhang2021detr, huang2021fsdr, huang2021rda], generative models [goodfellow2014generative] have achieved remarkable progress in recent years, especially in image synthesis. It finds applications in different tasks in image translation [park2019semantic, Isola_2017_CVPR, zhan2021unite, zhan2020sagan, zhang2021defect, zhang2021crlsr, zhan2021rabit]iizuka2017globally, yu2019free, yu2021diverse], and image editing [yu2018inpainting, wu2020cascade, wu2020leed, zhan2021emlight, zhan2021gmlight, zhan2020towards, zhan2021needlelight].
The early researches focus on synthesizing images unconditionally [dcgan, mao2017least, karras2017progressive]. To achieve better controllability over certain attributes of the generated images, more and more works emerge to perform conditional image synthesis. Various conditions have been exploited including image labels [CGAN], input images [Pathak_2016_CVPR, Isola_2017_CVPR, liu2017unsupervised, cyclegan, zhao2018modular, wu2019relgan], sketches [sangkloy2017scribbler, zhu2017toward, xian2018texturegan, zhao2019image], text descriptions [zhang2017stackgan, zhang2018stackgan++, cha2019adversarial], etc.
Although conditional GANs impose certain controls over the synthesized images, most of them suffer from constrained diversity and flexibility in the generated contents as they synthesize images at one go from a single latent code. LR-GAN [yang2017lr] and FineGAN [zhao2019image] attempt to mitigate this problem by decomposing foreground and background and generating them separately. They propose a stitching based approach that first generates background and then stitches a correlated foreground into the background. However, the stitching tends to occlude background if the generated foreground does not match the background. Our FBC-GAN generates foreground and background independently and composes them into a style-consistent and geometry-consistent image. It offers great flexibility and diversity in image generation, more details to be described in the ensuing subsections.
Image Composition. Image composition aims to overlay a masked foreground object over a background image with good consistency in both styles and geometry. To achieve style consistency, [lalonde2007photo] selects foreground objects from a database with coherent appearance for composition. [chen2019toward] uses a brightness and contrast model to adjust the foreground appearance. [zhan2019spatial] incorporates CycleGAN [cyclegan] to align the foreground style with the background. For geometry consistency, most of the existing works [lin2018st, chen2019toward, zhan2019spatial, zhan2020aicnet, zhan2019esir, zhan2019gadan, zhan2018verisimilar, zhan2019scene]
leverage a Spatial Transformation Network[jaderberg2015spatial] to adjust the position, size and pose of foreground objects.
Our FBC-GAN is composition based model which generates foreground and background separately. It adopts Adaptive Instance Normalization (AdaIN) [huang2017arbitrary] to achieve style consistency in the its generated foreground and background. For geometry consistency, it designs an innovative composition module that employs foreground shape as guidance to guide the generation of foreground and modify the background with minor alternations for geometrical compatibility.
The proposed FBC-GAN is an end-to-end image synthesizer. It consists of four modules: Shape Generator (S-Gen), Foreground Generator (FG-Gen), Background Generator (BG-Gen) and Background Modifier (BG-Mod) as illustrated in Fig. 2
. FG-Gen and BG-Gen independently and concurrently generate image foregrounds and backgrounds from latent codes sampled from Gaussian distributions. By generating foreground and background separately, it lifts the content, style and geometry correlations between them. To reinstate the style correlation, it aligns the statistics of their feature representations as inspired by AdaIN[huang2017arbitrary]. To reinstate the geometry correlation, it employs a S-Gen and BG-Mod to bridge the geometry (e.g., adding a branch below a bird standing in the sky) between the independently generated foregrounds and backgrounds.
As a result, the proposed FBC-GAN retains the style and geometry consistency for generated images for visual realism, while lifts the content correlations between generated foregrounds and backgrounds for superior synthesis diversity and flexibility.
3.2 Foreground Generation
Foreground Generator (FG-Gen) is illustrated in the upper part of Fig. 2. The network architecture of FG-Gen follows the generator in SPADE [park2019semantic] except it takes random Gaussian latent codes as input and the shape generated by S-Gen serves as input to SPADE module to guide the foreground generation. We divide the generator into two parts, and , and is only a single convolution layer that shares weight with in Background Generator.
Since FG-Gen and BG-Gen generate foreground and background separately and independently, the style of the generated foreground and background are usually incompatible with each other. Style alignment of foreground and background is thus necessary for synthesis realism. We adopt AdaIN [huang2017arbitrary] to feature map inputs of and to achieve style alignment of foreground and background. To align styles without changing the distinctive appearance of foreground, we modify AdaIN to soft AdaIN where style-aligned foreground features are weighted combination of features obtained from AdaIN and the original features:
where is the style-aligned feature maps of foreground. and are feature map inputs of and respectively. is a hyper-parameter controlling the intensity of the style alignment operation, which is empirically set to if foregrounds and backgrounds are all nature style images (Dataset1, Dataset2 and Dataset4 introduced in section 4.1) and if foregrounds are nature style and backgrounds are Monet style images (Dataset3 and Dataset5 introduced in section 4.1).
Finally, is fed into to generate style-aligned foreground object.
3.3 Background Generation
Background Generation consists of two modules: Background Generator (BG-Gen) and Background Modifier (BG-Mod).
As shown in Fig. 2, BG-Gen mainly consists of two parts, and . and together generate background scenes with random Gaussian latent codes as input. shares weight with for style alignment mentioned in section 3.2.
BG-Mod modifies the generated background to fit the generated shape in a geometrically consistent manner with the least alternation, ensuring geometrical consistency while keeping the background unchanged as much as possible. It concatenates the shape generated by S-Gen and the background generated by BG-Gen and feeds the concatenated results into . Without considering foreground appearance, focuses on aligning foreground and background geometrically. Details of BG-Mod is illustrated in Fig. 3. The preliminary output of is a geometrically aligned image, a foreground binary mask and an attention map. The geometrically aligned image is generated through adversarial learning. We use the idea of GAN-CLS [reed2016generative] to match the generated images and generated foreground binary masks so that the foreground objects shapes of generated images can be the same as the input shape. Spatial-wisely multiplying the generated mask and image produces geometrically compatible backgrounds as shown in Fig. 2. Attention map is used only during training stage to guide the generation of background contents. With generated shape as reference, it pays as much attention as possible to areas outside the foreground object (white pixels) and allows minor alternation of the background (black pixels) to achieve geometry consistency.
3.4 Final Image Composition
The final image is obtained by blending our style consistent foreground image generated from FG-Gen and geometrically consistent background generated from BG-Mod.
3.5 Varying positions, sizes and poses
By manipulating the input latent codes, our proposed FBC-GAN can generate images with the same background scene and different foreground objects as well as images with the same foreground objects and different background scenes. However, the most unique feature of our FBC-GAN is that it can vary the positions, sizes, and poses of the foreground objects in generated images while preserving the exact identity of the foreground object without additional conditions. This is attributed to the generation of pure foreground objects (zero background information) by our unique generation process. That is, instead of controlling the foreground sizes, positions and poses through latent code, we can directly apply transformations (shifting, flipping, rotation and resizing) to the generated foreground shapes and foreground objects. BG-Mod can modify the backgrounds accordingly to fit the transformed foreground objects. Such unique generation process endows the network with the capbility to generate images that can vary the positions, sizes and poses of the foreground object while preserving the exact identity of the foreground objects without additional conditions required.
3.6 Training Objective
The loss of background generation for BG-Gen follows in FineGAN [zhao2019image]. The generated foreground by FG-Gen and foreground shape by S-Gen are learned via adversarial learning with loss and , respectively. To generate more visual details for foreground, we introduce additional feature matching loss and perceptual loss for foreground generation and set their weights to be 10 [wang2018high]. Thus, the foreground loss can be formulated as
The preliminarily generated image from is learned via adversarial learning with loss . Inspired by the idea of GAN-CLS [reed2016generative], we align with the generated mask by a matching aware discriminator that implicitly discriminate between true image-segmentation pair(, ) and two sources of fake pairs: (, ) and (, ). is the real image, is its corresponding foreground object mask and is a mismatched foreground object mask from another real image. We therefore formulate the adversarial loss for image and segmentation pair by:
The foreground object shape in should be the same as the generted shape. This can be restricted by aligning the generated mask from with the generated shape from S-Gen . We introduce mean square error loss to achieve this alignment.
To ensure that will not modify background information massively, we design an attention-based background loss between the generated background and the background of , where attention mask is also generated to learn the attention of the reserved background areas. For our generator to reserve as much background information as possible and rationalize the overall image with minimum alternation, the attention mask shall be as close as the reverse of input shape prior . The attention-based background loss is thus defined by:
The overall loss function of the proposed FBC-GAN is:
We empirically set as and as .
4.1 Datasets and Implementation Details
Our network generates image foregrounds and backgrounds independently and reinstate their style and geometry correlations for synthesis realism. Thus, our generated foregrounds and backgrounds can be sampled from the same dataset or from from different datasets. We evaluate such superiority over five different datasets as shown in table 1
. Each dataset consists of a foreground sub-dataset and a background sub-dataset, where foreground sub-dataset contains foreground objects that is used to train FG-Gen and foreground shapes that is use to train S-Gen. And background sub-dataset contains images and foreground shapes where foreground shapes are used to guide BG-Mod in generating geometrically compatible background scenes. Original Stanford Cars[wah2011caltech]
dataset and ImageNet[imagenet_cvpr09]-bird dataset do not contain object shape, thus, we deploy DeepLab-V3 [chen2017rethinking]
to obtain the object shape and use VGG19 to screen out well-segmented (classification accuracy of segmented objects higher than 0.5) shapes and their corresponding images and foreground objects. As a result, there left 31979 out of 46800 image sets for ImageNet[imagenet_cvpr09]-bird and 3196 out of 8144 image sets for Stanford Cars [wah2011caltech].
We use Adam optimizer to train the model with = and = with a fixed learning rate of .
|Name||Foreground Sub-dataset||Background Sub-dataset|
|Dataset1||CUB200 [wah2011caltech]||CUB200 [wah2011caltech]|
|Dataset2||CUB200 [wah2011caltech]||ImageNet [imagenet_cvpr09]-bird|
|Dataset3||CUB200 [wah2011caltech]||Monet Style CUB200 [wah2011caltech]|
|Dataset4||Stanford Cars [krause20133d]||Stanford Cars [krause20133d].|
|Dataset5||Stanford Cars [krause20133d]||Monet Style Stanford Cars [krause20133d]|
|StackGAN-V2||30.04 0.5||20.66 0.38|
|FineGAN||30.1 0.64||20.34 0.22|
|Ours (FBC-GAN)||32.2 0.87||20.89 0.28|
|FineGAN||8.2 0.08||6.35 0.06|
|Ours (FBC-GAN)||31.46 0.76||20.86 0.37|
4.2 Evaluation Metrics
We perform quantitative evaluations by using three evaluation metrics. The first is Inception Score (IS)[salimans2016improved] that is commonly used in image synthesis problems.
The second metric is Conditional IS (CIS) [huang2018multimodal] that defines the inception score conditioned on modes. In our experiments, CIS is evaluated as the inception score of randomly generated images conditioned on the same background. The third metric is Learned Perceptual Image Patch Similarity (LPIPS) [zhang2018unreasonable] that evaluates the distance between image patches. Higher LPIPS means better diversity of generated images.
Frechet Inception Distance (FID) [heusel2017gans]
is another metric to measure the image fidelity. It calculates the distance between feature vectors of real and generated images. However, our model lifts the content correlation between generated foregrounds and backgrounds, making generated images move beyond the distribution of original dataset. Thus, it is not rationale to compare the feature vectors distance between our generated images and original dataset.
In our experiments, we compute IS over 30k randomly generated images for IS. For CIS, we evaluate over 10 groups of randomly generated images, where each group consists of 30k image samples conditioned on the same background. For LPIPS, we follow MUNIT [huang2018multimodal] to generate 100 groups of images where each group contains 5000 samples conditioned on the same background and randomly selects 19 pairs of images from each group.
4.3 Quantitative Experiments
We compare FBC-GAN with FineGAN [zhao2019image] and StackGAN-V2 [zhang2018stackgan++] quantitatively to demonstrate that our FBC-GAN can generate images with comparable fidelity and higher diversity. FineGAN [zhao2019image] is the state-of-the-art composition-based GAN for image synthesis, while StackGAN-V2 [zhang2018stackgan++] currently has the best performance on CUB200 [wah2011caltech] and Stanford Cars [krause20133d] used in our paper. We only conduct quantitative experiments on Dataset1 and Dataset4 as other datasets consist of foreground sub-dataset and a background sub-dataset obtained from different datasets which do not apply to FineGAN and StackGAN-V2.
For fair comparison, we report results of Finegan without hard negative training as this training technique is only applicable to class conditional models. And when comparing CIS and LPIP, we tie the background and child codes of FineGAN as the style of generated foreground and background are consistent only in this setting. All generated images have the same resolution of .
Table 2 shows quantitative experimental results. We can observe that FBC-GAN achieves the comparable IS, which means that its generated images have the comparable quality and fidelity. In addition, FBC-GAN achieves the best CIS and LPIPS demonstrating the superior diversity of its generated images as illustrated in Figs. 4 and 5 (more details to be discussed in the ensuing subsection). CIS and LPIPS for StackGAN-V2 are 0 as it is not able to generate different foreground object for a fixed background. The quantitative experiments show that FBC-GAN is capable of generating high-fidelity and high-diversity images consistently across the three adopted evaluation metrics.
4.4 Qualitative Experiments
We further present qualitative experimental results of FBC-GAN to demonstrate its flexibility and diversity in image synthesis. We also compare FBC-GAN with FineGAN, the state-of-the-art composition-based GAN that similarly tries to disentangle the foreground and background generation processes.
|w/o Style Alignment||0.254||0.261||0.230||0.285||0.255|
|w/ Style Alignment||0.274||0.273||0.253||0.295||0.282|
|w/o Geometrical Alignment||30.6 0.60||25.46 0.46||NA||20.14 0.32||NA|
|w/ Geometrical Alignment||32.2 0.87||26.03 0.45||NA||20.89 0.28||NA|
First two rows and last two rows of Fig. 4 are generated results for FineGAN and FBC-GAN, respectively. Every three images within a red box are images generated conditioned on the same background. As a comparison, FBC-GAN has superior flexibility in keeping the background unchanged. This is because FBC-GAN generates foreground objects with zero background information and composes them into geometrically compatible backgrounds that only make minimum alterations to originally generated backgrounds from BG-Gen. However, FineGAN tends to generate foregrounds with their own backgrounds, and it will wipe out the originally generated background if generated foregrounds contain too many backgrounds on their own.
Due to the independent generation of foreground objects and background images, FBC-GAN is capable of embedding the same foreground objects into different background images with good consistency in both style and geometry. As Fig. 5 shows, FBC-GAN can embed the same foreground object into different background images realistically. As a comparison, FineGAN can disentangle the foreground and background generation processes as well. On the other hand, its generated foreground can occlude the whole background with similar reason as described for Fig. 4. As a result, all generated images will be the same. Moreover, it is more rationale if the same foreground object has its correlated style appearing at different background scenes. Our FBC-GAN is capable of achieving such a requirement (e.g., the first image set generated by FBC-GAN shows foreground object is dimmer if the background scene is dimmer) while generated foreground objects do not change accordingly using FineGAN.
Fig.4 and 5 demonstrate the generation results of Dataset1 and Dataset4 using our FBC-GAN. Dataset1 and Dataset4 are datasets where foreground sub-dataset and background sub-dataset are sampled from the same dataset, and to demonstrate the flexibility of our FBC-GAN, Fig. 6 shows qualitative experimental results of another three datasets where each dataset consists of a foreground sub-dataset and a background sub-dataset obtained from different datasets. We find that although foregrounds and backgrounds are from different datasets, the final generated images are still visual realistic with style alignment and geometric alignment design in our FBC-GAN.
Further, FBC-GAN has a unique feature that without requiring additional conditions, it allows the same generated foreground objects to appear at different positions with different sizes and poses within the same background as shown in Fig. 7. This is largely attributed to our unique network generation design (details discussed in section 3.5).
This unique feature allows FBC-GAN to generate more images which further improves the generation diversity greatly.
4.5 Ablation Study
We perform ablation studies quantitatively and qualitatively to demonstrate the effectiveness of our style alignment and geometry alignment in image synthesis. We use Style Relevance [zhang2020cross] to show the style compatibility between generated foreground and background (evaluate over 30k generated foregrounds and backgrounds). First block in Table 3 shows that with our designed style consistency mechanism, style aligned foregrounds consistently have better style compatibility over the five datasets. Demonstrated images in Fig. 8 are composed of foreground objects sampled from nature style dataset and background scenes sampled from Monet style dataset, it is intuitive that style aligned images have more compatible foreground and background, in brightness, texture, etc. Second block in Table 3 shows quantitatively that generated images with our geometry consistency mechanism have higher quality consistently. And Fig. 9 further proves that our geometry alignment mechanism rationalizes the generated foreground and background geometrically, and the composed images has poor geometry fidelity without our geometric alignment.
This paper presents a novel Foreground-Background Composition GAN (FBC-GAN) that positions foreground and background generation as two independent processes and offers superior flexibility and diversity in image generation. FBC-GAN consists of a Foreground Generator (FG-Gen) and Background Generator (BG-Gen) to generate foreground and background concurrently and independently. It adapts the idea of AdaIN [huang2017arbitrary] for style alignment and it deploys Shape Generator (S-Gen) and Background Modifier (BG-Mod) to bridge the geometry between the generated foreground and background. Due to the novel generation process, FBC-GAN lifts the content correlation between the generated foreground and background, allowing generations with the same background but different foreground, the same foreground but different background, the same foreground and background but different foreground object positions, poses, etc. Additionally, it also allows to generate ‘new information’ by enabling image synthesis with foreground and background sampled from different datasets. Extensive experiments show the superiority of our synthesis network.
We present additional experimental results over multiple data settings to demonstrate the superior flexibility and diversity as well as generality of our proposed FBC-GAN.
Figs. 10 and 11 show qualitative comparisons of FineGAN and our FBC-GAN when background latent codes are fixed over datasets CUB200 and Stanford Cars. To demonstrate detailed differences between FineGAN and FBC-GAN, we also compare the generated foregrounds and backgrounds. With background latent codes fixed, the generated image backgrounds should ideally be the same. However, FineGAN tends to generate foregrounds with their own backgrounds, and it will wipe out the originally generated background if generated foreground images has too much background on their own. Therefore, FineGAN cannot guarantee fixed background. As a comparison, our FBC-GAN ensures that all finally generated images have the same background as long as the background latent codes are fixed. This is because FBC-GAN generates foreground objects with no background, and then composes them into geometrically compatible backgrounds with minimal alterations of the originally generated background. Such alternation without substantial change in originally generated background merely serves to avoid geometrical inconsistency like a bird standing in the air.
In addition, Figs. 12 and 13 show qualitative comparison with FineGAN with foreground latent codes fixed over CUB200 dataset and Stanford Cars dataset, respectively. We can spot the same issue for FineGAN: due to incomplete foreground-background disentangle, FineGAN generated foregrounds could wipe out the originally generated backgrounds. With changing background latent codes, FineGAN often fails to generate different backgrounds.
In addition, it is natural that the style of foreground object(s) should be consistent with the corresponding background. With this in mind, our proposed FBC-GAN can adjust the styles of foreground objects to be compatible with the separately generated background, making the final synthesized image more visually realistic. However, FineGAN lacks this desirable property.
We defined 5 datasets in this work. We further present more illustrations in Figs. 10 and 12 for Dataset1, and Figs. 11 and 13 for Dataset4. In addition, Fig. 14 shows several generation samples for Dataset2, Dataset3 and Dataset5. The successful generation over these 5 datasets demonstrates the generality of our proposed FBC-GAN. Additionally, it shows that our FBC-GAN can sample foregrounds and backgrounds from either same dataset or different datasets. This feature lays the foundation for flexible and diverse image generation with our FBC-GAN.
Another feature of FBC-GAN is that it can generate the same foreground object and background scene with different object positions, sizes, and poses without additional conditions. We achieve this feature by applying transformations (shifting, flipping, rotation, and resizing) directly to the generated foreground object and the generated shape. With background scenes modified accordingly, we can obtain visually realistic generation with the same foreground object and background scene. We demonstrated this feature qualitatively over CUB200 in our submitted manuscript. In this appendix, we further demonstrate this feature over the Stanford Cars as illustrated in Fig. 15.