Adaptive Composition GAN towards Realistic Image Synthesis

by   Fangneng Zhan, et al.
Nanyang Technological University

Despite the rapid progress of generative adversarial networks (GANs) in image synthesis in recent years, current approaches work in either geometry domain or appearance domain which tend to introduce various synthesis artifacts. This paper presents an innovative Adaptive Composition GAN (AC-GAN) that incorporates image synthesis in geometry and appearance domains into an end-to-end trainable network and achieves synthesis realism in both domains simultaneously. An innovative hierarchical synthesis mechanism is designed which is capable of generating realistic geometry and composition when multiple foreground objects with or without occlusions are involved in synthesis. In addition, a novel attention mask is introduced to guide the appearance adaptation to the embedded foreground objects which helps preserve image details and resolution and also provide better reference for synthesis in geometry domain. Extensive experiments on scene text image synthesis, automated portrait editing and indoor rendering tasks show that the proposed AC-GAN achieves superior synthesis performance qualitatively and quantitatively.


page 3

page 4

page 6

page 8

page 9

page 10


Spatial Fusion GAN for Image Synthesis

Recent advances in generative adversarial networks (GANs) have shown gre...

Towards Realistic 3D Embedding via View Alignment

Recent advances in generative adversarial networks (GANs) have achieved ...

Adversarial Image Composition with Auxiliary Illumination

Dealing with the inconsistency between a foreground object and a backgro...

Image Synthesis via Semantic Composition

In this paper, we present a novel approach to synthesize realistic image...

LatentKeypointGAN: Controlling GANs via Latent Keypoints

Generative adversarial networks (GANs) have attained photo-realistic qua...

Improving Human Image Synthesis with Residual Fast Fourier Transformation and Wasserstein Distance

With the rapid development of the Metaverse, virtual humans have emerged...

Scene Text Synthesis for Efficient and Effective Deep Network Training

A large amount of annotated training images is critical for training acc...

1 Introduction

As a longstanding and fundamental challenge in computer vision research, realistic image synthesis has been attracting increasing attention since the advent of deep neural networks (DNNs). An important driving factor is data constraint in DNN training, where automated synthesis of annotated training images is much cheaper, faster and more scalable as compared to the traditional manual annotation approach. In recent years, the advance of generative adversarial networks (GANs)


opens a new door to image synthesis by iterative adversarial learning between a generator and a discriminator. Quite a number of GAN-based image synthesis systems have been reported which can be broadly classified into three categories, namely, direct image generation

mirza2014cgan ; radford2016dcgan ; arjovsky2017wgan

, image-to-image translation

liu2016cogan ; zhu2017cyclegan ; isola2017pixel2pixel ; liu2017unit ; hoffman2018cycada ; ge2018fdgan and image composition lin2018stgan ; azadi2018comgan .

On the other hand, existing image synthesis systems are still facing two common constraints. First, most existing systems strive for synthesis realism in either appearance domain or geometry domain which often introduces synthesis artifacts. Second, many existing systems tend to sacrifice image details and resolutions and even semantics of the image contents. Specifically, the direct image generation does not generate image labels or annotations and the synthesized images often lack sufficient semantic integrity for training effective DNN models. The image-to-image translation focuses on the synthesis realism in appearance domain only and the synthesis also impairs image resolution and details. The image composition lin2018stgan ; azadi2018comgan can generate labeled or annotated images of high-resolution, but it usually focuses on synthesis realism in geometric domain and most existing systems can only deal with a single foreground object.

In our recent work zhan2019sfgan , we designed a composition-based synthesis technique that achieves synthesis realism in both geometry and appearance spaces. On the other hand, our prior work simplifies the synthesis problem and assumes only a single foreground object in composition. The proposed AC-GAN in this paper deals with more challenging yet practical synthesis situations by several novel designs. In particular, it develops an innovative hierarchy composition mechanism for realistic synthesis geometry when multiple foreground objects are involved, and it is capable of handling various occlusions as frequently observed in real-scene objects as illustrated in Fig. 1. In addition, it introduces an attention mask to guide adaptation to the embedded foreground objects which greatly improves synthesis realism in appearance domain and also provides better reference for synthesis realism in geometry space. Further, it incorporates synthesis in geometry and appearance domains within an end-to-end trainable network and architects them as mutual references for the optimal synthesis performance.

The rest of this paper is organized as follows. Section 2 presents related works briefly. The proposed technique is then described in details in Section 3. Experimental results are further presented and discussed in Section 4. Some concluding remark is finally drawn in Section 5.

2 Related Work

2.1 Image Composition

In recent years, quite a number of image composition studies have been reported in the field of computer vision, including the synthesis of single objects attias2016 ; park2015 ; su2015 , generation of full-scene images gaidon2016 ; richter2016 , etc. Image composition aims to generate new images by embedding foreground objects into background images. It strives for synthesis realism by adjusting the size and orientation of the foreground objects as well as blending between foreground objects and background images. zhu2015 proposes a model to distinguish natural photographs from automatically generated composite images. gupta2016synth ; jaderberg2014synth ; zhan2018ver ; zhan2019scene investigate the synthesis of scene text images for training better scene text detector and recognizer. They achieve synthesis realism by adjusting a group of parameters including text locations within the background images, geometry transformation of the foreground texts, and the fusion of the foreground texts and background images. Other image composition models have also been developed for training better DNN models dwibedi2017 .

The aforementioned image composition techniques strive for geometric realism by hand-crafted transformations that involve complicated parameters and are prone to unnatural geometry and alignments. The appearance realism is handled by different blending techniques where features are manually selected and susceptible to artifacts. Simple blending methods such as alpha blending uyttendaele2001 have been adopted to alleviate the clear appearance difference between foreground objects and background images, but they tend to blur the composed images and lose image details. Sophisticated blending such as Poisson blending perez2003 can achieve seamless fusion by manipulating the image gradient and adjusting the inconsistency in chrominance and luminance. Some appearance transfer based method has been reported in recent years. For example, luan2018 transfers the style of the foreground object according to the local statistics of the background image. tsai2017

presents an end-to-end deep convolutional neural network for image harmonization by considering the context and semantic information of the composite images.

Our AC-GAN adopts unsupervised GAN structures to learn geometry and appearance features which produce natural and consistent image composition with minimal visual artifacts. In addition, guided filters he2013 are introduced to preserve fine image details while performing appearance fusion of foreground objects and background images.

Figure 2: The structure of AC-GAN: The components in the blue and black boxes form the geometry module and appearance module, respectively. , , , and denote discriminators and generators, and denotes guided filters. The geometry module will generate ‘Composed’ and ‘Composed Mask’. In the appearance module, achieves the mapping from Composed Real (Composed Adapted Composed) and from Real Composed (Real Adapted Real). The components in the orange-color boxes denote the common part of the geometry and appearance modules, where the input of is the concatenation of the composed/adapted image and the corresponding masks

2.2 Generative Adversarial Networks

GANs goodfellow2014gan have achieved great success in generating new images from either existing images or random noises. Instead of manually selecting features and parameters, GAN generators learn an optimal mapping from random noise or existing images to the synthesized image, while GAN discriminators differentiate the synthesized images from natural ones via adversarial learning. Quite a number of GAN-based image synthesis methods have been reported in recent years. For example, denton2015lapgan introduces Laplacian pyramids that improve the quality of GAN-synthesized images greatly. lee2018context proposes an end-to-end trainable network for inserting an object instance mask of a specified class into the semantic label map of an image. Other systems attempt to synthesize realistic images by stacking a pair of generators zhang2017stackgan ; zhang2018stackgan++ , leaning more reasonable potential features chen2016infogan , exploring new training approach arjovsky2017wgan , visualizing and understanding GANs at the unit, object and scene level bau2019 , etc.

Most existing GAN-based image systems focus on synthesis realism in appearance domain lee2018context ; lee2018high . For instance, CycleGAN zhu2017cyclegan proposes a cycle-consistent adversarial network for realistic image-to-image translation, and so do other related GANs isola2017pixel2pixel ; shrivastava2017simgan ; zhu2017toward ; huang2018munit ; azadi2018mcgan ; park2019spade ; liu2019funit . LR-GAN jwyang2017lrgan

synthesizes images by introducing spatial transformer networks (STNs). GP-GAN

wu2017gpgan synthesizes high resolution images by leveraging Poisson blending perez2003 . In recent years, GAN-based systems have been proposed for synthesis realism in geometry domain, e.g., lin2018stgan presents a spatial transformer GAN (ST-GAN) by inserting STNs into the generator, azadi2018comgan describes a Compositional GAN that introduces a self consistent composition-decomposition network, yao2019 ; zhu2018von study GAN-based 3D manipulation and generation, etc.

Our AC-GAN incorporates image synthesis in geometry and appearance domains into an end-to-end trainable network and achieves synthesis realism in both domains simultaneously. It requires no supervision and is capable of composing images with multiple foreground objects with or without occlusion, more detail to be described in the ensuing Section on The Proposed Method.

3 The Proposed Method

This section presents our proposed AC-GAN for realistic image synthesis. In particular, we divide this section into four subsections which deal with the network structure of AC-GAN, the attention mask, the multiple objects composition and the adversarial training, respectively.

3.1 Model Structure

The proposed AC-GAN includes a geometry module, an appearance module with a pair of guided filters for image details preservation. The whole network is end-to-end trainable and requires no supervision as shown in Fig. 2.

Geometry Module: The geometry module consists of a spatial transformer network (STN), a composition module ‘Composition’ and a discriminator as shown in Fig. 2. The transformation in STN can be affine, homography, or thin plate spline tps . For foreground objects of interest, the STN will predict transformation matrixes and apply each predicted transformation with parameters to the corresponding foreground object. The ‘Composition’ will embed the transformed foreground objects into background images, which produces a ‘Composed’ image and a ‘Composed Mask’. will drive the STN to learn the real geometry by distinguishing ‘Composed, Composed Mask’ and the training reference ‘Adapted Real, Real Mask’.

Note the training reference ‘Adapted Real’ of the geometry module is not natural images that are realistic in both geometry and appearance. As the geometry module seeks synthesis realism in geometry, the appearance realism in natural images becomes certain noises which could mislead the discriminator in training. For the geometry module, the ideal training reference should be realistic in geometry while fake in appearance which is not naturally available. We derive such references by using the ‘Adapted Real’, i.e. the output of the appearance module as shown in Fig. 2 to be described in the ensuing subsection.

Appearance Module: The appearance module employs a cycle structure to embed foreground objects into background image harmoniously as shown in Fig. 2. It has two generators and for image-to-image translation in reverse directions, i.e. from ‘Composed, Composed Mask’ to ‘Adapted Composed’ and from ‘Real’ to ‘Adapted Real, Real Mask’. It also has two discriminators and that distinguish the adapted images and natural images in the reverse mappings.

Specifically, strives to distinguish ‘Adapted Composed’ and ‘Real’ which guides to learn the translation from ‘Composed, Composed Mask’ to ‘Adapted Composed’ in appearance domain. At the other end, learns the translation from ‘Real’ to ‘Adapted Real, Real Mask’, aiming for ‘Adapted Real’ that is realistic in geometry domain but similar to ‘Composed’ in appearance domain. As mentioned in the previous subsection, AC-GAN uses ‘Adapted Real’ from as reference to train the geometry module so that it focuses on synthesizing more realistic image geometry (as the interfering appearance differences have been suppressed in ‘Adapted Real’)

Layers Out Size Configurations
FC1 -
FC2 N * M + -
Table 1: The structure of STN in the geometry module shown in Fig. 2: and denotes the number of transformation matrix and the number of parameters within each transformation matrix, respectively.

Guided Filter: As appearance transfer in most image-to-image-translation GANs tends to sacrifice image details, we introduce guided filters he2013 ; he2015 ( in Fig. 2) into AC-GAN for detail preservation in the translation. Guided Filters filter an image by using a guidance image that can be the input image itself or another different image.The detail preserving transfer thus is formulated as a joint up-sampling problem that up-samples the filtering image under the guidance of another image.

The ‘Composed’ as denoted by (image details unchanged) acts as the input to provide edge and texture details and the output of (image details lost) as denoted by acts as the guide that provides the translated appearance information (contrast, illumination and so on). The ‘Adapted Composed’ with preserved details as denoted by can be derived by minimizing the reconstruction error between and , subjects to a linear model as follows:


where denotes the index of a pixel and denotes the index of a local square window with a radius . Coefficients and

can be estimated by minimizing the difference between

and which can be derived by minimizing the following cost function within :


where is a regularization parameter that prevents

from being too large. It can be solved via linear regression:


where and

are the mean and variance of

within , is the number of pixels in , and is the mean of within .

By computing for every pixel, the filter output can be derived by averaging all possible values of :


where and . The guide filters are embedded in the cycle network structure to implement an end-to-end trainable system.

Figure 3: ‘Predicted Mask’ is the predicted attention mask of the ‘Real Image’. ‘W/O Mask’ refers to the ‘Adapted Real’ without using the attention mask which tends to be messy compared with ‘With Mask’ which refers to the ‘Adapted Real’ using the attention mask.

3.2 Attention Mask

Image-to-image translation applies to the whole image which also translates other regions beyond the foreground objects undesirably. Specifically, cannot generate ‘Adapted Real’ precisely from ‘Real’ as it needs to identify the foreground region in ‘Real’ at first. To constrain the appearance transfer within the foreground objects and provide information of the foreground objects, attention masks as denoted by ‘Composed Mask’ and ‘Real Mask’ (where ‘1’ denotes foreground regions and ‘0’ the rest regions) are concatenated with the ‘Composed’ and ‘Adapted Real’ images as the input of and in training. The attention mask will provide precise mask of the foreground region to be adapted and is used to adjust the weight of the cycle-consistency loss of different regions. With the attention mask, can adapt the foreground region of the composed image accurately and compress undesired translation of other regions.

Note generates ‘Adapted Real’ and ‘Real Mask’ under certain supervision. Specifically, ‘Composed Mask’ is a precise mask (ground truth) of ‘Composed’ as produced by the geometry module. With the cycle-consistency loss, ‘Composed’ and ‘Composed Mask’ are translated by which need to be recovered by from the translated image. The training of for mapping from ‘Real’ to ‘Adapted Real, Real Mask’ is thus supervised, where an accurate ‘Real Mask’ helps to identify and adapt the foreground objects precisely. instead strives to distinguish ‘Adapted Real, Real Mask’ and ‘Composed, Composed Mask’, and this drives to generate precise ‘Real Mask’ and better ‘Adapted Real’. As Fig. 3 shows, the ‘Predicted Mask’ (i.e. the predicted ‘Real Mask’) is pretty precise and the ‘With Mask’ (the ‘Adapted Real’ using the attention mask) is clearly better than the ‘W/O Mask’ (the ‘Adapted Real’ without using the attention mask).

As discussed above, the input of is the concatenation of the image and the attention mask, which means that the attention mask will provide additional geometry information to the geometry module. In addition, is shared by the geometry and the appearance modules. A better translated ‘Adapted Real’ in appearance will thus enable to distinguish images largely according to the geometry features which enables the geometry module to compose images with better geometric realism. At the other end, the composed image with better geometric realism will enable the appearance module to concentrate on the adaptation in appearance domain. The two modules thus collaborate and drive each other for the optimal synthesis.

Figure 4: Hierarchy composition mechanism: , , , and denote the hierarchy parameters of foreground and subject to and . and denote the binary masks of and , respectively. L1 and L2 indicate the occlusion relationship between the foregrounds. The foreground object with a larger hierarchy parameter in L1 will be shown in the composed image. The network will learn the hierarchy parameters to compose realistic images as shown in ‘Real Occlusion’. A direct composition may lead to unrealistic image as shown in ‘Fake Occlusion’.
Figure 5: Comparison of AC-GAN and its variants with other GANs: The proposed AC-GAN is capable of generating more realistic images with correct occlusions when more than one foreground objects are to be composed. AC-GAN (WA) denotes the output of the geometry module of AC-GAN, AC-GAN (WF) denotes the synthesized images without the guided filter.

3.3 Multiple Objects Composition

One unique feature of the proposed AC-GAN is that it can compose with multiple foreground objects in one go. With multiple foreground objects, correct occlusion needs to be determined otherwise the composed image becomes unrealistic as shown in ‘Fake Occlusion’ in Fig. 4. We design a novel hierarchy composition mechanism for correct occlusions in synthesis as illustrated in Fig. 4 (using 2 layers for illustration). Specifically, layers as denoted by with hierarchy parameters (HP) are estimated to determine the proportion of the foreground objects in each layer. As Fig. 4 shows, the proportion of in and is and subject to . Each layer will contain an object mask. The top layer is the union set of the object masks as denoted by , and the bottom layer is minus the occlusion regions of different foreground masks as denoted by . and can thus be formulated as follows:


Different layers are added up with the background image to obtain the composed image ‘Composed’ in Fig. 2:


As the occlusion region will only be filled by the top layer according to the object mask, the foreground object with the highest HP in the top layer will occlude other foreground objects. For regions without occlusion, the hierarchy composition mechanism will not affect as HPs sums up to 1 across all layers. During the inference, the highest HP will be reset to 1 and the rest reset to 0 which generates an explicit occlusion hierarchy among foreground objects. HPs are predicted by STN whose learning is driven by that aims to differentiate correct occlusions of real images from unrealistic ones of false occlusions. With foreground objects to be composed, the STN will predict hierarchy parameters for composition. As the number of parameters of the selected transformation is , the number of parameters predicted by STN is and Table 1 shows detailed network structure.

3.4 Adversarial Training

Since AC-GAN aims to achieve synthesis realism in both geometry and appearance spaces, its training has two adversarial objectives in both realistic geometry and realistic appearance. The geometry module and appearance module are actually two inter-connected local GANs that collaborate with each other during training. For clarity, we denote the input of the geometry module, the composed image and the real image as , and , and the corresponding domains by , and , respectively.

Method  AMethod  B Baseline UNIT liu2017unit CycleGAN zhu2017cyclegan ST-GAN lin2018stgan SF-GAN zhan2019sfgan Real
AC-GAN(WA) 83%-17% 53%-47% 46%-54% 60%-40% 28%-72% 19%-81%
AC-GAN(WM) 95%-5% 87%-13% 72%-28% 82%-18% 55%-44% 36%-64%
AC-GAN(WO) 96%-4% 81%-19% 66%-34% 77%-23% 52%-48% 31%-69%
AC-GAN(WF) 100%-0% 91%-9% 75%-25% 89%-11% 59%-41% 39%-61%
AC-GAN 100%-0% 92%-8% 79%-21% 89%-11% 60%-40% 40%-60%
Table 2: Comparison and ablation study by using Amazon Mechanical Turk (AMT) user study for evaluating the realism of synthesized portrait images: Each cell contains two percentages that tell which of paired images synthesized by Method A and Method B (in the format of ‘Method A’ - ‘Method B’) are deemed as more realistic by users.

In the geometry module, the STN performs as a generator to generate transformed foreground objects. We adopt the Wasserstein GAN arjovsky2017wgan

objective to train the network and the loss function of

and is formulated by:


where and denote the ‘Adapted Real’ and ‘Real Mask’.

The appearance module adopts a cycle structure that involves two mappings in the reverse directions. The learning objective consists of an adversarial loss for cross-domain mapping and a cycle consistency loss that prevents mode collapse. For the adversarial loss, the loss function of and can be formulated by:


where denotes the ‘Compose Mask’.

To ensure that images can be recovered in the translation cycle and guide the network to focus on foreground objects, an attentional cycle-consistency loss is designed:


where denotes the weights of the foreground region. An identity loss is also introduced to ensure that the translated image preserves features of the original image:


The loss of the reverse mapping can be obtained similarly.

While updating the appearance module, all weights of the geometry module are frozen. For the mapping , and are optimized alternately where and denote the weight of the cycle-consistency loss and the identity loss.

4 Experiments

4.1 Datasets

The proposed AC-GAN is evaluated over three image synthesis tasks on automated portrait editing, scene text image generation and automated indoor rendering. A number of public datasets are employed in experiments which include:

CelebA liu2015 is a face image dataset that consists of more than 200k celebrity images with 40 attribute annotations. This dataset is characterized by large quantities, large face pose variations, complicated background clutters, rich annotations, and it is widely used for face attribute prediction.

SUNCG song2017indoor is a Large 3D Model Repository for Indoor Scenes. SUNCG is an ongoing effort to establish a richly-annotated, large-scale dataset of 3D scenes. The dataset contains over 45K different scenes with manually created realistic room and furniture layouts.

ICDAR2013 icdar2013 is used in the Robust Reading Competition in the International Conference on Document Analysis and Recognition (ICDAR) 2013. It contains 848 word images for network training and 1095 for testing.

ICDAR2015 icdar2015 is used in the Robust Reading Competition under ICDAR 2015. It contains incidental scene text images that are captured without preparation before capturing. 2077 text image patches are cropped from this dataset, where a large amount of cropped scene texts suffer from perspective and curvature distortions.

IIIT5K iiit5k

has 2000 training images and 3000 test images that are cropped from scene texts and born-digital images. Each word in this dataset has a 50-word lexicon and a 1000-word lexicon, where each lexicon consists of a ground-truth word and a set of randomly picked words.

SVT wang2011 is collected from the Google Street View images that were used for scene text detection research. 647 words images are cropped from 249 street view images and most cropped texts are almost horizontal.

SVTP phan2013 has 639 word images that are cropped from the SVT images. Most images in this dataset suffer from perspective distortion which are purposely selected for evaluation of scene text

CUTE risnumawan2014 has 288 word images most of which are curved. All words are cropped from the CUTE dataset which contains 80 scene text images that are originally collected for scene text detection research.

4.2 Implementation

The proposed AC-GAN is trained on two NVIDIA GTX 1080TI GPUs with 11GB memory. It is small and efficient in processing. For input image of size , it takes 132ms and 96ms per image in training and testing with a batch size of 4. The high efficiency is largely attributed to the introduction of guided filters. Specifically, guided filters can preserve the resolution of the input images effectively and so a generator of small size is sufficient which helps to reduce the network size and time costs greatly.

The transformation in STN is homography for the portrait editing and indoor rendering experiments, and thin plate spline for the scene text synthesis experiment. Although the end-to-end pipeline is not fully convolutional thanks to the existence of STN, the trained model still can deal with images of different size. As input size of the STN is fixed to , the testing image with different size will be resized to to predict transformation parameters and the corresponding transformation will be applied on the original image instead of the resized image, thus we can obtain composed image of the original size. As the appearance module is fully convolutional, the end-to-end model can deal with image of arbitrary size.

Jaderberg jaderberg2014synth 70.1 55.4 79.8 78.4 65.1 56.3 67.5
Gupta gupta2016synth 80.9 51.8 79.0 68.0 53.4 47.6 63.5
Zhan zhan2018ver 81.4 54.7 80.2 77.7 65.0 56.7 69.3
SF-GAN zhan2019sfgan 81.2 55.5 81.3 79.2 64.9 57.3 69.9
Baseline 65.4 48.1 76.1 64.7 48.6 42.7 57.6
AC-GAN 81.9 56.8 82.3 78.2 66.8 59.1 70.9
Table 3: Scene text recognition accuracy over six datasets ICDAR2013, ICDAR2015, IIIT5K, SVT, SVTP and CUTE as listed in the first row, where 1 million text images synthesized by the methods listed in Column 1 are used for text recognition model training.

4.3 Portrait Editing

Data preparation: We use CelebA liu2015 and follow the provided training/test split in experiments. With the annotations, we extract 3000 faces without hat and glasses as the background images and 1500 with hat and glasses as training references. For foreground objects, we use 20 hats and 25 pairs of glasses cropped in front-parallel view to compose with the randomly selected background images. The hat and glasses are not paired with the faces, and the composed images are not paired with training reference either.

We compared the proposed AC-GAN with state-of-the-art GAN UNIT, CycleGAN and ST-GAN in image synthesis. As the original CycleGAN and UNIT were not designed for image composition, we apply them to achieve image-to-image translation from the background images to the training reference. The original ST-GAN can compose with a single object only and we extend it for multiple objects by composing each object iteratively. The user studies were performed by using Amazon Mechanical Turk (AMT) where users are recruited on the Internet to tell which of each paired images (that are synthesized by two different GANs as listed in Table 2) are more realistic.

Results Analysis: Table 2 shows AMT results where the ‘Baseline’ denotes a baseline model that embeds foreground objects with random alignment and appearance. Each cell contains two percentages telling which of the paired images synthesized by Method A or Method B are more realistic (paired images are presented to users for judgment). As Table 2 shows, the AMT scores of AC-GAN is significantly higher than that of the ‘Baseline’ and also close to ‘Real’ images (40% - 60%), demonstrating the superior performance of AC-GAN in synthesis realism.

An ablation study is conducted with four AC-GAN variants as shown in Table 2. Specifically, AC-GAN (WA) denotes AC-GAN without the proposed appearance module. It has much lower AMT scores than the standard AC-GAN, demonstrating the importance of appearance realism in synthesizing realistic images. On the other hand, it performs clearly better than ST-GAN with AMT score versus , largely due to the use of ‘Adapted Real’ as references which improve synthesis realism in geometry space. AC-GAN (WM) denotes AC-GAN without using attention masks. Its AMT scores are clearly lower than the standard AC-GAN because attention masks help generate better appearance adaptation. AC-GAN (WO) denotes AC-GAN without incorporating the hierarchy composition mechanism. As shown in Table 2, the AMT scores of AC-GAN (WO) are clearly lower than that of Standard AC-GAN and this shows that correct occlusion relationship between foreground objects affects the synthesis realism significantly. AC-GAN (WF) denotes AC-GAN without using the guided filter. Its AMT scores are just slightly lower than that of standard AC-GAN because guided filters mainly preserve image resolution and details but don’t help much in synthesis realism.

Fig. 5 compares synthesized images by the proposed AC-GAN and three state-of-the-art GANs. As Fig. 5 shows, the AC-GAN synthesized images are much more realistic. Specifically, CycleGAN zhu2017cyclegan and UNIT liu2017unit can achieve certain realism in appearance domain, but the synthesized images are blurry and the embedded hats and glasses lose control. ST-GAN lin2018stgan can perform the transformation in geometry domain, but the learned geometry is not accurate in term of object size and embedding locations. In addition, it cannot handle synthesis realism in appearance domain. The AC-GAN (WA) without including the appearance module produces clear artifacts in appearance space, but it achieves more realistic geometry than ST-GAN thanks to the better training references. The AC-GAN (WF) without using guided filters can achieve realism in spatial and appearance spaces, but the synthesized images tend to sacrifice resolution and image details as compared with images synthesized by the standard AC-GAN (may need zoom-in to see the difference clearly).

Figure 6: Illustration of scene text image synthesis by different GANs: ‘BG’ in Row 1 denotes the background images. Rows 2-4 show the images that are synthesized by UNIT, CycleGAN and ST-GAN, respectively. AC-GAN1, AC-GAN2 and AC-GAN3 in Rows 5-7 show the images synthesized by AC-GAN when 1, 2 and 3 foreground text instances are applied in synthesis, respectively.

4.4 Scene Text Synthesis

Experiment Setting: For scene text image synthesis, the AC-GAN needs a set of background images, foreground texts and real scene text image patches as references. We collect the scene text image patches by cropping from the training images of ICDAR2013 icdar2013 , ICDAR2015 icdar2015 , SVT wang2011 and CUTE risnumawan2014 . In the image patch cropping, we extend the provided annotation boxes by a random scale to include more image structural information which will be used by the geometry module for image geometry learning. For the background images, we collect them by performing image in-painting to the cropped scene text image patches to erase the scene texts as illustrated in ‘BG’ in Fig. 6. The foreground text is randomly selected from the 90k-lexicon jaderberg2014synth . Note we also apply random rotations to the collected scene text image patches and background images so as to include richer variations in image geometry.

In training the AC-GAN, we select thin plate spline constraint the range of the transformation parameter so that the transformed foreground objects will not be too small or outside of the image. The AC-GAN synthesized scene text images cannot be applied to training directly as they contain some background as introduced by the cropping in the collection stage, we crop out the text region with tighter boxes by detecting a minimal external rectangle according to the ‘Composed Mask’. To benchmark with the state-of-the-art image synthesis techniques, we extract 1 million images (the same amount as synthesized by the proposed AC-GAN) from the synthesized scene text image datasets jaderberg2014synth ; gupta2016synth ; zhan2018ver ; zhan2019sfgan .

We use MORAN luo2019moran as the scene text recognition model and train it by using the synthesized images as described above. In addition, a baseline model ‘Baseline’ as shown in Table 3 is trained where 1 million images are synthesized by applying random colors, fonts and geometry transformation to the foreground texts. The trained scene text recognition models are tested over six public scene text datasets ICDAR2013 icdar2013 , ICDAR2015 icdar2015 , IIIT5K iiit5k , SVT wang2011 , SVTP phan2013 and CUTE risnumawan2014 as described in Datasets. Word level recognition accuracy is adopted to evaluate the performance of the trained recognition models and so the effectiveness of training images that are synthesized by different models.

Figure 7: Image synthesis for indoor rendering: the ST-GAN and AC-GAN trained on synthetic data are tested on the real scene for cabinet and sofa placement.
Figure 8: The effect of radius and regularization in the guided filter. ‘Input’ and ‘Guide’ denote the input image and guide image. The first row shows the output images with different radius , the second row shows the image with different . When comparing output images with different , the regularization is fixed to , when comparing output images with different , the radius is fixed to .

Results Analysis: Table 3 shows scene text recognition accuracy by different synthesis methods. As Table 3 shows, AC-GAN performs much better than the baseline and achieves the highest recognition accuracy for 5 out of 6 evaluated datasets. In addition, it achieves an up to 1% improvement in average recognition accuracy across the 6 datasets, demonstrating the superior usefulness of its synthesized images when used for training scene text recognition models. Compared with our earlier model SF-GAN, AC-GAN achieves clear improvement in datasets SVTP and CUTE where most scene texts are curved or in perspective views. The better performance is largely attributed to the attention masks which helps achieve better synthesis geometry by providing better training references (i.e. ‘Adapted Real’ in Fig. 2). Additionally, the attention masks as the input of the discriminator also provide additional geometry information to the training of the geometry module.

Fig. 6 shows the scene text images that are synthesized by different image synthesis methods. For UNIT liu2017unit and CycleGAN zhu2017cyclegan , we directly apply them to perform image-to-image translation from background images to real images. As Fig. 6 shows, UNIT tends to generate images with messy strokes. CycleGAN can generate strokes with real appearance but the generated strokes have no semantic meaning. Besides, both methods do not generate any annotations. ST-GAN lin2018stgan learns poor geometry transformations, largely because it uses real images as references where the appearance realism in real images misleads the network training.

AC-GAN1, AC-GAN2 and AC-GAN3 denote the AC-GAN synthesized images when 1, 2 and 3 foreground text instances are employed in synthesis, respectively. It should be noted that the three AC-GAN models are trained separately by using different training references. For AC-GAN2 and AC-GAN3, the training references are the cropped images containing at least 2 and 3 text instances, respectively. As Fig. 6 shows, AC-GAN1 with a single foreground text instance achieves excellent realism in both geometry and appearance spaces. AC-GAN2 and AC-GAN3 can align and coordinate the locations of multiple foreground text instances correctly, where different spatial transformations are learned and applied to different foreground text instances according to the local geometry. At the same time, the appearance of the embedded text instance is also adapted to be harmonious with the contextual backgrounds.

Note that scene text images with multiple text instances cannot be synthesized by applying existing GANs (e.g. ST-GAN and SF-GAN) multiple times. The reason is multiple executions of existing GANs has little coordination, where each execution seeks the best embedding region for the current text instance without considering other executions.

4.5 Indoor Rendering

Experiment Setting: The indoor rendering aims for the composition of furniture in indoor scenes with paired images. The indoor dataset was synthesized with natural geometry and illumination from the SUNG dataset song2017indoor that contains 41,499 scene models and 568,749 camera viewpoints from zhang2017indoor and jakob2010indoor . These synthesized images are used as the training reference, and the paired object and background image are extracted from each synthesized image and used as the foreground object and background image in training. In addition, we apply random brightness over the extracted foreground objects to simulate the real scene image composition where the foreground objects and the background images are usually unmatched in illumination.

We compare the proposed AC-GAN with ST-GAN lin2018stgan by testing them on the real scene images. The real scene images for testing are collected from the Internet by cropping the foreground objects and selecting the background images manually. The training of ST-GAN follows the setting as described in the original paper lin2018stgan .

Results Analysis: Fig. 7 shows two example synthesis on real scene images. As Fig. 7 shows, ST-GAN can only synthesize in geometry space and the synthesis artifacts in appearance space such as the unmatched brightness and contrast make the synthesized images clearly unrealistic. As a comparison, the proposed AC-GAN places foreground objects into background image with better geometry and alignment thanks to the better training reference for geometry module. It also adapts the brightness and contrast of the foreground objects with realistic matching with the background images thanks to the appearance module. Note concatenating a geometry adaptation model (e.g. ST-GAN) and an appearance adaptation model (e.g. CycleGAN) directly will not produce realistic adaptation in geometry and appearance domains because the discrepancy in one domain will affect the training of the adaptation model in another.

We also evaluate the role of radius and regularization as mentioned in Section 3. 1 Guided Filter, where controls the bandwidth of the guided filter and controls the degree of edge preservation. As Fig. 8 shows, the filter output with a large tends to be similar to the ‘Input’ with high resolution, but its appearance does not change much and still tends to be unrealistic. When becomes smaller, the filter output is translated towards the ‘Guide’ in appearance and it becomes almost the same as the ‘Guide’ when is very small. At the other end, the resolution of the filter output decreases when becomes smaller. For the regularization , the filter output tends to be blurry and loses details when is large. With the decrease of , the filter output preserves more details but certain noises are introduced when becomes very small. Certain trade-off needs to be taken for the setting and depending on specific tasks and requirements. In our implemented system, we empirically set and at and .

5 Conclusions

This paper presents an AC-GAN, an end-to-end trainable network that synthesizes realistic images given multiple foreground objects and a background image. The AC-GAN consists of a geometry module and an appearance module which is capable of achieving synthesis realism in geometry and appearance domains simultaneously. A novel hierarchy composition mechanism was designed to handle the occlusion among multiple foreground objects, and an attention mask is exploited to guide the appearance adaptation and provide better training reference for the geometry module. In addition, guided filter is introduced to preserve the resolution of the composed image during the appearance adaptation. The portrait editing experiment shows that the proposed AC-GAN can synthesize more realistic images than state-of-the-art GANs, largely due to the coordination between the geometry the appearance modules and the correct occlusions achieved by our proposed hierarchy composition mechanism. The scene text image synthesis experiment shows that the proposed AC-GAN is capable of synthesizing useful images for training accurate and robust deep recognition models. The indoor rendering experiment demonstrates how the parameters of guided filters affect image synthesis in different manners.

The proposed AC-GAN mainly works on 2-dimensional (2D) images which still have various constraints such as limited views of the foreground objects. We will explore image synthesis in 3-dimensional (3D) space for better synthesis flexibility and synthesis realism in our future work.

6 Acknowledgement

This work is funded by the Ministry of Education (MOE), Singapore, under the project “A semi-supervised learning approach for accurate and robust detection of texts in scenes” (RG128/17 (S)).


  • (1) Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: ICML (2017)
  • (2) Azadi, S., Fisher, M., Kim, V., Wang, Z., Shechtman, E., Darrell, T.: Multi-content gan for few-shot font style transfer. In: CVPR (2018)
  • (3) Azadi, S., Pathak, D., Ebrahimi, S., Darrell, T.: Compositional gan: Learning conditional image composition. arXiv:1807.07560 (2018)
  • (4) Bau, D., Zhu, J.Y., Strobelt, H., Zhou, B., Tenenbaum, J.B., Freeman, W.T., Torralba, A.: Gan dissection: Visualizing and understanding generative adversarial networks. In: ICLR (2019)
  • (5) Bookstein, F.L.: Principal warps: Thin-plate splines and the decomposition of deformations. TPAMI 11(6) (1989)
  • (6) Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In: NIPS (2016)
  • (7) Denton, E., Chintala, S., Szlam, A., Fergus, R.: Deep generative image models using a laplacian pyramid of adversarial networks. In: NIPS (2015)
  • (8) Dwibedi, D., Misra, I., Hebert, M.: Cut, paste and learn: Surprisingly easy synthesis for instance detection. In: ICCV (2017)
  • (9) Gaidon, A., Wang, Q., Cabon, Y., Vig, E.: Virtual worlds as proxy for multi-object tracking analysis. In: CVPR (2016)
  • (10) Ge, Y., Li, Z., Zhao, H., Yin, G., Yi, S., Wang, X., Li, H.: Fd-gan: Pose-guided feature distilling gan for robust person re-identification. In: NeurIPS (2018)
  • (11) Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. In: NIPS, pp. 2672–2680 (2014)
  • (12) Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: CVPR (2016)
  • (13) He, K., Sun, J.: Fast guided filter. arXiv:1505.00996 (2015)
  • (14) He, K., Sun, J., Tang, X.: Guided image filtering. TPAMI (2013)
  • (15) Hoffman, J., Tzeng, E., Park, T., Zhu, J.Y., Isola, P., Efros, A.A., Darrell, T.: Cycada: Cycle-consistent adversarial domain adaptation. In: ICML (2018)
  • (16) Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: ECCV (2018)
  • (17)

    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks.

    In: CVPR (2017)
  • (18) Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition.

    In: NIPS Deep Learning Workshop (2014)

  • (19) Jakob, W.: Mitsuba renderer. In: (2010)
  • (20) Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, V.R., Lu, S., Shafait, F., Uchida, S., Valveny, E.: Icdar 2015 competition on robust reading. In: ICDAR, pp. 1156–1160 (2015)
  • (21) Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., i Bigorda, L.G., Mestre, S.R., Mas, J., Mota, D.F., Almazan, J.A., de las Heras, L.P.: Icdar 2013 robust reading competition. In: ICDAR, pp. 1484–1493 (2013)
  • (22) Lee, D., Liu, S., Gu, J., Liu, M.Y., Yang, M.H., Kautz, J.: Context-aware synthesis and placement of object instances. In: NIPS (2018)
  • (23) Lin, C.H., Yumer, E., Wang, O., Shechtman, E., Lucey, S.: St-gan: Spatial transformer generative adversarial networks for image compositing. In: CVPR (2018)
  • (24) Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation networks. In: NIPS (2017)
  • (25) Liu, M.Y., Huang, X., Mallya, A., Karras, T., Aila, T., Lehtinen, J., Kautz, J.: Few-shot unsupervised image-to-image translation. arXiv:1905.01723 (2019)
  • (26) Liu, M.Y., Tuzel, O.: Coupled generative adversarial networks. In: NIPS (2016)
  • (27) Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: ICCV (2015)
  • (28) Luan, F., Paris, S., Shechtman, E., Bala, K.: Deep painterly harmonization. arXiv:1804.03189 (2018)
  • (29) Luo, C., Jin, L., Sun, Z.: Moran: A multi-object rectified attention network for scene text recognition. Pattern Recognition 90, 109–118 (2019)
  • (30) Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv:1411.1784 (2014)
  • (31) Mishra, A., Alahari, K., Jawahar, C.: Scene text recognition using higher order language priors. In: BMVC (2012)
  • (32) Movshovitz-Attias, Y., Kanade, T., Sheikh, Y.: How useful is photo-realistic rendering for visual learning? In: ECCV (2016)
  • (33) Park, D., Ramanan, D.: Articulated pose estimation with tiny synthetic videos. In: CVPR (2015)
  • (34) Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: CVPR (2019)
  • (35) Pérez, P., Gangnet, M., Blake, A.: Poisson image editing. TOG 22(3) (2003)
  • (36) Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspective distortion in natural scenes. In: ICCV (2013)
  • (37) Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: ICLR (2016)
  • (38) Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: Ground truth from computer games. In: ECCV (2016)
  • (39) Risnumawan, A., Shivakumara, P., Chan, C.S., Tan, C.L.: A robust arbitrary text detection system for natural scene images. Expert Systems with Applications 41(18), 8027–8048 (2014)
  • (40) Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., Webb, R.: Learning from simulated and unsupervised images through adversarial training. In: CVPR (2017)
  • (41) Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: CVPR (2017)
  • (42) Su, H., Qi, C.R., Li, Y., Guibas, L.: Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In: ICCV (2015)
  • (43) Tsai, Y.H., Shen, X., Lin, Z., Sunkavalli, K., Lu, X., Yang, M.H.: Deep image harmonization. In: CVPR (2017)
  • (44) Uyttendaele, M., Eden, A., Szeliski, R.: Eliminating ghosting and exposure artifacts in image mosaics. In: CVPR (2001)
  • (45) Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV (2011)
  • (46) Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional gans. In: CVPR (2018)
  • (47) Wu, H., Zheng, S., Zhang, J., Huang, K.: Gp-gan: Towards realistic high-resolution image blending. arXiv:1703.07195 (2017)
  • (48) Yang, J., Kannan, A., Batra, D., Parikh, D.: Lr-gan: Layered recursive generative adversarial networks for image generation. In: ICLR (2017)
  • (49) Yao, S., Hsu, T.M., Zhu, J.Y., Wu, J., Torralba, A., Freeman, B., Tenenbau, J.: 3d-aware scene manipulation via inverse graphics. In: NeurIPS (2018)
  • (50) Zhan, F., Lu, S., Xue, C.: Verisimilar image synthesis for accurate detection and recognition of texts in scenes. In: ECCV, pp. 249–266 (2018)
  • (51) Zhan, F., Zhu, H., Lu, S.: Scene text synthesis for efficient and effective deep network training. arXiv:1901.09193 (2019)
  • (52) Zhan, F., Zhu, H., Lu, S.: Spatial fusion gan for image synthesis. In: CVPR (2019)
  • (53) Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D.: Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: ICCV (2017)
  • (54) Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D.: Stackgan++: Realistic image synthesis with stacked generative adversarial networks. TPAMI (2018)
  • (55)

    Zhang, Y., Song, S., Yumer, E., Savva, M., Lee, J.Y., Jin, H., Funkhouser, T.: Physically-based rendering for indoor scene understanding using convolutional neural networks.

    In: CVPR (2017)
  • (56) Zhu, J.Y., Krahenbuhl, P., Shechtman, E., Efros, A.A.: Learning a discriminative model for the perception of realism in composite images. In: ICCV (2015)
  • (57) Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV (2017)
  • (58) Zhu, J.Y., Zhang, R., Pathak, D., Darrell, T., Efros, A.A., Wang, O., Shechtman, E.: Toward multimodal image-to-image translation. In: NIPS (2017)
  • (59) Zhu, J.Y., Zhang, Z., Zhang, C., Wu, J., Torralba, A., Tenenbaum, J., Freeman, B.: Visual object networks: Image generation with disentangled 3d representations. In: NeurIPS (2018)