Spatial Fusion GAN for Image Synthesis
Recent advances in generative adversarial networks (GANs) have shown great potentials in realistic image synthesis whereas most existing works address synthesis realism in either appearance space or geometry space but few in both. This paper presents an innovative Spatial Fusion GAN (SF-GAN) that combines a geometry synthesizer and an appearance synthesizer to achieve synthesis realism in both geometry and appearance spaces. The geometry synthesizer learns contextual geometries of background images and transforms and places foreground objects into the background images unanimously. The appearance synthesizer adjusts the color, brightness and styles of the foreground objects and embeds them into background images harmoniously, where a guided filter is introduced for detail preserving. The two synthesizers are inter-connected as mutual references which can be trained end-to-end without supervision. The SF-GAN has been evaluated in two tasks: (1) realistic scene text image synthesis for training better recognition models; (2) glass and hat wearing for realistic matching glasses and hats with real portraits. Qualitative and quantitative comparisons with the state-of-the-art demonstrate the superiority of the proposed SF-GAN.READ FULL TEXT VIEW PDF
Despite the rapid progress of generative adversarial networks (GANs) in ...
Recent advances in generative adversarial networks (GANs) have achieved ...
Generative adversarial networks (GANs) have attained photo-realistic qua...
A large amount of annotated training images is critical for training acc...
In this paper, we introduce a new method for generating an object image ...
Human video motion transfer has a wide range of applications in multimed...
Geometrical and appearance quality requirements set the limits of the cu...
Spatial Fusion GAN for Image Synthesis
With the advances of deep neural networks (DNNs), image synthesis has been attracting increasing attention as a means of generating novel images and creating annotated images for training DNN models, where the latter has great potentials to replace the traditional manual annotation which is usually costly, time-consuming and unscalable. The fast development of generative adversarial networks (GANs)  in recent years opens a new door of automated image synthesis as GANs are capable of generating realistic images by concurrently implementing a generator and discriminator. Three typical approaches have been explored for GAN-based image synthesis, namely, direct image generation [26, 32, 1]47, 15, 21, 13] and image composition [20, 2].
On the other hand, most existing GANs were designed to achieve synthesis realism either from geometry space or appearance space but few in both. Consequentially, most GAN-synthesized images have little contribution (many even harmful) when they are used in training deep network models. In particular, direct image generation still faces difficulties in generating high-resolution images due to the limited network capacity. GAN-based image composition is capable of producing high-resolution images [20, 2] by placing foreground objects into background images. But most GAN-based image composition techniques focus on geometric realism only (e.g. object alignment with contextual background) which often produce various artifacts due to appearance conflicts between the foreground objects and the background images. As a comparison, GAN-based image-to-image translation aims for appearance realism by learning the style of images of the target domain whereas the geometric realism is largely ignored.
We propose an innovative Spatial Fusion GAN (SF-GAN) that achieves synthesis realism in both geometry and appearance spaces concurrently, a very challenging task in image synthesis due to a wide spectrum of conflicts between the foreground objects and background images with respect to relative scaling, spatial alignment, appearance style, etc. The SF-GAN address these challenges by designing a geometry synthesizer and an appearance synthesizer. The geometry synthesizer learns the local geometry of background images with which the foreground objects can be transformed and placed into the background images unanimously. A discriminator is employed to train a spatial transformation network, targeting to produce transformed images that can mislead the discriminator. The appearance synthesizer learns to adjust the color, brightness and styles of the foreground objects for proper matching with the background images with minimum conflicts. A guided filter is introduced to compensate the detail loss that happens in most appearance-transfer GANs. The geometry synthesizer and appearance synthesizer are inter-connected as mutual references which can be trained end-to-end with little supervision.
The contributions of this work are threefold. First, it designs an innovative SF-GAN, an end-to-end trainable network that concurrently achieves synthesis realism in both geometry and appearance spaces. To the best of our knowledge, this is the first GAN that can achieve synthesis realism in geometry and appearance spaces concurrently. Second, it designs a fusion network that introduces guided filters for detail preserving for appearance realism, whereas most image-to-image-translation GANs tend to lose details while performing appearance transfer. Third, it investigates and demonstrates the effectiveness of GAN-synthesized images in training deep recognition models, a very important issue that was largely neglected in most existing GANs (except a few GANs for domain adaptation [13, 15, 21, 47]).
Realistic image synthesis has been studied for years, from synthesis of single objects [28, 29, 36] to generation of full-scene images [7, 33]. Among different image synthesis approaches, image composition has been explored extensively which synthesizes new images by placing foreground objects into some existing background image. The target is to achieve composition realism by controlling the object size, orientation, and blending between foreground objects and background images. For example, [9, 16, 43] investigate synthesis of scene text images for training better scene text detection and recognition models. They achieve the synthesis realism by controlling a series of parameters such as text locations within the background image, geometric transformation of the foreground texts, blending between the foreground text and background image, etc. Other image composition systems have also been reported for DNN training , composition harmonization [25, 37]46], etc.
Optimal image blending is critical for good appearance consistency between the foreground object and background image as well as minimal visual artifacts within the synthesized images. One straightforward way is to apply dense image matching at pixel level so that only the corresponding pixels are copied and pasted, but this approach does not work well when the foreground object and background image have very different appearance. An alternative way is to make the transition as smooth as possible so that artifacts can be hidden/removed within the composed images, e.g. alpha blending , but this approach tends to blur fine details in the foreground object and background images. In addition, gradient-based techniques such as Poisson blending  can edit the image gradient and adjust the inconsistency in color and illumination to achieve seamlessly blending.
Most existing image synthesis techniques aim for geometric realism by hand-crafted transformations that involve complicated parameters and are prone to various unnatural alignments. The appearance realism is handled by different blending techniques where features are manually selected and still susceptible to artifacts. Our proposed technique instead adopts a GAN structure that learn geometry and appearance features from real images with little supervision, minimizing various inconsistency and artifacts effectively.
GANs  have achieved great success in generating realistic new images from either existing images or random noises. The main idea is to have a continuing adversarial learning between a generator and a discriminator, where the generator tries to generate more realistic images while the discriminator aims to distinguish the newly generated images from real images. Starting from generating MNIST handwritten digits, the quality of GAN-synthesized images has been improved greatly by the laplacian pyramid of adversarial networks . This is followed by various efforts that employ a DNN architecture , stacking a pair of generators , learning more interpretable latent representations , adopting an alternative training method , etc.
Most existing GANs work towards synthesis realism in the appearance space. For example, CycleGAN  uses cycle-consistent adversarial networks for realistic image-to-image translation, and so other relevant GANs [15, 35]. LR-GAN  generates new images by applying additional spatial transformation networks (STNs) to factorize shape variations. GP-GAN  composes high-resolution images by using Poisson blending . A few GANs have been reported in recent years for geometric realsim, e.g.,  presents a spatial transformer GAN (ST-GAN) by embedding STNs in the generator for geometric realism,  designs Compositional GAN that employs a self-consistent composition-decomposition network.
Most existing GANs synthesize images in either geometry space (e.g. ST-GAN) or appearance space (e.g. Cycle-GAN) but few in both spaces. In addition, the GAN-synthesized images are usually not suitable for training deep network models due to the lack of annotation or synthesis realism. Our proposed SF-GAN can achieve both appearance and geometry realism by synthesizing images in appearance and geometry spaces concurrently. Its synthesized images can be directly used to train more powerful deep network models due to their high realism.
use one image as guidance for filtering another image which has shown superior performance in detail-preserving filtering. The filtering output is a linear transform of the guidance image by considering its structures, where the guidance image can be the input image itself or another different image. Guided filtering has been used in various computer vision tasks, e.g., uses it for weighted averaging and image fusion,  uses a rolling guidance for fully-controlled detail smoothing in an iterative manner,  uses a fast guided filter for efficient image super-solution,  uses guided filters for high-quality depth map restoration,  uses guided filtering for tolerance to heavy noises and structure inconsistency, and  puts Guided filtering as a nonconvex optimization problem and proposes solutions via majorize-minimization .
Most GANs for image-to-image-translation can synthesize high-resolution images but the appearance transfer often suppresses image details such as edges and texture. How to keep the details of the original image while learning the appearance of the target remain an active research area. The proposed SF-GANs introduces guided filters into a cycle network which is capable of achieving appearance transfer and detail preserving concurrently.
The proposed SF-GAN consists of a geometry synthesizer and an appearance synthesizer, and the whole network is end-to-end trainable as illustrated in Fig. 2. Detailed network structure and training strategy will be introduced in the following subsections.
The geometry synthesizer has a local GAN structure as highlighted by blue-color lines and boxes on the left of Fig. 2. It consists of a spatial transform network (STN), a composition module and a discriminator. The STN consists of an estimation network as shown in Table 1 and a transformation matrix which hasparameters that control the geometric transformation of the foreground object.
The foreground object and background image are concatenated to act as the input of the STN, where the estimation network will predict a transformation matrix to transform the foreground object. The transformation can be affine, homography, or thin plate spline  (We use thin plate spline for the scene text synthesis task and homography for the portrait wearing task). Each pixel in the transformed image is computed by applying a sampling kernel centered at a particular location in the original image. With pixels in the original and transformed images denoted by and , we use a transformation matrix H to perform pixel-wise transformation as follows:
where and denote the coordinates of the i-th pixel within the original and transformed image, respectively.
The transformed foreground object can thus be placed into the background image to form an initially composed image (Composed Image in Fig. 2). The discriminator D2 in Fig. 2 learns to distinguish whether the composed image is realistic with respect to a set of Real Images. On the other hand, our study shows that real images are not good references for training geometry synthesizer. The reason is real images are realistic in both geometry and appearance spaces while the geometry can only achieve realism in geometry space. The difference in appearance space between the synthesized images and real images will mislead the training of geometry synthesizer. For optimal training of geometry synthesizer, the reference images should be realistic in the geometry space only and concurrently have similar appearance (e.g. colors and styles) with the initially composed images. Such reference images are difficult to create manually. In the SF-GAN, we elegantly use images from the appearance synthesizer (Adapted Real shown in Fig. 2) as the reference to train the geometry synthesizer, more details about the appearance synthesizer to be discussed in the following subsection.
The appearance synthesizer is designed in a cycle structure as highlighted in orange-color lines and boxes on the right of Fig. 2. It aims to fuse the foreground object and background image to achieve synthesis realism in the appearance space. Image-to-image translation GANs also strive for realistic appearance but they usually lose visual details while performing the appearance transfer. Within the proposed SF-GAN, guided filters are introduced which help to preserve visual details effectively while working towards synthesis realism within the appearance space.
The proposed SF-GAN adopts a cycle structure for mapping between two domains, namely, the composed image domain and the real image domain. Two generators G1 and G2 are designed to achieve image-to-image translation in two reverse directions, G1 from Composed Image to Final Synthesis and G2 from Real Images to Adapted Real as illustrated in Fig. 2. Two discriminator D1 and D2 are designed to discriminate real images and translated images.
In particular, D1 will strive to distinguish the adapted composed images (i.e. the Composed Image after domain adaptation by G1) and Real Images, forcing G1 to learn to map from the Composed Image to Final Synthesis images that are realistic in the appearance space G2 will learn to map from Real Images to Adapted Real images, the images that ideally are realistic in the geometry space only but have similar appearance as the Composed Image. As discussed in the previous subsection, the Adapted Real from G2 will be used as references for training the geometry synthesizer as it will better focus on synthesizing images with realistic geometry (as the interfering appearance difference has been compressed in the Adapted Real).
Image appearance transfer usually comes with detail loss. We address this issue from two perspectives. The fist is by adaptive combination of cycle loss and identity loss. Specifically, we adopt a weighted combination strategy that assigns higher weight to the cycle-loss for interested image regions while higher weight to the identify-loss for non-interested regions. Take scene text image synthesis as an example. By assigning a larger cycle-loss weight and smaller identity-loss to text regions, it ensures a multi-mode mapping of the text style while keeping the background similar to the original image. The second is by introducing guided filters into the cycle structure for detail preserving, more details to be described in the next subsection.
Guided filter was designed to perform edge-preserving image smoothing. It influences the filtering by using structures in a guidance image As appearance transfer in most image-to-image-translation GANs tends to lose image details, we introduce guided filters (F as shown in Fig. 2) into the SF-GAN for detail preserving within the translated images. The target is to perform appearance transfer on the foreground object (within the Composed Image) only while keeping the background image with minimum changes.
We introduce guided filters into the proposed SF-GAN and formulate the detail-preserving appearance transfer as a joint up-sampling problem as illustrated in Fig. 3. In particular, the translated images from the output of G1 (image details lost) is the input image to be filtered and the initially Composed Image (image details unchanged) shown in Fig. 2 acts as the guidance image to provide edge and texture details. The detail-preserving image (corresponding to the Synthesized Image in Fig. 2) can thus be derived by minimizing the reconstruction error between and , subjects to a linear model:
where is the index of a pixel and is a local square window centered at pixel .
To determine the coefficients of the linear model and , we seek a solution that minimizes the difference between and the filter input which can be derived by minimizing the following cost function in the local window:
where is a regularization parameter that prevents
from being too large. It can be solved via linear regression:
are the mean and variance ofin , is the number of pixels in , and is the mean of in .
By applying the linear model to all windows in the image and computing , the filter output can be derived by averaging all possible values of :
where and . We integrate the guide filter into the cycle structure network to implement an end-to-end trainable system.
The proposed SF-GAN is designed to achieve synthesis realism in both geometry and appearance spaces. The SF-GAN training therefore has two adversarial objectives, one is to learn the real geometry and the other is to learn the real appearance The geometry synthesizer and appearance synthesizer are actually two local GANs that are inter-connected and need coordination during the training. For presentation clarity, we denote the Foreground Object and Background Image in Fig. 2 as the , the Composed Image as and the Real Image as which belongs to domains , and , respectively.
For the geometry synthesizer, the STN can actually be viewed as a generator which predicts transportation parameters for . After the transformation of the Foreground Object and Composition, the Composed Image becomes the input of the discriminator and the training reference comes from of the appearance synthesizer. For the geometry synthesizer, we adopt the Wasserstein GAN  objective for training which can be denoted by:
where denotes the domains for . Since G0 aims to minimize this objective against an adversary D2
that tries to maximize it, the loss functions ofD2 and G0 can be defined by:
The appearance synthesizer adopts a cycle structure that consists of two mappings and . It has two adversarial discriminators D1 and D2. D2 is shared between the geometry and appearance synthesizers, and it aims to distinguish from G2(z) within the appearance synthesizer. The learning objectives thus consists of an adversarial loss for the mapping between domains and a cycle consistency loss for preventing the mode collapse. For the adversarial loss, the objective of the mapping (and the same for the reverse mapping ) can be defined by:
As the adversarial losses cannot guarantee that the learned function maps an individual input to a desired output , we introduce cycle-consistency, aiming to ensure that the image translation cycle will bring back to the original image, i.e. . The cycle-consistency can be achieved by a cycle-consistency loss:
We also introduce the identity loss to ensure that the translated image preserves features of the original image:
For each training step, the model needs to update the geometry synthesizer and appearance synthesizer separately. In particular, and are optimized alternately while updating the geometry synthesizer. While updating the appearance synthesizer, all weights of the geometry synthesizer are freezed. In the mapping , and are optimized alternately where and controls the relative importance of the cycle-consistency loss and the identity loss, respectively. In the mapping , and are optimized alternately.
It should be noted that the sequential updating is necessary for end-to-end training of the proposed SF-GAN. If discarding the geometry loss, we need update the geometry synthesizer according to the loss function of the appearance synthesizer. On the other hand, the appearance synthesizer will generate blurry foreground objects regardless of the geometry synthesizer and this is similar to GANs for direct image generation. As discussed before, the direct image generation cannot provide accurate annotation information and the directly generated images also have low quality and are not suitable for training deep network models.
ICDAR2013  is used in the Robust Reading Competition in the International Conference on Document Analysis and Recognition (ICDAR) 2013. It contains 848 word images for network training and 1095 for testing.
ICDAR2015  is used in the Robust Reading Competition under ICDAR 2015. It contains incidental scene text images that are captured without preparation before capturing. 2077 text image patches are cropped from this dataset, where a large amount of cropped scene texts suffer from perspective and curvature distortions.
has 2000 training images and 3000 test images that are cropped from scene texts and born-digital images. Each word in this dataset has a 50-word lexicon and a 1000-word lexicon, where each lexicon consists of a ground-truth word and a set of randomly picked words.
SVT  is collected from the Google Street View images that were used for scene text detection research. 647 words images are cropped from 249 street view images and most cropped texts are almost horizontal.
SVTP  has 639 word images that are cropped from the SVT images. Most images in this dataset suffer from perspective distortion which are purposely selected for evaluation of scene text recognition under perspective views.
CUTE  has 288 word images mose of which are curved. All words are cropped from the CUTE dataset which contains 80 scene text images that are originally collected for scene text detection research.
CelebA  is a face image dataset that consists of more than 200k celebrity images with 40 attribute annotations. This dataset is characterized by large quantities, large face pose variations, complicated background clutters, rich annotations, and it is widely used for face attribute prediction.
Data Preparation: The SF-GAN needs a set of Real Images to act at references as illustrated in Fig. 2. We create the Real Images by cropping the text image patches from the training images of ICDAR2013 , ICDAR2015  and SVT  by using the provided annotation boxes. While cropping the text image patches, we extend the annotation box (by an extra 1/4 of the width and height of the annotation boxes) to include a certain amount of neighboring background. The purpose is to include certain local geometric structures so that the geometry synthesizer can learn the transformation matrix for transforming the foreground text with correct alignment with the background image.
Besides the Real Images, SF-Net also needs a set of Background Images as shown in Fig. 2. For scene text image synthesis, we collect the background images by smoothing out the text pixels of the cropped Real Images
. In particular, we first apply k-means clustering to the croppedReal Images to get text pixel and so the text mask. Inpainting is then applied over the text pixels which produces the background image without texts as shown in the second row in Fig. 4. Further, the Foreground Object (text for scene text synthesis) is computer-generated by using a 90k-lexicon. The created Background Images, Foreground Texts and Real Images are fed to the network to train the SF-GAN.
Note the synthesized scene text images by the trained SF-GAN will have a larger background as the original background images are larger than the annotation boxes. Texts need to be cropped out with tighter boxes (to exclude extra background) for optimal training of scene text recognition model. With the text maps as denoted by Transformed Object in Fig. 2, scene text patches can be cropped out accurately by detecting a minimal external rectangle.
Results Analysis: We use 1 million SF-GAN synthesized scene text images to train scene text recognition models and use the model recognition performance to evaluate the usefulness of the synthesized images. In addition, the SF-GAN is benchmarked with a number of state-of-the-art synthesis techniques by randomly selecting 1 million synthesized scene text images from  and randomly cropping 1 million scene text images from  and . Beyond that, we also synthesize 1 million scene text images with random text appearance by using ST-GAN .
For ablation analysis, we evaluate SF-GAN(GS) which denotes the output of the geometry synthesizer (Composed Image as shown in Fig. 2) and SF-GAN(AS) which denotes the output of the appearance synthesizer with random geometric alignments. A baseline SF-GAN (BS) is also trained where texts are placed with random alignment and appearance. The three SF-GANs also synthesize 1 million images each for scene text recognition tests. The recognition tests are performed over four regular scene text datasets ICDAR2013 , ICDAR2015 , SVT , IIIT5K  and two irregular datasets SVTP  and CUTE  as described in Datasets. Besides the scene text recognition, we also perform user studies with Amazon Mechanical Turk (AMT) where users are recruited to tell whether SF-GAN synthesized images are real or synthesized.
Tables 2 and 3 show scene text recognition and AMT user study results. As Table 2 shows, SF-GAN achieves the highest recognition accuracy for most of the 6 datasets and an up to 3% improvement in average recognition accuracy (across the 6 datasets), demonstrating the superior usefulness of its synthesized images while used for training scene text recognition models. The ablation study shows that the proposed geometry synthesizer and appearance synthesizer both help to synthesize more realistic and useful image in recognition model training. In addition, they are complementary and their combination achieves a 6% improvement in average recognition accuracy beyond the baseline SF-GAN(BS). The AMT results in the second column of Table 3 also show that the SF-GAN synthesized scene text images are much more realistic than state-of-the-art synthesis techniques. Note the synthesized images by  are gray-scale and not included in the AMT user study.
Fig. 4 shows a few synthesis images by using the proposed SF-GAN and a few state-of-the-art GANs. As Fig. 4 shows, ST-GAN can achieve geometric alignment but the appearance is clearly unrealistic within the synthesized images. The CycleGAN can adapt the appearance of the foreground texts to certain degrees but it ignores real geometry. This leads to not only unrealistic geometry but also degraded appearance as the discriminator can easily distinguish generated images and real images according to the geometry difference. The SF-GAN (GS) gives the output of the geometry synthesizer, i.e. the Composed Image as shown in Fig. 2, which produces better alignment due to good references from the appearance synthesizer. In addition, it can synthesize curve texts due to the use of a thin plate spline transformation . The fully implemented SF-GAN can further learn text appearance from real images and synthesize highly realistic scene text images. Besides, we can see that the proposed SF-GAN can learn from neighboring texts within the background images and adapt the appearance of the foreground texts accordingly.
AMT user study to evaluate the realism of synthesized images. Percentages represent the how often the images in each category were classified as “real” by Turkers.
Data preparation: We use the dataset CelebA  and follow the provided training/test split for portrait wearing experiment. The training set is divided into two groups by using the annotation ‘glass’ and ‘hat’, respectively. For the glass case, one group of people with glasses serve as the real data for matching against in our adversarial settings and the other group without glasses serves as the background. For the foreground glasses, we crop 15 pairs of front-parallel glasses and reuse them to randomly compose with the background images. According to our experiment, 15 pairs of glasses as the foreground objects are sufficient to train a robust model. The hat case has the similar setting, except that we use 30 cropped hats as the foreground objects.
Results Analysis: Fig 5. shows a few SF-GAN synthesized images and comparisons with ST-GAN synthesized images. As Fig. 5 shows, ST-GAN achieves realism in the geometry space by aligning the glasses and hats with the background face images. On the other hand, the synthesized images are unrealistic in the appearance space with clear artifacts in color, contrast and brightness. As a comparison, the SF-GAN synthesized images are much more realistic in both geometry and appearance spaces. In particular, the foreground glasses and hats within the SF-GAN synthesized images have harmonious brightness, contrast, and blending with the background face images. Additionally, the proposed SF-GAN also achieve better geometric alignment as compared with ST-GAN which focuses on geometric alignment only. We conjecture that the better geometric alignment is largely due to the reference from the appearance synthesizer. The AMT results as shown in the last two columns of Table 3 also show the superior synthesis performance of our proposed SF-GAN.
This paper presents a SF-GAN, an end-to-end trainable network that synthesize realistic images given foreground objects and background images. The SF-GAN combine a geometry synthesizer and a appearance synthesizer and is capable of achieving synthesis realism in both geometry and appearance spaces concurrently. Two use cases are studied. The first scene text image synthesis study shows that the proposed SF-GAN is capable of synthesizing useful images to train better scene text recognition models. The second hat-wearing and glass matching study shows the SF-GAN is widely applicable and can be easily extend to other tasks. We will continue to study SF-GAN for full-image synthesis for training better detection models.
Image-to-image translation with conditional adversarial networks.In CVPR, 2017.
NIPS Deep Learning Workshop, 2014.