Generative adversarial networks (GANs) [goodfellow2014gan] have demonstrated great potentials for synthesizing realistic images in recent years. Three typical approaches have been explored for image synthesis by using GANs, namely, direct image generation [radford2016dcgan, arjovsky2017wgan]zhu2017cyclegan, isola2017pixel2pixel] and image composition [lin2018stgan, zhan2019sfgan].
Image composition aims for embedding interested foreground objects into background scenes automatically and realistically. Researchers have explored different approaches to address this challenging task. The classical approach is image matting [matting] that estimates the opacity of foreground objects for composing with different backgrounds. It requires an image with the interested foreground and usually does not change the geometry of the foreground in the composed images. Another approach aims to embed transparent 2-dimensional (2D) foreground object images into background images realistically [zhu2017cyclegan, isola2017pixel2pixel]. It harmonizes the appearance of the foreground objects with respect to the background, and involves homography transformation for aligning the 2D foreground objects with the background image geometrically. To overcome the limited views of 2D foreground objects, composition with 3-dimensional (3D) object models with 360-degrees view freedom [zhu2018von, yao20183dsdn] has attracted increasing interest recently. On the other hand, it is still an open research challenge due to the very limited amount of 3D models with harmonious textures.
In this paper, we propose a View Alignment GAN (VA-GAN) that composes new images by embedding 3D object models into 2D background images with realistic object poses and textures as illustrated in Fig. 1. VA-GAN consists of a texture generator and a differential discriminator that are inter-connected and end-to-end trainable. The differential discriminator guides to learn geometry transformations from real object-background pairs, and predicts transformation parameters for embedding 3D object models into the background images with realistic poses and views. A view predictor is employed to produce a view code (consisting of the estimated transformation parameters) for generating object textures under the estimated object view. The texture generator learns to produce realistic object textures from the depth map of 3D model. A projector is employed to generate the depth map of 3D models according to the transformation parameters estimated by the view predictor.
The contributions of this work are threefold. First, we design an innovative View Alignment GAN that composes 3D object models into 2D background images with realistic texture and geometry automatically. Second, we design a differential discriminator that is capable of learning realistic geometric alignment between 3D object models and 2D background images. Third, we design a view encoding mechanism that can synthesize realistic textures from depth maps of 3D object models under arbitrary views.
2 Related Work
2.1 Image Composition
Image composition aims to embed foreground objects into background images realistically and automatically by adjusting object sizes, poses, colours, etc. A number of image composition techniques have been reported in the past few years due to its wide applications in various computer vision tasks[richter2016]. For example, [zhu2015] proposes a model to distinguish natural photographs from computer-composed ones. [dwibedi2017, zhan2018ver] treat image composition as an image augmentation scheme for generating more useful training images.
With the advance of GANs, a number of GAN-based methods have been proposed for image composition. For example, [lin2018stgan] presents a spatial transformer GAN for geometric realism in composition. [tripathi2019] employs a trainable synthesizer network to generate meaningful training samples. [chen2019compose] proposes an image compositing GAN that learns geometric and color correction simultaneously. [zhan2019sfgan] combines a geometry synthesizer and an appearance synthesizer to achieve synthesis realism in both geometry and appearance spaces.
2.2 Generative Adversarial Networks
GANs [goodfellow2014gan] have achieved great success in image generation from either random noises or existing images. For example, [denton2015lapgan] introduces Laplacian pyramids that improve the quality of GAN-synthesized images greatly. [lee2018context] proposes an end-to-end trainable network for inserting an object mask into the semantic label map of an image. Most existing GANs focus on high-fidelity image synthesis in appearance by manipulating image colours, textures, styles, etc. For instance, CycleGAN [zhu2017cyclegan] proposes a cycle-consistent adversarial network for realistic image-to-image translation, and so others [isola2017pixel2pixel, shrivastava2017simgan, huang2018munit, park2019spade, liu2019funit].
In recent years, researchers have been pay increasing attention to high-fidelity geometry in image synthesis by manipulating local image structures and global perspectives. For example, [azadi2018comgan] describes a Compositional GAN that introduces a self-consistent composition-decomposition network. [yao20183dsdn, zhu2018von] studies GAN-based 3D manipulation and generation.
3 Proposed Method
3.1 Network Architecture
The proposed VA-GAN consists of a View Generator, a Projector, a Texture Generator and a differential discriminator as illustrated in Fig. 2. It takes a background image with a selected object embedding region (i.e. Background1) and a 3D model of the interested foreground object as inputs. The View Generator predicts a View Code based on the local geometry of Background1, which consists of parameters that define the rotation angles (between ) of the 3D model in horizontal and vertical directions, respectively. A differentiable Projector [zhu2018von] is employed to generate a Depth Map of the 3D model according to the predicted View Code. The Texture Generator is pre-trained which generates realistic object textures from the depth map based on the predicted View Code and a style code, more detailed to be described in the ensuing subsection Texture Generator. A differential discriminator is designed which guides the View Generator to learn the View Code and align the 3D model with the background image (using real object-background pairs as references), more details to be described in the ensuing subsection Differential Discriminator.
3.2 Texture Generator
A Texture Generator is designed to generate realistic object textures from a depth map with a Style Code (randomly sampled from a normal distribution) and a View Code as illustrated in Fig. 2. The texture generator is pre-trained independently, which is freezed while training the geometric view alignment between the 3D object model and the background image.
The texture generator adopts a GAN-based cyclic structure for translation from depth maps (generated from 3D models) to object textures as illustrated in Fig. 3. Specifically, a style code is concatenated with the depth map to determine the style of the generated object texture. A style encoder strives to recover the style code from the generated object texture which leads to a cyclic consistency for the style code. For one specific style code, the texture generator can generate object textures of different views that are consistent with each other as illustrated in Fig. 4.
View Encoding: As the training of the view alignment depends on pairs of objects and background images, accurate generation of object textures is critical for the training of view alignment module as well as the whole VA-GAN. While generating 2D object textures from depth maps, one specific issue is that the generated object texture tends to present inaccurate views as illustrated in Fig. 5, largely due to insufficient view information in the depth maps. We design a view encoding mechanism that feeds a view code (from the View Generator) to the texture generator for accurate texture generation as illustrated in Fig. 3. As the views of 3D models are periodic every , the view code is transformed to be periodic as well before concatenating with images. We apply a Cosine transformation to the view code (between [-, ) which achieves the periodicity and smooth encoding of views.
As the style code has been concatenated with the depth map as the input of as shown in Fig. 3, the concatenation of another view code with the depth map entangles the effects of the two codes which tends to degrade the quality of the synthesized object. To avoid this problem, we concatenate the view code with the object texture as the input of as illustrated in Fig. 3. A view encoder will strive to recover the view code from the synthesized object which leads to a cyclic consistency for the view code.
3.3 Differential Discriminator
We design a differential discriminator to guide the View Generator to learn realistic geometric alignment between the foreground object and the background image. To learn optimal geometric alignments, needs to distinguish geometries instead of textures in the Real Pairs and Fake Pairs as shown in Fig. 2. Inspired by the differential circuit that amplifies the difference between two input voltages but suppresses any voltage common to the two inputs, we design the in a way that it will amplify the intra-pair geometry discrepancy and suppress the inter-pair texture discrepancy as illustrated in Fig. 6. Note that the inter-pair texture discrepancy mainly exists between the real pairs and fake pairs, while the intra-pair geometry discrepancy exists within both real and fake pairs.
Suppression of Inter-Pair Texture Discrepancy: works with the View Generator and Texture Generator to suppress the inter-pair texture discrepancy. In the Texture Generator in Fig. 3, two generators and are employed to generate object textures from depth maps and recover depth maps from object textures, respectively, and two discriminators and are employed to distinguish the real & fake object textures and real & fake depth maps, respectively. After the Texture Generator is trained, is able to recover indistinguishable depth maps which means that largely extracts geometry instead of texture features. The View Generator aims to extract geometry features from the background image for the prediction of View Code. thus suppresses the inter-pair texture discrepancy by adopting the View Generator to predict the view code of the background image and to generate the depth map from the object images in the fake and real pairs.
Amplification of Intra-Pair Geometry Discrepancy: As the texture features between the real and fake pairs is suppressed, the discrimination of real and fake pairs is largely up to the intra-pair geometry discrepancy between the depth map and view code. How to amplify the intra-pair geometry discrepancy thus becomes critical for the effective discrimination between real and fake pairs. Obviously, the intra-pair geometry discrepancy will be maximized when the depth map and view code have balanced weights when concatenated. Otherwise, the discriminator could be dominated by either depth map or view code if it has much higher weights which misses the target of minimizing the intra-pair geometry discrepancy.
The normal feature fusion through concatenation usually works only when features are of similar types and dimensions. As the depth map and view code are heterogeneous with very different dimensions, the normal concatenation will assign much higher weight to the much bigger depth features which reduces the intra-pair geometry discrepancy undesirably. We design an adaptive feature fusion strategy that introduces a dedicated fusion branch in the discriminator to predict a fusion weight as illustrated in Fig. 6. As the real-fake pair discrimination relies on the intra-pair geometry discrepancy, the fusion branch in the discriminator will strive to learn the fusion weights that maximize the intra-pair geometry discrepancy between the depth map and view code.
KITTI [kitti]: KITTI is captured by driving in the rural areas and on highways. Up to 15 cars and 30 pedestrians are captured per image. The dataset contains 7481 training images annotated with 3D bounding boxes.
ShapeNet [shapenet]: ShapeNet is a large shape repository of 55 object categories. We select 100 CAD models from the car category as the 3D models for Car Synthesis.
Cityscapes [cityscapes]: Cityscapes is for semantic understanding of urban street scenes with images collected from 50 cities. It also contain 3475 images with persons annotated with boxes.
DressedHuman [pedestrian]: DressedHuman was created for evaluating algorithms that estimate the shape of 3D human body in the presence of loose clothing.
4.2 Experiment Setting
In the car synthesis experiment, we collect 100 3D car models from ShapeNet [shapenet] as the foreground objects. The background images are obtained from KITTI [kitti]. In the pedestrian synthesis experiment, the 3D models are collected from DressedHuman [pedestrian] and the background images are collected from Cityscapes [cityscapes]. In both experiments, the real objects are cropped from the background images according to the provided semantic segmentation.
The differential discriminator involves two types of foreground-background pairs as shown in Fig. 2. Each Real Pair consists of a background image Background2 with an embedding region (where a foreground object is cropped) and the cropped foreground object. In each Fake Pair, an embedding region in the background image Background1 is determined by selecting a bounding box region around an existing object annotation box.
4.3 Experiment Analysis
Quantitative Analysis: We compare VA-GAN with state-of-the-art GANs by using Amazon Mechanical Turk (AMT) scores and Frechet Inception Distance (FID). Table 1 show AMT scores, where the number in each cell tells the percentage of images composed by VA-GAN that are deemed as more realistic than those by the compared GANs as listed in the first row. Note rows 2-3 just compare the generated objects by different GANs.
As Table 1 shows, VA-GAN outperforms LSGAN [mao2016lsgan], DCGAN [radford2016dcgan] and WGAN-GP [arjovsky2017wgan] clearly, largely because VA-GAN generates more accurate object textures from depth maps (instead of noises in [mao2016lsgan, radford2016dcgan, arjovsky2017wgan]) with more geometry information. We also compare VA-GAN with CycleGAN [zhu2017cyclegan] and VON [zhu2018von] that similarly generate objects from depth maps. VA-GAN achieves more realistic composition as well, largely due to the view encoding mechanism that helps to generate more accurate object views as illustrated in Fig. 5. In addition, VA-GAN performs better for car object synthesis than pedestrian object synthesis. This is largely due to the richer structures in 3D car models and the better-quality real objects in car images (pedestrians in Cityscapes are small and blurry).
For the final composition that embeds the GAN-generated objects into the background images, VA-GAN performs the best by large margins as shown in Table 1 as the view estimation produces better geometric alignment. Compared with real images, VA-GAN achieves lower AMT for the final composition as compared with the generated object, largely because object synthesis evaluates texture realism only without considering the realism in geometric alignments. Additionally, VA-GAN can control the style of the generated texture and generates accurate object textures in 360-degree views, whereas CycleGAN cannot control the style and VON can only generate objects of views (-90, 90). Note VON used 2600 high-quality cars image in training in the original paper, but we used much less images of lower quality (the same as used in training VA-GAN) in our experiment.
We also evaluate and compare VA-GAN by using the FID [fid]
that computes the distance between the features of natural images and images synthesized by the studied GANs (image features are extracted by using Inception network trained using ImageNet[imagenet]). Table 2 shows experimental results. As Table 2 shows, VA-GAN achieves the best FID score and this aligns with the AMT experiment well. The best FID is largely due to the view encoding that enables VA-GAN to synthesize accurate object textures of 360-degree freedom views.
Qualitative Analysis: The proposed VA-GAN is capable of synthesizing realistic object textures of arbitrary views as illustrated in Fig. 4. The style encoding ensures that the object textures of different views share the same style and are consistent with each other. The view encoding allows the generated object texture to be consistent with the depth map, and it also contributes to the training of the geometric alignment module.
Fig. 7 show the final compositions by VA-GAN and state-of-the-art CycleGAN and VON that generate objects and directly compose them with background images. As Fig. 7 shows, CycleGAN and VON compositions tend to be unrealistic due to the lack of geometric alignment. As a comparison, images synthesized by VA-GAN have clearly more realistic texture and geometric alignment. Another unique feature of VA-GAN is that it can compose the same 3D model with different background images with adaptive views as illustrated in VA-GAN1, VA-GAN2 and VA-GAN3. This makes it very useful to many tasks such as object tracking, person ReID, etc.
Ablation Study: An ablation study has been conducted to evaluate VA-GAN as shown in Table 3. It involves three variant of the proposed VA-GAN including: 1) VA-GAN (WA) that does not include the view alignment module (random view code are applied to the 3D model to compose with the background image); 2) VA-GAN (WE) that does not include the view encoding; and 3) VA-GAN (WD) that uses a normal discriminator instead of our proposed differential discriminator.
As Table 3 shows, VA-GAN (WA) obtains a clearly lower AMT score than VA-GAN, showing the importance of geometric alignment in synthesizing realistic images. VA-GAN (WE) also obtain lower AMT than VA-GAN as the view encoding helps to generate objects with more accurate views which further contributes to the training of the alignment module and improves the realism of synthesized images. Further, the AMT scores of VA-GAN (WD) are also lower than that of VA-GAN, demonstrating that the suppression of discrepancy in style space and amplification of discrepancy in geometry space (as implemented in our differential discriminator) improves the learning of geometric alignment greatly.
This paper presents a view-alignment GAN (VA-GAN), an end-to-end trainable network that is capable of synthesizing realistic images given 3D models and 2D background images. A view encoding mechanism is designed to synthesize accurate texture of the 3D model and a novel differential discriminator is proposed to achieve effective learning of geometric alignment. Extensive experiments show the proposed VA-GAN clearly outperform state-of-the-are GANs quantitatively and qualitatively. Moving forwards, we will continue to investigate how to combine VA-GAN with 3D model generation for more flexible end-to-end 3D model generation and image composition.