Towards Realistic 3D Embedding via View Alignment

07/14/2020 ∙ by Fangneng Zhan, et al. ∙ 0

Recent advances in generative adversarial networks (GANs) have achieved great success in automated image composition that generates new images by embedding interested foreground objects into background images automatically. On the other hand, most existing works deal with foreground objects in two-dimensional (2D) images though foreground objects in three-dimensional (3D) models are more flexible with 360-degree view freedom. This paper presents an innovative View Alignment GAN (VA-GAN) that composes new images by embedding 3D models into 2D background images realistically and automatically. VA-GAN consists of a texture generator and a differential discriminator that are inter-connected and end-to-end trainable. The differential discriminator guides to learn geometric transformation from background images so that the composed 3D models can be aligned with the background images with realistic poses and views. The texture generator adopts a novel view encoding mechanism for generating accurate object textures for the 3D models under the estimated views. Extensive experiments over two synthesis tasks (car synthesis with KITTI and pedestrian synthesis with Cityscapes) show that VA-GAN achieves high-fidelity composition qualitatively and quantitatively as compared with state-of-the-art generation methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 5

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative adversarial networks (GANs) [goodfellow2014gan] have demonstrated great potentials for synthesizing realistic images in recent years. Three typical approaches have been explored for image synthesis by using GANs, namely, direct image generation [radford2016dcgan, arjovsky2017wgan]

, image-to-image translation

[zhu2017cyclegan, isola2017pixel2pixel] and image composition [lin2018stgan, zhan2019sfgan].

Figure 1: Given 3D object models and 2D Backgrounds Image, the proposed VA-GAN is capable of predicting realistic view/pose of the 3D object models with respect to the background image and generating harmonious object textures automatically. For the three 3D car models shown at the bottom-left corner, VA-GAN predicts their poses according to the local geometry of background image and generates realistic textures as illustrated.

Image composition aims for embedding interested foreground objects into background scenes automatically and realistically. Researchers have explored different approaches to address this challenging task. The classical approach is image matting [matting] that estimates the opacity of foreground objects for composing with different backgrounds. It requires an image with the interested foreground and usually does not change the geometry of the foreground in the composed images. Another approach aims to embed transparent 2-dimensional (2D) foreground object images into background images realistically [zhu2017cyclegan, isola2017pixel2pixel]. It harmonizes the appearance of the foreground objects with respect to the background, and involves homography transformation for aligning the 2D foreground objects with the background image geometrically. To overcome the limited views of 2D foreground objects, composition with 3-dimensional (3D) object models with 360-degrees view freedom [zhu2018von, yao20183dsdn] has attracted increasing interest recently. On the other hand, it is still an open research challenge due to the very limited amount of 3D models with harmonious textures.

In this paper, we propose a View Alignment GAN (VA-GAN) that composes new images by embedding 3D object models into 2D background images with realistic object poses and textures as illustrated in Fig. 1. VA-GAN consists of a texture generator and a differential discriminator that are inter-connected and end-to-end trainable. The differential discriminator guides to learn geometry transformations from real object-background pairs, and predicts transformation parameters for embedding 3D object models into the background images with realistic poses and views. A view predictor is employed to produce a view code (consisting of the estimated transformation parameters) for generating object textures under the estimated object view. The texture generator learns to produce realistic object textures from the depth map of 3D model. A projector is employed to generate the depth map of 3D models according to the transformation parameters estimated by the view predictor.

The contributions of this work are threefold. First, we design an innovative View Alignment GAN that composes 3D object models into 2D background images with realistic texture and geometry automatically. Second, we design a differential discriminator that is capable of learning realistic geometric alignment between 3D object models and 2D background images. Third, we design a view encoding mechanism that can synthesize realistic textures from depth maps of 3D object models under arbitrary views.

2 Related Work

2.1 Image Composition

Image composition aims to embed foreground objects into background images realistically and automatically by adjusting object sizes, poses, colours, etc. A number of image composition techniques have been reported in the past few years due to its wide applications in various computer vision tasks

[richter2016]. For example, [zhu2015] proposes a model to distinguish natural photographs from computer-composed ones. [dwibedi2017, zhan2018ver] treat image composition as an image augmentation scheme for generating more useful training images.

With the advance of GANs, a number of GAN-based methods have been proposed for image composition. For example, [lin2018stgan] presents a spatial transformer GAN for geometric realism in composition. [tripathi2019] employs a trainable synthesizer network to generate meaningful training samples. [chen2019compose] proposes an image compositing GAN that learns geometric and color correction simultaneously. [zhan2019sfgan] combines a geometry synthesizer and an appearance synthesizer to achieve synthesis realism in both geometry and appearance spaces.

2.2 Generative Adversarial Networks

GANs [goodfellow2014gan] have achieved great success in image generation from either random noises or existing images. For example, [denton2015lapgan] introduces Laplacian pyramids that improve the quality of GAN-synthesized images greatly. [lee2018context] proposes an end-to-end trainable network for inserting an object mask into the semantic label map of an image. Most existing GANs focus on high-fidelity image synthesis in appearance by manipulating image colours, textures, styles, etc. For instance, CycleGAN [zhu2017cyclegan] proposes a cycle-consistent adversarial network for realistic image-to-image translation, and so others [isola2017pixel2pixel, shrivastava2017simgan, huang2018munit, park2019spade, liu2019funit].

In recent years, researchers have been pay increasing attention to high-fidelity geometry in image synthesis by manipulating local image structures and global perspectives. For example, [azadi2018comgan] describes a Compositional GAN that introduces a self-consistent composition-decomposition network. [yao20183dsdn, zhu2018von] studies GAN-based 3D manipulation and generation.

Figure 2: The architecture of the proposed View Alignment GAN: Given background images Background1, the View Predictor learns to predict a View Code (consisting of transformation parameters) according to the local geometry of Background1, under the guidance of a differential discriminator that employs real object-background pairs to learn geometry transformations. The Projector generates the depth map of 3D model based on the predicted View Code, and the Texture Generator learns to map the depth map to realistic object textures for the 3D model under the estimated view.

3 Proposed Method

3.1 Network Architecture

The proposed VA-GAN consists of a View Generator, a Projector, a Texture Generator and a differential discriminator as illustrated in Fig. 2. It takes a background image with a selected object embedding region (i.e. Background1) and a 3D model of the interested foreground object as inputs. The View Generator predicts a View Code based on the local geometry of Background1, which consists of parameters that define the rotation angles (between []) of the 3D model in horizontal and vertical directions, respectively. A differentiable Projector [zhu2018von] is employed to generate a Depth Map of the 3D model according to the predicted View Code. The Texture Generator is pre-trained which generates realistic object textures from the depth map based on the predicted View Code and a style code, more detailed to be described in the ensuing subsection Texture Generator. A differential discriminator is designed which guides the View Generator to learn the View Code and align the 3D model with the background image (using real object-background pairs as references), more details to be described in the ensuing subsection Differential Discriminator.

Figure 3:

Illustration of the cyclic processes in the texture generator: A style code (sampled from a normal distribution) is concatenated with the depth map to determine the style of the synthesized fake object, and a style encoder

strives to recover the style code from the synthesized fake object. A view code is concatenated with the synthesized fake object to recover the depth map. A view encoder strives to recover the view code from the synthesized fake object which leads to a cyclic consistency for the view code.

3.2 Texture Generator

A Texture Generator is designed to generate realistic object textures from a depth map with a Style Code (randomly sampled from a normal distribution) and a View Code as illustrated in Fig. 2. The texture generator is pre-trained independently, which is freezed while training the geometric view alignment between the 3D object model and the background image.

The texture generator adopts a GAN-based cyclic structure for translation from depth maps (generated from 3D models) to object textures as illustrated in Fig. 3. Specifically, a style code is concatenated with the depth map to determine the style of the generated object texture. A style encoder strives to recover the style code from the generated object texture which leads to a cyclic consistency for the style code. For one specific style code, the texture generator can generate object textures of different views that are consistent with each other as illustrated in Fig. 4.

View Encoding: As the training of the view alignment depends on pairs of objects and background images, accurate generation of object textures is critical for the training of view alignment module as well as the whole VA-GAN. While generating 2D object textures from depth maps, one specific issue is that the generated object texture tends to present inaccurate views as illustrated in Fig. 5, largely due to insufficient view information in the depth maps. We design a view encoding mechanism that feeds a view code (from the View Generator) to the texture generator for accurate texture generation as illustrated in Fig. 3. As the views of 3D models are periodic every , the view code is transformed to be periodic as well before concatenating with images. We apply a Cosine transformation to the view code (between [-, ) which achieves the periodicity and smooth encoding of views.

Figure 4: Object texture generation by the proposed texture generator: The textures of different views of the 3D model (on the left) are generated with a fixed style code. The generated textures of different views have the same style and are consistent with each other.
Figure 5: The proposed view encoding mechanism helps to generate more accurate textures from the depth map that are well aligned with the view of the 3D model. Without view encoding, the object texture generated from the depth map tends to produce ambiguous views.

As the style code has been concatenated with the depth map as the input of as shown in Fig. 3, the concatenation of another view code with the depth map entangles the effects of the two codes which tends to degrade the quality of the synthesized object. To avoid this problem, we concatenate the view code with the object texture as the input of as illustrated in Fig. 3. A view encoder will strive to recover the view code from the synthesized object which leads to a cyclic consistency for the view code.

3.3 Differential Discriminator

We design a differential discriminator to guide the View Generator to learn realistic geometric alignment between the foreground object and the background image. To learn optimal geometric alignments, needs to distinguish geometries instead of textures in the Real Pairs and Fake Pairs as shown in Fig. 2. Inspired by the differential circuit that amplifies the difference between two input voltages but suppresses any voltage common to the two inputs, we design the in a way that it will amplify the intra-pair geometry discrepancy and suppress the inter-pair texture discrepancy as illustrated in Fig. 6. Note that the inter-pair texture discrepancy mainly exists between the real pairs and fake pairs, while the intra-pair geometry discrepancy exists within both real and fake pairs.

Suppression of Inter-Pair Texture Discrepancy: works with the View Generator and Texture Generator to suppress the inter-pair texture discrepancy. In the Texture Generator in Fig. 3, two generators and are employed to generate object textures from depth maps and recover depth maps from object textures, respectively, and two discriminators and are employed to distinguish the real & fake object textures and real & fake depth maps, respectively. After the Texture Generator is trained, is able to recover indistinguishable depth maps which means that largely extracts geometry instead of texture features. The View Generator aims to extract geometry features from the background image for the prediction of View Code. thus suppresses the inter-pair texture discrepancy by adopting the View Generator to predict the view code of the background image and to generate the depth map from the object images in the fake and real pairs.

Amplification of Intra-Pair Geometry Discrepancy: As the texture features between the real and fake pairs is suppressed, the discrimination of real and fake pairs is largely up to the intra-pair geometry discrepancy between the depth map and view code. How to amplify the intra-pair geometry discrepancy thus becomes critical for the effective discrimination between real and fake pairs. Obviously, the intra-pair geometry discrepancy will be maximized when the depth map and view code have balanced weights when concatenated. Otherwise, the discriminator could be dominated by either depth map or view code if it has much higher weights which misses the target of minimizing the intra-pair geometry discrepancy.

The normal feature fusion through concatenation usually works only when features are of similar types and dimensions. As the depth map and view code are heterogeneous with very different dimensions, the normal concatenation will assign much higher weight to the much bigger depth features which reduces the intra-pair geometry discrepancy undesirably. We design an adaptive feature fusion strategy that introduces a dedicated fusion branch in the discriminator to predict a fusion weight as illustrated in Fig. 6. As the real-fake pair discrimination relies on the intra-pair geometry discrepancy, the fusion branch in the discriminator will strive to learn the fusion weights that maximize the intra-pair geometry discrepancy between the depth map and view code.

Figure 6: The structure of the proposed differential discriminator : employs (in Texture Generator) to extract depth maps from the object texture and the View Generator to predict the view code from background to suppress the texture discrepancy between real and fake pairs. To amplify the intra-pair geometry discrepancy between the depth map and view code, it adopts a fusion branch to predict adaptive weights for the fusion of depth map and view code. As distinguishes real and fake pair according to the intra-pair geometry discrepancy, the fusion branch will strive to learn the optimal weights that maximize the intra-pair geometry discrepancy. The orange and blue parts denote the Suppression of Inter-Pair Texture Discrepancy and Amplification of Intra-Pair Geometry Discrepancy.

4 Experiment

4.1 Datasets

KITTI [kitti]: KITTI is captured by driving in the rural areas and on highways. Up to 15 cars and 30 pedestrians are captured per image. The dataset contains 7481 training images annotated with 3D bounding boxes.

ShapeNet [shapenet]: ShapeNet is a large shape repository of 55 object categories. We select 100 CAD models from the car category as the 3D models for Car Synthesis.

Cityscapes [cityscapes]: Cityscapes is for semantic understanding of urban street scenes with images collected from 50 cities. It also contain 3475 images with persons annotated with boxes.

DressedHuman [pedestrian]: DressedHuman was created for evaluating algorithms that estimate the shape of 3D human body in the presence of loose clothing.

4.2 Experiment Setting

In the car synthesis experiment, we collect 100 3D car models from ShapeNet [shapenet] as the foreground objects. The background images are obtained from KITTI [kitti]. In the pedestrian synthesis experiment, the 3D models are collected from DressedHuman [pedestrian] and the background images are collected from Cityscapes [cityscapes]. In both experiments, the real objects are cropped from the background images according to the provided semantic segmentation.

The differential discriminator involves two types of foreground-background pairs as shown in Fig. 2. Each Real Pair consists of a background image Background2 with an embedding region (where a foreground object is cropped) and the cropped foreground object. In each Fake Pair, an embedding region in the background image Background1 is determined by selecting a bounding box region around an existing object annotation box.

TargetMethod LSGAN DCGAN CycleGAN VON Real
Cars* 70% 69% 57% 55% 42%
Pedestrians* 63% 62% 55% 54% 44%
Cars 94% 89% 77% 75% 33%
Pedestrians 87% 80% 68% 67% 38%
Table 1: Comparing VA-GAN with state-of-the-art GANs using Amazon Mechanical Turk: The number in each cell tells the percentage of the images composed by VA-GAN that are deemed as more realistic by users (than the images composed by the compared GANs in the first row). ‘Car*’ and ‘Pedestrian*’ denote synthesized objects only. ‘Car’ and ‘Pedestrian’ denote final compositions with geometric alignment with background images.
Methods Car Pedestrian
DCGAN 115.1 196.3
LSGAN 135.7 191.2
WGAN-GP 108.5 163.8
VON 74.6 112.4
VA-GAN 70.2 97.7
Table 2: The Frechet Inception Distance (FID) of the images that are composed by VA-GAN and several state-of-the-art GANs including DCGAN, LSGAN, WGAN-GP, and VON.

4.3 Experiment Analysis

Quantitative Analysis: We compare VA-GAN with state-of-the-art GANs by using Amazon Mechanical Turk (AMT) scores and Frechet Inception Distance (FID). Table 1 show AMT scores, where the number in each cell tells the percentage of images composed by VA-GAN that are deemed as more realistic than those by the compared GANs as listed in the first row. Note rows 2-3 just compare the generated objects by different GANs.

As Table 1 shows, VA-GAN outperforms LSGAN [mao2016lsgan], DCGAN [radford2016dcgan] and WGAN-GP [arjovsky2017wgan] clearly, largely because VA-GAN generates more accurate object textures from depth maps (instead of noises in  [mao2016lsgan, radford2016dcgan, arjovsky2017wgan]) with more geometry information. We also compare VA-GAN with CycleGAN [zhu2017cyclegan] and VON [zhu2018von] that similarly generate objects from depth maps. VA-GAN achieves more realistic composition as well, largely due to the view encoding mechanism that helps to generate more accurate object views as illustrated in Fig. 5. In addition, VA-GAN performs better for car object synthesis than pedestrian object synthesis. This is largely due to the richer structures in 3D car models and the better-quality real objects in car images (pedestrians in Cityscapes are small and blurry).

Figure 7: Comparison of image composition by different GANs: VA-GAN1, VA-GAN2 and VA-GAN3 denote three sample images composed by VA-GAN by using the same 3D model and style code but different background images. Red boxes highlight the embedded objects.

For the final composition that embeds the GAN-generated objects into the background images, VA-GAN performs the best by large margins as shown in Table 1 as the view estimation produces better geometric alignment. Compared with real images, VA-GAN achieves lower AMT for the final composition as compared with the generated object, largely because object synthesis evaluates texture realism only without considering the realism in geometric alignments. Additionally, VA-GAN can control the style of the generated texture and generates accurate object textures in 360-degree views, whereas CycleGAN cannot control the style and VON can only generate objects of views (-90, 90). Note VON used 2600 high-quality cars image in training in the original paper, but we used much less images of lower quality (the same as used in training VA-GAN) in our experiment.

We also evaluate and compare VA-GAN by using the FID [fid]

that computes the distance between the features of natural images and images synthesized by the studied GANs (image features are extracted by using Inception network trained using ImageNet

[imagenet]). Table 2 shows experimental results. As Table 2 shows, VA-GAN achieves the best FID score and this aligns with the AMT experiment well. The best FID is largely due to the view encoding that enables VA-GAN to synthesize accurate object textures of 360-degree freedom views.

Methods Car Pedestrian
VA-GAN (WA) 19% 22%
VA-GAN (WE) 34% 36%
VA-GAN (WD) 39% 41%
VA-GAN 43% 45%
Table 3: Ablation study using AMT scores: For the composed car and pedestrian images, the numbers tell the percentage of images (by the listed methods) that are deemed as real.

Qualitative Analysis: The proposed VA-GAN is capable of synthesizing realistic object textures of arbitrary views as illustrated in Fig. 4. The style encoding ensures that the object textures of different views share the same style and are consistent with each other. The view encoding allows the generated object texture to be consistent with the depth map, and it also contributes to the training of the geometric alignment module.

Fig. 7 show the final compositions by VA-GAN and state-of-the-art CycleGAN and VON that generate objects and directly compose them with background images. As Fig. 7 shows, CycleGAN and VON compositions tend to be unrealistic due to the lack of geometric alignment. As a comparison, images synthesized by VA-GAN have clearly more realistic texture and geometric alignment. Another unique feature of VA-GAN is that it can compose the same 3D model with different background images with adaptive views as illustrated in VA-GAN1, VA-GAN2 and VA-GAN3. This makes it very useful to many tasks such as object tracking, person ReID, etc.

Ablation Study: An ablation study has been conducted to evaluate VA-GAN as shown in Table 3. It involves three variant of the proposed VA-GAN including: 1) VA-GAN (WA) that does not include the view alignment module (random view code are applied to the 3D model to compose with the background image); 2) VA-GAN (WE) that does not include the view encoding; and 3) VA-GAN (WD) that uses a normal discriminator instead of our proposed differential discriminator.

As Table 3 shows, VA-GAN (WA) obtains a clearly lower AMT score than VA-GAN, showing the importance of geometric alignment in synthesizing realistic images. VA-GAN (WE) also obtain lower AMT than VA-GAN as the view encoding helps to generate objects with more accurate views which further contributes to the training of the alignment module and improves the realism of synthesized images. Further, the AMT scores of VA-GAN (WD) are also lower than that of VA-GAN, demonstrating that the suppression of discrepancy in style space and amplification of discrepancy in geometry space (as implemented in our differential discriminator) improves the learning of geometric alignment greatly.

5 Conclusions

This paper presents a view-alignment GAN (VA-GAN), an end-to-end trainable network that is capable of synthesizing realistic images given 3D models and 2D background images. A view encoding mechanism is designed to synthesize accurate texture of the 3D model and a novel differential discriminator is proposed to achieve effective learning of geometric alignment. Extensive experiments show the proposed VA-GAN clearly outperform state-of-the-are GANs quantitatively and qualitatively. Moving forwards, we will continue to investigate how to combine VA-GAN with 3D model generation for more flexible end-to-end 3D model generation and image composition.

References