1 Introduction
Modern deep generative models learn to synthesize realistic images. Figure 1a shows several cars generated by a recent model (Gulrajani et al., 2017). However, most methods have only focused on generating images in 2D, ignoring the 3D nature of the world. As a result, they are unable to answer some questions that would be effortless for a human, for example: what will a car look like from a different angle? What if we apply its texture to a truck? Can we mix different 3D designs? Therefore, a 2Donly perspective inevitably limits a model’s practical application in fields such as robotics, virtual reality, and gaming.
In this paper, we present an endtoend generative model that jointly synthesizes 3D shapes and 2D images via a disentangled object representation. Specifically, we decompose our image generation model into three conditionally independent factors: shape, viewpoint, and texture, borrowing ideas from classic graphics rendering engines (Kajiya, 1986). Our model first learns to synthesize 3D shapes that are indistinguishable from real shapes. It then computes its 2.5D sketches (Barrow and Tenenbaum, 1978; Marr, 1982) with a differentiable projection module from a sampled viewpoint. Finally, it learns to add diverse, realistic texture to 2.5D sketches and produce 2D images that are indistinguishable from real photos. We call our model Visual Object Networks (VON).
Wiring in conditional independence reduces our need for densely annotated data: unlike classic morphable face models (Blanz and Vetter, 1999), our training does not require paired data between 2D images and 3D shapes, nor dense correspondence annotations in 3D data. This advantage allows us to leverage both 2D image datasets and 3D shape collections (Chang et al., 2015) and to synthesize objects of diverse shapes and texture.
Through extensive experiments, we show that VON produce more realistic image samples than recent 2D deep generative models. We also demonstrate many 3D applications that are enabled by our disentangled representation, including rotating an object, adjusting object shape and texture, interpolating between two objects in texture and shape space independently, and transferring the appearance of a real image to new objects and viewpoints.
2 Related Work
GANs for 2D image synthesis.
Since the invention of Generative Adversarial Nets (GANs) (Goodfellow et al., 2014), many researchers have adopted adversarial learning for various image synthesis tasks, ranging from image generation (Radford et al., 2016; Arjovsky et al., 2017; Karras et al., 2018)
(Isola et al., 2017; Zhu et al., 2017a), texttoimage synthesis (Zhang et al., 2017; Reed et al., 2016), and interactive image editing (Zhu et al., 2016; Wang et al., 2018), to classic vision and graphics tasks such as inpainting (Pathak et al., 2016)and superresolution
(Ledig et al., 2017). Despite the tremendous progress made on 2D image synthesis, most of the above methods operate on 2D space, ignoring the 3D nature of our physical world. As a result, the lack of 3D structure inevitably limits some practical applications of these generative models. In contrast, we present an image synthesis method powered by a disentangled 3D representation.It allows a user to change the viewpoint easily, as well as to edit the object’s shape or texture independently. Dosovitskiy et al. (2015) used supervised CNNs for generating synthetic images given object style, viewpoint, and color. We differ in that our aim is to produce objects with 3D geometry and natural texture without using labelled data.3D shape generation.
There has been an increasing interest in synthesizing 3D shapes with deep generative models, especially GANs. Popular representations include voxels (Wu et al., 2016), point clouds (Gadelha et al., 2017b; Achlioptas et al., 2018), and octave trees (Tatarchenko et al., 2017). Other methods learn 3D shape priors from 2D images (Rezende et al., 2016; Gadelha et al., 2017a). Recent work also explored 3D shape completion from partial scans with deep generative models (Dai et al., 2017; Wang et al., 2017; Wu et al., 2018), including generalization to unseen object categories (Zhang et al., 2018). Unlike prior methods that only synthesize untextured 3D shapes, our method learns to produce both realistic shapes and images. Recent and concurrent work has learned to infer both texture and 3D shapes from 2D images, represented as parametrized meshes (Kanazawa et al., 2018), point clouds (Tatarchenko et al., 2016), or colored voxels (Tulsiani et al., 2017; Sun et al., 2018b). While they focus on 3D reconstruction, we aim to learn an unconditional generative model of shapes and images with disentangled representations of object texture, shape and pose.
Inverse graphics.
Motivated by the philosophy of “vision as inverse graphics” (Yuille and Kersten, 2006; Bever and Poeppel, 2010)
, researchers have made much progress in recent years on learning to invert graphics engines, many with deep neural networks
(Kulkarni et al., 2015b; Yang et al., 2015; Kulkarni et al., 2015a; Tung et al., 2017; Shu et al., 2017). In particular, Kulkarni et al. (2015b) proposed a convolutional inverse graphics network. Given an image of a face, the network learns to infer properties such as pose and lighting. Tung et al. (2017) extended inverse graphics networks with adversarial learning. Wu et al. (2017, 2018) inferred 3D shapes from a 2D image via 2.5D sketches and learned shape priors. Here we focus on a complementary problem—learning generative graphics networks via the idea of “graphics as inverse vision”. In particular, we learn our generative model with recognition models that recover 2.5D sketches from generated images.3 Formulation
Our goal is to learn an (implicit) generative model that can sample an image from three factors: a shape code , a viewpoint code , and a texture code . The texture code describes the appearance of the object, which accounts for the object’s albedo, reflectance, and environment illumination. These three factors are disentangled, conditionally independent from each other. Our model is categoryspecific, as the visual appearance of an object depends on the class. We further assume that all the codes lie in their own lowdimensional spaces. During training, we are given a 3D shape collection , where is a binary voxel grid, and a 2D image collection , where . Our model training requires no alignment between 3D and 2D data. We assume that every training image has a clean background and only contains the object of interest. This assumption makes our model focus on generating realistic images of the objects instead of complex backgrounds.
Figure 2 illustrates our model. First, we learn a 3D shape generation network that produces realistic voxels given a shape code (Section 3.1). We then develop a differentiable projection module that projects a 3D voxel grid into 2.5D sketches via , given a particular viewpoint (Section 3.2). Next, we learn to produce a final image given the 2.5D sketches and a randomly sampled texture code , using our texture synthesis network in Section 3.3. Section 3.4 summarizes our full model and Section 3.5 includes implementation details. Our entire model is differentiable and can be trained endtoend.
During testing, we sample an image from latent codes via our shape network , texture network , and projection module .
3.1 Learning 3D Shape Priors
Our first step is to learn a categoryspecific 3D shape prior from large shape collections (Chang et al., 2015). This prior depends on the object class but is conditionally independent of other factors such as viewpoint and texture. To model the 3D shape prior and generate realistic shapes, we adopt the 3D Generative Adversarial Networks recently proposed by Wu et al. (2016).
Consider a voxelized 3D object collection , where . We learn a generator to map a shape code
, randomly sampled from a Gaussian distribution, to a
voxel grid. Simultaneously, we train a 3D discriminatorto classify a shape as real or generated. Both discriminator and generator contain fully volumetric convolutional and deconvolutional layers. We find that the original 3DGAN
(Wu et al., 2016) sometimes suffers from mode collapse. To improve the quality and diversity of the results, we use the Wasserstein distance of WGANGP (Arjovsky et al., 2017; Gulrajani et al., 2017). Formally, we play the following minimax twoplayer game between and : ^{*}^{*}*For notation simplicity, we denote and ., where(1) 
To enforce the Lipschitz constraint in Wasserstein GANs (Arjovsky et al., 2017), we add a gradientpenalty loss to Eqn. 1, where is a randomly sampled point along the straight line between a real shape and a generated shape, and controls the capacity of . Since binary data is often challenging to model using GANs, we also experiment with distance function (DF) representation (Curless and Levoy, 1996), which is continuous on the 3D voxel space. See Section 4.1 for quantitative evaluations.
3.2 Generating 2.5D Sketches
Given a synthesized voxelized shape , how can we connect it to a 2D image? Inspired by recent work on 3D reconstruction (Wu et al., 2017), we use 2.5D sketches (Barrow and Tenenbaum, 1978; Marr, 1982) to bridge the gap between 3D and 2D. This intermediate representation provides three main advantages. First, generating 2.5D sketches from a 3D voxel grid is straightforward, as the projection is differentiable with respect to both the input shape and the viewpoint. Second, 2D image synthesis from a 2.5D sketch can be cast as an imagetoimage translation problem (Isola et al., 2017), where existing methods have achieved successes even without paired data (Zhu et al., 2017a). Third, compared with alternative approaches such as colored voxels, our method enables generating images at a higher resolution.
Here we describe our differentiable module for projecting voxels into 2.5D sketches. The inputs to this module are the camera parameters and 3D voxels. The value of each voxel stores the probability of it being present. To render the 2.5D sketches from the voxels under a perspective camera, we first generate a collection of rays, each originating from the camera’s center and going through a pixel’s center in the image plane. To render the 2.5D sketches, we need to calculate whether a given ray would hit the voxels, and if so, the corresponding depth value of that ray. To this end, we first sample a collection of points at evenly spaced depth along each ray. Next, for each point, we calculate the probability of hitting the input voxels using a differentiable trilinear interpolation
(Jaderberg et al., 2015) of the input voxels. Similar to Tulsiani et al. (2017), we then calculate the expectation of visibility and depth along each ray. Specifically, given a ray with samples , , … , along its path, we calculate the visibility (silhouette) as the expectation of the ray hitting the voxels: . Similarly, the expected depth can be calculated as , where is the depth of the sample . This process is fully differentiable since the gradients can be backpropagated through both the expectation calculation and the trilinear interpolation.Viewpoint estimation.
Our twodimensional viewpoint code encodes camera elevation and azimuth. We sample from an empirical distribution
of the camera poses from the training images. To estimate
, we first render the silhouettes of several candidate 3D models under uniformly sampled camera poses. For each input image, we compare its silhouette to the rendered 2D views and choose the pose with the largest IntersectionoverUnion value. More details can be found in the supplement.3.3 Learning 2D Texture Priors
Next, we learn to synthesize realistic 2D images given projected 2.5D sketches that encode both the viewpoint and the object shape. In particular, we learn a texture network that takes a randomly sampled texture code and the projected 2.5D sketches as input, and produces a 2D image . This texture network needs to model both object texture and environment illumination, as well as the differentiable rendering equation (Kajiya, 1986). Fortunately, this mapping problem can be cast as an unpaired imagetoimage translation problem (Zhu et al., 2017a; Yi et al., 2017; Liu et al., 2017). We adopt recently proposed cycleconsistent adversarial networks (CycleGAN) (Zhu et al., 2017a) as our baseline. Later, we relax the onetoone mapping restriction in CycleGAN to handle onetomany mappings from 2.5D sketches to 2D images.
Here we introduce two encoders and to estimate a texture code and 2.5D sketches from a real image . We train , , and jointly with adversarial losses (Goodfellow et al., 2014) and cycleconsistency losses (Zhu et al., 2017a; Yi et al., 2017). We use the following adversarial loss on the final generated image:
(2) 
where learns to classify real and generated images. We apply the same adversarial loss for 2.5D sketches :
(3) 
where aims to distinguish between 2.5D sketches and estimated 2.5D sketches from a real 2D image. We further use cycleconsistency losses (Zhu et al., 2017a) to enforce the bijective relationship between the two domains:
and  (4) 
where and control the importance of each cycle loss. TThe texture encoder and 2.5D sketch encoder serve as recognition models that recover the texture and 2.5D representation from a 2D image.
Onetomany mappings.
Prior studies (Isola et al., 2016; Mathieu et al., 2016) have found that latent codes are often ignored in conditional image generation due to the assumption of a onetoone mapping; vanilla CycleGAN also suffers from this problem based on our experiments. To address this, we introduce a latent space cycleconsistency loss to encourage to use the texture code :
(5) 
where controls its importance. Finally, to allow sampling at test time, we add a Kullback–Leibler (KL) loss on the space to force to be close to a Gaussian distribution:
(6) 
where and is its weight. We write the final texture loss as
(7) 
Note that the latent space reconstruction loss has been explored in unconditional GANs (Chen et al., 2016) and imagetoimage translation (Zhu et al., 2017b; Almahairi et al., 2018). Here we use this loss to learn onetomany mappings from unpaired data.
3.4 Our Full Model
Our full objective is
(8) 
where
controls the relative weight of shape and texture loss functions. We compare our visual object networks against 2D deep generative models in Section
4.1.3.5 Implementation Details
Shape networks.
For shape generation, we adopt the 3DGAN architecture from Wu et al. (2016). In particular, the discriminator contains volumetric convolutional layers and the generator contains stridedconvolutional layers. We remove the batch normalization layers (Ioffe and Szegedy, 2015) in the as suggested by the WGANGP paper (Gulrajani et al., 2017).
Texture networks.
For texture generation, we use the ResNet encoderdecoder (Zhu et al., 2017a; Huang et al., 2018) and concatenate the texture code to intermediate layers in the encoder. For the discriminator, we use twoscale PatchGAN classifiers (Isola et al., 2017; Zhu et al., 2017a) to classify overlapping patches as real or fake. We use a least square objective as in LSGAN (Mao et al., 2017) for stable training. We use ResNet encoders (He et al., 2015) for our and .
Differentiable projection module.
We assume the camera is at a fixed distance of m to the object’s center and use a focal length of mm (mm film equivalent). The resolution of the rendered sketches are , and we sample points evenly along each camera ray. We also assume no inplane rotation, that is, no tilting in the image plane. We implement a custom CUDA kernel for sampling along the projection rays and calculating the stop probabilities.
Training details.
We train our models on shapes (voxels or distance function) and images. During training, we first train the shape generator on 3D shape collections and then train the texture generator given ground truth 3D shape data and image data. Finally, we finetune both modules together. We sample the shape code and texture code from the standard Gaussian distribution , with the code length and
. The entire training usually takes two to three days. For hyperparameters, we set
, , , , , and . We use the Adam solver (Kingma and Ba, 2015) with a learning rate of for shape generation and for texture generation.We observe that the texture generator sometimes introduces the undesirable effect of changing the shape of the silhouette when rendering 2.5D sketches (i.e., and mask). To address this issue, we explicitly mask the generated 2D images with the silhouette from : i.e., , where is the background white color and the generator synthesizes an image given a depth map. Similarly, we reformulate , where the encoder only predicts , and the input object mask is used. In addition, we add a small mask consistency loss to encourage the predicted depth map to be consistent with the the object mask. As our training images have clean background, we can estimate the object mask with a simple threshold.
4 Experiments
We first compare our visual object networks (VON) against recent 2D GAN variants on two datasets. We evaluate the results using both a quantitative metric and a qualitative human perception study. We then perform an ablation study on the objective functions of our shape generation network. Finally, we demonstrate several applications enabled by our disentangled 3D representation. The full results and datasets can be found at our website. Please find our implementation at GitHub.
4.1 Evaluations
Datasets.
We use ShapeNet (Chang et al., 2015) for learning to generate 3D shapes. ShapeNet is a large shape repository of object categories. Here we use the chair and car categories, which has and CAD models respectively. For 2D datasets, we use the recently released Pix3D dataset to obtain RGB images of chairs alongside with their silhouettes (Sun et al., 2018a), with an addition of clean background images crawled from Google image search. We also crawled images of cars.
Baselines
We compare our method to three popular GAN variants commonly used in the literature: DCGAN with the standard crossentropy loss (Goodfellow et al., 2014; Radford et al., 2016), LSGAN (Mao et al., 2017), and WGANGP (Gulrajani et al., 2017). We use the same DCGANlike generator and discriminator architectures for all three GAN models. For WGANGP, we replace the BatchNorm by InstanceNorm (Ulyanov et al., 2016) in the discriminator, and we train the discriminator times per generator iteration.
Car  Chair  

DCGAN  130.5  225.0 
LSGAN  171.4  225.3 
WGANGP  123.4  184.9 
VON (voxels)  81.6  58.0 
VON (DF)  83.3  51.8 
Car  Chair  

DCGAN  72.2%  90.3% 
LSGAN  78.7%  92.4% 
WGANGP  63.0 %  89.1% 
Metrics.
To evaluate the image generation models, we calculate the Fréchet Inception Distance between generated images and real images, a metric highly correlated to human perception (Heusel et al., 2017; Lucic et al., 2018). Each set of images are fed to the Inception network (Szegedy et al., 2015)
trained on ImageNet
(Deng et al., 2009), and the features from the layer before the last fullyconnected layer are used to calculate the Fréchet Inception Distance.Second, we sample pairs of generated images from the VON and the stateoftheart models (DCGAN, LSGAN, and WGANGP), and show each pair to five subjects on Amazon MTurk. The subjects are asked to choose a more realistic result within the pair.
Results
Our VON consistently outperforms the 2D generative models. In particular, Table 2 shows that our results have the smallest Fréchet Inception Distance; in Table 2, of the responses preferred our results. This performance gain demonstrates that the learned 3D prior helps synthesize more realistic images. See Figure 3 for a qualitative comparison between these methods.
Analysis of shape generation.
For shape generation, we compare our method against the prior 3DGAN work by Wu et al. (2016) on both voxel grids and distance function representation. 3DGAN uses the same architecture but trained with a crossentropy loss. We evaluate the shape generation models using the Fréchet Inception Distance (FID) between the generated and real shapes. To extract statistics for each set of generated/real shapes, we train ResNetbased 3D shape classifiers (He et al., 2015) on all classes of shapes from ShapeNet; classifiers are trained separately on voxels and distance function representations. We extract the features from the layer before the last fullyconnected layer. Table 4.1 shows that our method achieves better results regarding FID. Figure 4.1a shows that the Wasserstein distance increases the quality of the results. As we use different classifiers for voxels and distance functions, the Fréchet Inception Distance is not comparable across representations.
4.2 Applications
We apply our visual object networks to several 3D manipulation applications, not possible by previous 2D generative models (Goodfellow et al., 2014; Kingma and Welling, 2014).
Changing viewpoints.
Shape and texture editing.
Disentangled interpolation.
Given our disentangled 3D representation, we can choose to interpolate between two objects in different ways. For example, we can interpolate objects in shape space with the same texture, or in the texture space with the same shape, or both, where . Figure 5c shows linear interpolations in the latent space.
Examplebased texture transfer.
We can infer the texture code from a real image with the texture encoder , and apply the code to new shapes. Figure 6 shows texture transfer results on cars and chairs using real images and generated shapes.
5 Discussion
In this paper, we have presented visual object networks (VON), a fully differentiable 3Daware generative model for image and shape synthesis. Our key idea is to disentangle the image generation process into three factors: shape, viewpoint, and texture. This disentangled 3D representation allows us to learn the model from both 3D and 2D visual data collections under an adversarial learning framework. Our model synthesizes more photorealistic images compared to existing 2D generative models; it also enables various 3D manipulations that are not possible with prior 2D methods.
In the future, we are interested in incorporating coarsetofine modeling (Karras et al., 2017) for producing shapes and images at a higher resolution. Another interesting direction to explore is to disentangle texture further into lighting and appearance (e.g., albedo), which could improve the consistency of appearance across different viewpoints and lighting conditions. Finally, as we do not have largescale 3D geometric data for entire scenes, our current method only works for individual objects. Synthesizing natural scenes is also a meaningful next step.
Acknowledgements
This work is supported by NSF #1231216, NSF #1524817, ONR MURI N000141612007, Toyota Research Institute, Shell, and Facebook. We thank Xiuming Zhang, Richard Zhang, David Bau, and Zhuang Liu for valuable discussions.
