Modern deep generative models learn to synthesize realistic images. Figure 1a shows several cars generated by a recent model (Gulrajani et al., 2017). However, most methods have only focused on generating images in 2D, ignoring the 3D nature of the world. As a result, they are unable to answer some questions that would be effortless for a human, for example: what will a car look like from a different angle? What if we apply its texture to a truck? Can we mix different 3D designs? Therefore, a 2D-only perspective inevitably limits a model’s practical application in fields such as robotics, virtual reality, and gaming.
In this paper, we present an end-to-end generative model that jointly synthesizes 3D shapes and 2D images via a disentangled object representation. Specifically, we decompose our image generation model into three conditionally independent factors: shape, viewpoint, and texture, borrowing ideas from classic graphics rendering engines (Kajiya, 1986). Our model first learns to synthesize 3D shapes that are indistinguishable from real shapes. It then computes its 2.5D sketches (Barrow and Tenenbaum, 1978; Marr, 1982) with a differentiable projection module from a sampled viewpoint. Finally, it learns to add diverse, realistic texture to 2.5D sketches and produce 2D images that are indistinguishable from real photos. We call our model Visual Object Networks (VON).
Wiring in conditional independence reduces our need for densely annotated data: unlike classic morphable face models (Blanz and Vetter, 1999), our training does not require paired data between 2D images and 3D shapes, nor dense correspondence annotations in 3D data. This advantage allows us to leverage both 2D image datasets and 3D shape collections (Chang et al., 2015) and to synthesize objects of diverse shapes and texture.
Through extensive experiments, we show that VON produce more realistic image samples than recent 2D deep generative models. We also demonstrate many 3D applications that are enabled by our disentangled representation, including rotating an object, adjusting object shape and texture, interpolating between two objects in texture and shape space independently, and transferring the appearance of a real image to new objects and viewpoints.
2 Related Work
GANs for 2D image synthesis.
Since the invention of Generative Adversarial Nets (GANs) (Goodfellow et al., 2014), many researchers have adopted adversarial learning for various image synthesis tasks, ranging from image generation (Radford et al., 2016; Arjovsky et al., 2017; Karras et al., 2018)Isola et al., 2017; Zhu et al., 2017a), text-to-image synthesis (Zhang et al., 2017; Reed et al., 2016), and interactive image editing (Zhu et al., 2016; Wang et al., 2018), to classic vision and graphics tasks such as inpainting (Pathak et al., 2016)
and super-resolution(Ledig et al., 2017). Despite the tremendous progress made on 2D image synthesis, most of the above methods operate on 2D space, ignoring the 3D nature of our physical world. As a result, the lack of 3D structure inevitably limits some practical applications of these generative models. In contrast, we present an image synthesis method powered by a disentangled 3D representation.It allows a user to change the viewpoint easily, as well as to edit the object’s shape or texture independently. Dosovitskiy et al. (2015) used supervised CNNs for generating synthetic images given object style, viewpoint, and color. We differ in that our aim is to produce objects with 3D geometry and natural texture without using labelled data.
3D shape generation.
There has been an increasing interest in synthesizing 3D shapes with deep generative models, especially GANs. Popular representations include voxels (Wu et al., 2016), point clouds (Gadelha et al., 2017b; Achlioptas et al., 2018), and octave trees (Tatarchenko et al., 2017). Other methods learn 3D shape priors from 2D images (Rezende et al., 2016; Gadelha et al., 2017a). Recent work also explored 3D shape completion from partial scans with deep generative models (Dai et al., 2017; Wang et al., 2017; Wu et al., 2018), including generalization to unseen object categories (Zhang et al., 2018). Unlike prior methods that only synthesize untextured 3D shapes, our method learns to produce both realistic shapes and images. Recent and concurrent work has learned to infer both texture and 3D shapes from 2D images, represented as parametrized meshes (Kanazawa et al., 2018), point clouds (Tatarchenko et al., 2016), or colored voxels (Tulsiani et al., 2017; Sun et al., 2018b). While they focus on 3D reconstruction, we aim to learn an unconditional generative model of shapes and images with disentangled representations of object texture, shape and pose.
, researchers have made much progress in recent years on learning to invert graphics engines, many with deep neural networks(Kulkarni et al., 2015b; Yang et al., 2015; Kulkarni et al., 2015a; Tung et al., 2017; Shu et al., 2017). In particular, Kulkarni et al. (2015b) proposed a convolutional inverse graphics network. Given an image of a face, the network learns to infer properties such as pose and lighting. Tung et al. (2017) extended inverse graphics networks with adversarial learning. Wu et al. (2017, 2018) inferred 3D shapes from a 2D image via 2.5D sketches and learned shape priors. Here we focus on a complementary problem—learning generative graphics networks via the idea of “graphics as inverse vision”. In particular, we learn our generative model with recognition models that recover 2.5D sketches from generated images.
Our goal is to learn an (implicit) generative model that can sample an image from three factors: a shape code , a viewpoint code , and a texture code . The texture code describes the appearance of the object, which accounts for the object’s albedo, reflectance, and environment illumination. These three factors are disentangled, conditionally independent from each other. Our model is category-specific, as the visual appearance of an object depends on the class. We further assume that all the codes lie in their own low-dimensional spaces. During training, we are given a 3D shape collection , where is a binary voxel grid, and a 2D image collection , where . Our model training requires no alignment between 3D and 2D data. We assume that every training image has a clean background and only contains the object of interest. This assumption makes our model focus on generating realistic images of the objects instead of complex backgrounds.
Figure 2 illustrates our model. First, we learn a 3D shape generation network that produces realistic voxels given a shape code (Section 3.1). We then develop a differentiable projection module that projects a 3D voxel grid into 2.5D sketches via , given a particular viewpoint (Section 3.2). Next, we learn to produce a final image given the 2.5D sketches and a randomly sampled texture code , using our texture synthesis network in Section 3.3. Section 3.4 summarizes our full model and Section 3.5 includes implementation details. Our entire model is differentiable and can be trained end-to-end.
During testing, we sample an image from latent codes via our shape network , texture network , and projection module .
3.1 Learning 3D Shape Priors
Our first step is to learn a category-specific 3D shape prior from large shape collections (Chang et al., 2015). This prior depends on the object class but is conditionally independent of other factors such as viewpoint and texture. To model the 3D shape prior and generate realistic shapes, we adopt the 3D Generative Adversarial Networks recently proposed by Wu et al. (2016).
Consider a voxelized 3D object collection , where . We learn a generator to map a shape code
, randomly sampled from a Gaussian distribution, to avoxel grid. Simultaneously, we train a 3D discriminator
to classify a shape as real or generated. Both discriminator and generator contain fully volumetric convolutional and deconvolutional layers. We find that the original 3D-GAN(Wu et al., 2016) sometimes suffers from mode collapse. To improve the quality and diversity of the results, we use the Wasserstein distance of WGAN-GP (Arjovsky et al., 2017; Gulrajani et al., 2017). Formally, we play the following minimax two-player game between and : ***For notation simplicity, we denote and ., where
To enforce the Lipschitz constraint in Wasserstein GANs (Arjovsky et al., 2017), we add a gradient-penalty loss to Eqn. 1, where is a randomly sampled point along the straight line between a real shape and a generated shape, and controls the capacity of . Since binary data is often challenging to model using GANs, we also experiment with distance function (DF) representation (Curless and Levoy, 1996), which is continuous on the 3D voxel space. See Section 4.1 for quantitative evaluations.
3.2 Generating 2.5D Sketches
Given a synthesized voxelized shape , how can we connect it to a 2D image? Inspired by recent work on 3D reconstruction (Wu et al., 2017), we use 2.5D sketches (Barrow and Tenenbaum, 1978; Marr, 1982) to bridge the gap between 3D and 2D. This intermediate representation provides three main advantages. First, generating 2.5D sketches from a 3D voxel grid is straightforward, as the projection is differentiable with respect to both the input shape and the viewpoint. Second, 2D image synthesis from a 2.5D sketch can be cast as an image-to-image translation problem (Isola et al., 2017), where existing methods have achieved successes even without paired data (Zhu et al., 2017a). Third, compared with alternative approaches such as colored voxels, our method enables generating images at a higher resolution.
Here we describe our differentiable module for projecting voxels into 2.5D sketches. The inputs to this module are the camera parameters and 3D voxels. The value of each voxel stores the probability of it being present. To render the 2.5D sketches from the voxels under a perspective camera, we first generate a collection of rays, each originating from the camera’s center and going through a pixel’s center in the image plane. To render the 2.5D sketches, we need to calculate whether a given ray would hit the voxels, and if so, the corresponding depth value of that ray. To this end, we first sample a collection of points at evenly spaced depth along each ray. Next, for each point, we calculate the probability of hitting the input voxels using a differentiable trilinear interpolation(Jaderberg et al., 2015) of the input voxels. Similar to Tulsiani et al. (2017), we then calculate the expectation of visibility and depth along each ray. Specifically, given a ray with samples , , … , along its path, we calculate the visibility (silhouette) as the expectation of the ray hitting the voxels: . Similarly, the expected depth can be calculated as , where is the depth of the sample . This process is fully differentiable since the gradients can be back-propagated through both the expectation calculation and the trilinear interpolation.
Our two-dimensional viewpoint code encodes camera elevation and azimuth. We sample from an empirical distribution
of the camera poses from the training images. To estimate, we first render the silhouettes of several candidate 3D models under uniformly sampled camera poses. For each input image, we compare its silhouette to the rendered 2D views and choose the pose with the largest Intersection-over-Union value. More details can be found in the supplement.
3.3 Learning 2D Texture Priors
Next, we learn to synthesize realistic 2D images given projected 2.5D sketches that encode both the viewpoint and the object shape. In particular, we learn a texture network that takes a randomly sampled texture code and the projected 2.5D sketches as input, and produces a 2D image . This texture network needs to model both object texture and environment illumination, as well as the differentiable rendering equation (Kajiya, 1986). Fortunately, this mapping problem can be cast as an unpaired image-to-image translation problem (Zhu et al., 2017a; Yi et al., 2017; Liu et al., 2017). We adopt recently proposed cycle-consistent adversarial networks (CycleGAN) (Zhu et al., 2017a) as our baseline. Later, we relax the one-to-one mapping restriction in CycleGAN to handle one-to-many mappings from 2.5D sketches to 2D images.
Here we introduce two encoders and to estimate a texture code and 2.5D sketches from a real image . We train , , and jointly with adversarial losses (Goodfellow et al., 2014) and cycle-consistency losses (Zhu et al., 2017a; Yi et al., 2017). We use the following adversarial loss on the final generated image:
where learns to classify real and generated images. We apply the same adversarial loss for 2.5D sketches :
where aims to distinguish between 2.5D sketches and estimated 2.5D sketches from a real 2D image. We further use cycle-consistency losses (Zhu et al., 2017a) to enforce the bijective relationship between the two domains:
where and control the importance of each cycle loss. TThe texture encoder and 2.5D sketch encoder serve as recognition models that recover the texture and 2.5D representation from a 2D image.
Prior studies (Isola et al., 2016; Mathieu et al., 2016) have found that latent codes are often ignored in conditional image generation due to the assumption of a one-to-one mapping; vanilla CycleGAN also suffers from this problem based on our experiments. To address this, we introduce a latent space cycle-consistency loss to encourage to use the texture code :
where controls its importance. Finally, to allow sampling at test time, we add a Kullback–Leibler (KL) loss on the space to force to be close to a Gaussian distribution:
where and is its weight. We write the final texture loss as
Note that the latent space reconstruction loss has been explored in unconditional GANs (Chen et al., 2016) and image-to-image translation (Zhu et al., 2017b; Almahairi et al., 2018). Here we use this loss to learn one-to-many mappings from unpaired data.
3.4 Our Full Model
3.5 Implementation Details
For shape generation, we adopt the 3D-GAN architecture from Wu et al. (2016). In particular, the discriminator contains volumetric convolutional layers and the generator contains strided-convolutional layers. We remove the batch normalization layers (Ioffe and Szegedy, 2015) in the as suggested by the WGAN-GP paper (Gulrajani et al., 2017).
For texture generation, we use the ResNet encoder-decoder (Zhu et al., 2017a; Huang et al., 2018) and concatenate the texture code to intermediate layers in the encoder. For the discriminator, we use two-scale PatchGAN classifiers (Isola et al., 2017; Zhu et al., 2017a) to classify overlapping patches as real or fake. We use a least square objective as in LS-GAN (Mao et al., 2017) for stable training. We use ResNet encoders (He et al., 2015) for our and .
Differentiable projection module.
We assume the camera is at a fixed distance of m to the object’s center and use a focal length of mm (mm film equivalent). The resolution of the rendered sketches are , and we sample points evenly along each camera ray. We also assume no in-plane rotation, that is, no tilting in the image plane. We implement a custom CUDA kernel for sampling along the projection rays and calculating the stop probabilities.
We train our models on shapes (voxels or distance function) and images. During training, we first train the shape generator on 3D shape collections and then train the texture generator given ground truth 3D shape data and image data. Finally, we fine-tune both modules together. We sample the shape code and texture code from the standard Gaussian distribution , with the code length and
. The entire training usually takes two to three days. For hyperparameters, we set, , , , , and . We use the Adam solver (Kingma and Ba, 2015) with a learning rate of for shape generation and for texture generation.
We observe that the texture generator sometimes introduces the undesirable effect of changing the shape of the silhouette when rendering 2.5D sketches (i.e., and mask). To address this issue, we explicitly mask the generated 2D images with the silhouette from : i.e., , where is the background white color and the generator synthesizes an image given a depth map. Similarly, we reformulate , where the encoder only predicts , and the input object mask is used. In addition, we add a small mask consistency loss to encourage the predicted depth map to be consistent with the the object mask. As our training images have clean background, we can estimate the object mask with a simple threshold.
We first compare our visual object networks (VON) against recent 2D GAN variants on two datasets. We evaluate the results using both a quantitative metric and a qualitative human perception study. We then perform an ablation study on the objective functions of our shape generation network. Finally, we demonstrate several applications enabled by our disentangled 3D representation. The full results and datasets can be found at our website. Please find our implementation at GitHub.
We use ShapeNet (Chang et al., 2015) for learning to generate 3D shapes. ShapeNet is a large shape repository of object categories. Here we use the chair and car categories, which has and CAD models respectively. For 2D datasets, we use the recently released Pix3D dataset to obtain RGB images of chairs alongside with their silhouettes (Sun et al., 2018a), with an addition of clean background images crawled from Google image search. We also crawled images of cars.
We compare our method to three popular GAN variants commonly used in the literature: DCGAN with the standard cross-entropy loss (Goodfellow et al., 2014; Radford et al., 2016), LSGAN (Mao et al., 2017), and WGAN-GP (Gulrajani et al., 2017). We use the same DCGAN-like generator and discriminator architectures for all three GAN models. For WGAN-GP, we replace the BatchNorm by InstanceNorm (Ulyanov et al., 2016) in the discriminator, and we train the discriminator times per generator iteration.
To evaluate the image generation models, we calculate the Fréchet Inception Distance between generated images and real images, a metric highly correlated to human perception (Heusel et al., 2017; Lucic et al., 2018). Each set of images are fed to the Inception network (Szegedy et al., 2015)
trained on ImageNet(Deng et al., 2009), and the features from the layer before the last fully-connected layer are used to calculate the Fréchet Inception Distance.
Second, we sample pairs of generated images from the VON and the state-of-the-art models (DCGAN, LSGAN, and WGAN-GP), and show each pair to five subjects on Amazon MTurk. The subjects are asked to choose a more realistic result within the pair.
Our VON consistently outperforms the 2D generative models. In particular, Table 2 shows that our results have the smallest Fréchet Inception Distance; in Table 2, of the responses preferred our results. This performance gain demonstrates that the learned 3D prior helps synthesize more realistic images. See Figure 3 for a qualitative comparison between these methods.
Analysis of shape generation.
For shape generation, we compare our method against the prior 3D-GAN work by Wu et al. (2016) on both voxel grids and distance function representation. 3D-GAN uses the same architecture but trained with a cross-entropy loss. We evaluate the shape generation models using the Fréchet Inception Distance (FID) between the generated and real shapes. To extract statistics for each set of generated/real shapes, we train ResNet-based 3D shape classifiers (He et al., 2015) on all classes of shapes from ShapeNet; classifiers are trained separately on voxels and distance function representations. We extract the features from the layer before the last fully-connected layer. Table 4.1 shows that our method achieves better results regarding FID. Figure 4.1a shows that the Wasserstein distance increases the quality of the results. As we use different classifiers for voxels and distance functions, the Fréchet Inception Distance is not comparable across representations.
Shape and texture editing.
Given our disentangled 3D representation, we can choose to interpolate between two objects in different ways. For example, we can interpolate objects in shape space with the same texture, or in the texture space with the same shape, or both, where . Figure 5c shows linear interpolations in the latent space.
Example-based texture transfer.
We can infer the texture code from a real image with the texture encoder , and apply the code to new shapes. Figure 6 shows texture transfer results on cars and chairs using real images and generated shapes.
In this paper, we have presented visual object networks (VON), a fully differentiable 3D-aware generative model for image and shape synthesis. Our key idea is to disentangle the image generation process into three factors: shape, viewpoint, and texture. This disentangled 3D representation allows us to learn the model from both 3D and 2D visual data collections under an adversarial learning framework. Our model synthesizes more photorealistic images compared to existing 2D generative models; it also enables various 3D manipulations that are not possible with prior 2D methods.
In the future, we are interested in incorporating coarse-to-fine modeling (Karras et al., 2017) for producing shapes and images at a higher resolution. Another interesting direction to explore is to disentangle texture further into lighting and appearance (e.g., albedo), which could improve the consistency of appearance across different viewpoints and lighting conditions. Finally, as we do not have large-scale 3D geometric data for entire scenes, our current method only works for individual objects. Synthesizing natural scenes is also a meaningful next step.
This work is supported by NSF #1231216, NSF #1524817, ONR MURI N00014-16-1-2007, Toyota Research Institute, Shell, and Facebook. We thank Xiuming Zhang, Richard Zhang, David Bau, and Zhuang Liu for valuable discussions.
- Achlioptas et al.  Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Learning representations and generative models for 3d point clouds. In ICLR Workshop, 2018.
- Almahairi et al.  Amjad Almahairi, Sai Rajeswar, Alessandro Sordoni, Philip Bachman, and Aaron Courville. Augmented cyclegan: Learning many-to-many mappings from unpaired data. In ICML, 2018.
- Arjovsky et al.  Martín Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
- Barrow and Tenenbaum  Harry G Barrow and Jay M Tenenbaum. Recovering intrinsic scene characteristics from images. Computer Vision Systems, 1978.
- Bever and Poeppel  Thomas G Bever and David Poeppel. Analysis by synthesis: a (re-) emerging program of research for language and vision. Biolinguistics, 4(2-3):174–200, 2010.
- Blanz and Vetter  Volker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. In SIGGRAPH, 1999.
- Chang et al.  Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. Shapenet: An information-rich 3d model repository. arXiv:1512.03012, 2015.
- Chen et al.  Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS, 2016.
- Curless and Levoy  Brian Curless and Marc Levoy. A volumetric method for building complex models from range images. In SIGGRAPH, 1996.
- Dai et al.  Angela Dai, Charles Ruizhongtai Qi, and Matthias Nießner. Shape completion using 3d-encoder-predictor cnns and shape synthesis. In CVPR, 2017.
- Deng et al.  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
Dosovitskiy et al. 
Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Brox.
Learning to generate chairs with convolutional neural networks.In CVPR, 2015.
- Gadelha et al. [2017a] Matheus Gadelha, Subhransu Maji, and Rui Wang. 3d shape induction from 2d views of multiple objects. In 3D Vision (3DV), pages 402–411. IEEE, 2017a.
- Gadelha et al. [2017b] Matheus Gadelha, Subhransu Maji, and Rui Wang. Shape generation using spatially partitioned point clouds. In BMVC, 2017b.
- Goodfellow et al.  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
- Gulrajani et al.  Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans. In NIPS, 2017.
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2015.
- Heusel et al.  Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Günter Klambauer, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a nash equilibrium. NIPS1706.08500, 2017.
- Huang et al.  Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation. ECCV1804.04732, 2018.
- Ioffe and Szegedy  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
- Isola et al.  Phillip Isola, Daniel Zoran, Dilip Krishnan, and Edward H Adelson. Learning visual groups from co-occurrences in space and time. In ICLR Workshop, 2016.
Isola et al. 
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros.
Image-to-image translation with conditional adversarial networks.In CVPR, 2017.
- Jaderberg et al.  Max Jaderberg, Karen Simonyan, and Andrew Zisserman. Spatial transformer networks. In NIPS, 2015.
- Kajiya  James T Kajiya. The rendering equation. In SIGGRAPH, 1986.
- Kanazawa et al.  Angjoo Kanazawa, Shubham Tulsiani, Alexei A Efros, and Jitendra Malik. Learning category-specific mesh reconstruction from image collections. ECCV1803.07549, 2018.
- Karras et al.  Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. In ICLR, 2017.
- Karras et al.  Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. ICLR, 2018.
- Kingma and Ba  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
- Kingma and Welling  Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.
- Kulkarni et al. [2015a] Tejas D Kulkarni, Pushmeet Kohli, Joshua B Tenenbaum, and Vikash Mansinghka. Picture: A probabilistic programming language for scene perception. In CVPR, 2015a.
- Kulkarni et al. [2015b] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse graphics network. In NIPS, 2015b.
- Ledig et al.  Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017.
- Liu et al.  Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. In NIPS, 2017.
- Lucic et al.  Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are gans created equal? a large-scale study. NIPS, 2018.
- Mao et al.  Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In ICCV, 2017.
- Marr  David Marr. Vision: A computational investigation into the human representation and processing of visual information, volume 2. W. H. Freeman and Company, 1982.
- Mathieu et al.  Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. In ICLR, 2016.
- Pathak et al.  Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016.
- Radford et al.  Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
- Reed et al.  Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text-to-image synthesis. In ICML, 2016.
- Rezende et al.  Danilo Jimenez Rezende, SM Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and Nicolas Heess. Unsupervised learning of 3d structure from images. In NIPS, 2016.
- Shu et al.  Zhixin Shu, Ersin Yumer, Sunil Hadap, Kalyan Sunkavalli, Eli Shechtman, and Dimitris Samaras. Neural face editing with intrinsic image disentangling. In CVPR, 2017.
- Sun et al. [2018a] Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Tianfan Xue, Joshua B Tenenbaum, and William T Freeman. Pix3d: Dataset and methods for single-image 3d shape modeling. In CVPR, 2018a.
- Sun et al. [2018b] Yongbin Sun, Ziwei Liu, Yue Wang, and Sanjay E Sarma. Im2avatar: Colorful 3d reconstruction from a single image. arXiv:1804.06375, 2018b.
- Szegedy et al.  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
- Tatarchenko et al.  Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Multi-view 3d models from single images with a convolutional network. In ECCV, 2016.
- Tatarchenko et al.  Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs. In ICCV, 2017.
- Tulsiani et al.  Shubham Tulsiani, Hao Su, Leonidas J Guibas, Alexei A Efros, and Jitendra Malik. Learning shape abstractions by assembling volumetric primitives. In CVPR, 2017.
- Tung et al.  Hsiao-Yu Fish Tung, Adam W Harley, William Seto, and Katerina Fragkiadaki. Adversarial inverse graphics networks: Learning 2d-to-3d lifting and image-to-image translation from unpaired supervision. In ICCV, 2017.
- Ulyanov et al.  Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv:1607.08022, 2016.
- Wang et al.  Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, 2018.
- Wang et al.  Weiyue Wang, Qiangui Huang, Suya You, Chao Yang, and Ulrich Neumann. Shape inpainting using 3d generative adversarial network and recurrent convolutional networks. In ICCV, 2017.
- Wu et al.  Jiajun Wu, Chengkai Zhang, Tianfan Xue, William T Freeman, and Joshua B Tenenbaum. Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling. In NIPS, 2016.
- Wu et al.  Jiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, William T Freeman, and Joshua B Tenenbaum. MarrNet: 3D Shape Reconstruction via 2.5D Sketches. In NIPS, 2017.
- Wu et al.  Jiajun Wu, Chengkai Zhang, Xiuming Zhang, Zhoutong Zhang, William T Freeman, and Joshua B Tenenbaum. Learning 3d shape priors for shape completion and reconstruction. In ECCV, 2018.
- Yang et al.  Jimei Yang, Scott E Reed, Ming-Hsuan Yang, and Honglak Lee. Weakly-supervised disentangling with recurrent transformations for 3d view synthesis. In NIPS, 2015.
- Yi et al.  Zili Yi, Hao (Richard) Zhang, Ping Tan, and Minglun Gong. Dualgan: Unsupervised dual learning for image-to-image translation. In ICCV, 2017.
Yuille and Kersten 
Alan Yuille and Daniel Kersten.
Vision as bayesian inference: analysis by synthesis?TiCS, 10(7):301–308, 2006.
- Zhang et al.  Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
- Zhang et al.  Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Joshua B Tenenbaum, William T Freeman, and Jiajun Wu. Learning to reconstruct shapes from unseen categories. In NIPS, 2018.
- Zhu et al.  Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A Efros. Generative visual manipulation on the natural image manifold. In ECCV, 2016.
- Zhu et al. [2017a] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017a.
- Zhu et al. [2017b] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. Toward multimodal image-to-image translation. In NIPS, 2017b.