1 Introduction
Generative models using convolutional neural networks (ConvNets) have achieved state of the art in image/object generation problems. Notable works of the class include variational autoencoders
kingma2013auto and generative adversarial networks goodfellow2014generative , both of which have drawn large success in various applications isola2016image ; radford2015unsupervised ; zhu2016generative ; wang2016generative ; yan2016attribute2image . With the recent introduction of large publicly available 3D model repositories wu20153d ; chang2015shapenet , the study of generative modeling on 3D data using similar frameworks has also become of increasing interest.In computer vision and graphics, 3D object models can take on various forms of representations. Of such, triangular meshes and point clouds are popular for their vectorized (and thus scalable) data representations as well as their compact encoding of shape information, optionally embedded with texture. However, this efficient representation comes with an inherent drawback as the dimensionality per 3D shape sample can vary, making the application of learning methods problematic. Furthermore, such data representations do not elegantly fit within conventional ConvNets as Euclidean convolutional operations cannot be directly applied. Hitherto, most existing works on 3D model generation resort to volumetric representations, allowing 3D Euclidean convolution to operate on regular discretized voxel grids. 3D ConvNets (as opposed to the classical 2D form) have been applied successfully to 3D volumetric representations for both discriminative
wu20153d ; maturana2015voxnet ; hegde2016fusionnet and generative girdhar2016learning ; choy20163d ; yan2016perspective ; wu2016learning problems.Despite their recent success, 3D ConvNets suffer from an inherent drawback when modeling shapes with volumetric representations. Unlike 2D images, where every pixel contains meaningful spatial and texture information, volumetric representations are informationsparse. More specifically, a 3D object is expressed as a voxelwise occupancy grid, where voxels “outside” the object (set to off) and “inside” the object (set to on) are unimportant and fundamentally not of particular interest. In other words, the richest information of shape representations lies on the surface of the 3D object, which makes up only a slight fraction of all voxels in an occupancy grid. Consequently, 3D ConvNets are extremely wasteful in both computation and memory in trying to predict much unuseful data with highcomplexity 3D convolutions, severely limiting the granularity of the 3D volumetric shapes that can be modeled even on highend GPUnodes commonly used in deep learning research.
In this paper, we propose an efficient framework to represent and generate 3D object shapes with dense point clouds. We achieve this by learning to predict the 3D structures from multiple viewpoints, which is jointly optimized through 3D geometric reasoning. In contrast to prior art that adopts 3D ConvNets to operate on volumetric data, we leverage 2D convolutional operations to predict points clouds that shape the surface of the 3D objects. Our experimental results show that we generate much denser and more accurate shapes than stateoftheart 3D prediction methods.
Our contributions are summarized as follows:

[leftmargin=24pt]

We advocate that 2D ConvNets are capable of generating dense point clouds that shapes the surface of 3D objects in an undiscretized 3D space.

We introduce a pseudorendering pipeline to serve as a differentiable approximation of true rendering. We further utilize the pseudorendered depth images for 2D projection optimization for learning dense 3D shapes.

We demonstrate the efficacy of our method on singleimage 3D reconstruction problems, which significantly outperforms stateoftheart methods.
2 Related Work
3D shape generation.
As 2D ConvNets have demonstrated huge success on a myriad of image generation problems, most works on 3D shape generation follow the analogue using 3D ConvNets to generate volumetric shapes. Prior works include using 3D autoencoders girdhar2016learning and recurrent networks choy20163d to learn a latent representation for volumetric data generation. Similar applications include the use of an additional encoded pose embedding to learn shape deformations yumer2016learning and using adversarial training to learn more realistic shape generation wu2016learning ; gadelha20163d . Learning volumetric predictions from 2D projected observations has also been explored yan2016perspective ; rezende2016unsupervised ; gadelha20163d , which use 3D differentiable sampling on voxel grids for spatial transformations jaderberg2015spatial . Constraining the ray consistency of 2D observations have also been suggested very recently tulsiani2017multi .
Most of the above approaches utilize 3D convolutional operations, which is computationally expensive and allows only coarse 3D voxel resolution. The lack of granularity from such volumetric generation has been an open problem following these works. Riegler et al. riegler2016octnet proposed to tackle the problem by using adaptive hierarchical octary trees on voxel grids to encourage encoding more informative parts of 3D shapes. Concurrent works follow to use similar concepts hane2017hierarchical ; tatarchenko2017octree to predict 3D volumetric data with higher granularity.
Recently, Fan et al. fan2016point
also sought to generate unordered point clouds by using variants of multilayer perceptrons to predict multiple 3D coordinates. However, the required learnable parameters linearly proportional to the number of 3D point predictions and does not scale well; in addition, using 3D distance metrics as optimization criteria is intractable for large number of points. In contrast, we leverage convolutional operations with a joint 2D project criterion to capture the correlation between generated point clouds and optimize in a more computationally tractable fashion.
3D view synthesis.
Research has also been done in learning to synthesize novel 3D views of 2D objects in images. Most approaches using ConvNets follow the convention of an encoderdecoder framework. This has been explored by mixing 3D pose information into the latent embedding vector for the synthesis decoder tatarchenko2016multi ; zhou2016view ; park2017transformation . A portion of these works also discussed the problem of disentangling the 3D pose representation from object identity information kulkarni2015deep ; yang2015weakly ; reed2015deep , allowing further control on the identity representation space.
The drawback of these approaches is their inefficiency in representing 3D geometry — as we later show in the experiments, one should explicitly factorize the underlying 3D geometry instead of implicitly encoding it into mixed representations. Resolving the geometry has been proven more efficient than tolerating in several works (e.g.Spatial Transformer Networks jaderberg2015spatial ; lin2016inverse ).
3 Approach
Our goal is to generate 3D predictions that compactly shape the surface geometry with dense point clouds. The overall pipeline is illustrated in Fig. 1. We start with an encoder that maps the input data to a latent representation space. The encoder may take on various forms of data depending on the application; in our experiments, we focus on encoding RGB images for singleimage 3D reconstruction tasks. From the latent representation, we propose to generate the dense point clouds using a structure generator based on 2D convolutions with a joint 2D projection criterion, described in detail as follows.
3.1 Structure Generator
The structure generator predicts the 3D structure of the object at different viewpoints (along with their binary masks), i.e. the 3D coordinates at each pixel location. Pixel values in natural images can be synthesized through convolutional generative models mainly due to their exhibition of strong local spatial dependencies; similar phenomenons can be observed for point clouds when treating them as multichannel images on a 2D grid. Based on this insight, the structure generator is mainly based on 2D convolutional operations to predict the images representing the 3D surface geometry. This approach circumvents the need of timeconsuming and memoryexpensive 3D convolutional operations for volumetric predictions. The evidence of such validity is verified in our experimental results.
Assuming the 3D rigid transformation matrices of the viewpoints are given a priori, each 3D point at viewpoint can be transformed to the canonical 3D coordinates as via
(1) 
where is the predefined camera intrinsic matrix. This defines the relationship between the predicted 3D points and the fused collection of point clouds in the canonical 3D coordinates, which is the outcome of our network.
3.2 Joint 2D Projection Optimization
To learn point cloud generation using the provided 3D CAD models as supervision, the standard approach would be to optimize over a 3Dbased metric that defines the distance between the point cloud and the groundtruth CAD model (e.g. Chamfer distance fan2016point ). Such metric usually involves computing surface projections for every generated point, which can be computationally expensive for very dense predictions, making it intractable.
We overcome this issue by alternatively optimizing over the joint 2D projection error of novel viewpoints. Instead of using only projected binary masks as supervision yan2016perspective ; gadelha20163d ; rezende2016unsupervised , we conjecture that a wellgenerated 3D shape should also have the ability to render reasonable depth images from any viewpoint. To realize this concept, we introduce the pseudorenderer, a differentiable module to approximate true rendering, to synthesize novel depth images from dense point clouds.
Pseudorendering.
Given the 3D rigid transformation matrix of a novel viewpoint , each canonical 3D point can be further transformed to back in the image coordinates via
(2) 
This is the inverse operation of Eq. (1) with different transformation matrices and can be combined with Eq. (1) together, composing a single effective transformation. By such, we obtain the location as well as the new depth value at viewpoint .
To produce a pixelated depth image, one would also need to discretize all coordinates, resulting in possibly multiple transformed points projecting and “colliding” onto the same pixel (Fig. 2). We resolve this issue with the pseudorenderer , which increases the projection resolution to alleviate such collision effect. Specifically, is projected onto a target image upsampled by a factor of , reducing the quantization error of
as well as the probability of collision occurrence. A maxpooling operation on the inverse depth values with kernel size
follows to downsample back to the original resolution while maintaining the minimum depth value at each pixel location. We use such approximation of the rendering operation to maintain differentiability and parallelizability within the backpropagation framework.
Optimization.
We use the pseudorendered depth images and the resulting masks
at novel viewpoints for optimization. The loss function consists of the mask loss
and the depth loss , respectively defined as(3) 
where we simultaneously optimize over novel viewpoints at a time. and are the groundtruth mask and depth images at the th novel viewpoint. We use elementwise loss for the depth (posing it as a pixelwise binary classification problem) and crossentropy loss for the mask. The overall loss function is defined as , where is the weighting factor.
Optimizing the structure generator over novel projections enforces joint 3D geometric reasoning between the predicted point clouds from the viewpoints. It also allows the optimization error to evenly distribute across novel viewpoints instead of focusing on the fixed viewpoints.
4 Experiments
We evaluate our proposed method by analyzing its performance in the application of singleimage 3D reconstruction and comparing against stateoftheart methods.
Data preparation.
We train and evaluate all networks using the ShapeNet database chang2015shapenet , which contains a large collection of categorized 3D CAD models. For each CAD model, we prerender 100 depth/mask image pairs of size 128128 at random novel viewpoints as the ground truth of the loss function. We consider the entire space of possible 3D rotations (including inplane rotation) for the viewpoints and assume identity translation for simplicity. The input images are objects prerendered from a fixed elevation and 24 different azimuth angles.
Sec.  Input  Latent  Number of filters  

size  vector  image encoder  structure generator  
4.1  6464  512D  conv: 96, 128, 192, 256  linear: 1024, 2048, 4096 
linear: 2048, 1024, 512  deconv: 192, 128, 96, 64, 48  
4.2  128128  1024D  conv: 128, 192, 256, 384, 512  linear: 2048, 4096, 12800 
linear: 4096, 2048, 1024  deconv: 384, 256, 192, 128, 96 
Architectural details.
The structure generator follows the structure of conventional deep generative models, consisting of linear layers followed by 2D convolution layers (with kernel size 33). The dimensions of all feature maps are halved after each encoder convolution and doubled after each generator convolution. Details of network dimensions are listed in Table 1. At the end of the decoder, we add an extra convolution layer with filters of size 1
1 to encourage individuality of the generated pixels. Batch normalization
ioffe2015batchand ReLU activations are added between all layers.
The generator predicts images of size 128128 with 4 channels (, , and the binary mask), where the fixed viewpoints are chosen from the 8 corners of a centered cube. Orthographic projection is assumed in the transformation in (1) and (2). We use for the upsampling factor of the pseudorenderer in our experiments.
Training details.
All networks are optimized using the Adam optimizer kingma2014adam . We take a twostage training procedure: the structure generator is first pretrained to predict the depth images from the viewpoints (with a constant learning rate of 1e2), and then the entire network is finetuned with joint 2D projection optimization (with a constant learning rate of 1e4). For the training parameters, we set and .
Quantitative metrics.
We measure using the average pointwise 3D Euclidean distance between two 3D models: for each point in the source model, the distance to the target model is defined as . This metric is defined bidirectionally as the distance from the predicted point cloud to the groundtruth CAD model and vice versa. It is necessary to report both metrics for they represent different aspects of quality — the former measures 3D shape similarity and the latter measures surface coverage kong2017using . We represent the groundtruth CAD models as collections of uniformly densified 3D points on the surfaces (100K densified points in our settings).
4.1 Single Object Category
We start by evaluating the efficacy of our dense point cloud representation on 3D reconstruction for a single object category. We use the chair category from ShapeNet, which consists of 6,778 CAD models. We compare against (a) Tatarchenko et al. tatarchenko2016multi , which learns implicit 3D representations through a mixed embedding, and (b) Perspective Transformer Networks (PTN) yan2016perspective , which learns to predict volumetric data by minimizing the projection error. We include two variants of PTN as well as a baseline 3D ConvNet from Yan et al. yan2016perspective . We use the same 80%20% training/test split provided by Yan et al. yan2016perspective .
We pretrain our network for 200K iterations and finetune endtoend for 100K iterations. For the method of Tatarchenko et al. tatarchenko2016multi , we evaluate by predicting depth images from our same viewpoints and transform the resulting point clouds to the canonical coordinates. This shares the same network architecture to ours, but with 3D pose information additionally encoded using 3 linear layers (with 64 filters) and concatenated with the latent vector. We use the novel depth/mask pairs as direct supervision for the decoder output and train this network for 300K iterations with a constant learning rate of 1e2. For PTN yan2016perspective , we extract the surface voxels (by subtracting the prediction by its eroded version) and rescale them such that the tightest 3D bounding boxes of the prediction and the groundtruth CAD models have the same volume. We use the pretrained models readily provided by the authors.
Method  3D error metric  

pred. GT  GT pred.  
3D ConvNet (vol. only) yan2016perspective  1.827  2.660 
PTN (proj. only) yan2016perspective  2.181  2.170 
PTN (vol. & proj.) yan2016perspective  1.840  2.585 
Tatarchenko et al. tatarchenko2016multi  2.381  3.019 
Proposed method  1.768  1.763 
The quantitative results on the test split are reported in Table 2. We achieve a lower average 3D distance than all baselines in both metrics, even though our approach is optimized with joint 2D projections instead of these 3D error metrics. This demonstrates that we are capable of predicting more accurate shapes with higher density and finer granularity. This highlights the efficiency of our approach using 2D ConvNets to generate 3D shapes compared to 3D ConvNet methods such as PTN yan2016perspective as they attempt to predict all voxel occupancies inside a 3D grid space. Compared to Tatarchenko et al. tatarchenko2016multi , an important takeaway is that 3D geometry should explicitly factorized when possible instead of being implicitly learned by the network parameters. It is much more efficient to focus on predicting the geometry from a sufficient number of viewpoints and combining them with known geometric transformations.
We visualize the generated 3D shapes in Fig. 3. Compared to the baselines, we predict more accurate object structures with a much higher point cloud density (around 10 higher than volumetric methods). This further highlights the desirability of our approach — we are able to efficiently use 2D convolutional operations and utilize highresolution supervision given similar memory budgets.
4.2 General Object Categories
We also evaluate our network on the singleimage 3D reconstruction task trained with multiple object categories. We compare against (a) 3DR2N2 choy20163d , which learns volumeric predictions through recurrent networks, and (b) Fan et al. fan2016point , which predicts an unordered set of 1024 3D points. We use 13 categories of ShapeNet for evaluation (listed in Table 3), where the 80%20% training/test split is provided by Choy et al. choy20163d . We evaluate 3DR2N2 by its surface voxels using the same procedure as described in Sec. 4.1. We pretrain our network for 300K iterations and finetune endtoend for 100K iterations; for the baselines, we use the pretrained models readily provided by the authors.
Category  3DR2N2 choy20163d  Fan et al. fan2016point  Proposed  
1 view  3 views  5 views  (1 view)  (1 view)  
airplane  3.207 / 2.879  2.521 / 2.468  2.399 / 2.391  1.301 / 1.488  1.294 / 1.541 
bench  3.350 / 3.697  2.465 / 2.746  2.323 / 2.603  1.814 / 1.983  1.757 / 1.487 
cabinet  1.636 / 2.817  1.445 / 2.626  1.420 / 2.619  2.463 / 2.444  1.814 / 1.072 
car  1.808 / 3.238  1.685 / 3.151  1.664 / 3.146  1.800 / 2.053  1.446 / 1.061 
chair  2.759 / 4.207  1.960 / 3.238  1.854 / 3.080  1.887 / 2.355  1.886 / 2.041 
display  3.235 / 4.283  2.262 / 3.151  2.088 / 2.953  1.919 / 2.334  2.142 / 1.440 
lamp  8.400 / 9.722  6.001 / 7.755  5.698 / 7.331  2.347 / 2.212  2.635 / 4.459 
loudspeaker  2.652 / 4.335  2.577 / 4.302  2.487 / 4.203  3.215 / 2.788  2.371 / 1.706 
rifle  4.798 / 2.996  4.307 / 2.546  4.193 / 2.447  1.316 / 1.358  1.289 / 1.510 
sofa  2.725 / 3.628  2.371 / 3.252  2.306 / 3.196  2.592 / 2.784  1.917 / 1.423 
table  3.118 / 4.208  2.268 / 3.277  2.128 / 3.134  1.874 / 2.229  1.689 / 1.620 
telephone  2.202 / 3.314  1.969 / 2.834  1.874 / 2.734  1.516 / 1.989  1.939 / 1.198 
watercraft  3.592 / 4.007  3.299 / 3.698  3.210 / 3.614  1.715 / 1.877  1.813 / 1.550 
mean  3.345 / 4.102  2.702 / 3.465  2.588 / 3.342  1.982 / 2.146  1.846 / 1.701 
We list the quantitative results in Table 3, where the metrics are reported percategory. Our method achieves an overall lower error in both metrics. We outperform the volumetric baselines (3DR2N2) by a large margin and has better prediction performance than Fan et al.in most cases. We also visualize the predictions in Fig. 4; again we see that our method predicts more accurate shapes with higher point density. We find that our method can be more problematic when objects contain very thin structures (e.g. lamps); adding hybrid linear layers fan2016point may help improve performance.
4.3 Generative Representation Analysis
We analyze the learned generative representations by observing the 3D predictions from manipulation in the latent space. Previous works have demonstrated that deep generative networks can generate meaningful pixel/voxel predictions by performing linear operations in the latent space radford2015unsupervised ; dosovitskiy2015learning ; wu2016learning ; here, we explore the possibility of such manipulation for dense point clouds in an undiscretized space.
We show in Fig. 5 the resulting dense shapes generated from the embedding vector interpolated in the latent space. The morphing transition is smooth with plausible interpolated shapes, which suggests that our structure generator can generate meaningful 3D predictions from convex combinations of encoded latent vectors. The structure generator is also capable of generating reasonable novel shapes from arithmetic results in the latent space — from Fig. 6) we observe semantic feature replacement of table height/shape as well as chair arms/backs. These results suggest that the highlevel semantic information encoded in the latent vectors are manipulable and interpretable of the resulting dense point clouds through the structure generator.
5 Conclusion
In this paper, we introduced a framework for generating 3D shapes in the form of dense point clouds. Compared to conventional volumetric prediction methods using 3D ConvNets, it is more efficient to utilize 2D convolutional operations to predict surface information of 3D shapes. We showed that by introducing a pseudorenderer, we are able to synthesize approximate depth images from novel viewpoints to optimize the 2D projection error within a backpropagation framework. Experimental results for singleimage 3D reconstruction tasks showed that we generate more accurate and much denser 3D shapes than stateoftheart 3D reconstruction methods.
References
 (1) A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An informationrich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
 (2) C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3dr2n2: A unified approach for single and multiview 3d object reconstruction. In European Conference on Computer Vision, pages 628–644. Springer, 2016.

(3)
A. Dosovitskiy, J. Tobias Springenberg, and T. Brox.
Learning to generate chairs with convolutional neural networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 1538–1546, 2015.  (4) H. Fan, H. Su, and L. Guibas. A point set generation network for 3d object reconstruction from a single image. arXiv preprint arXiv:1612.00603, 2016.
 (5) M. Gadelha, S. Maji, and R. Wang. 3d shape induction from 2d views of multiple objects. arXiv preprint arXiv:1612.05872, 2016.
 (6) R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta. Learning a predictable and generative vector representation for objects. In European Conference on Computer Vision, pages 484–499. Springer, 2016.
 (7) I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 (8) C. Häne, S. Tulsiani, and J. Malik. Hierarchical surface prediction for 3d object reconstruction. arXiv preprint arXiv:1704.00710, 2017.
 (9) V. Hegde and R. Zadeh. Fusionnet: 3d object classification using multiple data representations. arXiv preprint arXiv:1607.05695, 2016.
 (10) S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. pages 448–456, 2015.
 (11) P. Isola, J.Y. Zhu, T. Zhou, and A. A. Efros. Imagetoimage translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004, 2016.
 (12) M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Advances in Neural Information Processing Systems, pages 2017–2025, 2015.
 (13) D. Kingma and J. Ba. Adam: A method for stochastic optimization. 2015.
 (14) D. P. Kingma and M. Welling. Autoencoding variational bayes. 2013.
 (15) C. Kong, C.H. Lin, and S. Lucey. Using locally corresponding cad models for dense 3d reconstructions from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 (16) T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. In Advances in Neural Information Processing Systems, pages 2539–2547, 2015.
 (17) C.H. Lin and S. Lucey. Inverse compositional spatial transformer networks. arXiv preprint arXiv:1612.03897, 2016.
 (18) D. Maturana and S. Scherer. Voxnet: A 3d convolutional neural network for realtime object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 922–928. IEEE, 2015.
 (19) E. Park, J. Yang, E. Yumer, D. Ceylan, and A. C. Berg. Transformationgrounded image generation network for novel 3d view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 (20) A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 (21) S. E. Reed, Y. Zhang, Y. Zhang, and H. Lee. Deep visual analogymaking. In Advances in Neural Information Processing Systems, pages 1252–1260, 2015.
 (22) D. J. Rezende, S. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess. Unsupervised learning of 3d structure from images. In Advances In Neural Information Processing Systems, pages 4997–5005, 2016.
 (23) G. Riegler, A. O. Ulusoys, and A. Geiger. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 (24) M. Tatarchenko, A. Dosovitskiy, and T. Brox. Multiview 3d models from single images with a convolutional network. In European Conference on Computer Vision, pages 322–337. Springer, 2016.
 (25) M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree generating networks: Efficient convolutional architectures for highresolution 3d outputs. arXiv preprint arXiv:1703.09438, 2017.
 (26) S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multiview supervision for singleview reconstruction via differentiable ray consistency. arXiv preprint arXiv:1704.06254, 2017.
 (27) X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. In European Conference on Computer Vision, pages 318–335. Springer, 2016.
 (28) J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generativeadversarial modeling. In Advances in Neural Information Processing Systems, pages 82–90, 2016.
 (29) Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1912–1920, 2015.
 (30) X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from visual attributes. In European Conference on Computer Vision, pages 776–791. Springer, 2016.
 (31) X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective transformer nets: Learning singleview 3d object reconstruction without 3d supervision. In Advances in Neural Information Processing Systems, pages 1696–1704, 2016.
 (32) J. Yang, S. E. Reed, M.H. Yang, and H. Lee. Weaklysupervised disentangling with recurrent transformations for 3d view synthesis. In Advances in Neural Information Processing Systems, pages 1099–1107, 2015.
 (33) M. E. Yumer and N. J. Mitra. Learning semantic deformation flows with 3d convolutional networks. In European Conference on Computer Vision, pages 294–311. Springer, 2016.
 (34) T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros. View synthesis by appearance flow. In European Conference on Computer Vision, pages 286–301. Springer, 2016.
 (35) J.Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision, pages 597–613. Springer, 2016.
Comments
There are no comments yet.