3D reconstruction from single RGB images has been a longstanding challenge in computer vision. While recent progress with deep learning-based techniques and large shape or image databases has been significant, the reconstruction of detailed geometry and texture for a large variety of object categories with arbitrary topology remains challenging. Point clouds have emerged as one of the most popular representations to tackle this challenge because of a number of distinct advantages: unlike meshes they can easily represent arbitrary topology, unlike 3D voxel grids they do not suffer from cubic complexity, and unlike implicit functions they can reconstruct shapes using a single evaluation of a neural network. In addition, it is straightforward to represent surface textures with point clouds by storing per-point RGB values.
In this paper, we present a novel method to reconstruct 3D point clouds from single RGB images, including the optional recovery of per-point RGB texture. In addition, our approach can be trained on multiple categories. The key idea of our method is to solve the problem using a two-stage approach, where both stages can be implemented using powerful 2D image-to-image translation networks: in the first stage, we recover an object coordinate map from the input RGB image. This is similar to a depth image, but it corresponds to a point cloud in object-centric coordinates that is independent of camera pose. In the second stage, we reproject the object space point cloud into depth images from eight fixed viewpoints in image space, and perform depth map completion. We can then trivially fuse all completed object space depth maps into a final 3D reconstruction, without requiring a separate alignment stage, for example using iterative closest point algorithm (ICP). Since all networks are based on 2D convolutions, it is straightforward to achieve high resolution reconstructions with this approach. Texture reconstruction uses the same pipeline, but operating on RGB images instead of object space depth maps.
We train our approach on a multi-category dataset and show that our object-centric, two-stage approach leads to better generalization than competing techniques. In addition, recovering object space point clouds allows us to avoid a separate camera pose estimation step. In summary, our main contributions are as follows:
A strategy to generate 3D shapes from single RGB images in a two-stage approach, by first recovering object coordinate images as an intermediate representation, and then performing reprojection, depth map completion, and a final trivial fusion step in object space.
The first work to train a single network to reconstruct point clouds with RGB textures on multiple categories.
2 Related Work
Our method is mainly related to single image 3D reconstruction and shape completion. We briefly review previous works in these two aspects.
Single image 3D reconstruction. Along with the development of deep learning techniques, single image 3D reconstruction has made a huge progress. Because of the regularity, early works mainly learned to reconstruct voxel grids from 3D supervision  or 2D supervision  using differentiable renderers [29, 25]. However, these methods can only reconstruct shapes at low resolution, such as 32 or 64, due to the cubic complexity of voxel grids. Although various strategies [8, 24] were proposed to increase the resolution, these methods were too complex to follow. Mesh based methods [27, 16] are also alternatives to increase the resolution. However, these methods are still hard to handle arbitrary topology, since the vertices topology of reconstructed shapes mainly inherits from the template. Point clouds based methods [7, 19, 32, 17] provides another direction for single image 3D reconstruction. However, these methods also have a bottleneck of low resolution, which makes it hard to reveal more geometry details.
Besides low resolution, lack of texture is another issue which significantly affects the realism of the generated shapes. Current methods aim to map the texture from single images to reconstructed shapes either represented by mesh templates  or point clouds in a form of object coordinate maps . Although these methods have shown promising results in some specific shape classes, they usually can only work in category-specific reconstruction. In addition, the texture prediction pipeline of  sampling pixels from input images directly work on symmetric object with a good viewpoint. Though some other methods (e.g. [34, 23]) predict nice novel RGB views by view synthesis, they can only work on category-specific reconstruction.
Different from all these methods, our method can jointly learn to reconstruct high resolution geometry and texture by a two-stage reconstruction and taking object coordinate maps (also called NOCS map in [26, 21]) as intermediate representation. Different from previous methods [33, 32] which use depth maps as intermediate representation and require camera pose information in their pipelines, our method does not require camera pose information.
Shape completion. Shape completion is to infer the whole 3D geometry from partial observations. Different methods use volumetric grids  or point clouds [31, 30, 1] as shape representation for completion task. Points-based methods are mainly based on encoder and decoder structure which employs PointNet architecture  as backbones. Although these works have shown nice completed shapes, they are limited to low resolution. To resolve this issue, Hu et al.  introduced Render4Completion to cast the 3D shape completion problem into multiple 2D view completion, which demonstrates promising potential on high resolution shape completion. Our method follows this direction, however, we can not only learn geometry but also texture, which makes our method much different.
Most 3D point cloud reconstruction methods [17, 4, 6] solely focus on generating 3D shapes from input RGB images , where is the image resolution and are 3D coordinates. Recovering the texture besides 3D coordinates is a more challenging task, which requires learning a mapping from to , where are RGB values.
We propose a method to generate high resolution 3D predictions and recover textures from RGB images. At a high level, we decompose the reconstruction problem into two less challenging tasks: first, transforming 2D images to 3D partial shapes that correspond to the observed parts of the target object, and second, completing the unseen parts of the 3D object. We use object coordinate images to represent partial 3D shapes, and multiple depth and RGB views to represent completed 3D shapes.
As shown in Fig. 1, our pipeline consists of four sub-modules: (1) 2D-3D Net, an image translation network which translates an RGB image to a partial shape (represented by object coordinate image ); (2) the Joint Projection module, which first jointly maps the partial shape with texture to generate , a partial shape mapped with texture, and then jointly project into 8 pairs of partial depth and texture views from 8 fixed viewpoints (the 8 vertices of a cube); (3) the multi-view texture and depth completion module, which consists of two networks: Multi-view Texture-Depth Completion Net (MTDCN), which generates completed texture maps and depth maps by jointly completing partial texture and depth maps, and as an alternative, Multi-view Texture-Depth Completion Net (MDCN), which only completes depth maps and generates more accurate results ; (4) the Joint Fusion module, which jointly fuses the completed depth and texture views into completed 3D shape with textures, like and .
3.1 2D RGB Image to Partial Shapes
We propose to use 3-channel object coordinate images to represent partial shapes. Each pixel on the object coordinate image represents a 3D point, where its value corresponds to the point’s location . An object coordinate image is aligned with the input image, as shown in Figure 1, and in our pipeline, it represents the visible parts of the target 3D object. With this image-based 3D representation, we formulate the 2D-to-3D transformation as an image-to-image translation problem, and propose a 2D-3D Net to perform the translation based on the U-Net  architecture as in .
Unlike the depth map representation used in  and , which requires camera pose information for back-projection, the 3-channel object coordinate image can represent a 3D shape independently. Note that our network infers the camera pose of the input RGB image so that the generated partial shape is aligned with ground truth 3D shape.
3.2 Partial Shapes to Multiple Views
In this module, we transform the input RGB image and the predicted object coordinate image to a partial shape mapped with texture, , which is then rendered from 8 fixed viewpoints to generate depth maps and texture maps. The process is illustrated in Fig. 2.
Joint Texture and Shape Mapping. The input RGB image is aligned with the generated object coordinate image . An equivalent partial point cloud can be obtained by taking 3D coordinates from and texture from .
We denote a pixel on as , where and are pixel coordinates, and similarly, a point on as . Given and appearing at the same location, which means and , then and can be projected into 3D coordinates as on partial shape , where are RGB channels and .
Joint Projection. We render multiple depth maps and texture maps from 8 fixed viewpoints of the partial shape , where , .
Given , we denote a point on depth map as where and are pixel coordinates and is the depth value. Similarly, a point on is , where are RGB values. Then, we transform each 3D point on the partial shape into a pixel on depth map by
where is the intrinsic camera matrix, and
are the rotation matrix and translation vector of view. Note that Eq. (1) only projects the 3D coordinates of .
However, different points on may be projected to the same location on the depth map . For example, in Fig. 2, are projected to the same pixel on , where . The corresponding point on the texture map is where .
is used to store multiple depth values corresponding to the same pixel. Then we implement a depth-pooling operator with strideto select the minimum depth value. We set in our experiments. In depth-pooling, we store the indices of pooling () and select the closest point from the view point among . For example, in Fig. 2, pooling index , the selected point is , and the corresponding point on is . In this case, we copy the texture values from to .
3.3 Multi-view Texture and Depth Completion
Multi-view Texture-Depth Completion Net (MTDCN). We propose a Multi-view Texture-Depth Completion Net (MTDCN) to jointly complete texture and depth maps. MTDCN is based on a U-Net architecture. In our pipeline, we stack each pair of partial depth map and texture map into a 4-channel texture-depth map , . MTDCN takes as input, and generates completed 4-channel texture-depth maps , where and are completed texture and depth map respectively. The completions of the car model are shown in Fig. 3. After fusing these views, we get a completed shape with texture in Fig. 1.
In contrast to the category-specific reconstruction in , which samples texture from input images, thus having its performance relying on the viewpoint of the input images and the symmetry of the target objects, MTDCN can be trained to infer textures on multiple categories and does not assume objects being symmetric.
Multi-view Depth Completion Net (MDCN). In our experiments, we found it very challenging to complete both depth and texture map at the same time. As an alternative we also train MDCN, which only completes partial depth maps and can generate more accurate full depth maps . We then map the texture generated by MTDCN to the MDCN-generated shape to get a reconstructed shape with texture as illustrated in Fig. 1.
3.4 Joint Fusion
With the completed texture maps and depth maps by MTDCN and more accurate completed depth maps by MDCN, we jointly fuse the depth and texture maps into a colored 3D point, as illustrated in Fig. 1.
Joint Fusion for MTDCN. Given one point on , and the aligned point on the texture map , where and , the back-projected point on is by
Note that Eq. 2 only back-projects the depth map to the coordinates of , while the texture of is obtained from , where . We also extract a completed shape without texture.
Joint Fusion for MDCN. We map the texture generated from MTDCN to the completed shape of MDCN . The joint fusion process is similar. However, since texture and depth maps are generated separately, a valid point on a depth map may be aligned to an invalid point on the corresponding texture map, especially near edges. For such points, we take their nearest valid neighbor on the texture map. Since is generated by direct fusion of depth maps , has the same shape as .
3.5 Loss Function and Optimization
Training Objective. We perform a two-stage training and train three networks: 2D-3D Net (), MTDCN (), and MDCN (). Given an input RGB image , the generated object coordinate image is . The training objective of is
where is the ground truth object coordinate image.
Given an partial texture-depth images , , the completed texture-depth images , we get the optimal by
where is the ground truth texture-depth image.
MDCV only completes depth maps and takes 1-channel depth maps as input. Given a partial depth map , the completed depth map . is trained with
where is the ground truth depth image.
Optimization. We use Minibatch SGD and the Adam optimizer  to train all the networks. More details can be found in the supplementary material.
We evaluate our methods (Ours- generated by MDCN, and Ours- by MTDCN) on single-image 3D reconstruction and compare against state-of-the-art methods.
Dataset and Metrics. We train all our networks on synthetic models from ShapeNet , and evaluate them on both ShapeNet and Pix3D . We render depth maps, texture maps and object coordinate images for each object. More details can be found in the supplementary material. The image resolution is . We sample 100K points from each mesh object as ground truth point clouds for evaluations on ShapeNet, as in . For a fair comparison, we use Chamfer Distance (CD)  as the quantitative metric. Another popular option, Earth Mover’s Distance (EMD) , requires that the generated point cloud has the same size as the ground truth, and its calculation is time-consuming. While EMD is often used as a metric for methods whose output is sparse and has fixed size, like 1024 or 2048 points in [6, 17], it is not suitable to evaluate our methods that generates very dense point clouds with varied number of points.
4.1 Single Object Category
We first evaluate our method on a single object category. Following [29, 15], we use the chair category from ShapeNet with the same 80%-20% training/test split. We compare against two methods (Tatarchenko et al.  and Lin et al. 
) that generate dense point clouds by view synthesis, as well as two voxels-based methods, Perspective Transformer Networks (PTN) in two variants, and a baseline 3D-CNN provided in .
The quantitative results on the test dataset are reported in Table 4. Test results of other approaches are referenced from . Our method (Ours-) achieves the lowest CD in this single-category task. A visual comparison with Lin’s method is shown in Fig. 4, where our generated point clouds are denser and more accurate. In addition, we also infer the textures of the generated point clouds.
4.2 General Object Categories from ShapeNet
Reconstruct novel objects from seen categories. We test our method on novel objects from the 13 seen categories and compare against (a) 3D-R2N2 , which predicts volumeric models with recurrent networks, and (b) PSGN , which predicts an unordered set of 1024 3D points by fully-connected layers and deconvolutional layers, and (3) 3D-LMNet which predicts point clouds by latent-embedding matching. We only compare methods that follow the same setting as 3D-R2N2, and do not include  which assumes fixed elevation or OptMVS . We use the pretrained models readily provided by the authors, and the results of 3D-R2N2 and PSGN are referenced from . Note that we extract the surface voxels of 3D-R2N2 for evaluation.
Table 4 shows the quantitative results. Since most methods need ICP alignment as a post-processing step to achieve finer alignment with ground truth, we list the results without and with ICP. Specially, PSGN predicts rotated point clouds, so we only list the results after ICP alignment. Ours- outperforms the state-of-the-art methods on most categories. Specifically, we outperform 3D-LMNet on 12 categories out of 13 without ICP, and 7 with ICP. In addition, we achieve the lowest CD in average. Different from other methods, our methods do not rely too much on ICP, and more analysis can be found in Section 4.4.
We also visualize the predictions in Fig. 6. It can be seen that our method predicts more accurate shapes with higher point density. Besides 3D coordinate predictions, our methods also predict textures. We demonstrate ours- from two different views (v1) and (v2).
Reconstruct objects from unseen categories. We also evaluate how well our models generalizes to 6 unseen categories from ShapeNet: bed, bookshelf, guitar, laptop, motorcycle, and train. The quantitative comparisons with 3D-LMNet in Table 4 shows a better generalization of our method. We outperform 3D-LMNet on 4 categories out of 6 before or after ICP. Qualitative completions are shown in Fig. 5. Our methods perform reasonably well on the reconstruction of bed and guitar, while 3D-LMNet interprets the input as sofa or lamp from the seen categories respectively.
|bed||13.56 (7.13)||12.82 (8.43)||11.46 (6.51)|
|bookshelf||7.47 (4.68)||8.99 (7.96)||5.63 (4.89)|
|guitar||8.19 (6.40)||7.07 (7.29)||5.96 (6.33)|
|laptop||19.42 (5.21)||9.76 (7.58)||7.08 (5.67)|
|motorcycle||7.00 (5.91)||7.32 (6.75)||7.03 (5.79)|
|train||6.59 (4.07)||9.16 (4.38)||9.54 (3.93)|
|mean||10.37 (5.57)||9.19 (7.06)||7.79 (5.52)|
4.3 Real-world Images from Pix3D
To test the generalization of our approach to real-world images, we evaluate our trained model on the Pix3D dataset . We compare against the state-of-the-art methods, PSGN , 3D-LMNet  and OptMVS . Following  and , we uniformly sample 1024 points from the mesh as ground truth point cloud to calculate CD, and remove images with occlusion and truncation. We also provide the results of taking denser point cloud as ground truth in the supplementary. We have 4476 test images from seen categories, and 1048 from unseen categories.
Reconstruct novel objects from seen categories in Pix3D. We test the methods on 3 seen categories (chair, sofa, table) that co-occur in the 13 training sets of ShapeNet, and the results are shown in Table 5. Even on real-world data, our networks generate well aligned shapes, while other methods largely rely on ICP. Qualitative results are shown in Fig. 7. Our method performs well on real images and generates denser point clouds with reasonable texture. Besides more accurate shape alignment, our method also predicts better shapes, like the aspect ratio in the ‘Table’ example.
Reconstruct objects from unseen categories in Pix3D. We also test the pretrained models on 7 unseen categories (bed, bookcase, desk, misc, tool, wardrobe), and the results are shown in Table 5. Our methods outperform other approaches [6, 28, 17] in mean CD with or without ICP alignment. Fig. 7 shows a qualitative comparison. For ‘Bed-1’ and ‘Bed-2’, our methods generate reasonable beds, while 3D-LMNet regards them as sofa or car-like objects. Similarly, we generate reasonable ‘Desk-1’ and recovers the main structure of the input. For ‘Desk-2’, our method estimates the aspect ratio more accurately and recovers some details of the target object, like the curved legs. For ‘Bookcase’, ours generates a reasonable shape, while OptMVS or 3D-LMNet take it as a chair. In addition, we also successfully predict textures for unseen categories on real images.
4.4 Ablation Study
Contributions of each reconstruction stage to the final shape. Considering both 2D-3D and view completion nets perform reconstruction, in Table 4, we compare the generated partial shape with the completed shape Ours- on their CD to ground truth on the multiple-category and single-category (chair) task. For the former, the mean CD decreases from 10.58 to 3.91 after the second stage.
Reconstruction accuracy of MTDCN and MDCN. As shown in Tables 4, 4, 5, and Figures 6, 7, 4, MDCN generates denser point clouds with smoother surfaces, and the mean CD is lower. Fig. 3 highlights that the completed maps by MDCN are more accurate than those of MTDCN.
The impact of ICP alignment on reconstruction results.
Besides CD, pose estimation should also be evaluated in the comparisons among different reconstruction methods. We evaluate the pose estimations of 3D-LMNet and our methods by comparing the relative mean improvement of CD after ICP alignment in Table6 (S: ShapeNet, P: Pix3D), which is calculated from the data in Table 4, 4, 5. A bigger improvement means a worse alignment. Although the generated shapes of 3D-LMNet are assumed to be aligned with ground truth, its performance still relies heavily on ICP alignment. But our methods rely less on ICP, which implies that our pose estimation is more accurate. We use the same ICP implementation as 3D-LMNet .
In sum, our method predicts shape better, like pose estimation, the sizes and aspect ratio of shapes in Fig. 7. We attribute this to the use of intermediate representation. The object coordinate images containing only the seen parts, are easier to infer compared to direct reconstructions from images in [6, 28, 17]. Furthermore, the predicted partial shapes also constrain the view completion net to generate aligned shapes. In addition, our method generalizes to unseen categories better than existing methods. Qualitative results in Fig. 5 and 7 show that our method captures more generic, class-agnostic shape priors for object reconstruction.
However, our generated texture is a little blurry since we regress pixel values, instead of predicting texture flow  which predicts texture coordinates and samples pixel values directly from inputs to yield realistic textures. However, ’s texture prediction can only be applied on category-specific task with a good viewpoint of the symmetric object, so it cannot be applied on multiple-category reconstruction directly. We would like to study how to combine the pixel regression methods and texture flow prediction methods together to predict realistic texture on multiple categories.
We propose a two-stage reconstruction method for 3D reconstruction from single RGB images by leveraging object coordinate images as intermediate representation. Our pipeline can generate denser point clouds than previous methods and also predict textures on multiple-category reconstruction tasks. Experiments show that our method outperforms the existing methods on both seen and unseen categories on synthetic or real-world datasets.
-  Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas J. Guibas. Learning representations and generative models for 3d point clouds. In ICML, 2018.
-  Paul Besl and H.D. McKay. A method for registration of 3-d shapes. ieee trans pattern anal mach intell. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 14:239–256, 03 1992.
-  Angel X. Chang, Thomas A. Funkhouser, Leonidas J. Guibas, Pat Hanrahan, Qi-Xing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. Shapenet: An information-rich 3d model repository. CoRR, abs/1512.03012, 2015.
-  Christopher Bongsoo Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. ArXiv, abs/1604.00449, 2016.
Angela Dai, Charles Ruizhongtai Qi, and Matthias Nießner.
Shape completion using 3d-encoder-predictor cnns and shape synthesis.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6545–6554, 2017.
-  Haoqiang Fan, Hao Su, and Leonidas J. Guibas. A point set generation network for 3d object reconstruction from a single image. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2463–2471, 2016.
-  Haoqiang Fan, Hao Su, and Leonidas J. Guibas. A point set generation network for 3d object reconstruction from a single image. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2463–2471, 2017.
-  Christian Hane, Shubham Tulsiani, and Jitendra Malik. Hierarchical surface prediction for 3D object reconstruction. In International Conference on 3D Vision, pages 412–420, 2017.
-  Tao Hu, Zhizhong Han, Abhinav Shrivastava, and Matthias Zwicker. Render4completion: Synthesizing multi-view depth maps for 3d shape completion. ArXiv, abs/1904.08366, 2019.
-  Tao Hu, Zhizhong Han, and Matthias Zwicker. 3d shape completion with multi-view consistent inference, 2019.
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros.
Image-to-image translation with conditional adversarial networks.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5967–5976, 2017.
-  Wenzel Jakob. Mitsuba renderer. In https://www.mitsuba-renderer.org/, 2010.
-  Angjoo Kanazawa, Shubham Tulsiani, Alexei A. Efros, and Jitendra Malik. Learning category-specific mesh reconstruction from image collections. ArXiv, abs/1803.07549, 2018.
-  Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 12 2014.
Chen-Hsuan Lin, Chen Kong, and Simon Lucey.
Learning efficient point cloud generation for dense 3d object
AAAI Conference on Artificial Intelligence (AAAI), 2018.
-  Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. Soft rasterizer: A differentiable renderer for image-based 3D reasoning. The IEEE International Conference on Computer Vision, 2019.
-  Priyanka Mandikal, K L Navaneet, Mayank Agarwal, and R. Venkatesh Babu. 3d-lmnet: Latent embedding matching for accurate and diverse 3d point cloud reconstruction from a single image. ArXiv, abs/1807.07796, 2018.
-  Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 77–85, 2017.
-  Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NIPS, 2017.
-  O. Ronneberger, P.Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), volume 9351 of LNCS, pages 234–241. Springer, 2015. (available on arXiv:1505.04597 [cs.CV]).
-  Srinath Sridhar, Davis Rempe, Julien Valentin, Sofien Bouaziz, and Leonidas J. Guibas. Multiview aggregation for learning category-specific shape reconstruction. In Advances in Neural Information Processing Systems. 2019.
-  Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Tianfan Xue, Joshua B. Tenenbaum, and William T. Freeman. Pix3d: Dataset and methods for single-image 3d shape modeling. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2974–2983, 2018.
-  Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Multi-view 3d models from single images with a convolutional network. In ECCV, 2015.
-  Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Octree generating networks: Efficient convolutional architectures for high-resolution 3D outputs. In IEEE International Conference on Computer Vision, pages 2107–2115, 2017.
-  Shubham Tulsiani, Alexei A. Efros, and Jitendra Malik. Multi-view consistency as supervisory signal for learning shape and pose prediction. In Computer Vision and Pattern Regognition, 2018.
-  He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J. Guibas. Normalized object coordinate space for category-level 6d object pose and size estimation. In CVPR, 2019.
-  Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3D mesh models from single RGB images. In European Conference on Computer Vision, pages 55–71, 2018.
-  Yi Wei, Shaohui Liu, Wang Zhao, Jiwen Lu, and Jie Zhou. Conditional single-view shape generation for multi-view stereo reconstruction. In CVPR, 2019.
-  Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In NIPS, 2016.
-  Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. Foldingnet: Interpretable unsupervised learning on 3d point clouds. CoRR, abs/1712.07262, 2017.
-  Wentao Yuan, Tejas Khot, David Held, Christoph Mertz, and Martial Hebert. Pcn: Point completion network. 2018 International Conference on 3D Vision (3DV), pages 728–737, 2018.
-  Wei Zeng, Sezer Karaoglu, and Theo Gevers. Inferring point clouds from single monocular images by depth intermediation. ArXiv, abs/1812.01402, 2018.
-  Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Josh Tenenbaum, Bill Freeman, and Jiajun Wu. Learning to reconstruct shapes from unseen classes. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 2257–2268. Curran Associates, Inc., 2018.
-  Jun-Yan Zhu, Zhoutong Zhang, Chengkai Zhang, Jiajun Wu, Antonio Torralba, Joshua B. Tenenbaum, and Bill Freeman. Visual object networks: Image generation with disentangled 3d representations. In NeurIPS, 2018.