Learning to Generate Dense Point Clouds with Textures on Multiple Categories

12/22/2019 ∙ by Tao Hu, et al. ∙ University of Maryland 21

3D reconstruction from images is a core problem in computer vision. With recent advances in deep learning, it has become possible to recover plausible 3D shapes even from single RGB images for the first time. However, obtaining detailed geometry and texture for objects with arbitrary topology remains challenging. In this paper, we propose a novel approach for reconstructing point clouds from RGB images. Unlike other methods, we can recover dense point clouds with hundreds of thousands of points, and we also include RGB textures. In addition, we train our model on multiple categories which leads to superior generalization to unseen categories compared to previous techniques. We achieve this using a two-stage approach, where we first infer an object coordinate map from the input RGB image, and then obtain the final point cloud using a reprojection and completion step. We show results on standard benchmarks that demonstrate the advantages of our technique. Code is available at https://github.com/TaoHuUMD/3D-Reconstruction.



There are no comments yet.


page 2

page 4

page 7

page 8

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

3D reconstruction from single RGB images has been a longstanding challenge in computer vision. While recent progress with deep learning-based techniques and large shape or image databases has been significant, the reconstruction of detailed geometry and texture for a large variety of object categories with arbitrary topology remains challenging. Point clouds have emerged as one of the most popular representations to tackle this challenge because of a number of distinct advantages: unlike meshes they can easily represent arbitrary topology, unlike 3D voxel grids they do not suffer from cubic complexity, and unlike implicit functions they can reconstruct shapes using a single evaluation of a neural network. In addition, it is straightforward to represent surface textures with point clouds by storing per-point RGB values.

In this paper, we present a novel method to reconstruct 3D point clouds from single RGB images, including the optional recovery of per-point RGB texture. In addition, our approach can be trained on multiple categories. The key idea of our method is to solve the problem using a two-stage approach, where both stages can be implemented using powerful 2D image-to-image translation networks: in the first stage, we recover an object coordinate map from the input RGB image. This is similar to a depth image, but it corresponds to a point cloud in object-centric coordinates that is independent of camera pose. In the second stage, we reproject the object space point cloud into depth images from eight fixed viewpoints in image space, and perform depth map completion. We can then trivially fuse all completed object space depth maps into a final 3D reconstruction, without requiring a separate alignment stage, for example using iterative closest point algorithm (ICP)

[2]. Since all networks are based on 2D convolutions, it is straightforward to achieve high resolution reconstructions with this approach. Texture reconstruction uses the same pipeline, but operating on RGB images instead of object space depth maps.

We train our approach on a multi-category dataset and show that our object-centric, two-stage approach leads to better generalization than competing techniques. In addition, recovering object space point clouds allows us to avoid a separate camera pose estimation step. In summary, our main contributions are as follows:

  • A strategy to generate 3D shapes from single RGB images in a two-stage approach, by first recovering object coordinate images as an intermediate representation, and then performing reprojection, depth map completion, and a final trivial fusion step in object space.

  • The first work to train a single network to reconstruct point clouds with RGB textures on multiple categories.

  • More accurate reconstruction results than previous methods on both seen and unseen categories from ShapeNet [3] or Pix3D [22] datasets.

Figure 1: Approach overview. An image is passed through a 2D-3D network to reconstruct the visible parts of the object, represented by an object coordinate image . and represent the texture and 3D coordinates of a shape respectively, which yield a partial shape with texture when combined by a Joint Texture and Shape Mapping operator. Next, by Joint Projection, is jointly projected from 8 fixed viewpoints into 8 pairs of partial depth maps and textures maps, which are translated to completed maps by the Multi-view Texture-Depth Completion Net (MTDCN) that jointly completes texture and depth maps. Alternatively, Multi-view Depth Completion Net (MDCN) only completes the depth maps. Finally, the Joint Fusion operator fuses the completed multiple texture and depth maps into completed point clouds.

2 Related Work

Our method is mainly related to single image 3D reconstruction and shape completion. We briefly review previous works in these two aspects.

Single image 3D reconstruction. Along with the development of deep learning techniques, single image 3D reconstruction has made a huge progress. Because of the regularity, early works mainly learned to reconstruct voxel grids from 3D supervision [4] or 2D supervision [23] using differentiable renderers [29, 25]. However, these methods can only reconstruct shapes at low resolution, such as 32 or 64, due to the cubic complexity of voxel grids. Although various strategies [8, 24] were proposed to increase the resolution, these methods were too complex to follow. Mesh based methods [27, 16] are also alternatives to increase the resolution. However, these methods are still hard to handle arbitrary topology, since the vertices topology of reconstructed shapes mainly inherits from the template. Point clouds based methods [7, 19, 32, 17] provides another direction for single image 3D reconstruction. However, these methods also have a bottleneck of low resolution, which makes it hard to reveal more geometry details.

Besides low resolution, lack of texture is another issue which significantly affects the realism of the generated shapes. Current methods aim to map the texture from single images to reconstructed shapes either represented by mesh templates [13] or point clouds in a form of object coordinate maps [21]. Although these methods have shown promising results in some specific shape classes, they usually can only work in category-specific reconstruction. In addition, the texture prediction pipeline of  [13] sampling pixels from input images directly work on symmetric object with a good viewpoint. Though some other methods (e.g. [34, 23]) predict nice novel RGB views by view synthesis, they can only work on category-specific reconstruction.

Different from all these methods, our method can jointly learn to reconstruct high resolution geometry and texture by a two-stage reconstruction and taking object coordinate maps (also called NOCS map in [26, 21]) as intermediate representation. Different from previous methods [33, 32] which use depth maps as intermediate representation and require camera pose information in their pipelines, our method does not require camera pose information.

Shape completion. Shape completion is to infer the whole 3D geometry from partial observations. Different methods use volumetric grids [5] or point clouds [31, 30, 1] as shape representation for completion task. Points-based methods are mainly based on encoder and decoder structure which employs PointNet architecture [18] as backbones. Although these works have shown nice completed shapes, they are limited to low resolution. To resolve this issue, Hu et al. [9] introduced Render4Completion to cast the 3D shape completion problem into multiple 2D view completion, which demonstrates promising potential on high resolution shape completion. Our method follows this direction, however, we can not only learn geometry but also texture, which makes our method much different.

3 Approach

Most 3D point cloud reconstruction methods [17, 4, 6] solely focus on generating 3D shapes from input RGB images , where is the image resolution and are 3D coordinates. Recovering the texture besides 3D coordinates is a more challenging task, which requires learning a mapping from to , where are RGB values.

We propose a method to generate high resolution 3D predictions and recover textures from RGB images. At a high level, we decompose the reconstruction problem into two less challenging tasks: first, transforming 2D images to 3D partial shapes that correspond to the observed parts of the target object, and second, completing the unseen parts of the 3D object. We use object coordinate images to represent partial 3D shapes, and multiple depth and RGB views to represent completed 3D shapes.

As shown in Fig. 1, our pipeline consists of four sub-modules: (1) 2D-3D Net, an image translation network which translates an RGB image to a partial shape (represented by object coordinate image ); (2) the Joint Projection module, which first jointly maps the partial shape with texture to generate , a partial shape mapped with texture, and then jointly project into 8 pairs of partial depth and texture views from 8 fixed viewpoints (the 8 vertices of a cube); (3) the multi-view texture and depth completion module, which consists of two networks: Multi-view Texture-Depth Completion Net (MTDCN), which generates completed texture maps and depth maps by jointly completing partial texture and depth maps, and as an alternative, Multi-view Texture-Depth Completion Net (MDCN), which only completes depth maps and generates more accurate results ; (4) the Joint Fusion module, which jointly fuses the completed depth and texture views into completed 3D shape with textures, like and .

3.1 2D RGB Image to Partial Shapes

We propose to use 3-channel object coordinate images to represent partial shapes. Each pixel on the object coordinate image represents a 3D point, where its value corresponds to the point’s location . An object coordinate image is aligned with the input image, as shown in Figure 1, and in our pipeline, it represents the visible parts of the target 3D object. With this image-based 3D representation, we formulate the 2D-to-3D transformation as an image-to-image translation problem, and propose a 2D-3D Net to perform the translation based on the U-Net [20] architecture as in [11].

Unlike the depth map representation used in [33] and [32], which requires camera pose information for back-projection, the 3-channel object coordinate image can represent a 3D shape independently. Note that our network infers the camera pose of the input RGB image so that the generated partial shape is aligned with ground truth 3D shape.

3.2 Partial Shapes to Multiple Views

In this module, we transform the input RGB image and the predicted object coordinate image to a partial shape mapped with texture, , which is then rendered from 8 fixed viewpoints to generate depth maps and texture maps. The process is illustrated in Fig. 2.

Joint Texture and Shape Mapping. The input RGB image is aligned with the generated object coordinate image . An equivalent partial point cloud can be obtained by taking 3D coordinates from and texture from .

Figure 2: Joint Projection.

We denote a pixel on as , where and are pixel coordinates, and similarly, a point on as . Given and appearing at the same location, which means and , then and can be projected into 3D coordinates as on partial shape , where are RGB channels and .

Joint Projection. We render multiple depth maps and texture maps from 8 fixed viewpoints of the partial shape , where , .

Given , we denote a point on depth map as where and are pixel coordinates and is the depth value. Similarly, a point on is , where are RGB values. Then, we transform each 3D point on the partial shape into a pixel on depth map by


where is the intrinsic camera matrix, and

are the rotation matrix and translation vector of view

. Note that Eq. (1) only projects the 3D coordinates of .

However, different points on may be projected to the same location on the depth map . For example, in Fig. 2, are projected to the same pixel on , where . The corresponding point on the texture map is where .

To alleviate this collision effect, We implement a pseudo-rendering technique similar to  [10, 15]. Specifically, for each point on , a depth buffer with a size of

is used to store multiple depth values corresponding to the same pixel. Then we implement a depth-pooling operator with stride

to select the minimum depth value. We set in our experiments. In depth-pooling, we store the indices of pooling () and select the closest point from the view point among . For example, in Fig. 2, pooling index , the selected point is , and the corresponding point on is . In this case, we copy the texture values from to .

3.3 Multi-view Texture and Depth Completion

In our pipeline, a full shape is represented by depth images from multiple views, which are processed by CNNs to generate high resolution 3D shapes as mentioned in [15, 9].

Multi-view Texture-Depth Completion Net (MTDCN). We propose a Multi-view Texture-Depth Completion Net (MTDCN) to jointly complete texture and depth maps. MTDCN is based on a U-Net architecture. In our pipeline, we stack each pair of partial depth map and texture map into a 4-channel texture-depth map , . MTDCN takes as input, and generates completed 4-channel texture-depth maps , where and are completed texture and depth map respectively. The completions of the car model are shown in Fig. 3. After fusing these views, we get a completed shape with texture in Fig. 1.

In contrast to the category-specific reconstruction in [13], which samples texture from input images, thus having its performance relying on the viewpoint of the input images and the symmetry of the target objects, MTDCN can be trained to infer textures on multiple categories and does not assume objects being symmetric.

Multi-view Depth Completion Net (MDCN). In our experiments, we found it very challenging to complete both depth and texture map at the same time. As an alternative we also train MDCN, which only completes partial depth maps and can generate more accurate full depth maps . We then map the texture generated by MTDCN to the MDCN-generated shape to get a reconstructed shape with texture as illustrated in Fig. 1.

Different from the multi-view completion net in [9], which only completes 1-channel depth maps, MTDCN can jointly complete both texture and depth maps. It should be mentioned that there is no discriminator in MTDCN or MDCN, in contrast to [9].

3.4 Joint Fusion

With the completed texture maps and depth maps by MTDCN and more accurate completed depth maps by MDCN, we jointly fuse the depth and texture maps into a colored 3D point, as illustrated in Fig. 1.

Joint Fusion for MTDCN. Given one point on , and the aligned point on the texture map , where and , the back-projected point on is by

Figure 3: Completions of texture and depth maps.

Note that Eq. 2 only back-projects the depth map to the coordinates of , while the texture of is obtained from , where . We also extract a completed shape without texture.

Joint Fusion for MDCN. We map the texture generated from MTDCN to the completed shape of MDCN . The joint fusion process is similar. However, since texture and depth maps are generated separately, a valid point on a depth map may be aligned to an invalid point on the corresponding texture map, especially near edges. For such points, we take their nearest valid neighbor on the texture map. Since is generated by direct fusion of depth maps , has the same shape as .

3.5 Loss Function and Optimization

Training Objective. We perform a two-stage training and train three networks: 2D-3D Net (), MTDCN (), and MDCN (). Given an input RGB image , the generated object coordinate image is . The training objective of is


where is the ground truth object coordinate image.

Given an partial texture-depth images , , the completed texture-depth images , we get the optimal by


where is the ground truth texture-depth image.

MDCV only completes depth maps and takes 1-channel depth maps as input. Given a partial depth map , the completed depth map . is trained with


where is the ground truth depth image.

Optimization. We use Minibatch SGD and the Adam optimizer [14] to train all the networks. More details can be found in the supplementary material.

4 Experiments

We evaluate our methods (Ours- generated by MDCN, and Ours- by MTDCN) on single-image 3D reconstruction and compare against state-of-the-art methods.

Dataset and Metrics. We train all our networks on synthetic models from ShapeNet [3], and evaluate them on both ShapeNet and Pix3D [22]. We render depth maps, texture maps and object coordinate images for each object. More details can be found in the supplementary material. The image resolution is . We sample 100K points from each mesh object as ground truth point clouds for evaluations on ShapeNet, as in [15]. For a fair comparison, we use Chamfer Distance (CD) [7] as the quantitative metric. Another popular option, Earth Mover’s Distance (EMD) [7], requires that the generated point cloud has the same size as the ground truth, and its calculation is time-consuming. While EMD is often used as a metric for methods whose output is sparse and has fixed size, like 1024 or 2048 points in [6, 17], it is not suitable to evaluate our methods that generates very dense point clouds with varied number of points.

4.1 Single Object Category

We first evaluate our method on a single object category. Following [29, 15], we use the chair category from ShapeNet with the same 80%-20% training/test split. We compare against two methods (Tatarchenko et al. [23] and Lin et al. [15]

) that generate dense point clouds by view synthesis, as well as two voxels-based methods, Perspective Transformer Networks (PTN)

[29] in two variants, and a baseline 3D-CNN provided in [29].

The quantitative results on the test dataset are reported in Table 4. Test results of other approaches are referenced from [15]. Our method (Ours-) achieves the lowest CD in this single-category task. A visual comparison with Lin’s method is shown in Fig. 4, where our generated point clouds are denser and more accurate. In addition, we also infer the textures of the generated point clouds.

4.2 General Object Categories from ShapeNet

We also simultaneously train our network on 13 categories (listed in Table 4) from ShapeNet and use the same 80%-20% training/test split as existing methods [4, 17].

Reconstruct novel objects from seen categories. We test our method on novel objects from the 13 seen categories and compare against (a) 3D-R2N2 [4], which predicts volumeric models with recurrent networks, and (b) PSGN [6], which predicts an unordered set of 1024 3D points by fully-connected layers and deconvolutional layers, and (3) 3D-LMNet which predicts point clouds by latent-embedding matching. We only compare methods that follow the same setting as 3D-R2N2, and do not include [15] which assumes fixed elevation or OptMVS [28]. We use the pretrained models readily provided by the authors, and the results of 3D-R2N2 and PSGN are referenced from [15]. Note that we extract the surface voxels of 3D-R2N2 for evaluation.

Table 4 shows the quantitative results. Since most methods need ICP alignment as a post-processing step to achieve finer alignment with ground truth, we list the results without and with ICP. Specially, PSGN predicts rotated point clouds, so we only list the results after ICP alignment. Ours- outperforms the state-of-the-art methods on most categories. Specifically, we outperform 3D-LMNet on 12 categories out of 13 without ICP, and 7 with ICP. In addition, we achieve the lowest CD in average. Different from other methods, our methods do not rely too much on ICP, and more analysis can be found in Section 4.4.

We also visualize the predictions in Fig. 6. It can be seen that our method predicts more accurate shapes with higher point density. Besides 3D coordinate predictions, our methods also predict textures. We demonstrate ours- from two different views (v1) and (v2).

Figure 4: Reconstructions on single-category task.

Reconstruct objects from unseen categories. We also evaluate how well our models generalizes to 6 unseen categories from ShapeNet: bed, bookshelf, guitar, laptop, motorcycle, and train. The quantitative comparisons with 3D-LMNet in Table 4 shows a better generalization of our method. We outperform 3D-LMNet on 4 categories out of 6 before or after ICP. Qualitative completions are shown in Fig. 5. Our methods perform reasonably well on the reconstruction of bed and guitar, while 3D-LMNet interprets the input as sofa or lamp from the seen categories respectively.

Method CD 3D CNN (vol. loss only) 4.49 PTN (proj. loss only) 4.35 PTN (vol. & proj. loss) 4.43 Tatarchenko et al. 5.40 Lin et al. 3.53 Ours- 3.68 Ours- 3.04
Table 1: CD on single-category task.
Category Ours-
airplane 10.53 4.19
bench 7.85 3.40
cabinet 19.07 4.88
car 11.14 2.90
chair 8.69 3.59
display 12.43 4.71
lamp 11.95 6.18
loudspeaker 20.26 6.39
rifle 9.47 5.44
sofa 10.86 4.07
table 8.83 3.27
telephone 9.83 3.16
vessel 9.08 3.79
mean 10.58 3.91



chair 9.04 3.04
Table 2: Mean CD of partial shape and completed shape to ground truth.
Category 3D-R2N2 PSGN 3D-LMNet Ours- Ours- airplane (4.79) (2.79) 6.16 (2.26) 3.70 (3.37) 4.19 (3.66) bench (4.93) (3.80) 5.79 (3.72) 4.27 (3.83) 3.40 (3.10) cabinet (4.04) (4.91) 6.98 (4.46) 6.77 (5.89) 4.88 (4.50) car (4.81) (3.85) 3.17 (2.91) 2.93 (2.95) 2.90 (2.90) chair (4.93) (4.24) 7.08 (3.74) 4.47 (4.12) 3.59 (3.22) display (5.04) (4.25) 7.89 (3.72) 5.55 (4.94) 4.71 (3.85) lamp (13.03) (4.56) 11.36 (4.57) 8.06 (7.13) 6.18 (5.65) loudspeaker (6.69) (6.00) 7.95 (5.46) 9.53 (8.28) 6.39 (5.74) rifle (6.64) (2.67) 4.46 (2.55) 5.31 (4.28) 5.44 (4.30) sofa (5.50) (5.38) 6.06 (4.44) 4.43 (3.93) 4.07 (3.57) table (5.26) (4.10) 6.65 (3.84) 4.59 (4.26) 3.27 (3.14) telephone (4.61) (3.50) 3.91 (3.10) 4.98 (4.72) 3.16 (2.90) vessel (6.82) (3.59) 6.30 (3.81) 4.13 (3.85) 3.79 (3.52) mean (5.93) (4.13) 6.14 (3.59) 4.68 (4.26) 3.91 (3.56)
Table 3: Average CD of multiple-seen-category experiments on ShapeNet. Numbers beyond ‘()’ are the CD before ICP, and in ‘()’ are after ICP.
Category 3D-LMNet Ours- Ours-
bed 13.56 (7.13) 12.82 (8.43) 11.46 (6.51)
bookshelf 7.47 (4.68) 8.99 (7.96) 5.63 (4.89)
guitar 8.19 (6.40) 7.07 (7.29) 5.96 (6.33)
laptop 19.42 (5.21) 9.76 (7.58) 7.08 (5.67)
motorcycle 7.00 (5.91) 7.32 (6.75) 7.03 (5.79)
train 6.59 (4.07) 9.16 (4.38) 9.54 (3.93)
mean 10.37 (5.57) 9.19 (7.06) 7.79 (5.52)
Table 4: Average CD of multiple-unseen-category experiments on ShapeNet.
Category PSGN 3D-LMNet OptMVS Ours- Ours- chair (8.98) 9.50 (5.46) 8.86 (7.23) 8.35 (7.40) 7.28 (6.05) sofa (7.27) 7.82 (6.54) 8.25 (8.00) 8.54 (7.18) 8.41 (6.83) table (8.84) 13.57 (7.62) 9.09 (8.88) 9.52 (9.06) 8.53 (7.97) mean-seen (8.55) 9.73 (6.04) 8.75 (7.67) 8.54 (7.55) 7.74 (6.53)     bed* (9.23) 13.11 (9.02) 12.69 (9.01) 10.91 (8.41) 11.04 (8.19) bookcase* (8.24) 8.32 (6.64) 8.10 (8.35) 10.38 (9.72) 8.99 (8.44) desk* (8.40) 11.75 (7.72) 9.01 (8.50) 8.64 (8.16) 7.64 (7.18) misc* (9.84) 13.45 (11.34) 13.82 (12.36) 12.58 (11.03) 11.48 (9.30) tool* (11.20) 13.64 (9.09) 14.98 (11.27) 13.27 (11.70) 12.18 (9.02) wardrobe* (7.84) 9.46 (6.96) 6.96 (7.26) 9.15 (8.80) 8.33 (8.26) mean-unseen (8.81) 11.67 (8.22) 10.48 (8.83) 10.19 (8.86) 9.57 (8.07)
Table 5: Average CD on both seen and unseen category on Pix3D dataset. All numbers are scaled by 100. ‘*’ indicates unseen category.
Figure 5: Results on ShapeNet unseen category
Figure 6: Reconstructions of the seen categories on ShapeNet dataset. ‘C’ is the generated object coordinate image, and ‘GT’ is another view of the target object. Ours- is generated by MTDCN, Ours- and Ours- are generated by MDCN.

4.3 Real-world Images from Pix3D

To test the generalization of our approach to real-world images, we evaluate our trained model on the Pix3D dataset [22]. We compare against the state-of-the-art methods, PSGN [6], 3D-LMNet [17] and OptMVS [28]. Following [17] and [22], we uniformly sample 1024 points from the mesh as ground truth point cloud to calculate CD, and remove images with occlusion and truncation. We also provide the results of taking denser point cloud as ground truth in the supplementary. We have 4476 test images from seen categories, and 1048 from unseen categories.

Reconstruct novel objects from seen categories in Pix3D. We test the methods on 3 seen categories (chair, sofa, table) that co-occur in the 13 training sets of ShapeNet, and the results are shown in Table  5. Even on real-world data, our networks generate well aligned shapes, while other methods largely rely on ICP. Qualitative results are shown in Fig. 7. Our method performs well on real images and generates denser point clouds with reasonable texture. Besides more accurate shape alignment, our method also predicts better shapes, like the aspect ratio in the ‘Table’ example.

Reconstruct objects from unseen categories in Pix3D. We also test the pretrained models on 7 unseen categories (bed, bookcase, desk, misc, tool, wardrobe), and the results are shown in Table  5. Our methods outperform other approaches [6, 28, 17] in mean CD with or without ICP alignment. Fig. 7 shows a qualitative comparison. For ‘Bed-1’ and ‘Bed-2’, our methods generate reasonable beds, while 3D-LMNet regards them as sofa or car-like objects. Similarly, we generate reasonable ‘Desk-1’ and recovers the main structure of the input. For ‘Desk-2’, our method estimates the aspect ratio more accurately and recovers some details of the target object, like the curved legs. For ‘Bookcase’, ours generates a reasonable shape, while OptMVS or 3D-LMNet take it as a chair. In addition, we also successfully predict textures for unseen categories on real images.

Figure 7: Reconstructions on Pix3D dataset. ‘C’ is object coordinate image, and ‘GT’ is ground truth model.

4.4 Ablation Study

Contributions of each reconstruction stage to the final shape. Considering both 2D-3D and view completion nets perform reconstruction, in Table 4, we compare the generated partial shape with the completed shape Ours- on their CD to ground truth on the multiple-category and single-category (chair) task. For the former, the mean CD decreases from 10.58 to 3.91 after the second stage.

Reconstruction accuracy of MTDCN and MDCN. As shown in Tables 445, and Figures 674, MDCN generates denser point clouds with smoother surfaces, and the mean CD is lower. Fig. 3 highlights that the completed maps by MDCN are more accurate than those of MTDCN.

Method S-seen S-unseen P-seen P-unseen
3D-LMNet 0.42 0.46 0.38 0.30
Ours- 0.09 0.29 0.16 0.16
Table 6: Relative CD improvements after ICP.

The impact of ICP alignment on reconstruction results.

Besides CD, pose estimation should also be evaluated in the comparisons among different reconstruction methods. We evaluate the pose estimations of 3D-LMNet and our methods by comparing the relative mean improvement of CD after ICP alignment in Table

6 (S: ShapeNet, P: Pix3D), which is calculated from the data in Table 445. A bigger improvement means a worse alignment. Although the generated shapes of 3D-LMNet are assumed to be aligned with ground truth, its performance still relies heavily on ICP alignment. But our methods rely less on ICP, which implies that our pose estimation is more accurate. We use the same ICP implementation as 3D-LMNet [17].

4.5 Discussion

In sum, our method predicts shape better, like pose estimation, the sizes and aspect ratio of shapes in Fig. 7. We attribute this to the use of intermediate representation. The object coordinate images containing only the seen parts, are easier to infer compared to direct reconstructions from images in [6, 28, 17]. Furthermore, the predicted partial shapes also constrain the view completion net to generate aligned shapes. In addition, our method generalizes to unseen categories better than existing methods. Qualitative results in Fig. 5 and 7 show that our method captures more generic, class-agnostic shape priors for object reconstruction.

However, our generated texture is a little blurry since we regress pixel values, instead of predicting texture flow [13] which predicts texture coordinates and samples pixel values directly from inputs to yield realistic textures. However, [13]’s texture prediction can only be applied on category-specific task with a good viewpoint of the symmetric object, so it cannot be applied on multiple-category reconstruction directly. We would like to study how to combine the pixel regression methods and texture flow prediction methods together to predict realistic texture on multiple categories.

5 Conclusion

We propose a two-stage reconstruction method for 3D reconstruction from single RGB images by leveraging object coordinate images as intermediate representation. Our pipeline can generate denser point clouds than previous methods and also predict textures on multiple-category reconstruction tasks. Experiments show that our method outperforms the existing methods on both seen and unseen categories on synthetic or real-world datasets.