Render4Completion: Synthesizing Multi-view Depth Maps for 3D Shape Completion

04/17/2019 ∙ by Tao Hu, et al. ∙ 0

We propose a novel approach for 3D shape completion by synthesizing multi-view depth maps. While previous work for shape completion relies on volumetric representations, meshes, or point clouds, we propose to use multi-view depth maps from a set of fixed viewing angles as our shape representation. This allows us to be free of the limitations of memory for volumetric representations and point clouds by casting shape completion into an image-to-image translation problem. Specifically, we render depth maps of the incomplete shape from a fixed set of viewpoints, and perform depth map completion in each view. Different from image-to-image translation network that completes each view separately, our novel network, multi-view completion net (MVCN), leverages information from all views of a 3D shape to help the completion of each single view. This enables MVCN to leverage more information from different depth views to achieve high accuracy in single depth view completion and keep the consistency among the completed depth images in different views. Benefited by the multi-view representation and the novel network structure, MVCN significantly improves the accuracy of 3D shape completion in large-scale benchmarks compared to the state of the art.

READ FULL TEXT VIEW PDF

Authors

page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Shape completion is a challenge in 3D shape analysis. It is the base of applications such as 3D scanning in robotics, autonomous driving, or 3D modeling and fabrication. While learning-based methods that leverage large shape databases have achieved significant advances recently, choosing a suitable 3D representation for such tasks remains a difficult problem. Volumetric approaches such as binary voxel grids or distance functions have the advantage that convolutional neural networks can readily be applied, but including a third dimension increases the memory requirements and limits the resolutions that can be handled. On the other hand, point-based techniques provide a more parsimonious shape representation, and recently there has been much progress in generalizing convolutional networks to such irregularly sampled data. However, most generative techniques for 3D point clouds involve fully connected layers that limit the number of points and level of shape detail that can be obtained

[1, 4, 27]

In this paper, we propose to use a shape representation that is based on multi-view depth maps for shape completion. The representation consists of a fixed number of depth images taken from a set of pre-determined viewpoints. Each pixel is a 3D point, and the union of points over all depth images yields the 3D point cloud of a shape. This has the advantage that we can use several recent advances in deep neural networks that operate on images, like U-Net[21] and 2D convolutional networks. In addition, the number of points is not fixed and the point density can easily be increased by using higher resolution depth images, or more viewpoints.

Here we leverage this representation for shape completion. A key idea to perform shape completion is to render multiple depth images of an incomplete shape from a set of pre-defined viewpoints, and then to complete each depth map using image-to-image translation networks. To improve the completion accuracy, we further propose multi-view completion net that leverages information from all depth views of a 3D shape to achieve high accuracy for single depth view completion. In summary, our contributions are as follows:

  • A strategy to address shape completion by re-rendering multi-view depth maps to represent the incomplete shape, and performing image translation of these rendered views.

  • A significant improvement of image-to-image translation net, multi-view completion net.

  • Our outperforming results demonstrate more accurate completion results than previous methods.

Figure 1: (a) Render 8 depth maps of an incomplete shape from 8 viewpoints of a cube; (b) These rendered 8 depth maps are passed through a multi-view GAN, which generates 8 completed depth maps; (c) Back-project the 8 depth maps into a completed 3D model.

2 Related Work

Deep learning on 3D shapes.

Pioneering works on deep learning for 3D shapes have relied on volumetric representations 

[14, 26], which allow the straightforward application of convolutional neural network architectures. To avoid the computation and memory costs of 3D convolutions and 3D voxel grids, multi-view convolutional neural networks have also been proposed early for shape analysis [19, 24] such as recognition and classification. But these techniques cannot address shape completion. In addition to volumetric and multi-view representations, point clouds have also been popular for deep learning on 3D shapes. PointNet and its extension [18, 20] are the pioneer in this area.

3D shape completion. Shape completion can be performed using volumetric grids, as proposed by Dai et al. [3] and Han et al. [6], which are convenient for CNNs, like 3D-Encoder-Predictor CNNs for [3] and encoder-decoder CNN for patch-level geometry refinement in [6]. However, when represented with volumetric grids, data size grows cubically as the size of the space increases, which severely limits resolution and application. To address this problem, point based shape completion methods were presented, like [1, 27, 28]. The point completion network (PCN) [28] is a recent state of the art approach that extends the PointNet architecture [18] to provide an encoder, followed by a multi-stage decoder that uses both fully connected [1] and folding layers [27]. They show that their decoder leads to better results than using a fully connected [1] or folding based [27] decoder separately. However, for these the voxel based or point based shape completion methods, the number of input and output voxels or points are still fixed. For example, input should be voxelized on a grid [6] and the output point cloud size is 2048 [27], which, however, can lead to loss of detail in many scenarios.

3D reconstruction from images. The problem of 3D shape reconstruction from single RGB images shares similarities with 3D shape completion, but is arguably even harder. While a complete survey of these techniques is beyond the scope of this paper, our work shares some similarities with the approach by Lin et al. [13]. They use a multi-view depth map representation for shape reconstruction from single RGB images using a differentiable renderer. In contrast to their technique, we address shape completion, and our approach allows us to solve the problem directly using image-to-image translation. Soltani et al. [23]

do shape synthesis and reconstruction from multi-view depth images, which are generated by a variational autoencoder

[12]. However, they do not consider the relations between the multi-view depth images of the same model in their generative net.

Image translation and completion. A key advantage of our approach is that it allows us to leverage powerful image-to-image translation architecture to address the shape completion problem, including techniques based on generative adversarial networks (GAN) [5], and U-Net structures [21]. Based on conditional GAN, image-to-image translation network can be applied on a variety of tasks [9]. Satoshi et al.[7] and Portenier et at. [17] propose to use conditional GAN for image completion or editing. However, each image is completed individually in their networks. We propose a network that can combine information from other related images to help the completion of one single image.

3 Method

3.1 A multi-view representation

As discussed above, high resolution completion is difficult to achieve by existing methods that operate on voxels or point clouds due to limitations of memory or network structures. In contrast, multi-view representations of 3D shapes can circumvent these obstacles to achieve high resolution and dense completion. As shown in Fig.1 (a), given an incomplete point cloud, our method starts from rendering 8 depth maps for this point cloud. Specifically, the renderings are generated by placing 8 virtual cameras at the 8 vertices of a cube enclosing the shape, all pointing towards the centroid of the shape. We also render 8 depth maps from the ground truth point cloud, and then we use these image pairs to train our network.

With this multi-view representation, the shape completion problem can be formulated as image-to-image translation, i.e., translating an incomplete depth map to a corresponding complete depth map, for which we can take full advantages of several recent advances in net structures that operate successfully on images, like U-Net based generators and GANs. After the completion net shown in Fig.1(b), we get 8 completed depth maps in Fig.1(c), which can be back-projected into a completed point cloud.

3.2 Multi-view depth maps completion

For the completion problem, we need to learn a mapping from an incomplete depth map to a completed depth map , where is rendered from a partial 3D shape and . We expect to complete each depth map of as similar as possible to the corresponding depth map of ground truth 3D shape .

Although completing each one of depth maps of a 3D shape separately makes a lot of sense to completing the 3D shape, there are two disadvantages. First, we cannot keep the consistency among the completed depth maps from the same 3D shape, which affects the accuracy of the completed 3D shapes obtained by back-projecting the completed depth maps. Second, we cannot leverage the information from other depth maps of the same 3D shape while completing one single depth map, which limits the accuracy of completing single depth image, since views from the same 3D model share some common information, like global shape, and local parts which can be seen in different view images.

To resolve these issues, we propose a multi-view completion net (MVCN) to complete one single depth image by jointly considering the global 3D shape information. In order to complete a depth image as similar as possible to the ground truth in terms of both low-frequency correctness and high-frequency structure, MVCN is designed based on conditional GAN[5] which is formed by a generator and a discriminator . In addition, we introduce a shape descriptor for each 3D shape to contribute to the completion of each depth image from , where holds the information of the shape . Shape descriptor, , is learned along with other parameters in MVCN, and it is updated dynamically with the completion of all the depth images of shape .

3.3 The structure of MVCN

We use a U-Net based structure [21] as our generator, which has shown its effectiveness over encoder-decoder net in some vision tasks, like image-to-image translation [9]. We find U-Net structure is powerful and efficient to learn the knowledge of depth image completion. In addition, with our shape descriptor, we present an end-to-end net structure as illustrated in Fig.2.

Figure 2: Architecture of MVCN. The shape descriptor represents the information of a 3D shape, which contributes to the completion of each single depth map from the 3D shape.

We adopt the generator and discriminator architecture of [9]. MCVN consists of 8 U-Net modules with the input of 256x256, and each U-Net module has two submodules, DOWN and UP. DOWN (e.g.

) consists of the form Convolution-BatchNorm-ReLU

[8, 15], and UP (e.g. ) consists of the form UpReLU-UpConv-UpNorm. More details can be found in [9].

In MVCN, DOWN modules are used to extract the feature of each depth . For each 3D shape , we learn a shape descriptor by aggregating all depth features

through a view-pooling layer. Since not all the features are necessary to represent the global shape, we use max pooling to extract the maximum activation in each dimension of all

as shape descriptor, as illustrated in Fig.2. In addition, shape descriptor is applied to contribute to the completion of each depth image from shape .

Specifically, for an input , we employ the output of as the view feature , and insert the view-pooling layer after . In Fig.2, for each shape , we use a shape memory to store all the view features of . When we get , we first use it to update the corresponding feature map in shape memory. For example, if , the third feature map in shape memory will be replaced with . Then we obtain the shape descriptor to represent in current iteration by a view-pooling layer (max pooling all feature maps in the shape memory of ). This strategy can dynamically keep the best view features in all training iterations. This procedure is illustrated in Fig. 2. Subsequently, we use shape descriptor to contribute the completion of depth by concatenating with view feature as the input of module . This concatenated feature is also employed to module by skip connection, which makes full usage of in MVCN.

3.4 Loss function

Unlike some approaches that focus on image generation [22], our method does not generate images from noise, which also makes our training stable, as mentioned in [7].

Therefore, the objective of our conditional GAN is:

(1)

In completion problem, we also expect the generator could not only deceive the discriminator but also produce a completion result near the ground truth. In experiments, we also find it beneficial to mix the GAN objective with a traditional loss, such as L1 distance and L2 distance, which is consistent with previous approaches [9, 16]. Since L1 encourages less blurring than L2, and from Eqn.4, there is a linear mapping from a pixel in depth image to a 3D point, we expect the generated image near to ground truth in an L1 sense rather than L2. Therefore, the loss of generator is:

(2)

Our final object in training is :

(3)

where is a balance parameter which controls the contributions of the two terms.

3.5 Optimization and inference

To optimize our net, we follow the standard approach [5, 9]. The training of and is alternated, one gradient descent step on , then one step on . Minibatch SGD and the Adam solver [11] are applied, with a learning rate of 0.0006 for and 0.000006 for , which slows down the rate at which learns relative to . Momentum parameters are , and the batch size is 32.

In inference, first we run MVCN with all the 8 views of a 3D shape to build shape memory and extract shape descriptor. Then we run the net again for the second time to complete each view with the learned shape descriptor.

4 Experiments

In this section, we first describe the creation of a multi-category dataset to train our model, and then we illustrate the effectiveness of our method and the improvement of MCVN over single view completion net (VCN), where each view is completed individually without shape descriptor. Finally, we analyze the performance of our method, and make comparisons with existing methods. By default, we conduct the training of MVCN under the MVCN-Airplane600 (trained with the first 600 shapes of airplane in ShapeNet[2], and test it under the same 150 models involved in [28]).

4.1 Data generation

We use synthetic CAD models from ShapeNet to create a dataset to train our model. Specifically, we take models from 8 categories: airplane, cabinet, car, chair, lamp, sofa, table, and vessel. Our inputs are partial point clouds. For each model, we extract one partial point cloud by back-projecting a 2.5D depth map (from a random viewpoint) into 3D, and render this partial point cloud into depth maps of resolution 256x256 as training samples. The reason why we use back-projected depth maps as partial point clouds instead of subsets of the complete point cloud is that our training samples are closer to real-world sensor data in this way. In addition, similar to other researches, we choose to use a synthetic dataset to generate training data because it contains detailed 3D shapes, which, however, are not available in real-world datasets. In the same way, we also render depth maps from the ground truth point cloud as the ground truth depth map. More details of rendering and back-projecting depth maps can be found in Appendix.

4.2 Evaluation metrics

Our final target is 3D shape completion. Given a generated depth image , for each pixel at location with depth value , we can back-project to a 3D point through an inverse perspective transformation,

(4)

where , , and

are the camera intrinsic matrix, rotation matrix, and translation vector respectively. Similar to

[28], here we also use the symmetric version of Chamfer Distance (CD) [4] to calculate the average closest point distance between the ground truth shape and the completed shape as follows,

(5)

4.3 Analysis of the objective function

We conduct ablation studies to justify the effectiveness of our objective function for completion problem. Table 1(a) shows the quantitative effects of these variations, and Fig.3 shows the qualitative effects. The cGAN alone (bottom left, setting in Eqn.3

) gives very noisy results. L2+ cGAN (bottom middle) leads to reasonable but blurry results. L1 alone (top right) also produces reasonable results, but we can find some visual defects, like some unfilled holes as marked, which makes the final CD distance higher than that of L1 + cGAN. These visual defects can be reduced when adding both L1 and cGAN in loss function (bottom right). As shown by the example in Fig.

5, the combination of L1 and cGAN can complete the depth images in high accuracy. We further explore which components of the objective function are important for the point cloud completion by changing different weights ( in Eqn.3) of L1 loss. In Table 1(b), the best completion result is achieved when . We set in the following experiments.

Figure 3: Completion results for different losses.
Loss Avg CD
cGAN 0.010729
L1 0.005672
L2 + cGAN 0.006467
L1 + cGAN 0.005512
(a)
in Eqn.3 Avg CD
0.005748
0.005665
0.005512
0.005541
(b)
Table 1: Analysis of the objective function: average CD for different losses (a), and different (b).

4.4 Analysis of the view-pooling layer

Analysis of the methods to do view-pooling. We also study view-pooling methods to construct shape descriptor, including element-wise max-pooling and mean-pooling. According to our experiments, mean-pooling is not as effective as max-pooling to extract shape descriptor in image completion task, similar to the recognition task [24]. The average CD is 0.005926 for mean-pooling, but that of max-pooling is 0.005512, so max-pooling is used.

Position Avg L1 distance Avg CD
3.376642 0. 005512
3.433185 0. 005604
3.500945 0. 005919
Code 3.477186 0.005836
Table 2: Completion results for different positions of view-pooling layer

Analysis of the positions of view-pooling layer. Here we insert the view-pooling layer into different positions to extract shape descriptor and further to evaluate its effectiveness, including , , and , which are marked in Fig.2. Intuitively, shape descriptor would have the biggest impact on the original network if we place the view-pooling layer before , and the experimental results illustrate this in Table 2, where both average L1 distance and CD are the lowest. We also try to do view pooling after and concatenate the shape descriptor with code (marked in purple in Fig.2) and then pass them through a fully connected layer, but experiments show that the shape descriptor will be ignored since both the average L1 distance and CD do not decrease compared with single view completion net (average L1 distance is 3.473643 and CD is 0.005839 in Table 5).

Model Name Avg L1 Distance
MVCN-V3 3.794273
MVCN-V8-3 3.616740
MVCN-V5 3.564278
MVCN-V8-5 3.397558
Table 3: Average L1 Distance for different numbers of views in view-pooling.
Figure 4: Completion results for different numbers of views in view-pooling.
Figure 5: An example of the completion of sofa. The 1st row: incomplete point cloud and 8 depth maps of it; The 2nd row: generated point cloud and related 8 depth maps; The 3rd row: ground truth point cloud and its 8 depth maps.

Analysis of the number of views in view-pooling.We also analyze the effects of the number of views used in view-pooling. In Table 3, MVCN-V3 was trained with 3 depth images (No.1, 3, 5) of the 8 depth images of each 3D model, and MVCN-V5 was trained with 5 depth images (No. 1, 3, 5, 6, 8). MVCN-V8-3 and MVCN-V8-5 were trained with all the 8 depth images, but were tested with 3 views and 5 views respectively. In order to make fair comparisons, we took the 1st, 3rd, and 5th view images to test MVCN-V8-3 and MVCN-V3, and 1st, 3rd, 5th, 6th, 8th to test MVCN-V8-5 and MVCN-V5. The results show that the completion of one single view will be better when we increase the number of views, which means other views are helpful for the completion of one single view, and the more the views, the higher the completion accuracy. Fig.4 shows an example of the completion. When we increase the number of views in view-pooling, the completion results will be improved.

Figure 6: Visual comparison between VCN and MVCN. Starting from the partial point cloud in the first row, VCN and MVCN perform completions of depth maps in the second and third row, respectively, where the completed point clouds are also shown. We use colormap (from blue to green to red) to highlight the pixels with bigger errors than 10 in terms of L1 distance. Ground truth data is in the last row. MVCN achieves lower L1 distance on all the 8 depth maps.

4.5 Improvements over single view completion net

Pervasive improvements on L1 distance and CD. From Table 5, we find significant and pervasive improvements over single view completion net (VCN) on both average L1 distance and CD on different categories. Nets in Table 5 were trained with 600 3D models for airplane, 1600 for lamp, and 1000 for other categories. We use 150 models of each category to evaluate our network, the same test dataset in [28]. We further conduct visual comparison with VCN in Fig. 6, where we can see MVCN can achieve higher completion accuracy with the help of shape descriptor.

Model Avg L1 Distance Avg CD
MVCN-Airplane600 3.376642 0.005512
MVCN-Airplane1200 3.156200 0.005273
MVCN-Lamp1000 6.660511 0.012012
MVCN-Lamp1600 6.245297 0.010576
VCN-Lamp1000 6.763339 0.012091
VCN-Lamp1600 6.430318 0.012007
Table 4: Improvements while increasing training samples.

Better generalization capability. Table 4 shows that we can improve the performance of VCN and MVCN while increasing the number of training samples. We find that the performance differences between MVCN-Lamp1000 and VCN-Lamp1000 are not obvious. The reason is that there are relatively large individual differences among lamp models in ShapeNet, and the completion results are bad in several unusual lamp models in test set. For these models, the comparisons between VCN and MVCN are less meaningful, so the improvement is not obvious. But this can be solved when we add another 600 training samples in training. MVCN-Lamp1600 has a bigger improvement than VCN-Lamp1600 on average L1 distance and CD, which indicates a better generalization capability of MVCN.

Model Average L1 Distance
Avg Airplane Cabinet Car Chair Lamp Sofa Table Vessel
VCN 5.431036 3.473643 4.304635 3.858853 7.644824 6.430318 5.716992 7.572865 4.44616
MVCN 5.102478 3.376642 3.991407 3.609639 7.143200 6.245297 5.284686 7.155616 4.013339
Model
Mean Chamfer Distance per point
Avg Airplane Cabinet Car Chair Lamp Sofa Table Vessel
VCN 0.008800 0.005839 0.007297 0.006589 0.010398 0.012007 0.009565 0.009371 0.009334
MVCN 0.008328 0.005512 0.007154 0.006322 0.010077
0.010576
0.009174 0.009020 0.008790
Table 5: Comparison of average L1 Distance and mean Chamfer Distance between VCN and MCVN.

4.6 Comparison with the state-of-the-art

Baselines. Some previous completion methods need prior knowledge of the shape [25], or assume more complete inputs [10], so they are not directly comparable to our method. Here we compare MVCN with several strong baselines. PCN-CD [28] trained with point completion net with CD as loss function, is the state of the art while this work was developed. PCN-EMD uses Earth Mover’s Distance (EMD) [4] as loss function, but it is intractable for dense completion due to the calculation complexity of EMD. The encoders of FC[1], Folding [27] are the same with PCN-CD, but decoders are different, a 3-layer fully-connected network for FC, and folding-based layer for Folding. PN2 uses the same decoder, but the encoder is PointNet++ [20]. 3D-EPN [3] is a representative of the class of volumetric completion methods. For fair comparison, the distance field outputs of 3D-EPN are converted into point clouds as mentioned in [28].

Figure 7: Comparison between MVCN and PCN-CD.

Training dataset. To achieve the results in Table 6, FC, Folding, PCN-CD and PCN-EMD, were trained with 3795 airplane models, 1322 cabinet models, 5677 car models, 5750 chair models, 2068 lamp models, 2923 sofa models, 1689 vessel models, and 5750 table models, and for each 3D model, 8 partial point clouds from 8 random viewpoint were extracted. However, we only use 1600 models for lamp, 1200 for airplane, and 1000 for other categories to train our network, and we only extract one partial point cloud of each 3D model. Therefore, we only use 1000 car point clouds to train our network, but 45416 car point clouds are used in training [28]. In fact, the performance of MVCN can be further improved when we have more kinds of 3D models according to Table 4. In testing, we use the same 150 objects in each category as [28] for fair comparison.

Model Mean Chamfer Distance per point
Avg Airplane Cabinet Car Chair Lamp Sofa Table Vessel
3D-EPN 0.020147 0.013161 0.021803 0.020306 0.018813 0.025746 0.021089 0.021716 0.018543
FC 0.009799 0.005698 0.011023 0.008775 0.010969 0.011131 0.011756 0.009320 0.009720
Folding 0.010074 0.005965 0.010831 0.009272 0.011245 0.012172 0.011630 0.009453 0.010027
PN2 0.013999 0.010300 0.014735 0.012187 0.015775 0.017615 0.016183 0.011676 0.013521
PCN-CD 0.009636 0.005502 0.010625 0.008696 0.010998 0.011339 0.011676 0.008590 0.009665
PCN-EMD 0.010021 0.005849 0.010685 0.009080 0.011580 0.011961 0.012206 0.009014 0.009789
MVCN 0.008298 0.005273 0.007154 0.006322 0.010077 0. 010576 0.009174 0.009020 0.008790
Table 6: Comparison with the state-of-the-art in terms of mean Chamfer Distance per point over multiple shape categories.

Comparisons. As done in [28], we use the symmetric version of CD in Eqn.5 to calculate the average closest point distance, where ground truth point clouds and generated point clouds are not required to be the same size. The comparison of average of CD does not vary too much with the number of points in ground truth and generated point clouds, which makes the comparison between different methods fair and possible. Table 6 shows the qualitative results, where the completion results of other methods are from [28]. Our method achieves the lowest CD across almost all object categories. More detailed comparison with PCN-CD is in Fig. 7, where the height of the blue bar indicates the amount of improvement of our methods over PCN-CD on each shape. Fig.8 shows the quantitative results. Our completions are denser, and we recover more details in the results. Another obvious advantage is that our method can complete shapes with complex geometry, like the 2nd to 4th objects, but other methods fail to recover these shapes.

Figure 8: Qualitative completion on ShapeNet, where MVCN can complete complex shape with high resolution.

5 Conclusion

We have presented a method for shape completion by rendering multi-view depth maps of incomplete shape, and then performing image completion of these rendered views. Our multi-view completion net shows significant improvements over single view completion net across multiple object categories. Experiments show that view based representation and novel network structure can achieve better results with less training samples, perform better on objects with complex geometry, and generate higher resolution results than previous methods.

References

  • [1] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. J. Guibas. Learning representations and generative models for 3d point clouds. In ICML, 2018.
  • [2] A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q.-X. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. Shapenet: An information-rich 3d model repository. CoRR, abs/1512.03012, 2015.
  • [3] A. Dai, C. R. Qi, and M. Nießner. Shape completion using 3d-encoder-predictor cnns and shape synthesis.

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 6545–6554, 2017.
  • [4] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3d object reconstruction from a single image. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2463–2471, 2017.
  • [5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
  • [6] X. Han, Z. Li, H. Huang, E. Kalogerakis, and Y. Yu. High-resolution shape completion using deep neural networks for global structure and local geometry inference. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [7] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally and locally consistent image completion. ACM Trans. Graph., 36(4):107:1–107:14, July 2017.
  • [8] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  • [9] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.

    Image-to-image translation with conditional adversarial networks.

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5967–5976, 2017.
  • [10] M. Kazhdan and H. Hoppe. Screened poisson surface reconstruction. ACM Trans. Graph., 32(3):29:1–29:13, July 2013.
  • [11] D. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 12 2014.
  • [12] D. P. Kingma and M. Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2014.
  • [13] C.-H. Lin, C. Kong, and S. Lucey. Learning efficient point cloud generation for dense 3d object reconstruction. In

    AAAI Conference on Artificial Intelligence (AAAI)

    , 2018.
  • [14] D. Maturana and S. Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 922–928, Sep. 2015.
  • [15] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
  • [16] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2536–2544, 2016.
  • [17] T. Portenier, Q. Hu, A. Szabó, S. A. Bigdeli, P. Favaro, and M. Zwicker. Faceshop: deep sketch-based face image editing. ACM Trans. Graph., 37:99:1–99:13, 2018.
  • [18] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 77–85, 2017.
  • [19] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas. Volumetric and multi-view cnns for object classification on 3d data. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5648–5656, June 2016.
  • [20] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NIPS, 2017.
  • [21] O. Ronneberger, P.Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), volume 9351 of LNCS, pages 234–241. Springer, 2015. (available on arXiv:1505.04597 [cs.CV]).
  • [22] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, and X. Chen. Improved techniques for training gans. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2234–2242. Curran Associates, Inc., 2016.
  • [23] A. A. Soltani, H. Huang, J. Wu, T. D. Kulkarni, and J. B. Tenenbaum. Synthesizing 3d shapes via modeling multi-view depth maps and silhouettes with deep generative networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2511–2519, 2017.
  • [24] H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. In Proc. ICCV, 2015.
  • [25] M. Sung, V. G. Kim, R. Angst, and L. Guibas. Data-driven structural priors for shape completion. ACM Trans. Graph., 34(6):175:1–175:11, Oct. 2015.
  • [26] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1912–1920, 2015.
  • [27] Y. Yang, C. Feng, Y. Shen, and D. Tian. Foldingnet: Interpretable unsupervised learning on 3d point clouds. CoRR, abs/1712.07262, 2017.
  • [28] W. Yuan, T. Khot, D. Held, C. Mertz, and M. Hebert. Pcn: Point completion network. 2018 International Conference on 3D Vision (3DV), pages 728–737, 2018.

6 Appendix

6.1 Overview

In this document, we provide technical details and additional experimental results to the main paper.

6.2 Rendering and back-projecting depth maps

Provide details of data generation in Subsection 4.1. Render multi-view depth maps. First, for each 3D model, we move its center to the origin. Most models in modern online repositories, such as ShapeNet and the 3D Warehouse, satisfy this requirement that models are upright oriented along a consistent axis, and some previous completion or recognition methods also follow the same assumption [26, 28]. With this assumption, the center consists of the midpoints along axis. Then, each model is uniformly scaled to fit into a consistent sphere (radius is 0.2) and the scale factor is the maximum length along axis divided by radius. Finally, we render 8 depth maps for each partial point cloud, as mentioned in Section 3.1. In this way, all the shapes occur at the center of depth images, and they are big enough in the whole images. We also render 8 depth maps of the ground truth point cloud and use these image pairs to train our net.

Back-project multi-view depth maps into a point cloud. After MVCN, we get 8 generated depth maps, and then back-project the 8 depth maps into a completed point cloud by voting algorithm. We use a pixel value of 255 to represent the background of depth image and infinity in 3D. In practice, though some pixels in the background of generative images are close to 255, like 245, they are not the exact 255. These points appear in generative images, like bottom right in Fig.3, though not obvious, or bottom left in Fig.3, where this problem is enlarged due to a bad training loss. After back-projection, they will be sparse points in 3D. We delete these noisy points by using Find-Union algorithm, where two points are in the same set if their euclidean distance is smaller than 0.02, and if one point has less than 6 neighbors, it will be deleted as a noisy point. We restore the generated model to its original size after completion.

6.3 Analysis of the number of views in view-pooling

More experimental results for Subsection 4.4. We further show the improvements on L1 distance for all view images of test dataset in Fig.9. The x-axis represents different view images. It should be mentioned that the same represents different view images for ‘V8 vs V3’ and ‘V8 vs V5’, considering the test dataset has 150 3D models, so 450 view images are used to test ‘V8 vs V3’, and 750 view images are used to test ‘V8 vs V5’). The height of the blue bar indicates the amounts of improvement of 8 views over 3, and the red bar indicates the improvement of 8 views over 5. Positive values mean the L1 distance is lower while using 8 views. Since the training dataset is relatively small (600 3D models for training and 150 3D models for testing), our net performs bad on several unusual models in testing dataset, which fall on the boundary in Fig.9. Comparisons on boundary instances are not meaningful. Apart from these, for most view images, we decrease L1 distance by increasing the number of views in view-pooling. More views mean the shape descriptors are more helpful.

Figure 9: Improvements of 8 views over 3 and 5 views in view-pooling.