Convolutional neural networks have proven highly successful at analysis and synthesis of visual data such as images and videos. This has spurred interest in applying convolutional network architectures also to 3D shapes, where a key challenge is to find suitable generalizations of discrete convolutions to the 3D domain. Popular techniques include using discrete convolutions on 3D grids [Wu et al.2015], graph convolutions on meshes [Litany et al.2018], convolution-like operators on 3D point clouds [Atzmon, Maron, and Lipman2018, Li et al.2018b], or 2D convolutions on 2D shape parameterizations [Cohen et al.2018]. A simple approach in the last category is to represent shapes using multiple 2D projections, or multiple depth images, and apply 2D convolutions on these views. This has led to successful techniques for shape classification [Su et al.2015], single-view 3D reconstruction [Richter and Roth2018], shape completion [Hu et al.2019], and shape synthesis [Soltani et al.2017]. One issue in these approaches, however, is to encourage consistency among the separate views and avoid that each view represents a slightly different object. This is not an issue in supervised training, where the loss encourages all views to match the ground truth shape. But at inference time or in unsupervised training, ground truth is not available and a different mechanism is required to encourage consistency.
In this paper, we address the problem of shape completion using a multi-view depth image representation, and we propose a multi-view consistency loss that is minimized during inference. We formulate inference as an energy minimization problem, where the energy is the sum of a data term given by a conditional generative net, and a regularization term given by a geometric consistency loss. Our results show the benefits of optimizing geometric consistency in a multi-view shape representation during inference, and we demonstrate that our approach leads to state-of-the-art results in shape completion benchmarks. In summary, our contributions are as follows:
We propose a multi-view consistency loss for 3D shape completion that does not rely on ground truth data.
We formulate multi-view consistent inference as an energy minimization problem including our consistency loss as a regularizer, and a neural network-based data term.
We show state-of-the-art results in standard shape completion benchmarks, demonstrating the benefits of the multi-view consistency loss in practice.
3D shape completion. Different 3D shape representations have been applied in 3D shape completion, such as voxels, point clouds, and multiple views. Voxel-based representations are widely used in shape completion with 3D CNN, such as 3D-Encoder-Predictor CNNs [Dai, Qi, and Nießner2017] and encoder-decoder CNN for patch-level geometry refinement [Han et al.2017]. However, computational complexity grows cubically as the voxel resolution increases, which severely limits the completion accuracy. To address this problem, several point cloud-based shape completion methods [Achlioptas et al.2018, Yang et al.2017, Yuan et al.2018] have been proposed. The point completion network (PCN) [Yuan et al.2018] is a current state-of-the-art approach that extends the PointNet architecture [Qi et al.2017] to provide an encoder, followed by a multi-stage decoder that uses both fully connected [Achlioptas et al.2018] and folding layers [Yang et al.2017]. The output point cloud size in these methods is fixed, however, to small numbers like 2048 [Yang et al.2017], which often leads to the loss of detail. View-based methods resolve this issue by completing each rendered depth image [Hu et al.2019]
of the incomplete shape, and then back-projecting the completed images into a dense point cloud. By leveraging state-of-the-art image-to-image translation networks[Isola et al.2017], MVCN [Hu et al.2019] completed each single view with a shape descriptor which encodes the characteristics of the whole 3D object to achieve higher accuracy. However, view-based methods fail to maintain geometric consistency among completed views during inference. Our approach resolves this issue using our novel multi-view consistent inference technique.
Multi-view consistency. One problem of view-based representation is inconsistency among multiple views. Some researchers presented a multi-view loss to train their network to achieve consistency in multi-view representations, like discovering 3D keypoints [Suwajanakorn et al.2018] and reconstructing 3D objects from images [Lin, Kong, and Lucey2018, Li et al.2018a, Tulsiani, Efros, and Malik2018, Jiang et al.2018, Khot et al.2019]. With differentiable rendering [Lin, Kong, and Lucey2018, Tulsiani, Efros, and Malik2018], the consistency distances among different views can be leveraged as 2D supervision to learn 3D shapes in their networks. However, these methods can only guarantee consistency for training data in training stage. Different from these methods, with the help of our novel energy optimization and consistency loss implementation, our proposed method can improve geometric consistency on test data directly during the inference stage.
Multi-view Consistent Inference
Overview. The goal of our method is to guarantee multi-view consistency in inference, as shown in the overview in Fig. 1. Our method starts from converting partial point clouds to multi-view depth image representations by rendering the points into a set of incomplete depth images from a number of fixed viewpoints. In our current implementation, we use eight viewpoints placed on the corners of a cube. Our approach builds on a conditional generative net which is trained to output completed depth images
by estimating a shape descriptorconditioned on a set of incomplete inputs . We obtain the conditional generative net in a separate, supervised training stage. During inference, we keep the network weights fixed and optimize the shape descriptor to minimize an energy consisting of a consistency loss, which acts as a regularizer, and a data term. On the one hand, the consistency loss quantifies the geometric consistency among the completed depth images . On the other hand, the data term encourages the solution to stay close to an initially estimated shape descriptor . This leads to the following optimization for the desired shape descriptor :
where is a weighting factor, and we denote and as the initially estimated completed depth images and optimized completed depth images in inference, respectively. In addition, we will formulate the regularization term and data term as multi-view consistency loss and generator loss in Section ‘Consistency Loss’.
Conditional generative net. The conditional generative net is built on the structure of multi-view completion net [Hu et al.2019], as shown in Fig. 2, which is an image-to-image translation architecture applied to perform depth image completion for multiple views of the same shape. We train the conditional generative net following a standard conditional GAN approach [Goodfellow et al.2014]. To share information between multiple depth images of the same shape, our architecture learns a shape descriptor for each 3D object by pooling a so-called shape memory consisting of feature maps from all views of the shape. The network consists of 8 U-Net modules, and each U-Net module has two submodules, Down and Up, so there are 8 Down submodules () in the encoder and 8 Up submodules (
) in the decoder. Down submodules consist of the form Convolution-BatchNorm-ReLU[Ioffe and Szegedy2015, Nair and Hinton2010], and Up submodules () consist of the form UpReLU-UpConv-UpNorm. The shape memory is the feature map after the third Down submodule () of the encoder. More details can be found in [Isola et al.2017, Hu et al.2019].
In inference, we optimize the shape descriptor of given test input . We first get an initial estimation of the shape descriptor for each test shape by running the trained model once, and initialize with . During inference the other parameters of are fixed.
Our consistency loss is based on the sum of the distances between each pixel in the multi-view depth map and its approximate nearest neighbor in any of the other views. In this section we introduce the details of the multi-view consistency loss calculation following the overview in Fig. 1. For all views , we first calculate pairwise per-pixel consistency distances to each other view , that is, per-pixel distances to approximate nearest neighbors in view . We then perform consistency pooling, which for each view provides the consistency distances over all other views (as opposed to the initial pairwise consistency distances between two of views). We call these the loss maps . The final consistency loss is the sum over all loss maps.
Pairwise Consistency Distances
Given a source view and a target view , we calculate the consistency distance between and by view-reprojection and closest point pooling, where and is the image resolution. Specifically, view-reprojection transforms the depth information of source to a reprojection map according to the transformation matrix of the target . Then, closest point pooling further produces the consistency distance between and . Fig. 3 shows the pipeline, where the target view is and the source view is . In the following, we denote a pixel on source view as , where and are considered pixel coordinates, its back-projected 3D point as , and the reprojected pixel on reprojection map as , where and are the depth values at the location and , respectively.
View-reprojection. The view-reprojection operator back-projects each point on into the canonical 3D coordinates as via
where is the intrinsic camera matrix, and and
are the rotation matrix and translation vector of viewrespectively. This defines the relationship between the view and its back-projected point cloud . We use to denote the transformation matrix of , which contains the pose information, such that . Then, we transform each 3D point in the point cloud into a pixel on the reprojection map as
Eq. (2) and Eq. (3) illustrate that we can transform the depth information of source view to reprojection map , which has the same pose with the target view . However, due to the discrete grid of the depth images, different points in the point cloud may be projected to the same pixel on the reprojection map when using Eq. (3), like in Fig. 3. In fact, all the are projected to the same pixel on , and the corresponding point on the target view is . To alleviate this collision effect, we implement a pseudo-rendering technique similar to [Lin, Kong, and Lucey2018]. Specifically, for each pixel on , a sub-pixel grid with a size of () is presented to store multiple depth values corresponding to the same pixel, so the reprojection is .
Closest point pooling. The closest point pooling operator computes the consistency distance between reprojection and target view . First, we also upsample to by repeating each depth value into a sub-pixel grid. Then, we calculate the element-wise distance between and the upsampled . Finally, we perform closest point pooling to extract the minimal distance in each sub-pixel grid using min-pooling with a
filter and a stride of. This provides the consistency distance between source view and target view , where . The consistency distance is shown in Fig. 3, where . Note that we directly take the th input view as the reprojection when , since the incomplete input also provides some supervision.
Note that some consistency distances in
may be large due to noisy view completion or self-occlusion between the source and target views, and these outliers interfere with our energy minimization. Therefore, we perform outlier suppression by ignoring consistency distances above a threshold ofof the depth range (from the minimum to the maximum depth value of a model).
Consistency Distance Aggregation by Consistency Pooling
Given a target view , we get all the consistency distances between and all the other source views , as shown in Fig. 4, where , and we use the same colorbar with Fig. 1. Obviously, different source views cover different parts of the target view , which leads to different consistency distances in . For example, the red parts on each in Fig. 4 indicate that they can not be well inferred from the source view, so these parts are not helpful for the optimization of the target view.
By extracting the minimum distance between and the reprojections from all other views, we cover the whole with the closest points to it and we obtain the loss maps . In our pipeline, we implement this efficiently using a consistency pooling operator defined as,
where , , and is the number of views in pooling. We use to make it possible to restrict pooling to a subset of the views (see Section ‘Experiments’ for an evaluation of this parameter). This is illustrated using as an example in Fig. 4. Fig. 5 shows all the consistency loss maps to each target view.
Our multi-view consistent inference aims to maximize the depth consistency across all views by optimizing the shape descriptor of a 3D model. Therefore, the consistency loss to the whole 3D model takes the loss maps for all target views,
where is the number of views and is the input set of incomplete depth images of the 3D model.
In Eq. (1), we also have a data term to keep close to the initial estimation during inference. We call this the generator loss , which aims to prevent the completed depth images drifting away from the prior learned from the training data:
where is the input, and are the initially estimated outputs and optimized outputs respectively, and
. In summary, the overall loss function in inferenceis
where is a weighting factor. We optimize the shape descriptor for 100 gradient descent steps, and we take with the smallest consistency loss in the last 10 steps as . It should be mentioned that since the gradients of are small, we use a large learning rate of 0.2.
Our method is built on MVCN [Hu et al.2019], a state-of-the-art view-based shape completion method. To fairly evaluate the improvements over MVCN directly, we use the same pipeline a MVCN to generate training and test depth images, where each 3D object is represented by depth maps with a resolution of . We take 3D models from ShapeNet [Chang et al.2015]. Initially, we set in Eq. 5
to conduct consistency pooling in the following experiments. In addition, we use the same training dataset and hyperparameters with[Hu et al.2019] to train the network, and the same test dataset with [Hu et al.2019, Yuan et al.2018] to evaluate our methods with Chamfer Distance (CD) [Fan, Su, and Guibas2017].
Analysis of the Objective Function
We test different objective functions in Eq. (7) to justify the effectiveness of our methods. Table 1 shows the quantitative effects of these variations. The experiments are conducted on 100 3D airplane models (besides test dataset or training dataset), which are randomly selected under the constraints that the average CD is close to that of the test dataset in [Hu et al.2019]. We change the weighting factor between and , and different distance functions in (using or ). When , only is used in loss function. According to the comparison, we select distance to calculate generator loss, and set in the following experiments.
The Size of Depth-buffer in Pseudo-rendering
As mentioned above, we use a depth-buffer in pseudo-rendering, and the depth-buffer size is . Obviously, a bigger buffer means less collisions in pseudo-rendering, which further makes the reprojection more accurate. The average CD is lower when we increase the size of the depth-buffer, as shown in Table 2 (a), where the experiments are conducted on two categories of the test dataset. From the loss maps in Fig. 6 (c) to (e), given in consistency pooling Eq. (5), the consistency loss goes smaller when we increase . This is because the closest points (reprojected from the other 8 views) to the target view are more accurate. We also see less noisy points (brighter ones) in Fig. 6 (e).
The Number of Views in Consistency Pooling
In this part, we analyze the effects of varying the number of views in consistency pooling. As shown in Fig. 4, more views mean a bigger coverage over the target view and a smaller consistency loss. Given a depth-buffer size of , Fig. 6 (e) to (g) show that the consistency loss increases when or , and we also find more noisy points in these loss maps Fig. 6 (f) and (g).
Comparison with Direct Optimization Method
Our multi-view consistent inference can also be used to optimize completed depth maps directly without the conditional generative net . We call this direct optimization on depth maps, and in this part, we compare our methods with direct optimization. In fact, direct optimization only contains the Consistency loss calculation C part in Fig. 1
. Each depth map will be a trainable tensor. We first initialize the tensors with the completed views, and then update these tensors by minimizing the consistency loss in Eq. 7. We use distance to calculate , , and the learning rate is 0.0006, which produces the best results for direct optimization.
Fig. 7 shows the comparisons. Here we color-code the normals of the completed point clouds, which are estimated using a k-d tree search algorithm with a search radius of 0.5 and a maximum number of neighbors of 30. Compared with direct optimization, our method performs better. For example, in terms of optimizing point clouds, we can smooth the surface, like the seat of the chair, and remove some outliers. As for completing depth maps, our method can fill a hole appearing in MVCN [Hu et al.2019] and even add the missing leg, where the distances to the ground truth are marked in red.
Though the direct optimization method can also refine the point clouds of MVCN, it does not perform well in removing outliers on point clouds (left) or completing a depth map (right) in Fig. 7. The reason is that direct optimization does not have any knowledge to distinguish shape and background from a depth map, which means that for pixels in a hole, direct optimization does not know whether they belong to a hole of the shape or the background. However, with the knowledge of shape completion learned in the conditional generative net , our method completes shapes better.
Intermediate Results and Convergence
In Fig. 8, the image insets illustrate the intermediate completion results of the [0, 20, 40, 60, 80, 100]th step for one example depth image from the cabinet class. In addition, is averaged over all cabinet objects, where is ground truth. For clarity, the curve is offset vertically by 0.2. . We see the completed results are closer to ground truth than MVCN, though there is no ground truth supervision in inference.
Fig. 8 illustrates empirically that, under the defined loss function, our optimization can find a good solution within 100 steps. The figure shows the average loss vs gradient descent steps on all the 150 cabinet test objects. We reach the maximum of within steps, then the distance to decreases in the following steps. For 98% of all the 1200 test objects, the maximum is reached within 10 steps (), and within 20 steps for almost all. After 100 steps, the optimization has largely converged.
Improvements over Existing Works. Here we compare our method with the state-of-the-art shape completion methods, including 3D-EPN [Dai, Qi, and Nießner2017], FC [Achlioptas et al.2018], Folding [Yang et al.2017], three variants of PCN [Yuan et al.2018]: PN2, PCN-CD, PCN-EMD, and MVCN [Hu et al.2019]. TopNet [Tchapmi et al.2019] is a recent point-based method, but their generated point clouds are sparse.
Table 3 shows the quantitative results, where the completion results of the other methods are from [Yuan et al.2018, Hu et al.2019] and ‘Direct-Opt’ is the direct optimization method introduced above. With multi-view consistency optimization, both direct optimization and our method can improve MVCN on most categories of the test datasets, and our method achieves better results. The optimization methods fail on the Lamp dataset. As mentioned in [Hu et al.2019], the reason is that the completion of MVCN is bad on several lamp objects, which makes the optimization less meaningful.
Fig. 11 shows the qualitative improvements over the currently best view-based method, MVCN, where the normals of point clouds are color-coded. With the conditional generative net and multi-view consistency loss , our method produces completed point clouds with smoother surfaces and fewer outliers, and can also fill holes of shapes on multiple categories.
Completions results given different inputs. Fig. 10 (a, b, c) show completed airplanes and cars under 3 different inputs of the same objects. Since the car input in (a) leaves a lot of ambiguity, the completed cars vary. The airplanes results are more similar because the inputs contain most of the structure.
Multiple views of completed shapes. Fig 10 (c) shows a completed airplane and car from 3 views. We see the completed shapes are consistent among different views.
We proposed multi-view consistent inference to enforce geometric consistency in view-based 3D shape completion. We defined a novel multi-view consistency loss suitable for optimization in inference, which can be achieved without the supervision of ground truth. The experimental results demonstrate that our method can complete 3D shapes more accurately than existing methods.
- [Achlioptas et al.2018] Achlioptas, P.; Diamanti, O.; Mitliagkas, I.; and Guibas, L. J. 2018. Learning representations and generative models for 3d point clouds. In ICML.
- [Atzmon, Maron, and Lipman2018] Atzmon, M.; Maron, H.; and Lipman, Y. 2018. Point convolutional neural networks by extension operators. ACM Trans. Graph. 37(4):71:1–71:12.
- [Chang et al.2015] Chang, A. X.; Funkhouser, T. A.; Guibas, L. J.; Hanrahan, P.; Huang, Q.-X.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; Xiao, J.; Yi, L.; and Yu, F. 2015. Shapenet: An information-rich 3d model repository. CoRR abs/1512.03012.
- [Cohen et al.2018] Cohen, T. S.; Geiger, M.; Köhler, J.; and Welling, M. 2018. Spherical cnns. CoRR abs/1801.10130.
- [Dai, Qi, and Nießner2017] Dai, A.; Qi, C. R.; and Nießner, M. 2017. Shape completion using 3d-encoder-predictor cnns and shape synthesis. 6545–6554.
- [Fan, Su, and Guibas2017] Fan, H.; Su, H.; and Guibas, L. J. 2017. A point set generation network for 3d object reconstruction from a single image. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2463–2471.
- [Goodfellow et al.2014] Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Ghahramani, Z.; Welling, M.; Cortes, C.; Lawrence, N. D.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 27. Curran Associates, Inc. 2672–2680.
- [Han et al.2017] Han, X.; Li, Z.; Huang, H.; Kalogerakis, E.; and Yu, Y. 2017. High-resolution shape completion using deep neural networks for global structure and local geometry inference. In The IEEE International Conference on Computer Vision (ICCV).
- [Hu et al.2019] Hu, T.; Han, Z.; Shrivastava, A.; and Zwicker, M. 2019. Render4completion: Synthesizing multi-view depth maps for 3d shape completion. CoRR abs/1904.08366.
- [Ioffe and Szegedy2015] Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML.
[Isola et al.2017]
Isola, P.; Zhu, J.-Y.; Zhou, T.; and Efros, A. A.
Image-to-image translation with conditional adversarial networks.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 5967–5976.
- [Jiang et al.2018] Jiang, L.; Shi, S.; Qi, X.; and Jia, J. 2018. GAL: geometric adversarial loss for single-view 3D-object reconstruction. In European Conference on Computer vision, 820–834.
- [Khot et al.2019] Khot, T.; Agrawal, S.; Tulsiani, S.; Mertz, C.; Lucey, S.; and Hebert, M. 2019. Learning unsupervised multi-view stereopsis via robust photometric consistency. volume abs/1905.02706.
- [Li et al.2018a] Li, K.; Pham, T.; Zhan, H.; and Reid, I. D. 2018a. Efficient dense point cloud object reconstruction using deformation vector fields. In European Conference on Computer Vision, 508–524.
- [Li et al.2018b] Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; and Chen, B. 2018b. Pointcnn: Convolution on x-transformed points. In Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and Garnett, R., eds., Advances in Neural Information Processing Systems 31. Curran Associates, Inc. 820–830.
[Lin, Kong, and Lucey2018]
Lin, C.-H.; Kong, C.; and Lucey, S.
Learning efficient point cloud generation for dense 3d object
AAAI Conference on Artificial Intelligence (AAAI).
[Litany et al.2018]
Litany, O.; Bronstein, A.; Bronstein, M.; and Makadia, A.
Deformable shape completion with graph convolutional autoencoders.CVPR.
- [Nair and Hinton2010] Nair, V., and Hinton, G. E. 2010. Rectified linear units improve restricted boltzmann machines. In ICML.
[Qi et al.2017]
Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J.
Pointnet: Deep learning on point sets for 3d classification and segmentation.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 77–85.
- [Richter and Roth2018] Richter, S. R., and Roth, S. 2018. Matryoshka networks: Predicting 3d geometry via nested shape layers. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 1936–1944.
- [Soltani et al.2017] Soltani, A. A.; Huang, H.; Wu, J.; Kulkarni, T. D.; and Tenenbaum, J. B. 2017. Synthesizing 3d shapes via modeling multi-view depth maps and silhouettes with deep generative networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2511–2519.
- [Su et al.2015] Su, H.; Maji, S.; Kalogerakis, E.; and Learned-Miller, E. G. 2015. Multi-view convolutional neural networks for 3d shape recognition. In Proc. ICCV.
- [Suwajanakorn et al.2018] Suwajanakorn, S.; Snavely, N.; Tompson, J.; and Norouzi, M. 2018. Discovery of latent 3d keypoints via end-to-end geometric reasoning. In NeurIPS.
- [Tchapmi et al.2019] Tchapmi, L. P.; Kosaraju, V.; Rezatofighi, H.; Reid, I.; and Savarese, S. 2019. Topnet: Structural point cloud decoder. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- [Tulsiani, Efros, and Malik2018] Tulsiani, S.; Efros, A. A.; and Malik, J. 2018. Multi-view consistency as supervisory signal for learning shape and pose prediction. In Computer Vision and Pattern Regognition.
- [Wu et al.2015] Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; and Xiao, J. 2015. 3d shapenets: A deep representation for volumetric shapes. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1912–1920.
- [Yang et al.2017] Yang, Y.; Feng, C.; Shen, Y.; and Tian, D. 2017. Foldingnet: Point cloud auto-encoder via deep grid deformation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 206–215.
- [Yuan et al.2018] Yuan, W.; Khot, T.; Held, D.; Mertz, C.; and Hebert, M. 2018. Pcn: Point completion network. 2018 International Conference on 3D Vision (3DV) 728–737.