Introduction
Convolutional neural networks have proven highly successful at analysis and synthesis of visual data such as images and videos. This has spurred interest in applying convolutional network architectures also to 3D shapes, where a key challenge is to find suitable generalizations of discrete convolutions to the 3D domain. Popular techniques include using discrete convolutions on 3D grids [Wu et al.2015], graph convolutions on meshes [Litany et al.2018], convolutionlike operators on 3D point clouds [Atzmon, Maron, and Lipman2018, Li et al.2018b], or 2D convolutions on 2D shape parameterizations [Cohen et al.2018]. A simple approach in the last category is to represent shapes using multiple 2D projections, or multiple depth images, and apply 2D convolutions on these views. This has led to successful techniques for shape classification [Su et al.2015], singleview 3D reconstruction [Richter and Roth2018], shape completion [Hu et al.2019], and shape synthesis [Soltani et al.2017]. One issue in these approaches, however, is to encourage consistency among the separate views and avoid that each view represents a slightly different object. This is not an issue in supervised training, where the loss encourages all views to match the ground truth shape. But at inference time or in unsupervised training, ground truth is not available and a different mechanism is required to encourage consistency.
In this paper, we address the problem of shape completion using a multiview depth image representation, and we propose a multiview consistency loss that is minimized during inference. We formulate inference as an energy minimization problem, where the energy is the sum of a data term given by a conditional generative net, and a regularization term given by a geometric consistency loss. Our results show the benefits of optimizing geometric consistency in a multiview shape representation during inference, and we demonstrate that our approach leads to stateoftheart results in shape completion benchmarks. In summary, our contributions are as follows:

We propose a multiview consistency loss for 3D shape completion that does not rely on ground truth data.

We formulate multiview consistent inference as an energy minimization problem including our consistency loss as a regularizer, and a neural networkbased data term.

We show stateoftheart results in standard shape completion benchmarks, demonstrating the benefits of the multiview consistency loss in practice.
Related Work
3D shape completion. Different 3D shape representations have been applied in 3D shape completion, such as voxels, point clouds, and multiple views. Voxelbased representations are widely used in shape completion with 3D CNN, such as 3DEncoderPredictor CNNs [Dai, Qi, and Nießner2017] and encoderdecoder CNN for patchlevel geometry refinement [Han et al.2017]. However, computational complexity grows cubically as the voxel resolution increases, which severely limits the completion accuracy. To address this problem, several point cloudbased shape completion methods [Achlioptas et al.2018, Yang et al.2017, Yuan et al.2018] have been proposed. The point completion network (PCN) [Yuan et al.2018] is a current stateoftheart approach that extends the PointNet architecture [Qi et al.2017] to provide an encoder, followed by a multistage decoder that uses both fully connected [Achlioptas et al.2018] and folding layers [Yang et al.2017]. The output point cloud size in these methods is fixed, however, to small numbers like 2048 [Yang et al.2017], which often leads to the loss of detail. Viewbased methods resolve this issue by completing each rendered depth image [Hu et al.2019]
of the incomplete shape, and then backprojecting the completed images into a dense point cloud. By leveraging stateoftheart imagetoimage translation networks
[Isola et al.2017], MVCN [Hu et al.2019] completed each single view with a shape descriptor which encodes the characteristics of the whole 3D object to achieve higher accuracy. However, viewbased methods fail to maintain geometric consistency among completed views during inference. Our approach resolves this issue using our novel multiview consistent inference technique.Multiview consistency. One problem of viewbased representation is inconsistency among multiple views. Some researchers presented a multiview loss to train their network to achieve consistency in multiview representations, like discovering 3D keypoints [Suwajanakorn et al.2018] and reconstructing 3D objects from images [Lin, Kong, and Lucey2018, Li et al.2018a, Tulsiani, Efros, and Malik2018, Jiang et al.2018, Khot et al.2019]. With differentiable rendering [Lin, Kong, and Lucey2018, Tulsiani, Efros, and Malik2018], the consistency distances among different views can be leveraged as 2D supervision to learn 3D shapes in their networks. However, these methods can only guarantee consistency for training data in training stage. Different from these methods, with the help of our novel energy optimization and consistency loss implementation, our proposed method can improve geometric consistency on test data directly during the inference stage.
Multiview Consistent Inference
Overview. The goal of our method is to guarantee multiview consistency in inference, as shown in the overview in Fig. 1. Our method starts from converting partial point clouds to multiview depth image representations by rendering the points into a set of incomplete depth images from a number of fixed viewpoints. In our current implementation, we use eight viewpoints placed on the corners of a cube. Our approach builds on a conditional generative net which is trained to output completed depth images
by estimating a shape descriptor
conditioned on a set of incomplete inputs . We obtain the conditional generative net in a separate, supervised training stage. During inference, we keep the network weights fixed and optimize the shape descriptor to minimize an energy consisting of a consistency loss, which acts as a regularizer, and a data term. On the one hand, the consistency loss quantifies the geometric consistency among the completed depth images . On the other hand, the data term encourages the solution to stay close to an initially estimated shape descriptor . This leads to the following optimization for the desired shape descriptor :(1) 
where is a weighting factor, and we denote and as the initially estimated completed depth images and optimized completed depth images in inference, respectively. In addition, we will formulate the regularization term and data term as multiview consistency loss and generator loss in Section ‘Consistency Loss’.
Conditional generative net. The conditional generative net is built on the structure of multiview completion net [Hu et al.2019], as shown in Fig. 2, which is an imagetoimage translation architecture applied to perform depth image completion for multiple views of the same shape. We train the conditional generative net following a standard conditional GAN approach [Goodfellow et al.2014]. To share information between multiple depth images of the same shape, our architecture learns a shape descriptor for each 3D object by pooling a socalled shape memory consisting of feature maps from all views of the shape. The network consists of 8 UNet modules, and each UNet module has two submodules, Down and Up, so there are 8 Down submodules () in the encoder and 8 Up submodules (
) in the decoder. Down submodules consist of the form ConvolutionBatchNormReLU
[Ioffe and Szegedy2015, Nair and Hinton2010], and Up submodules () consist of the form UpReLUUpConvUpNorm. The shape memory is the feature map after the third Down submodule () of the encoder. More details can be found in [Isola et al.2017, Hu et al.2019].In inference, we optimize the shape descriptor of given test input . We first get an initial estimation of the shape descriptor for each test shape by running the trained model once, and initialize with . During inference the other parameters of are fixed.
Consistency Loss
Our consistency loss is based on the sum of the distances between each pixel in the multiview depth map and its approximate nearest neighbor in any of the other views. In this section we introduce the details of the multiview consistency loss calculation following the overview in Fig. 1. For all views , we first calculate pairwise perpixel consistency distances to each other view , that is, perpixel distances to approximate nearest neighbors in view . We then perform consistency pooling, which for each view provides the consistency distances over all other views (as opposed to the initial pairwise consistency distances between two of views). We call these the loss maps . The final consistency loss is the sum over all loss maps.
Pairwise Consistency Distances
Given a source view and a target view , we calculate the consistency distance between and by viewreprojection and closest point pooling, where and is the image resolution. Specifically, viewreprojection transforms the depth information of source to a reprojection map according to the transformation matrix of the target . Then, closest point pooling further produces the consistency distance between and . Fig. 3 shows the pipeline, where the target view is and the source view is . In the following, we denote a pixel on source view as , where and are considered pixel coordinates, its backprojected 3D point as , and the reprojected pixel on reprojection map as , where and are the depth values at the location and , respectively.
Viewreprojection. The viewreprojection operator backprojects each point on into the canonical 3D coordinates as via
(2) 
where is the intrinsic camera matrix, and and
are the rotation matrix and translation vector of view
respectively. This defines the relationship between the view and its backprojected point cloud . We use to denote the transformation matrix of , which contains the pose information, such that . Then, we transform each 3D point in the point cloud into a pixel on the reprojection map as(3) 
Eq. (2) and Eq. (3) illustrate that we can transform the depth information of source view to reprojection map , which has the same pose with the target view . However, due to the discrete grid of the depth images, different points in the point cloud may be projected to the same pixel on the reprojection map when using Eq. (3), like in Fig. 3. In fact, all the are projected to the same pixel on , and the corresponding point on the target view is . To alleviate this collision effect, we implement a pseudorendering technique similar to [Lin, Kong, and Lucey2018]. Specifically, for each pixel on , a subpixel grid with a size of () is presented to store multiple depth values corresponding to the same pixel, so the reprojection is .
Closest point pooling. The closest point pooling operator computes the consistency distance between reprojection and target view . First, we also upsample to by repeating each depth value into a subpixel grid. Then, we calculate the elementwise distance between and the upsampled . Finally, we perform closest point pooling to extract the minimal distance in each subpixel grid using minpooling with a
filter and a stride of
. This provides the consistency distance between source view and target view , where . The consistency distance is shown in Fig. 3, where . Note that we directly take the th input view as the reprojection when , since the incomplete input also provides some supervision.Note that some consistency distances in
may be large due to noisy view completion or selfocclusion between the source and target views, and these outliers interfere with our energy minimization. Therefore, we perform outlier suppression by ignoring consistency distances above a threshold of
of the depth range (from the minimum to the maximum depth value of a model).Consistency Distance Aggregation by Consistency Pooling
Given a target view , we get all the consistency distances between and all the other source views , as shown in Fig. 4, where , and we use the same colorbar with Fig. 1. Obviously, different source views cover different parts of the target view , which leads to different consistency distances in . For example, the red parts on each in Fig. 4 indicate that they can not be well inferred from the source view, so these parts are not helpful for the optimization of the target view.
By extracting the minimum distance between and the reprojections from all other views, we cover the whole with the closest points to it and we obtain the loss maps . In our pipeline, we implement this efficiently using a consistency pooling operator defined as,
(4) 
where , , and is the number of views in pooling. We use to make it possible to restrict pooling to a subset of the views (see Section ‘Experiments’ for an evaluation of this parameter). This is illustrated using as an example in Fig. 4. Fig. 5 shows all the consistency loss maps to each target view.
Loss Function
Our multiview consistent inference aims to maximize the depth consistency across all views by optimizing the shape descriptor of a 3D model. Therefore, the consistency loss to the whole 3D model takes the loss maps for all target views,
(5) 
where is the number of views and is the input set of incomplete depth images of the 3D model.
In Eq. (1), we also have a data term to keep close to the initial estimation during inference. We call this the generator loss , which aims to prevent the completed depth images drifting away from the prior learned from the training data:
(6) 
where is the input, and are the initially estimated outputs and optimized outputs respectively, and
. In summary, the overall loss function in inference
is(7) 
where is a weighting factor. We optimize the shape descriptor for 100 gradient descent steps, and we take with the smallest consistency loss in the last 10 steps as . It should be mentioned that since the gradients of are small, we use a large learning rate of 0.2.
Experiments
Our method is built on MVCN [Hu et al.2019], a stateoftheart viewbased shape completion method. To fairly evaluate the improvements over MVCN directly, we use the same pipeline a MVCN to generate training and test depth images, where each 3D object is represented by depth maps with a resolution of . We take 3D models from ShapeNet [Chang et al.2015]. Initially, we set in Eq. 5
to conduct consistency pooling in the following experiments. In addition, we use the same training dataset and hyperparameters with
[Hu et al.2019] to train the network, and the same test dataset with [Hu et al.2019, Yuan et al.2018] to evaluate our methods with Chamfer Distance (CD) [Fan, Su, and Guibas2017].Analysis of the Objective Function
We test different objective functions in Eq. (7) to justify the effectiveness of our methods. Table 1 shows the quantitative effects of these variations. The experiments are conducted on 100 3D airplane models (besides test dataset or training dataset), which are randomly selected under the constraints that the average CD is close to that of the test dataset in [Hu et al.2019]. We change the weighting factor between and , and different distance functions in (using or ). When , only is used in loss function. According to the comparison, we select distance to calculate generator loss, and set in the following experiments.
5.228  5.160  5.129  5.160  5.155  6.383  
5.362  5.110  5.136  5.135  5.175 
The Size of Depthbuffer in Pseudorendering
As mentioned above, we use a depthbuffer in pseudorendering, and the depthbuffer size is . Obviously, a bigger buffer means less collisions in pseudorendering, which further makes the reprojection more accurate. The average CD is lower when we increase the size of the depthbuffer, as shown in Table 2 (a), where the experiments are conducted on two categories of the test dataset. From the loss maps in Fig. 6 (c) to (e), given in consistency pooling Eq. (5), the consistency loss goes smaller when we increase . This is because the closest points (reprojected from the other 8 views) to the target view are more accurate. We also see less noisy points (brighter ones) in Fig. 6 (e).


The Number of Views in Consistency Pooling
In this part, we analyze the effects of varying the number of views in consistency pooling. As shown in Fig. 4, more views mean a bigger coverage over the target view and a smaller consistency loss. Given a depthbuffer size of , Fig. 6 (e) to (g) show that the consistency loss increases when or , and we also find more noisy points in these loss maps Fig. 6 (f) and (g).
Comparison with Direct Optimization Method
Our multiview consistent inference can also be used to optimize completed depth maps directly without the conditional generative net . We call this direct optimization on depth maps, and in this part, we compare our methods with direct optimization. In fact, direct optimization only contains the Consistency loss calculation C part in Fig. 1
. Each depth map will be a trainable tensor. We first initialize the tensors with the completed views
, and then update these tensors by minimizing the consistency loss in Eq. 7. We use distance to calculate , , and the learning rate is 0.0006, which produces the best results for direct optimization.Fig. 7 shows the comparisons. Here we colorcode the normals of the completed point clouds, which are estimated using a kd tree search algorithm with a search radius of 0.5 and a maximum number of neighbors of 30. Compared with direct optimization, our method performs better. For example, in terms of optimizing point clouds, we can smooth the surface, like the seat of the chair, and remove some outliers. As for completing depth maps, our method can fill a hole appearing in MVCN [Hu et al.2019] and even add the missing leg, where the distances to the ground truth are marked in red.
Though the direct optimization method can also refine the point clouds of MVCN, it does not perform well in removing outliers on point clouds (left) or completing a depth map (right) in Fig. 7. The reason is that direct optimization does not have any knowledge to distinguish shape and background from a depth map, which means that for pixels in a hole, direct optimization does not know whether they belong to a hole of the shape or the background. However, with the knowledge of shape completion learned in the conditional generative net , our method completes shapes better.
Intermediate Results and Convergence
In Fig. 8, the image insets illustrate the intermediate completion results of the [0, 20, 40, 60, 80, 100]th step for one example depth image from the cabinet class. In addition, is averaged over all cabinet objects, where is ground truth. For clarity, the curve is offset vertically by 0.2. . We see the completed results are closer to ground truth than MVCN, though there is no ground truth supervision in inference.
Fig. 8 illustrates empirically that, under the defined loss function, our optimization can find a good solution within 100 steps. The figure shows the average loss vs gradient descent steps on all the 150 cabinet test objects. We reach the maximum of within steps, then the distance to decreases in the following steps. For 98% of all the 1200 test objects, the maximum is reached within 10 steps (), and within 20 steps for almost all. After 100 steps, the optimization has largely converged.
Completion results
Improvements over Existing Works. Here we compare our method with the stateoftheart shape completion methods, including 3DEPN [Dai, Qi, and Nießner2017], FC [Achlioptas et al.2018], Folding [Yang et al.2017], three variants of PCN [Yuan et al.2018]: PN2, PCNCD, PCNEMD, and MVCN [Hu et al.2019]. TopNet [Tchapmi et al.2019] is a recent pointbased method, but their generated point clouds are sparse.
Table 3 shows the quantitative results, where the completion results of the other methods are from [Yuan et al.2018, Hu et al.2019] and ‘DirectOpt’ is the direct optimization method introduced above. With multiview consistency optimization, both direct optimization and our method can improve MVCN on most categories of the test datasets, and our method achieves better results. The optimization methods fail on the Lamp dataset. As mentioned in [Hu et al.2019], the reason is that the completion of MVCN is bad on several lamp objects, which makes the optimization less meaningful.
Fig. 11 shows the qualitative improvements over the currently best viewbased method, MVCN, where the normals of point clouds are colorcoded. With the conditional generative net and multiview consistency loss , our method produces completed point clouds with smoother surfaces and fewer outliers, and can also fill holes of shapes on multiple categories.
Completions results given different inputs. Fig. 10 (a, b, c) show completed airplanes and cars under 3 different inputs of the same objects. Since the car input in (a) leaves a lot of ambiguity, the completed cars vary. The airplanes results are more similar because the inputs contain most of the structure.
Multiple views of completed shapes. Fig 10 (c) shows a completed airplane and car from 3 views. We see the completed shapes are consistent among different views.
Completions on noisy inputs. In Fig. 9
, we perturb the input depth map with Gaussian noise whose standard deviation is 0.01 times the scale of the depth measurements. Our completion is robust to the noisy input.
Conclusion
We proposed multiview consistent inference to enforce geometric consistency in viewbased 3D shape completion. We defined a novel multiview consistency loss suitable for optimization in inference, which can be achieved without the supervision of ground truth. The experimental results demonstrate that our method can complete 3D shapes more accurately than existing methods.
References
 [Achlioptas et al.2018] Achlioptas, P.; Diamanti, O.; Mitliagkas, I.; and Guibas, L. J. 2018. Learning representations and generative models for 3d point clouds. In ICML.
 [Atzmon, Maron, and Lipman2018] Atzmon, M.; Maron, H.; and Lipman, Y. 2018. Point convolutional neural networks by extension operators. ACM Trans. Graph. 37(4):71:1–71:12.
 [Chang et al.2015] Chang, A. X.; Funkhouser, T. A.; Guibas, L. J.; Hanrahan, P.; Huang, Q.X.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; Xiao, J.; Yi, L.; and Yu, F. 2015. Shapenet: An informationrich 3d model repository. CoRR abs/1512.03012.
 [Cohen et al.2018] Cohen, T. S.; Geiger, M.; Köhler, J.; and Welling, M. 2018. Spherical cnns. CoRR abs/1801.10130.

[Dai, Qi, and Nießner2017]
Dai, A.; Qi, C. R.; and Nießner, M.
2017.
Shape completion using 3dencoderpredictor cnns and shape synthesis.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
6545–6554.  [Fan, Su, and Guibas2017] Fan, H.; Su, H.; and Guibas, L. J. 2017. A point set generation network for 3d object reconstruction from a single image. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2463–2471.
 [Goodfellow et al.2014] Goodfellow, I.; PougetAbadie, J.; Mirza, M.; Xu, B.; WardeFarley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Ghahramani, Z.; Welling, M.; Cortes, C.; Lawrence, N. D.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 27. Curran Associates, Inc. 2672–2680.
 [Han et al.2017] Han, X.; Li, Z.; Huang, H.; Kalogerakis, E.; and Yu, Y. 2017. Highresolution shape completion using deep neural networks for global structure and local geometry inference. In The IEEE International Conference on Computer Vision (ICCV).
 [Hu et al.2019] Hu, T.; Han, Z.; Shrivastava, A.; and Zwicker, M. 2019. Render4completion: Synthesizing multiview depth maps for 3d shape completion. CoRR abs/1904.08366.
 [Ioffe and Szegedy2015] Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML.

[Isola et al.2017]
Isola, P.; Zhu, J.Y.; Zhou, T.; and Efros, A. A.
2017.
Imagetoimage translation with conditional adversarial networks.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 5967–5976.  [Jiang et al.2018] Jiang, L.; Shi, S.; Qi, X.; and Jia, J. 2018. GAL: geometric adversarial loss for singleview 3Dobject reconstruction. In European Conference on Computer vision, 820–834.
 [Khot et al.2019] Khot, T.; Agrawal, S.; Tulsiani, S.; Mertz, C.; Lucey, S.; and Hebert, M. 2019. Learning unsupervised multiview stereopsis via robust photometric consistency. volume abs/1905.02706.
 [Li et al.2018a] Li, K.; Pham, T.; Zhan, H.; and Reid, I. D. 2018a. Efficient dense point cloud object reconstruction using deformation vector fields. In European Conference on Computer Vision, 508–524.
 [Li et al.2018b] Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; and Chen, B. 2018b. Pointcnn: Convolution on xtransformed points. In Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; CesaBianchi, N.; and Garnett, R., eds., Advances in Neural Information Processing Systems 31. Curran Associates, Inc. 820–830.

[Lin, Kong, and Lucey2018]
Lin, C.H.; Kong, C.; and Lucey, S.
2018.
Learning efficient point cloud generation for dense 3d object
reconstruction.
In
AAAI Conference on Artificial Intelligence (AAAI)
. 
[Litany et al.2018]
Litany, O.; Bronstein, A.; Bronstein, M.; and Makadia, A.
2018.
Deformable shape completion with graph convolutional autoencoders.
CVPR.  [Nair and Hinton2010] Nair, V., and Hinton, G. E. 2010. Rectified linear units improve restricted boltzmann machines. In ICML.

[Qi et al.2017]
Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J.
2017.
Pointnet: Deep learning on point sets for 3d classification and segmentation.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 77–85.  [Richter and Roth2018] Richter, S. R., and Roth, S. 2018. Matryoshka networks: Predicting 3d geometry via nested shape layers. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 1936–1944.
 [Soltani et al.2017] Soltani, A. A.; Huang, H.; Wu, J.; Kulkarni, T. D.; and Tenenbaum, J. B. 2017. Synthesizing 3d shapes via modeling multiview depth maps and silhouettes with deep generative networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2511–2519.
 [Su et al.2015] Su, H.; Maji, S.; Kalogerakis, E.; and LearnedMiller, E. G. 2015. Multiview convolutional neural networks for 3d shape recognition. In Proc. ICCV.
 [Suwajanakorn et al.2018] Suwajanakorn, S.; Snavely, N.; Tompson, J.; and Norouzi, M. 2018. Discovery of latent 3d keypoints via endtoend geometric reasoning. In NeurIPS.
 [Tchapmi et al.2019] Tchapmi, L. P.; Kosaraju, V.; Rezatofighi, H.; Reid, I.; and Savarese, S. 2019. Topnet: Structural point cloud decoder. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
 [Tulsiani, Efros, and Malik2018] Tulsiani, S.; Efros, A. A.; and Malik, J. 2018. Multiview consistency as supervisory signal for learning shape and pose prediction. In Computer Vision and Pattern Regognition.
 [Wu et al.2015] Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; and Xiao, J. 2015. 3d shapenets: A deep representation for volumetric shapes. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1912–1920.
 [Yang et al.2017] Yang, Y.; Feng, C.; Shen, Y.; and Tian, D. 2017. Foldingnet: Point cloud autoencoder via deep grid deformation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 206–215.
 [Yuan et al.2018] Yuan, W.; Khot, T.; Held, D.; Mertz, C.; and Hebert, M. 2018. Pcn: Point completion network. 2018 International Conference on 3D Vision (3DV) 728–737.
Comments
There are no comments yet.