3D shape generation is a long-standing research problem in computer vision and computer graphics with applications in autonomous driving, augmented reality, etc. Conventional approaches mainly leverage multi-view geometry based on stereo correspondences between images but are restricted by the coverage provided by the input views. With the availability of large-scale 3D shape datasets and the success of deep learning in several computer vision tasks, 3D representations such as voxel gridChoy et al. (2016); Tulsiani et al. (2017); Yan et al. (2016) and point cloud Yang et al. (2018); Fan et al. (2017) have been explored for single-view 3D reconstruction. Among them, triangle mesh representation has received the most attention as it has various desirable properties for a wide range of applications and is capable of modeling detailed geometry without high memory requirement.
Single-view 3D reconstruction methods Wang et al. (2018); Huang et al. (2015); Kar et al. (2015); Su et al. (2014) generate the 3D shape from merely a single color image but suffer from occlusion and limited visibility which leads to low quality reconstructions in the unseen areas. Multi-view methods Wen et al. (2019); Choy et al. (2016); Kar et al. (2017); Gwak et al. (2017) extend the input to images from different viewpoints which provides more visual information and improves the accuracy of the generated shapes. Recent work in multi-view mesh reconstruction Wen et al. (2019) introduces a multi-view deformation network using perceptual feature from each color image for refining the meshes generated by Pixel2Mesh Wang et al. (2018). Although promising results were obtained, this method relies on perceptual features from color images which do not explicitly encode the objects’ geometry and could restrict the accuracy of the 3D models.
In this work, we present a novel multi-view mesh generation method where we start by predicting coarse volumetric occupancy grid representations for the color images of each input viewpoint independently using a shared fully convolutional network which are merged into a single voxel grid in a probabilistic fashion followed by cubify Gkioxari et al. (2019) operation to convert it to a triangle mesh. We then use Graph Convolutional Network (GCN) Scarselli et al. (2008); Wang et al. (2018)
to fine-tune the cubified voxel grid in a coarse-to-fine manner. The GCN refines the coarse mesh by using the feature vector of each graph node (mesh vertices) obtained by projecting the vertices on the 2D contrastive depth features. The contrastive depth features are extracted from the rendered depth maps of the current mesh and predicted depth maps from a multi-view stereo network. We also propose an attention-based method to fuse feature from multiple views that can learn the importance of different views for each of the mesh vertices. Constrains between the intermediate refined mesh from GCN with predicted depth maps of different viewpoints further improve the final mesh quality.
By employing multi-view voxel grid generation and refining it using geometry information from both the current mesh (through the rendered depth maps) and predicted depth maps, we are able to generate high-quality meshes. We validate our method on the ShapeNet Chang et al. (2015) benchmark and our method achieves the best performance among all previous multi-view and single-view mesh generation methods.
2 Related Work
2.1 Traditional Shape Generation Methods
3D model generation has traditionally been tackled using multi-view geometry principles. Among them, structure-from-motion (SfM) Schonberger and Frahm (2016) and simultaneous localization and mapping (SLAM) Cadena et al. (2016)
are popular techniques that perform 3D reconstruction and camera pose estimation at the same time. Closer to our problem setup, multi-view stereo methods infer 3D geometry from images with known camera parameters. Volumetric methodsKar et al. (2017); Kutulakos and Seitz (2000); Seitz and Dyer (1999) predict voxel grid representation of objects by estimating the relationship between each voxel and object surfaces. Point cloud based methods Furukawa and Ponce (2009); Lhuillier and Quan (2005) start with a sparse point cloud and gradually increase the density of points to obtain a final dense point cloud of the object. Durou et al. (2008); Zhang et al. (1999); Favaro and Soatto (2005) reason about shading, texture and defocus to reason about visible parts of the object and infer its 3D geometry. While the results of these works are impressive in terms of quality and completeness of reconstruction, they still struggle with poorly textured and non-reflective surfaces and require carefully selected input views.
2.2 Deep Shape Generation Methods
Deep learning based approaches can learn to infer 3D structure from training data and can be robust against poorly textured and reflective surfaces as well as limited and arbitrarily selected input views. These methods can be categorized into single view and multi-view methods. Huang et al. (2015); Su et al. (2014) use shape component retrieval and deformation from a large dataset for single-view 3D shape generation. Kurenkov et al. (2018) extend this idea by introducing free-form deformation networks on retrieved object templates from a database. Some work learn shape deformation from ground truth foreground masks of 2D images Kar et al. (2015); Yan et al. (2016); Tulsiani et al. (2017)
. Recurrent Neural Networks (RNN) based methodsChoy et al. (2016); Kar et al. (2017); Gwak et al. (2017) are another popular solution to solve this problem. Gwak et al. (2017); Lin et al. (2019) introduce image silhouettes along with adversarial multi-view constraints and optimize object mesh models using multi-view photometric constraints. Predicting mesh directly from color images was proposed in Wang et al. (2018); Wickramasinghe et al. (2019); Pan et al. (2019); Wen et al. (2019); Gkioxari et al. (2019); Tang et al. (2019). DR-KFS Jin et al. (2019)
introduces a differentiable visual similarity metric while SeqXY2SeqZHan et al. (2020) represents 3D shapes using a set of 2D voxel tubes for shape reconstruction. Front2Back Yao et al. (2020) generates 3D shapes by fusing predicted depth and normal images and DV-Net Jia et al. (2020) predicts dense object point clouds using dual-view RGB images with a gated control network to fuse point clouds from the two views.
2.3 Depth Estimation
Compared to 3D shape generation, depth prediction is an easier problem formulation since it simplifies the task to per-view depth map estimation. Deep learning based multi-view stereo depth estimation was first introduced in Hartmann et al. (2017) where a learned cost metric is used to estimate patch similarities. DeepMVS Huang et al. (2018) warps multi-view images to 3D space and then applies deep networks for regularization and aggregation to estimate depth images. Learned 3D cost volume based depth prediction was proposed in MVSNet Yao et al. (2018) where a 3 dimensional cost volume is built using homographically warped 2D features from multi-view images and 3D CNNs are used for cost regularization and depth regression. This idea was further extended by Chen et al. (2019); Luo et al. (2019); Gu et al. (2019); Yao et al. (2019).
Figure 1 shows the architecture of the proposed system which takes as input multi-view color images of an object with known poses and outputs a triangle mesh representing the surface of the object.
3.1 Multi-view Voxel Grid Prediction
Single-view Voxel Grid Prediction
The single-view voxel branch consists of a ResNet feature extractor and a fully convolutional voxel grid prediction network. It generates the coarse initial shape of an object from one viewpoint as voxel occupancy grid using a color image. Here, we set the resolution of the generated voxel occupancy grid as 32 32 32. The voxel prediction networks for all viewpoints share the same weights.
Probabilistic Occupancy Grid Merging
Voxel occupancy grid predicted from a single viewpoint suffers from occlusion and limited visibility. In order to fuse voxel grids from different viewpoints, we propose a probabilistic occupancy grid merging method which merges the voxel grids from each input viewpoint probabilistically to obtain the final voxel grid output. This allows occluded regions in one view to be estimated from other views where those regions are visible as well as increase the confidence of prediction in overlapping regions. Occupancy probability of each voxel is represented by
Bayesian update on the probabilities reduce to simple summation of log likelihoods Konolige (1997). Hence, the multi-view log-odds of a voxel is given by:
3.2 Mesh Refinement
The cubified mesh from the voxel branch only provides a coarse reconstruction of the object’s surface. We apply graph convolutional networks which represent each mesh vertex as one graph node and deforms them to more accurate positions.
GCN-based Mesh Deformation
The features pooled from multi-view images along with 3D coordinates of the vertices in world frame are used as features of the graph nodes. Series of Graph-based Convolutional Network (GCN) blocks are applied to deform a mesh at the current stage to the next stage, starting with the cubified voxel grids. A graph convolution deforms mesh vertices by propagating features from neighboring vertices by applying where is the set of neighboring vertices of the i-th vertex in the mesh, represents the feature vector of a vertex, and and are learnable parameters of the model. Each GCN block utilizes several graph convolutions to transform the vertex features along with a final vertex refinement operation where the features along with vertex coordinates are further transformed as where the matrix is another learnable parameter to obtain the deformed mesh.
Contrastive Depth Feature Extraction
Yao et al. (2020) demonstrate that using intermediate, image-centric 2.5D representations instead of directly generating 3D shapes in global frame from raw 2D images can improve 3D reconstruction quality. We therefore propose to formulate the features for graph nodes using 2.5D depth maps as input additional inputs alongside the RGB features. Specifically, we render the meshes at different GCN stages to depth image at all the input views using Kato et al. (2018) and use them along with predicted depths for depth feature extraction. We call this form of depth input contrastive depth as it contrasts the rendered depths of the current mesh against the predicted depths and allows the network to reason about the deformation better than when using predicted depth or color images alone. Given the 2D features, corresponding feature vectors of individual vertices can be found by projecting the 3D vertex coordinates to the feature planes using known camera parameters. We use VGG-16 Simonyan and Zisserman (2014) as our contrastive depth feature extraction network.
Multi-View Depth Estimation
We extend MVSNet Yao et al. (2018) and predict the depth maps of all views since the original implementation predicts depth of only one reference view. This is achieved by transforming the feature volumes to each view’s coordinate frame using homography warping and applying identical cost volume regularization and depth regression on each view. Detailed network architecture diagram of this module is provided in the appendix.
Attention-based Multi-View Feature Pooling
In order to fuse multi-view contrastive depth features, we formulate an attention module by adapting multi-head attention mechanism originally designed for sequence to sequence machine translation using transformer (encoder-decoder) architecture Vaswani et al. (2017). In a transformer architecture the encoder hidden state is mapped to lower dimension key-value pairs (K, V) while the decoder hidden state is mapped to a query vector Q using independent fully connected layers. The encoder hidden state in our case is the multi-view features while the decoder hidden state is the mean of the multi-view features. The attention weights are computed using scaled-dot product:
where is the number of input views.
Multiple attention heads are used which are concatenated and transformed to obtain the final output
where multiple are parameters to be learned, is the number of attention heads and .
We choose multi-head attention as our feature pooling method since it allows the model to attend information from different representation subspaces of the features by training multiple attentions in parallel. This method is also invariant to the order and number of input views. We visualize the learned attention weights (average of each attention heads) in Figure 2 where we can observe that the attention weights roughly takes into account the visibility/occlusion information from each view.
3.3 Loss functions
The losses which are derived from Wang et al. (2018) to constrain the mesh predicted by each GCN block (P) to resemble the ground truth (Q) include Chamfer distance and surface normal loss with additional regularization in the form of edge length loss for visually appealing results.
Our depth prediction network is supervised using adaptive reversed Huber loss (also known as BerHu criterion) Lambert-Lacroix and Zwald (2016).
Contrastive depth loss
BerHu loss is also applied between the rendered depth images at different GCN stages and the predicted depth images.
Binary cross-entropy loss between the predicted voxel occupancy probabilities and the ground truth occupancies is used as voxel loss to supervise the voxel predictions
We use the weighted sum of the individual losses discussed above as the final loss to train our model in an end-to-end fashion. , where is the final loss term.
4.1 Experimental Setup
We evaluate the proposed method against various multi-view shape generation methods. The state-of-the-art method is Pixel2Mesh++ Wen et al. (2019) (referred as P2M++). Wen et al. (2019) also provide a baseline by directly extending Pixel2Mesh Wang et al. (2018) to operate on multi-view images (referred as MVP2M) using their statistical feature pooling method to aggregate features from multiple color images. Results from additional multi-view shape generation baselines 3D-R2N2 Choy et al. (2016) and LSM Kar et al. (2017) are also reported.
We evaluate our method against the state-of-the-art methods on the dataset from Choy et al. (2016) which is a subset of ShapeNet Chang et al. (2015) and has been widely used by recent 3D shape generation methods. It contains 50K 3D CAD models from 13 categories. Each model is rendered with a transparent background from 24 randomly chosen camera viewpoints to obtain color images. The corresponding camera intrinsics and extrinsics are provided in the dataset. Since the dataset does not contain depth images, we render them using a custom depth renderer at the same viewpoints as the color images and with the same camera intrinsics. We follow the training/testing/validation split of Gkioxari et al. (2019).
For the depth prediction module, we follow the original MVSNet Yao et al. (2018) implementation. The output depth dimensions reduces by a factor of 4 to 5656 from the 224224 input image. The number of depth hypotheses is chosen as 48 which offers a balance between accuracy and running/training time efficiency. These depth hypotheses represent values from m to m at an interval of mm. These values were chosen based on the range of depths present in the dataset.
The hierarchical features obtained from "Contrastive Depth Features Extractor" are of total 4800 dimensions for each view. The aggregated multi-view features are compressed to 480 dimensional after applying attentive feature pooling. 5 attention heads are used for merging multi-view features. The loss function weights are set as, , , and . Two settings of were used, (referred as Best) which gives better quantitative results and (referred as Pretty) which gives better qualitative results.
Training and Runtime
The network is optimized using Adam optimizer with a learning rate of
. The training is done on 5 Nvidia RTX-2080 GPUs with effective batch size 5. The depth prediction network (MVSNet) is trained independently for 30 epochs. Then the whole system is trained for another 40 epochs with the weights of the MVSNet frozen. Our system is implemented in PyTorch deep learning framework and it takes around 60 hours for training.
, we use F1-score as our evaluation metric. The F1-score is the harmonic mean of precision and recall where the precision/recall are calculated by finding the percentage of points in the predicted/ground truth that can find a nearest neighbor from the other within a threshold. Two values of are used: and m.
4.2 Comparison with previous Multi-view Shape Generation Methods
We quantitatively compare our method against previous works for multi-view shape generation in Table 1 and show the effectiveness of our methods in improving the shape quality. Our method outperforms the state-of-the-art method Pixel2Mesh++ Wen et al. (2019) with a decrease in chamfer distance to ground truth by 34% and 15% increase in F1-score at threshold . Note that in Table 1 the same model is trained for all the categories but accuracy on individual categories as well as average over the categories are evaluated. We provide the chamfer distances in the appendix.
|Category||F-score ()||F-score ()|
We also provide visual results for qualitative assessment of the generated shapes by our Pretty model in Figure 3 which shows that it is able to more accurately predict topologically diverse shapes.
4.3 Ablation studies
Contrastive Depth Feature Extraction
We evaluate several methods for contrastive feature extraction (Sub-section 3.2). These methods are 1) Input Concatenation: using the concatenated rendered and predicted depth maps as input to the VGG feature extractor, 2) Input Difference: using the difference of the two depth maps as input to VGG, 3) Feature Concatenation: concatenating features from rendered and predicted depths extracted by shared VGG, 4) Feature Difference: using difference of the features from the two depth maps extracted by shared VGG, and 5) None: using the VGG features from the predicted depths only. The quantitative results are summarized in Table 2 and shows that Input Concatenation method produces better results than other formulations.
|(1) Input Concatenation||80.80||90.72|
|(2) Input Difference||80.41||90.54|
|(3) Feature Concatenation||80.45||90.54|
|(4) Feature Difference||80.30||90.40|
In the 5th row and 6th row of Table 3, we present the performance of the proposed attention method against statistical feature pooling Wen et al. (2019) and a simpler attention mechanism Hu et al. (2020); Yang et al. (2020) where the pooled features are simply the weighted sum of the multi-view features. We find that the three method perform similarly on our final architecture but multi-head attention method performs better on more light-weight architectures.
Contrastive Depth Losses
We also evaluate the effect of using additional regularization from contrastive depth losses: rendered depth vs. predicted depth and rendered depth vs. ground truth depth in the 3rd, 4th and 5th rows of Table 3 which show that introducing the additional loss terms to constrain the refined meshes improves the accuracy of the generated shapes.
Ground truth depth as input
In row 7 we use ground truth instead of predicted depths which gives the upper bound on our mesh prediction accuracy in relation to the depth prediction accuracy.
Row 8 uses a sphere as the coarse shape instead of cubified voxel grid.
Naive multi-view Mesh R-CNN
In row 9 of Table 3 we extend Mesh R-CNN Gkioxari et al. (2019) to multi-view using statistical feature pooling method proposed in Wen et al. (2019) for mesh refinement while in row 10 we further extend their single-view voxel grid prediction method to our probabilistic multi-view voxel grid prediction.
|(1) Baseline framework||79.82||90.18|
|(2) Baseline + rendered vs predicted depth loss (final model)||80.80||90.72|
|(3) Baseline + rendered vs GT depth loss||80.35||90.55|
|(4) Baseline + rendered vs predicted depth loss + rendered vs GT depth loss||80.45||90.56|
|(5) Baseline with stats pooling||79.63||90.10|
|(6) Baseline with simple attention||80.03||90.21|
|(7) Baseline with GT depth||84.58||92.86|
|(8) Sphere initialization||73.78||85.49|
|(9) Naive multi-view Mesh R-CNN (single-view voxel prediction)||72.74||84.99|
|(10) Naive multi-view Mesh R-CNN (multi-view voxel prediction)||76.97||88.24|
Number of View
We test the performance of our framework with respect to the number of views. Table 5 shows that the accuracy of our method increases as we increase the number of input views for training. These experiments also validate that the attention-based feature pooling can efficiently encode features from different views to take advantage of larger number of views.
Table 5 shows the results when using different number of views during testing on our model trained with 3 views which indicates that increasing the number of views during testing does not improve the accuracy while decreasing the number of views can cause a significant drop in accuracy.
We propose a neural network based solution to predict 3D triangle mesh models of objects from images taken from multiple views. First, we propose a multi-view voxel grid prediction module which probabilistically merges voxel grids predicted from individual input views. We then cubify the merged voxel grid to triangle mesh and apply graph convolutional networks for further refining the mesh. The features for the mesh vertices are extracted from contrastive depth input consisting of rendered depths at each refinement stage along with the predicted depths. The proposed mesh reconstruction method outperforms existing methods with a large margin and is capable of reconstructing objects with more complex topologies.
- Past, present, and future of simultaneous localization and mapping: toward the robust-perception age. IEEE Transactions on robotics 32 (6), pp. 1309–1332. Cited by: §2.1.
- Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: Table 9, §1, §4.1.
- Point-based multi-view stereo network. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1538–1547. Cited by: §2.3.
- 3d-r2n2: a unified approach for single and multi-view 3d object reconstruction. In European conference on computer vision, pp. 628–644. Cited by: §1, §2.2, §4.1, §4.1.
- Numerical methods for shape-from-shading: a new survey with benchmarks. Computer Vision and Image Understanding 109 (1), pp. 22–43. Cited by: §2.1.
A point set generation network for 3d object reconstruction from a single image.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 605–613. Cited by: §1.
- A geometric approach to shape from defocus. IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (3), pp. 406–417. Cited by: §2.1.
- Accurate, dense, and robust multiview stereopsis. IEEE transactions on pattern analysis and machine intelligence 32 (8), pp. 1362–1376. Cited by: §2.1.
- Mesh r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9785–9795. Cited by: Probabilistic Occupancy Grid Merging, Coarse Shape Generation, §1, §2.2, §4.1, §4.3.
- Cascade cost volume for high-resolution multi-view stereo and stereo matching. arXiv preprint arXiv:1912.06378. Cited by: §2.3.
- Weakly supervised 3d reconstruction with adversarial constraint. In 2017 International Conference on 3D Vision (3DV), pp. 263–272. Cited by: §1, §2.2.
- SeqXY2SeqZ: structure learning for 3d shapes by sequentially predicting 1d occupancy segments from 2d coordinates. arXiv preprint arXiv:2003.05559. Cited by: §2.2.
- Learned multi-patch similarity. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1586–1594. Cited by: §2.3.
- RandLA-net: efficient semantic segmentation of large-scale point clouds. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Cited by: §4.3.
- Deepmvs: learning multi-view stereopsis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2821–2830. Cited by: §2.3.
- Single-view reconstruction via joint analysis of image and shape collections. ACM Transactions on Graphics (TOG) 34 (4), pp. 1–10. Cited by: §1, §2.2.
- DV-net: dual-view network for 3d reconstruction by fusing multiple sets of gated control point clouds. Pattern Recognition Letters 131, pp. 376–382. Cited by: §2.2.
- DR-kfs: a differentiable visual similarity metric for 3d shape reconstruction. External Links: Cited by: §2.2.
- Learning a multi-view stereo machine. In Advances in neural information processing systems, pp. 365–376. Cited by: §1, §2.1, §2.2, §4.1.
- Category-specific object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1966–1974. Cited by: §1, §2.2.
- Neural 3d mesh renderer. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.2.
- Improved occupancy grids for map building. Autonomous Robots 4 (4), pp. 351–367. Cited by: §3.1.
- Deformnet: free-form deformation network for 3d shape reconstruction from a single image. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 858–866. Cited by: §2.2.
- A theory of shape by space carving. International journal of computer vision 38 (3), pp. 199–218. Cited by: §2.1.
- The adaptive berhu penalty in robust regression. Journal of Nonparametric Statistics 28 (3), pp. 487–514. Cited by: §3.3.
- A quasi-dense approach to surface reconstruction from uncalibrated images. IEEE transactions on pattern analysis and machine intelligence 27 (3), pp. 418–433. Cited by: §2.1.
- Photometric mesh optimization for video-aligned 3d object reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 969–978. Cited by: §2.2.
- P-mvsnet: learning patch-wise matching confidence aggregation for multi-view stereo. In Proceedings of the IEEE International Conference on Computer Vision, pp. 10452–10461. Cited by: §2.3.
- Deep mesh reconstruction from single rgb images via topology modification networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9964–9973. Cited by: §2.2.
- The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §1.
- Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4104–4113. Cited by: §2.1.
- Photorealistic scene reconstruction by voxel coloring. International Journal of Computer Vision 35 (2), pp. 151–173. Cited by: §2.1.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.2.
- Estimating image depth using shape collections. ACM Transactions on Graphics (TOG) 33 (4), pp. 1–11. Cited by: §1, §2.2.
- A skeleton-bridged deep learning approach for generating meshes of complex topologies from single rgb images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4541–4550. Cited by: §2.2.
- Multi-view supervision for single-view reconstruction via differentiable ray consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2626–2634. Cited by: §1, §2.2.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.2.
- Pixel2mesh: generating 3d mesh models from single rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 52–67. Cited by: §1, §1, §2.2, §3.3, §4.1, §4.1.
- Pixel2mesh++: multi-view 3d mesh generation via deformation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1042–1051. Cited by: Table 6, Experiments, §1, §2.2, §4.1, §4.1, §4.2, §4.3, §4.3.
- Voxel2Mesh: 3d mesh model generation from volumetric data. External Links: Cited by: §2.2.
- Perspective transformer nets: learning single-view 3d object reconstruction without 3d supervision. In Advances in neural information processing systems, pp. 1696–1704. Cited by: §1, §2.2.
Robust attentional aggregation of deep feature sets for multi-view 3d reconstruction. International Journal of Computer Vision 128 (1), pp. 53–73. Cited by: §4.3.
- Foldingnet: point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 206–215. Cited by: §1.
- Mvsnet: depth inference for unstructured multi-view stereo. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 767–783. Cited by: MVSNet architecture, §2.3, §3.2, §4.1.
- Recurrent mvsnet for high-resolution multi-view stereo depth inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5525–5534. Cited by: §2.3.
- Front2Back: single view 3d shape reconstruction via front to back prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 531–540. Cited by: §2.2, §3.2.
- Shape-from-shading: a survey. IEEE transactions on pattern analysis and machine intelligence 21 (8), pp. 690–706. Cited by: §2.1.
Appendix A Appendix
Our depth prediction module is based on MVSNet Yao et al. (2018) which constructs a regularized 3D cost volumes to estimate the depth map of the reference view. Here, we extent MVSNet to predict the depth maps of all views instead of only the reference view. This is achieved by transforming the feature volumes to each view’s coordinate frame using homography warping and applying identical cost volume regularization and depth regression on each view. This allows the reuse of pre-regularization feature volumes for efficient multi-view depth prediction invariant to the order of input images. Figure 4 shows the architecture of the our depth estimation module.
Probabilistic Occupancy Grid Merging
We use single-view voxel prediction network from Gkioxari et al. (2019)
to predict predicts voxel grids for each of the input images in their respective local coordinate frames. The occupancy grids are transformed to global frame (which is set to the coordinate frame of the first image) by finding the equivalent global grid values in the local grids after applying bilinear interpolation on the closest matches. The voxel grids in global coordinates are then probabilistically merged according to Sub-section3.1 of the main submission.
We quantitatively compare our method against previous works for multi-view shape generation in Table 6 and show effectiveness of our proposed shape generation methods in improving shape quality. Our method outperforms the state-of-the-art method Pixel2Mesh++ Wen et al. (2019) with decrease in chamfer distance to ground truth by 34%, which shows the effectiveness of our proposed method. Note that in Table 6 same model is trained for all the categories but accuracy on individual categories as well as average over all the categories are evaluated.
|Category||Chamfer Distance (CD)|
Coarse Shape Generation
We conduct comparisons on voxel grid predicted from our proposed probabilistically merged voxel grids against single view method Gkioxari et al. (2019). As is shown in Table 8, the accuracy of the initial shape generated from probabilistically merged voxel grid is higher than that from individual views.
Accuracy at different GCN stages
We analyze the accuracy of meshes at different GCN stages in Table 8 to validate that our method produces the meshes in a coarse-to-fine manner.
Resolution of Depth Prediction
We conduct experiments using different numbers of depth hypotheses in our depth prediction network (Sub-section MVSNet architecture), producing depth values at different resolutions. A higher number of depth hypothesis means finer resolution of the predicted depths. The quantitative results with different hypothesis numbers are summarized in Table 9. We set depth hypothesis as for our final architecture which is equivalent to the resolution of mm.
We conduct experiments to evaluate the generalization capability of our system across the semantic categories. We train our model with only 12 out of the 13 categories and test on the category that was left out. Table 10 shows that the accuracy generally does not decrease significantly when compared with the model that was trained on all 13 categories.
|Category||F-score ()||F-score ()|