## 1 Introduction

3D planar structure reconstruction from RGB images has been an important yet challenging problem in computer vision for decades. It aims to detect piece-wise planar regions and predict the corresponding 3D plane parameters from RGB images. The recovered 3D planes can be used in various applications such as robotics

[taguchi2013point], Augmented Reality (AR) [chekhlov2007ninja], and indoor scene understanding

[tsai2011real].Traditional methods [furukawa2009manhattan, gallup2010piecewise, sinha2009piecewise] work well in certain cases but usually highly rely on some assumptions (*e.g.*, Manhattan-world assumption [furukawa2009manhattan]) of the target scene and are thus not always robust in complicated real-world cases. Recently, some methods [liu2018planenet, yang2018recovering, liu2019planercnn, yu2019single, tan2021planetr]

have been proposed to recover planes from single-view images based on Convolutional Neural Networks (CNNs). These methods could reconstruct 3D planes better in terms of completeness and robustness compared with traditional methods. However, all of them, albeit achieving reasonable results on 2D plane segmentation, attempt to recover 3D plane geometry from a single image, which is an ill-posed problem as it only relies on single-view regression for plane parameters and has ambiguity in depth scale recovery. Thus the recovered 3D planes from those methods are far from being accurate. The limitations of these methods motivate us to consider the possibility of reconstructing 3D planes from multiple views with CNNs in an end-to-end framework.

In contrast to reconstructing 3D geometry from single images, multi-view-stereo (MVS) [furukawa2009accurate] takes multiple images as input with known relative camera poses. MVS methods achieve superior performance on 3D reconstruction compared with single-view methods since the scale of a scene can be resolved by triangulating matched feature points on calibrated images [hartley2003multiple]. Recently, a few learning-based MVS methods [yao2018mvsnet, im2019dpsnet, gu2020cascade, yang2020cost] have been proposed and have achieved promising improvements for a wide range of scenes. While effective in reconstructing areas with rich textures, their pipelines would suffer from ambiguity in finding feature matches in the textureless area, which often belongs to planar regions. Besides, the generated depth map usually lacks smoothness as planar structures are not explicitly parsed. Some recent MVS approaches [kusupati2020normal, long2020occlusion, zhao2021confidence]

propose to jointly learn the geometric relationship between depth and normal to capture local planarity. However, these methods usually estimate depth and normal separately, and only learn pixel-level planarity by enforcing the constraints by additional losses. Piece-wise planar structures,

*e.g.*, walls and floors, which usually indicate strong global geometric smoothness, are not well captured in these approaches.

In this work, as shown in Fig. LABEL:fig:introduction, we take advantage of both sets of methods and propose to reconstruct planar structures in an MVS framework. Our framework consists of dual branches: a plane detection branch and a plane MVS branch. The plane detection branch predicts a set of 2D plane masks with their corresponding semantic labels of the target image. The plane MVS branch, which is our key contribution, takes posed target and source images as input. Inspired by the frontal plane sweeping formulation that is widely used in MVS pipelines, we propose a slanted plane sweeping strategy to learn the plane parameters without ambiguity.
Specifically, instead of using a set of frontal plane hypotheses (*i.e.*, depth hypotheses) for plane sweeping as in conventional MVS methods, we perform plane sweeping with a group of slanted plane hypotheses to build a plane cost volume for per-pixel plane parameter regression.

To associate the two branches, we present a soft-pooling strategy to get piece-wise plane parameters and propose a loss objective based on it to make the two branches benefit from each other. We apply learned uncertainties [kendall2017uncertainties] on different loss terms to train the multi-task learning system in a balanced way. Moreover, our system can generalize well in new environments with different data distributions. The results can be further improved with a simple but effective finetuning strategy without groundtruth plane annotations.

To the best of our knowledge, this is the first work that reconstructs planar structures in an end-to-end MVS framework. The reconstructed depth map takes advantage of multi-view geometry to resolve the scale ambiguity issue. It is much geometrically smoother compared with depth-based MVS schemes by parsing planar structures. Experimental results across different indoor datasets demonstrate that our proposed PlaneMVS not only significantly outperforms single-view plane reconstruction methods, but is also better than several SOTA learning-based MVS approaches.

## 2 Related Work

##### Piece-wise planar reconstruction.

Traditional plane reconstruction methods [furukawa2009manhattan, gallup2010piecewise, sinha2009piecewise] usually take a single or multiple images as input and detect the primitives such as vanishing points and lines as geometric cues to recover planar structures. Such methods make strong assumptions about the environment and often do not generalize well into various scenarios. Recent learning-based approaches [liu2018planenet, yang2018recovering, yu2019single, liu2019planercnn, tan2021planetr, qian2020learning]

handle the plane reconstruction problem from a single image with Deep Neural Networks (DNNs) and achieve promising results. PlaneNet

[liu2018planenet] proposes a multi-branch network to learn plane masks and parameters jointly. PlaneRecover [yang2018recovering] proposes to segment piece-wise planes with only groundtruth depth supervision but without any plane groundtruth. PlaneAE [yu2019single] and PlaneTR [liu2019planercnn] learn to cluster image pixels into piece-wise planes with bottom-up frameworks. Alternatively, PlaneRCNN [liu2019planercnn] takes advantage of a two-stage detection framework [he2017mask] to estimate plane segmentation and plane geometry in several parallel branches. Qian and Furukawa [qian2020learning] model the inter-plane relationships to further refine the initial planar reconstruction. However, although being possible to learn 2D plane segmentation with a single image, it is still challenging to learn accurate 3D plane geometry only with single-view regression. Most recently, Jin*et al.*[jin2021planar] proposes a framework to jointly reconstruct planes and estimate camera poses from sparse views. In our work, we assume the camera poses are obtained from some SLAM systems, and design our plane detection branch based on PlaneRCNN [liu2019planercnn], but learn plane geometry in a separate multi-view-stereo (MVS) branch.

##### Multi-view stereo.

Different from single-view depth estimation [eigen2015predicting, godard2019digging, tiwari2020pseudo, zou2020learning, ji2021monoindoor], multi-view stereo transforms the depth estimation problem into triangulating corresponding points from a pair of posed images. Thus, it could solve the scale ambiguity issue in the single-view case. Traditional MVS approaches can be roughly categorized as voxel-based methods [kutulakos2000theory, seitz1999photorealistic], point-cloud-based methods [furukawa2009accurate, lhuillier2005quasi] and depth-map-based methods [campbell2008using, galliani2015massively, tola2012efficient].

In recent years, some learning-based methods have been proposed and have shown superior robustness and generalizability. Volumetric methods such as [ji2017surfacenet, kar2017learning] aggregate multi-view information to learn a voxel representation of the scene. However, they can only be applied to small-sized scenes due to high memory consumption of volumetric representation. For depth-based methods, MVSNet [yao2018mvsnet] utilizes an end-to-end framework to reconstruct the depth map of the reference image from multi-view input based on the plane-sweeping strategy. Some follow-up methods aim to achieve better accuracy-speed trade-off [yao2019recurrent, yu2020fast, wang2021patchmatchnet] or refine the depth map in a cascaded framework [gu2020cascade, yang2020cost], or incorporate visibility as well as uncertainty into the framework [zhang2020visibility, luo2019p, xu2020pvsnet]. These depth map-based MVS approaches usually apply the fronto-parallel plane hypothesis for plane sweeping, aiming to learn pixel-level feature correspondences at correct depths. However, for textureless areas or repetitive patterns, it is challenging for the network to accurately match pixel-level features, thus making the inferred depth less accurate. Different from depth-map based MVS, Atlas [murez2020atlas] and NeuralRecon [sun2021neuralrecon] propose to learn a TSDF [curless1996volumetric] representation from posed images for 3D surface reconstruction which avoids multi-view depth fusion.

Due to the matching ambiguity in textureless areas, some MVS works [birchfield1999multiway, gallup2007real, bleyer2011patchmatch] aim to model local planarity since textureless areas are usually planar. Traditionally, Birchfield and Tomasi [birchfield1999multiway] introduce slanted-plane with Markov Random Fields for stereo matching. Gallup *et al.* [gallup2007real] first estimate dominant plane directions and warp along those planes based on plane sweeping. A few methods [bleyer2011patchmatch, xu2020planar, romanoni2019tapa] perform stereo patch matching in textureless regions based on iterative optimizations or probabilistic frameworks. For learning-based methods, derived from the idea of patchmatch stereo, a line of works [kusupati2020normal, long2020occlusion, zhao2021confidence] incorporate the geometric relationship between depth and surface normal into MVS framework. Although sharing high-level ideas, our work differs from these methods in several aspects. Firstly, some work [long2020occlusion] segments piece-wise planes as an offline pre-processing step to generate smooth and consistent normals. However, we jointly learn plane segmentation and plane geometry within the proposed framework. Secondly, they usually learn depth and normal separately and apply loss objectives as extra constraints based on local planarity. In contrast, we directly learn to regress pixel-level plane parameters with a set of slanted plane hypotheses with the plane-sweeping strategy in one MVS pipeline, so the joint relationship between depth and normal is learned implicitly. Thirdly, while those works aim to employ planar priors to assist multi-view stereo, our goal is to reconstruct piece-wise planar structures with an MVS framework.

## 3 Method

This section is organized as follows: we first introduce our semantic plane detection branch in Sec. 3.1, and present our plane MVS branch in Sec. 3.2. Then we describe the piece-wise plane reconstruction process in Sec. 3.3. Finally, we introduce our loss objectives in Sec. 3.4.

### 3.1 Plane detection

An overview of PlaneRCNN. PlaneRCNN [liu2019planercnn] is one of the state-of-the-art single-view plane reconstruction approaches, which builds upon Mask-RCNN [he2017mask]. It designs several separate branches for estimating 2D plane masks and 3D plane geometry. It first applies FPN [lin2017feature] to extract a feature map, then adopts a two-stage detection framework to predict 2D plane masks . An encoder-decoder architecture processes the feature map to get a per-pixel depth map . Instance-level plane features from ROI-Align[he2017mask] are passed into a plane normal branch to predict plane normals . They also design a refinement network to refine initial plane masks and a reprojection loss between neighboring views to enforce multi-view geometry consistency during training. With predicted and , the piece-wise planar depth map can be reconstructed.

Our semantic plane detection. Our detection head is based on PlaneRCNN [liu2019planercnn] but with several modifications. Firstly, we remove all the geometry estimation modules including the plane normal prediction module and the monocular depth estimation module since 3D plane geometry is estimated by our MVS branch. Secondly, we also remove the plane refinement module and the multi-view reprojection loss used in PlaneRCNN to conserve memory. Additionally, since semantic information is helpful for scene understanding, as in Mask-RCNN [he2017mask], we add semantic label prediction for each plane instance to get its semantic class. We will introduce the details on how we define and generate the semantic plane annotations in Sec. 4.2. To summarize, for an input image with resolution , our plane detection head predicts a set of plane bounding boxes , their confidence scores where , the binary plane masks where , and their corresponding semantic labels .

### 3.2 Planar MVS

Next, we introduce our plane MVS head, which is our key contribution in this work. Fig. 2 shows the architecture of this branch, and we will present each part sequentially.

Feature extraction.

The 2D image feature extraction for the MVS head is shared with the plane detection head. Specifically, we obtain multi-scale 2D feature maps with

levels from the FPN feature backbone. Here we only utilize the finest level feature . To further balance the memory consumption and accuracy, we pass into a dimension-deduction layer and an average pooling layer to get reduced feature representation . serves as the feature map input of the MVS network. It is worth exploring whether using multiple levels of features would bring benefits, but that is not our current focus, and we leave it to future work.Differentiable planar homography.
Previous MVS methods[yao2018mvsnet, im2019dpsnet] propose to warp the source feature with fronto-parallel planes, *i.e.*, depth hypotheses, to the target view. This is effective in associating the features from multiple views at the correct depth values of the target view. In our setup, the objective is to learn per-pixel plane parameters instead of depth. To this end, we propose to leverage slanted plane hypotheses for performing plane sweeping to learn per-pixel plane parameters with the MVS framework. The representation of differentiable homography using slanted plane hypotheses is the same as using depth hypotheses. The homography between two views at plane , where is the plane normal and is the offset at pixel of the target view, can be represented as:

(1) |

where symbol means “equality up to a scale”. is the intrinsic matrix. and are the relative camera rotation and translation matrices between two views, respectively. Therefore it can be concluded that, without considering occlusion and object motion, the homography at pixel between two views is only determined by the plane with known camera poses. This perfectly aligns with our goal to learn 3D plane parameters with MVS. We can learn pixel-level plane parameters , which is a non-ambiguous representation for a plane by employing slanted plane-sweeping in an MVS framework.

Slanted plane hypothesis generation. One of the main differences of our framework from conventional MVS methods lies in the hypothesis representation. In depth-based MVS pipelines, their plane hypotheses are fronto-parallel w.r.t. the camera. Therefore, a set of one-dimensional depth hypotheses which cover the depth range of the target 3D space are sufficient for depth regression. However, in our work, we need a set of three-dimensional slanted plane hypotheses . Finding slanted plane hypotheses is a non-trivial task since the number of candidate planes that pass through a 3D point is infinite. We need to determine the appropriate hypothesis range for each dimension of . To this end, we randomly sample training images and plot the distribution for every axis of groundtruth plane , which reflects the general distribution for plane parameters in various scenes. Then we select the upper and lower bounds for each axis by ensuring most groundtruth values lie within the selected range. We sample the hypotheses uniformly between the bounds along every axis. Please see details of the hypothesis range and number we choose in our supplementary material.

Cost volume construction. After determining plane hypotheses, we warp the source feature map into the target view by Eq. (1). For every slanted plane hypothesis, we concatenate the warped source feature and target feature to associate them, which can better keep the original single-view feature representation than applying distance metrics[yao2018mvsnet]. Then we stack the features along the hypothesis dimension to build a feature cost volume . Following [yao2018mvsnet], we utilize an encoder-decoder architecture with 3D CNN layers to regularize the feature cost volume. Finally, we use a single 3D CNN layer with softmax activation to transform the cost volume

into a plane probability volume

.Per-pixel plane parameters inference and refinement. To make the whole system differentiable, following[yao2018mvsnet], soft-argmax is applied to get the initial pixel-level plane parameters. Given the plane hypothesis set {, , …, }, 3D plane parameter at pixel can be inferred as:

(2) |

where is the probability of hypothesis at pixel .

With soft-argmax, we can get an initial pixel-level plane parameter tensor

. We need to upsample it back to the original image resolution. We find that directly applying bilinear upsampling will lead to the over-smoothness issue. Here we adopt the upsampling method proposed by RAFT[teed2020raft]. Specifically, for each pixel of , we learn a convex combination by first predicting an grid, then applying weighted combination over the learned weights of its coarse neighbors to get the upsampled plane parameters . This upsampling approach better preserves the boundaries of planes and other details in the reconstructed planar depth map.Following [yao2018mvsnet], we apply a refinement module, which aims to learn the residual of the initial plane parameters w.r.t. groundtruth. The upsampled initial plane parameters is concatenated with the normalized original image as input to preserve image details, then passed into several 2D CNN layers to predict its residual . Then we get the refined pixel-level plane parameters , which is our final per-pixel plane parameters prediction.

### 3.3 Planar depth map reconstruction

In this subsection, we present how we associate the above two branches to make them benefit from each other. We also demonstrate how to get the piece-wise planar depth map as the final reconstructed plane representation.

Plane instance-aware soft pooling. After getting per-pixel plane parameters and plane masks from the two branches, the natural question is, can we associate the two heads and make them benefit from each other? To this end, inspired by [yang2018recovering, yu2019single], we design a soft-pooling operation and propose a loss supervision on the depth map. For a detected plane, we output its soft mask , where each element at each pixel of is the predicted foreground probability instead of a binary value for differentiability. Then the instance plane parameter can be computed by a soft-pooling operation with weighted averaging:

(3) |

Then the instance-level planar depth map can be reconstructed:

(4) |

where is an indicator variable to identify foreground pixels. A threshold of is applied on to determine whether pixel is identified as foreground. is the inverse intrinsic matrix and is the homogeneous coordinate of pixel .

Method | Depth Metrics | Detection Metrics | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

AbsRel | SqRel | RMSE | RMSE_log | AP | AP | AP | AP | AP | mAP | ||||

PlaneRCNN[liu2019planercnn] | 0.164 | 0.068 | 0.284 | 0.186 | 0.780 | 0.953 | 0.989 | 0.310 | 0.475 | 0.526 | 0.546 | 0.554 | 0.452 |

MVSNet[yao2018mvsnet] | 0.105 | 0.040 | 0.232 | 0.145 | 0.882 | 0.972 | 0.993 | - | - | - | - | - | - |

DPSNet[im2019dpsnet] | 0.100 | 0.035 | 0.215 | 0.135 | 0.896 | 0.977 | 0.994 | - | - | - | - | - | - |

NAS[kusupati2020normal] | 0.098 | 0.035 | 0.213 | 0.134 | 0.905 | 0.979 | 0.994 | - | - | - | - | - | - |

ESTDepth[long2021multi] | 0.113 | 0.037 | 0.219 | 0.147 | 0.879 | 0.976 | 0.995 | - | - | - | - | - | - |

PlaneMVS-pixel (Ours) | 0.091 | 0.029 | 0.194 | 0.120 | 0.920 | 0.987 | 0.997 | 0.448 | 0.535 | 0.556 | 0.560 | 0.564 | 0.466 |

PlaneMVS-final (Ours) | 0.088 | 0.026 | 0.186 | 0.116 | 0.926 | 0.988 | 0.998 | 0.456 | 0.540 | 0.559 | 0.562 | 0.564 | 0.466 |

Depth map representation and loss supervision. We can obtain a stitched depth map for the image by filling the planar pixels with instance planar depth maps from Eq. (4). Since the learned pixel-wise plane parameters capture local planarity, we fill the non-planar pixels with the reconstructed pixel-wise planar depth map.

Then we can design a soft-pooling loss between the reconstructed depth map and groundtruth depth map , with loss supervision as our soft-pooling loss :

(5) |

By supervising the model with , because of Eq. (3), the planar depth map is not only determined by the plane MVS head but also the plane detection head. In other words, the model is supposed to make the learned 2D plane segmentation and 3D plane parameters consistent with each other. During training, one module is able to get constraints from the other one’s output. Note that although this loss shares similarity with [yang2018recovering, yu2019single], there are differences between their work and ours. PlaneRecover [yang2018recovering] applies a similar loss to assign pixels to different plane instances. PlaneAE [yu2019single] builds the loss on plane parameter instead of depth map and targets to improve instance-level parameters. In contrast, our soft-pooling loss is mainly designed for making possible interactions between 2D plane segmentation and 3D parameter predictions.

### 3.4 Supervision with loss term uncertainty

Our supervisions have three parts: the plane detection losses , the plane MVS losses , and the soft pooling loss . includes two-stage classification and bounding box regression losses, and the mask loss in the stage. includes the loss built on initial per-pixel plane parameters and its reconstructed depth map, and the refined ones. For each term of , we adopt masked loss which is only applied on the pixels with valid groundtruth. Since the goals of plane detection and plane MVS branch are distinct, we weight each loss term by its learned uncertainty as introduced in [kendall2017uncertainties]. This is effective in our experiments and can outperform the results without applying uncertainty by a large margin. Our final loss objective can be written as:

(6) |

where is the learned uncertainty for each loss term.

## 4 Experiments

### 4.1 Implementation details

We implement our framework in Pytorch

[paszke2017automatic]. The SGD optimizer is applied with an initial learning rate of and a weight decay of . The batch size is set to be , and the model is trained end-to-end on NVIDIA 2080Ti GPUs for epochs on the ScanNet [dai2017scannet] benchmark. The learning rate decays to and at and epoch respectively. We re-implement plane detection module following [liu2019planercnn] but with a publicly released implementation [massa2018mrcnn] of Mask-RCNN [he2017mask]. Following [liu2019planercnn], we initialize the weights with a detection model pretrained on COCO

[lin2014microsoft]. The input image size is set to beduring training and testing. Since our batch size is relatively small, we freeze all the batch normalization

[ioffe2015batch] layers of the plane detection head during training. We apply group normalization [wu2018group] as the normalization function in our plane MVS head.### 4.2 Training data generation

Semantic plane groundtruth generation. To build our plane dataset with semantic labels, we first obtain and pre-process the raw rendered plane masks from [liu2019planercnn], and get the 2D raw semantic maps from ScanNet [dai2017scannet]. Then we map the semantic labels from ScanNet to NYU40 [silberman2012indoor]. We merge some semantically similar labels in NYU40 and finally pick labels that are likely to contain planar structures. We obtain the semantic label for each plane instance by projecting its mask onto the semantic map then performing a majority vote. If the voted result does not belong to any of the labels we select, we simply label the raw mask as non-planar and treat it as a negative sample during training and evaluation.

View selection for MVS. We have to sample stereo pairs from ScanNet [dai2017scannet] monocular sequences as an appropriate stereo pair should have a large enough camera baseline as well as a sufficient overlap. In our work, we choose those stereo pairs as qualified ones if their relative translations lie between and . We select views (a target and a source view) during training and testing. We believe adding more views could further improve the performance, but that is not the main theme of this work.

### 4.3 Datasets

In our experiments, we use ScanNet [dai2017scannet] for training and evaluation. We further generalize our model to two other RGB-D indoor datasets, *i.e.*, 7-Scenes [glocker2013real] and TUM-RGBD [sturm2012benchmark], by testing with and without finetuning, to demonstrate the generalizability. Since the two datasets do not contain any plane groundtruths, we only evaluate the planar geometry metrics and show the qualitative results for plane detection on them. Due to the space limit, here we only introduce how we use ScanNet. Please refer to our supplementary material for the information on other datasets.

ScanNet. ScanNet [dai2017scannet] is a large indoor benchmark containing hundreds of scenes. We sample the training and testing stereo pairs from its official training and validation split, respectively. After pre-processing and filtering out the unqualified data following the steps in PlaneRCNN [liu2019planercnn], we randomly subsample pairs for training. However, since the raw 3D meshes of ScanNet are not always complete, the rendered plane masks from meshes are noisy and inaccurate in quite a few images. This results in unconvincing plane detection evaluation if we directly test on those images. To this end, we manually pick 950 stereo pairs whose plane mask annotations are visually clean and complete from the original testing set for our evaluation.

### 4.4 Evaluation metrics

Following previous plane reconstruction methods [liu2018planenet, liu2019planercnn], we mainly evaluate the plane reconstruction quality on average precision (AP) of plane detection with varying depth error thresholds , and the widely-used depth metrics [eigen2014depth]. Since we introduce plane semantics in our framework, we also evaluate the mean average precision (mAP) [lin2014microsoft] which couples semantic segmentation and detection as used in object detection papers.

### 4.5 Comparison with state-of-the-arts

##### Single-view plane reconstruction methods.

We first compare our PlaneMVS with a SOTA single-view plane reconstruction method PlaneRCNN [liu2019planercnn], which also serves as the baseline of our model. We test it on our re-implemented version with plane semantic predictions with the same training and testing data as ours. Tab. 1 shows that our method outperforms PlaneRCNN in terms of both plane geometry and 3D plane detection by a large margin. As shown in Fig. 3, PlaneRCNN does well in obtaining geometrically smooth planar depth maps, but their plane parameters are far from accurate (*e.g.*, the and row of Fig. 3), which rely on single-view regression and suffer from the depth scale ambiguity issue (*e.g.*, the and row of Fig. 3). For AP without considering depth, we also get considerable improvements benefiting from multi-task learning and the proposed soft-pooling loss. Although PlaneRCNN is a strong baseline, Fig. LABEL:fig:det_qualitative clearly shows that our method better perceives plane boundaries, and our segmentation aligns better with 3D plane geometry. For mAP evaluation which considers plane semantic accuracy with detection, our method also outperforms PlaneRCNN by a nontrivial margin.

Learning-based MVS methods.
We also compare our method against several representative MVS methods. We select two representative depth-based MVS methods, MVSNet [yao2018mvsnet] and DPSNet [im2019dpsnet] since our MVS module shares similar network architecture with them. Besides, we also compare with NAS [kusupati2020normal] which aims to enforce depth-normal geometric consistency in MVS. We train and test these methods on our ScanNet data split with their released code for fair comparisons. We also compare with one of the state-of-the-art multi-view depth estimation methods ESTDepth [long2021multi]. From Tab. 1, our method clearly outperforms those MVS methods. Note that [long2021multi] is designed for temporally longer frames which may explain its possible performance drop when testing on two-views. The qualitative results in Fig. 3 clearly show that compared with conventional depth-based MVS methods, our “PlaneMVS-pixel” results reconstructed from pixel-level plane parameters have shown more accurate depth, especially over textureless areas, which can be accredited to the proposed slanted plane hypothesis that learns planar geometry. By applying soft-pooling with detected plane masks (*i.e.*, our “PlaneMVS-final”), global geometric smoothness and sharper boundaries can be achieved over planar regions. The texture-copy issue in some cases (*e.g.*, the row of Fig. 3) of other methods can also be effectively avoided in ours.

### 4.6 Results on 7-Scenes

Method | AbsRel | |
---|---|---|

PlaneRCNN [liu2019planercnn] | 0.221 | 0.640 |

MVSNet [yao2018mvsnet] | 0.162 | 0.766 |

DPSNet [im2019dpsnet] | 0.159 | 0.788 |

NAS [kusupati2020normal] | 0.154 | 0.784 |

ESTDepth [long2020occlusion] | 0.153 | 0.786 |

Ours | 0.158 | 0.793 |

Ours-FT | 0.113 | 0.890 |

Method | Depth Metrics | Detection Metrics | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

AbsRel | SqRel | RMSE | RMSE_log | AP | AP | AP | AP | AP | ||||

Baseline | 0.170 | 0.074 | 0.305 | 0.200 | 0.746 | 0.944 | 0.990 | 0.288 | 0.458 | 0.519 | 0.545 | 0.551 |

+ Soft-pooling loss | 0.119 | 0.042 | 0.234 | 0.148 | 0.871 | 0.979 | 0.995 | 0.380 | 0.520 | 0.549 | 0.557 | 0.561 |

+ Loss term uncertainty | 0.089 | 0.027 | 0.190 | 0.119 | 0.922 | 0.987 | 0.997 | 0.449 | 0.535 | 0.556 | 0.560 | 0.562 |

+ Convex upsampling | 0.088 | 0.026 | 0.186 | 0.116 | 0.926 | 0.988 | 0.998 | 0.456 | 0.540 | 0.559 | 0.562 | 0.564 |

We evaluate our approach on the 7-Scenes [glocker2013real] dataset to check its generalizability. Tab. 2 shows that our method also significantly outperforms PlaneRCNN [liu2019planercnn], and is better than or comparable with other MVS methods [yao2018mvsnet, im2019dpsnet, kusupati2020normal, long2021multi]. Since PlaneRCNN [liu2019planercnn] learns plane geometry from single views, the ability to generalize beyond the domain of training scenes is limited. However, our method benefits from multi-view geometry to learn multi-view feature correspondences and thus has superior generalizability on unseen data. We leave how we perform finetuning on 7-Scenes with only groundtruth depth to the supplementary material.

### 4.7 Ablation study

In this subsection, we evaluate the effectiveness of each proposed component (*i.e.*, soft-pooling loss, loss term uncertainty, convex upsampling, and slanted plane hypothesis). We leave some comparisons on hyper-parameters and settings to the supplementary material.

Method | AbsRel | SqRel | AP | AP | |
---|---|---|---|---|---|

Fronto-MVS | 0.094 | 0.033 | 0.917 | 0.433 | 0.548 |

Ours | 0.088 | 0.026 | 0.926 | 0.456 | 0.564 |

Soft-pooling loss. The soft-pooling loss is designed for coupling plane detection and plane geometry. From Tab. 3, it significantly improves both plane depth and 3D detection on all metrics. It helps the pixels within the same plane learns consistent plane parameters which benefits the plane geometry. For plane detection, as shown in Fig. LABEL:fig:det_qualitative, our detected planes also align better with 3D plane geometry, especially over plane boundaries.

Training with loss term uncertainty. Tab. 3 shows that by weighting each loss term with learned uncertainty, our model can get further improvement on plane geometry and 3D detection. There are two possible reasons. Firstly, our model has different branches with multiple losses for 2D or 3D objectives. Setting each loss with the same weight may not make them converge smoothly. Secondly, as introduced in Sec. 4.1, the plane detection head is initialized with a COCO-pretrained model. But for the MVS head, we train it from scratch. Thus the learning procedures for the two branches may be imbalanced if we do not adaptively change the weights for each term. After adopting the learnable uncertainty, the weights of different terms are automatically tuned during training, which has brought great benefits to our multi-task learning framework.

Convex upsampling. We analyze the effect of applying the convex combination upsampling to replace bilinear upsampling. As shown in Tab. 3, we get considerable improvement on most metrics. Compared with applying bilinear upsampling on estimated plane maps, the learned convex upsampling can better keep fine-grained details.

Slanted plane hypothesis.
We conduct an additional ablation study, *i.e.*, replacing our slanted plane hypothesis with the frontal-parallel plane hypothesis, using the same network architecture. We also apply convex upsampling and loss-term uncertainty for fair comparison. We employ the least-squares algorithm to fit planes with the predicted per-pixel depth map and plane masks, and then transform the plane parameters to planar depth maps. As shown in Tab. 4, our proposed method outperforms the ‘Fronto-MVS’ baseline on both 3D plane detection and depth metrics. Besides, our model learns plane parameters in an end-to-end manner instead of fitting planes as a post-processing step. This verifies the effectiveness of the proposed slanted plane hypothesis w.r.t. fronto-parallel hypothesis.

## 5 Conclusion and Future Work

In this work, we propose PlaneMVS, a deep MVS framework for multi-view plane reconstruction. Based on our proposed slanted plane hypothesis for plane-sweeping, 3D plane parameters can be learned by deep MVS in an end-to-end manner. We also couple the plane detection branch and the plane MVS branch with the proposed soft pooling loss. Compared with single-view methods, our system can reconstruct 3D planes with significantly better accuracy, robustness, and generalizability. Without sophisticated designs, our system even outperforms several state-of-the-art MVS approaches. Please refer to our supplementary material for more results, discussions and potential limitations.

There are a few directions worth exploring in the future. First, recent advanced designs of deep MVS systems [gu2020cascade, yang2020cost, zhang2020visibility, wang2021patchmatchnet] could be incorporated to further improve MVS reconstruction. Second, temporal information from videos (beyond two frames as we are currently using) can be exploited to achieve temporally coherent plane reconstruction, such that consistent single-view predictions could be fused into a global 3D model of the entire scene.

## 6 Supplementary Material

### 6.1 Hypothesis selection for slanted planes

Fig. 5 shows the distribution of the three axes of plane sampled from training images. Based on the distribution, we select , , as the range of axis for , respectively, to ensure at least

of the groundtruth planes lie within the ranges. Since our plane hypothesis is a three-dimensional vector, the computational cost of the cost volume is cubic w.r.t. the number of hypothesis per axis. To reach a balance between accuracy and memory consumption, we sample 8 hypotheses uniformly along every axis and finally have

plane hypotheses in total.### 6.2 Semantic classes on ScanNet

After merging the semantically-similar categories in NYU40 [silberman2012indoor] labels, we pick 11 classes: wall, floor, door, chair, window, picture, desk & table, bed & sofa, monitor & screen, cabinet & counter, box & bin, which are likely to contain planar structures in indoor scenes. Please refer to Fig. LABEL:fig:plane_gt_qualitative for some visualization examples of the generated planar instance and semantic groundtruth from ScanNet[dai2017scannet].

### 6.3 Benchmark setup

7-Scenes. 7-Scenes [glocker2013real] collects posed RGB-D camera frames of seven indoor scenes. We sample stereo pairs in the same manner as in ScanNet [dai2017scannet] and follow the official split to get finetuning and evaluation data. We finally have pairs for finetuning and pairs for evaluation.

TUM-RGBD. TUM-RGBD [sturm2012benchmark] is an indoor RGB-D monocular SLAM dataset with calibrated cameras. We randomly select 4 scenes (*i.e.*, fr1-desk, fr1-room, fr1-desk2, fr3-long-office-household) with pairs for finetuning and 2 scenes (*i.e.*, fr2-desk, fr3-long-office-household-validation) containing pairs for evaluation.

### 6.4 Results on 7-Scenes and TUM-RGBD

We have discussed how we deal with 7-Scenes and have demonstrated its quantitative results in the main paper. Here we introduce our simple but effective strategy to perform finetuning with only groundtruth depth. We first generate pseudo groundtruths of plane masks by getting the predictions with the ScanNet-pretrained model on the testing images. Then we train our model without plane parameter losses but maintain other losses. We simply set each loss weight to instead of adopting the loss term uncertainty during finetuning since we find it cannot bring much improvement. We finetune the model for epochs. The planar depth gets much improved and we find that the plane detection results also tend to be visually better, which may be accredited to multi-task training and our soft-pooling loss to associate 2D with 3D. The same applies to the TUM-RGBD [sturm2012benchmark] dataset. Some qualitative examples of 7-Scenes are shown in Fig. 7.

As shown in Tab. 5 and Fig. 8, similar to 7-Scenes, our approach generalizes much better on TUM-RGBD compared with PlaneRCNN [liu2019planercnn], thanks to the learned multi-view geometric relationship. By performing the proposed finetuning strategy, the results get further improved on both 3D planar geometry and 2D planar detection.

Method | AbsRel | SqRel | |
---|---|---|---|

PlaneRCNN[liu2019planercnn] | 0.243 | 0.105 | 0.655 |

Ours | 0.143 | 0.07 | 0.795 |

Ours-FT | 0.120 | 0.054 | 0.851 |

### 6.5 More Ablation studies

In this section, we discuss the impact of applying different hyper-parameters or settings in our experiments. Then we show qualitative examples on the two components of our proposed method to intuitively demonstrate their effects.

#### 6.5.1 Hyper-parameters and settings

Plane hypothesis range.
We first study the effect of the plane hypothesis range we set. We compare the results of different hypothesis ranges while keeping the hypothesis number unchanged: (i) use the same range of for the axes; (ii) broaden the range to ; (iii) shorten the range to ; (iv) employ the same range of for the axes and a different range of for the axis. As shown in Tab. 6, setting (iv), which serves as our default setting, achieves the best result. The performance drops when using the same range for all axes as (i), since values mainly distribute between . Using a broader range, *e.g.* (i) and (ii), covers some marginal values but decreases the density of the plane hypothesis, thus leading to less accurate results. In setting (iv), although shortening ranges can increase the hypothesis density, some non-negligible groundtruth values are not well covered, thus also leading to worse results.

Hypos range | AbsRel | |
---|---|---|

(-2, 2) for x,y,z | 0.093 | 0.920 |

(-1.75, 1.75) for x,y,z | 0.094 | 0.921 |

(-2.5, 2.5) for x,y,z | 0.096 | 0.919 |

(-2, 2) for x,y; (-2, 0.5) for z | 0.088 | 0.926 |

Plane hypothesis number.
When keeping the plane hypothesis range constant, varying hypothesis number changes the hypothesis density. We test our model using hypotheses per axis, *i.e.*, and respectively. The results are listed in Tab. 7. As expected, in general, the higher density we set, the better geometry performance we achieve. The performance gaps among different numbers are small, which demonstrates that our model is robust to these hyper-parameters to some extent. Note that using will substantially increase the memory consumption. So we choose in our default setting.

Hypos number per axis | AbsRel | |
---|---|---|

6 hypos (216 in total) | 0.091 | 0.924 |

8 hypos (512 in total) | 0.088 | 0.926 |

10 hypos (1,000 in total) | 0.088 | 0.927 |

Method | AbsRel | |
---|---|---|

Pixel-planar w/o pooling | 0.091 | 0.920 |

Pooling with predicted masks | 0.088 | 0.925 |

Soft-pooling with predicted masks | 0.088 | 0.926 |

Pooling with groundtruth masks | 0.087 | 0.932 |

Plane instance-aware soft pooling. We now evaluate the recovered depths among different pooling strategies reflecting the efficacy of plane detection on the learned 3D planar geometry. As shown in Tab. 8, when evaluating the depth reconstructed from pixel-level plane parameters, it underperforms the results with plane instance pooling since the generated depth maps cannot capture piece-wise planarity. The result improves when we apply hard-pooling with predicted plane masks over the pixel-level plane parameters. Applying soft-pooling weighted with pixel-level probability further brings a minor improvement since the probability reflects the confidence of a pixel belonging to a plane instance. Finally, we use groundtruth plane masks to perform pooling, which represents the upper bound of the impact of plane detection on geometry. As expected, it achieves the best result among the settings. Since groundtruth plane masks are not available during testing, we always apply the soft-pooling with predicted masks in other experiments.

Depth on planar region.
We further compare the reconstructed depth over only planar regions *v.s.* the whole image. Specifically, we conduct experiments only evaluating depth on the pixels that belong to any of the groundtruth planes. As shown in Tab. 9, compared with the depth over the whole image, the quantitative result over planar regions is better, no matter whether plane-instance-pooling is applied or not. This demonstrates that our proposed method’s geometry improvement mainly comes from the pixels of planar regions, which conforms to our initial motivation and objective.

Method | AbsRel | |
---|---|---|

Depth over whole image w/o pooling | 0.091 | 0.920 |

Depth over planar region w/o pooling | 0.086 | 0.929 |

Depth over whole image | 0.088 | 0.926 |

Depth over planar region | 0.081 | 0.938 |

Training dataset scale. In our default setting, we only sample stereo pairs for training. To analyze the impact of the scale of training data, we sample a larger training set with stereo pairs from the same scene split but keep the evaluation split unchanged. As shown in Tab. 10, our performance can be further improved with more training data on both plane detection and geometry metrics.

Dataset Scale | AbsRel | AP | AP | |
---|---|---|---|---|

20,000 training pairs | 0.088 | 0.926 | 0.456 | 0.564 |

66,000 training pairs | 0.082 | 0.934 | 0.470 | 0.570 |

#### 6.5.2 Qualitative ablation analysis

This section gives some qualitative ablation analysis on the two components (*i.e.*, convex upsampling and the soft-pooling loss) used in our method. Fig. LABEL:fig:convex_qualitative shows the efficacy of convex upsampling. We show the depth map recovered from pixel-level parameters to eliminate the effect of plane instance pooling. It is clear that the results upsampled by convex combination have sharper boundaries and fewer artifacts than using bilinear upsampling.

Fig. LABEL:fig:soft_pooling_qualitative shows the effectiveness of the proposed soft-pooling loss. The detected planes from the model trained with the soft-pooling loss are much more complete and align better with their boundaries.

### 6.6 Additional visualizations

We provide additional visualizations on predicted instance plane detection, planar semantic map, reconstructed planar depth map and 3D point cloud in Fig. 11, from our testing set on ScanNet [dai2017scannet].

### 6.7 Discussions and limitations

Our method *v.s.* patchmatch stereo. Our method shares high-level ideas with traditional patchmatch stereo works [bleyer2011patchmatch, galliani2015massively]

which aim to estimate a slanted plane for each pixel on the stereo reconstruction problem. However, our method differs from them in several aspects. (i) They perform patch matching around a pixel within a squared support window, where the patch size requires to be carefully set, thus not flexible and adaptive across various real-world cases. Instead of explicitly defining a patch, we associate and match the multi-view deep features. This is based on the observation that a pixel’s receptive field on the feature map is far beyond itself because of stacked CNNs. The model can automatically learn the appropriate field for matching local features with end-to-end training. (ii) These methods usually first initialize pixels with random slanted plane hypotheses, then undergo sophisticated, multi-stage schemes with iterative optimizations. In contrast, we generate more reliable slanted plane hypotheses based on a data-driven approach (

*i.e.*, analyzing the groundtruth plane distribution), and learn the pixel-wise plane parameters in an end-to-end manner, which is much easier to optimize. (iii) They usually adopt the photometric pixel dissimilarity as the matching cost function, which is sensitive to illumination changes and motion blurs across views. In contrast, we apply a feature-metric matching strategy, which is more robust to potential noises compared with applying photometric distance.

Potential limitations. Although we have achieved superior performance in most images, our system generates some failure cases as well. Firstly, as shown in Fig. LABEL:fig:failure_case_overlap, because of the large temporal gap, there exist areas in the target image which are invisible in the source image and thus do not follow the planar homography relationship. This issue may be mitigated by introducing a network to learn the pixel-wise visibility or uncertainty [zhang2020visibility]. Secondly, as in Fig. LABEL:fig:failure_case_interplane, there exist holes on some adjacent planes reconstructed from our method. An existing work [qian2020learning] proposes to infer and enforce the inter-plane relationship from single images. This approach may solve the second issue and could be incorporated to further improve the final plane reconstruction. We also leave it into future work to explore.