Learning to Reconstruct and Understand Indoor Scenes from Sparse Views

06/19/2019 ∙ by Jingyu Yang, et al. ∙ Tianjin University 2

This paper proposes a new method for simultaneous 3D reconstruction and semantic segmentation of indoor scenes. Unlike existing methods that require recording a video using a color camera and/or a depth camera, our method only needs a small number of (e.g., 3-5) color images from uncalibrated sparse views as input, which greatly simplifies data acquisition and extends applicable scenarios. Since different views have limited overlaps, our method allows a single image as input to discern the depth and semantic information of the scene. The key issue is how to recover relatively accurate depth from single images and reconstruct a 3D scene by fusing very few depth maps. To address this problem, we first design an iterative deep architecture, IterNet, that estimates depth and semantic segmentation alternately, so that they benefit each other. To deal with the little overlap and non-rigid transformation between views, we further propose a joint global and local registration method to reconstruct a 3D scene with semantic information from sparse views. We also make available a new indoor synthetic dataset simultaneously providing photorealistic high-resolution RGB images, accurate depth maps and pixel-level semantic labels for thousands of complex layouts, useful for training and evaluation. Experimental results on public datasets and our dataset demonstrate that our method achieves more accurate depth estimation, smaller semantic segmentation errors and better 3D reconstruction results, compared with state-of-the-art methods.

READ FULL TEXT VIEW PDF

Authors

page 4

page 5

page 8

page 9

page 10

page 11

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the increasing demand for indoor navigation, home/office design, and augmented reality, indoor 3D reconstruction and understanding have become active topics in computer vision and graphics. Existing reconstruction methods can be broadly categorized into two groups. The first group scans indoor scenes with an integrated depth camera based on either time-of-flight (ToF) or structured light sensing that offers dense measurements of depth. The pioneering KinectFusion 

[40] presents a detailed workflow using Kinect for indoor reconstruction. It was more recently extended by ElasticFusion [52] and BundleFusion [8] which achieve state-of-the-art results in real-time 3D reconstruction. Despite that it is relatively simple to acquire depth, the depth captured by such methods contains much noise and missing data, and is limited to a small range of distances. Color cameras do not suffer from these issues, are still far more available (e.g.on mobile phones) and have a smaller form factor than depth cameras. It is therefore interesting to study 3D scene reconstruction using a color camera, which however is challenging due to lack of depth information. Simultaneous localization and mapping (SLAM) [34] and structure from motion (SFM) [6] are two popular approaches to achieve feature-based point cloud 3D reconstruction on-line and off-line, respectively. However, these feature-based methods require rich textures in the scene, and are therefore difficult to obtain dense point clouds. All the above methods require consecutive frame tracking or dense view capturing.

Figure 1: An example of our IterNet RGB-D dataset and the reconstructed 3D model by our IterNet: (a) Input RGB image, (b) Ground-truth depth map, (c) Ground-truth semantic segmentation, (d)-(f) Reconstructed 3D model using our estimated depth map and semantic labels. This example is part of testing data.

In this paper, we propose a new indoor-scene 3D reconstruction and semantic segmentation method using color images captured from several uncalibrated sparse views. The first challenge is the difficulty in dense reconstruction from sparse views with little overlap, which is practically degenerated into monocular depth estimation. The second challenge, hence, is non-rigid transformation between views brought in by the inaccurate depth estimated from single color images. To address these problems, we design IterNet, an iteratively optimized deep framework for simultaneous depth map recovery and semantic segmentation for each view, where the two tasks help improve each other. To estimate non-rigid transformations between sparse views, we further develop a joint global and local alignment method to fuse estimated depths with the help of semantic information, which integrates geometry, photometry and semantic information in the coarse-to-fine manner.

Depth recovery and semantic segmentation from images are ill-posed and it is essential to learn from high-quality training data. For indoor scene understanding, a number of datasets have been made publicly available. Real-world datasets, such as NYUv2

[39], SUN RGB-D [48] and ScanNet [7], need a lot of manual labor to annotate the labels and contain unavoidable noise in depths assumed as ground-truths, while synthetic datasets [49, 17] are difficult to generate photorealistic RGB images and usually have limited layouts and image resolution. To our best knowledge, no existing datasets can provide photorealistic RGB images, accurate depth maps, pixel-level semantic labels, and thousands of complex layouts at the same time. To address this, we build IterNet RGB-D dataset with these features.

Experimental results on both public datasets and our dataset demonstrate that our method outperforms state-of-the-art methods on depth estimation, semantic segmentation, and multi-view reconstruction. Figure 1 gives an example of our IterNet RGB-D dataset and the reconstructed 3D model with estimated semantics using our IterNet. We will make the code and the dataset available online for research purposes.

In summary, our work is an integrated work that includes 1) an unprecedented indoor synthetic dataset simultaneously providing photorealistic high-resolution RGB images, accurate depth maps and pixel-level semantic labels for thousands of complex layouts, 2) a depth estimation method from a single color image, 3) a semantic segmentation method from a single view, and 4) a multi-view reconstruction method for sparse views. Each component of our method has novelty and is proved by experiments on public datasets and our dataset. They jointly solve a challenging problem of 3D reconstruction and understanding from sparse views. Our main contributions are:

  • We provide IterNet RGB-D dataset including photorealistic high-resolution RGB images, accurate depth maps, and pixel-level semantic labels for thousands of complex layouts, useful for training and evaluation.

  • We solve a challenging problem, namely reconstructing and understanding indoor 3D scenes using only color images captured from several uncalibrated sparse views. It is applicable to more scenarios than previous approaches that rely on texture and/or geometries of dense views, e.g., reconstructing and understanding a room using several photos captured by different users.

  • We design a novel iterative joint optimization method for depth estimation and semantic segmentation for a given input color image, where the two tasks help improve each other. This architecture is not restricted to these tasks we address here and can also be extended to other related tasks such as object/part parsing.

  • We propose a joint global and local registration method to fuse different sparse perspectives. This coarse-to-fine alignment is robust to the sparsity of views and the errors of monocular depth estimation.

Dataset
NYUv2
[39]
SUN
RGB-D [48]
Building
Parser [2]
Matterport
3D [4]
ScanNet
[7]
SUNCG
[49]
SceneNet
RGB-D [17]
IterNet
RGB-D
Year 2012 2015 2017 2017 2017 2017 2016 2019
Type Real Real Real Real Real Synthetic Synthetic Synthetic
Images/Scans 1449 10K 70K 194K 1513 130K 5M 12,856
Layouts 464 - 270 90 1513 45,622 57 3214
Object Classes 894 800 13 40 50 84 255 333
RGB
Depth
Semantic Label
RGB Texturing Real Real Real Real Real Not Photorealistic Photorealistic Photorealistic
Image
Resolution
640480 640480 10801080 12801024 640480 640480 320240
1280960;
1280720
Table I: Comparison between various indoor datasets. IterNet RGB-D is our proposed dataset. : not included, ✓: included, -: relevant information not available.

Ii Related work

Indoor datasets. Naseer et al.[38] gave a comprehensive overview of indoor scene understanding in 2.5/3D. The first dataset is NYU-Depth with two versions introduced by Silberman et al.[39] using Microsoft Kinect. SUN RGB-D dataset [48] captured by four different RGB-D sensors contains 10,335 indoor images with dense annotations. Armeni et al.[2] provided Building Parser dataset with instance level semantic and geometric annotations. Matterport3D [4] contains 10,800 panoramic images covering viewpoints captured by a Matterport camera. ScanNet [7] is a 3D reconstruction dataset with 2.5 million frames obtained from 1,513 scans. These real-world datasets usually have some noise and missing areas in depth maps and need a lot of manual effort to annotate the labels. Hence, synthetic datasets are proposed for easy generation and accurate ground-truth. SUNCG [49] is a densely annotated large-scale indoor dataset, but the rendered RGB images are not photorealistic and RGB-D videos are not available. SceneNet RGB-D [17] provides pixel-level annotations and photorealistic RGB images, but the number of layouts is limited. Table I compares various publicly available 2.5/3D indoor datasets with our IterNet RGB-D dataset. Our dataset provides a total of 12,856 photorealistic images for thousands of layouts, and has a higher image resolution: and , covering more indoor scenes. Moreover, our dataset provides absolute depth maps and pixel-level semantic segmentation that are more precise and accurate. Compared with other datasets, the indoor scenes covered by our dataset are more general and more complex.

Monocular Depth Estimation. In computer vision, monocular depth estimation has been a long-standing topic in the last decades. Previous approaches mainly focused on hand-crafted features [18], defocused features[30], statistical priors[20] or graphical models [32]

. With the development of deep learning, more recent approaches are based on Convolutional Neural Networks (CNNs). For instance, Eigen

et al. [13] proposed a multi-scale CNN for depth estimation and demonstrated the effectiveness of the CNN-based method with promising results. Considering the correlation between tasks, Wang et al.[51] introduced a CNN for joint depth estimation and semantic segmentation. Xu et al.[53] proposed a multi-task approach for depth estimation via cross-modal interactions to refine the task. Recently, the attention mechanism has become popular, and Xu et al.[55] proposed a structured attention mechanism to fuse the features of different scales. The most similar work to ours is [54] where a continuous Conditional Random Field (CRF) is used to combine multi-scale features. Our approach develops from a similar intuition but further integrates semantic information in an iterative way.

Semantic Segmentation.

Semantic segmentation is an extension of image classification. Instead of classifying an image as a whole, semantic segmentation assigns per-pixel predictions of object categories for the given image. It is challenging due to randomness of object distribution, poor illumination, and occlusion. Deng

et al.[9] proposed a robust information theoretic (RIT) model to reduce the uncertainties, i.e., missing and noisy labels, by learning a transformation function and a discriminative classifier that maximize the mutual information of data and their labels in the latent space. Alterative approaches are typically based on CNNs. Long et al.[35] proposed a Fully Convolutional Network (FCN), a popular CNN architecture for dense predictions without any fully connected layers. Almost all the subsequent approaches on semantic segmentation adopted this paradigm. With the development of depth sensors and the release of RGB-D datasets, some methods attempted to use depth information for better segmentation, no longer limited to a single RGB image. Li et al.[27] constructed HHA images [16]

for the depth channel through geometric encoding before feeding them to the network and used Long Short-Term Memory (LSTM) to fuse two different features. Ma

et al.[36] predicted semantic segmentation from RGB-D sequences, but it is inapplicable to sparse views. Our method exploits depth information to help improve semantic segmentation, but the depth is estimated from the input color image instead of directly captured by a dedicated depth sensor. We propose an iterative method for joint estimation of the depth and semantic segmentation, which benefit each other.

Indoor Scene 3D Reconstruction. Indoor Scene 3D Reconstruction from a color video or multi-view color images is a challenging and active topic. Given a color video, most structure from motion (SFM) methods [47] recovered the 3D structure by estimating the motion of the cameras corresponding to the frames. However, it is difficult for these methods to obtain dense and accurate reconstruction. Given multi-view color images with calibrated camera parameters, multi-view stereo (MVS) methods [33] can achieve more accurate 3D reconstruction. But they require adjacent views to have sufficient overlap and cannot work well with sparse views. COLMAP [45, 46] provides a pipeline containing both SFM and MVS with graphical and command-line interfaces. When the views of images are very sparse, the depth of each image can be estimated and fused together using iterative closest point (ICP) like registration methods [15]. However, it is difficult to achieve accurate depth estimation from individual color images which increases the difficulties of ICP fusion. Saxena et al.[43] proposed a novel method for 3D reconstruction from sparse views, but it only worked well for building-like outdoor scenes and cannot generate semantics. Learning-based methods, e.g., MVSNet [56] and DeepMVS [19], output the depth of a specific frame based on a color multi-view sequence, but they cannot deal with sparse views. In this paper, we design IterNet to estimate a more accurate depth map with the help of semantic segmentation, and propose a joint global and local registration method to better achieve indoor scene 3D reconstruction from sparse views.

Iii Proposed Method

In this section, we first introduce our IterNet RGB-D dataset in Section III-A, and then describe the technical details of IterNet for iterative joint depth estimation and semantic segmentation in Section III-B. The joint global and local multi-view reconstruction method is presented in Section III-C. Figure 2 illustrates the workflow of our method.

Figure 2: Illustration of the proposed method for indoor 3D reconstruction and understanding. The blue Module refers to our IterNet for iterative joint depth estimation and semantic segmentation (Section III-B). With the help of semantic segmentation, we use our proposed joint global and local registration method to reconstruct a 3D scene with semantic information from sparse views (Section III-C).

Iii-a Dataset

Different from the production of other synthetic datasets [17, 49], our dataset is generated by a third-party platform which includes various real-life house styles, real prototype rooms designed by professional designers, and detailed model materials. We also implement high-quality photorealistic rendering. Compared to traditional rendering, we adopt the method of image splitting and recombination to achieve distributed rendering. To accelerate the rendering speed, we utilize the computing power of multiple servers with CPUs, thus multiplying the rendering speed. The average rendering time of a image is about 90 seconds. Our rendering is realized on a cluster of 32 servers, each consisting of a CPU with 32 cores and 64 threads. For rendering 12,856 images, it takes about 321 hours. In terms of rendering quality, in addition to considering the direct illumination of the light source in the scene, the illumination reflected by other objects, known as Global Illumination (GI), is also taken into consideration. There are many ways to achieve GI. In order to render better results, we adopt the Brute Force (BF) algorithm [50] based on path tracking. The number of samples per pixel is up to 512 and varies for different scenes. The noise level is controlled below 0.05. A lower noise level yields better rendering quality, but requires longer rendering time. In order to obtain better results and minimize the rendering time, rendered images are denoised using a wavelet-based denoising method [11]. Figure 3 shows some examples of different scenarios in our dataset. Our dataset provides photorealistic high-resolution RGB images, accurate depth maps and pixel-level semantic labels for thousands of layouts, useful for training and evaluation. Figure 4 shows more scenarios in our dataset. It can be seen that our dataset contains more complex indoor layouts, richer textures, colorful and realistic lightings, and higher resolution images, which are more photorealistic and closer to real-world images than existing synthetic datasets. Our dataset will be available online.

Figure 3: Some examples of different scenarios in our dataset. From top to bottom: color images, ground-truth depth maps, and ground-truth semantic segmentations.
Figure 4: More examples of different scenarios in our dataset with color images, depth maps, and semantic labels. Our dataset contains more complex indoor layouts, richer textures, colorful and realistic lightings, and higher resolution images.

Iii-B IterNet: Iterative CNN for Joint Depth Estimation and Semantic Segmentation

Network Architecture. The proposed IterNet is a multi-task deep CNN mainly consisting of two parts: the depth estimation sub-network and the semantic segmentation sub-network, as shown in Figure 5.

Figure 5: Overview of the proposed IterNet architecture. The CCRF blocks in the depth estimation sub-network fuse the features at different scales and combine the semantic features. In the semantic segmentation sub-network, the purple block represents atrous convolution which reduces the size of image while increasing the receptive field. The ASPP block indicates atrous spatial pyramid pooling which is made by four different dilated convolutions for resampling in our implementation.

In the design of the depth estimation sub-network, we refer to a monocular depth estimation method [54] using a continuous conditional random field (CCRF) to combine multi-scale features. Different from [54], we add a semantic branch built upon an encoder-decoder structure to extract semantic features and further use a CCRF to integrate the multi-scale RGB features and the semantic features which can better make use of boundary constraints in semantic segmentation. The RGB branch consists of a front-end base network and a refinement network combined with several CCRF modules. Together with semantic information, the output of the RGB branch is fed into a CCRF module to generate the estimation of depth which is used as the input of the semantic segmentation sub-network.

In the semantic segmentation sub-network, we use the Long Short-Term Memorized Context Fusion (LSTM-CF) [27] Model with different fusion scheme for the RGB-D features, which is capable of fusing contextual information from multiple sources (i.e. photometric and depth channels). Instead of the original serial vertical and horizontal context layers, we adopt a parallel context layer and a direct fusion scheme to better play the role of depth. We also add an Atrous Spatial Pyramid Pooling (ASPP) [5] as a multi-scale feature extractor. Unlike an encoder-decoder network extracting different intermediate layers to obtain multi-scale features, ASPP employs multiple parallel filters with different sampling rates. For depth information, rather than directly feeding a depth image into the network, we first encode it into an HHA image [16] using geocentric encoding and then input it into the network.

Training and Testing. Given datasets of RGB-Depth-Semantic triplets, our aim is to train the designed network for joint depth and semantic estimation. The depth estimation sub-network and semantic segmentation sub-network are designed to interact with each other to boost the performance. Instead of jointly training the two sub-networks, we train the depth estimation and semantic segmentation sub-networks sequentially for flexible boosting. Taking the depth estimation sub-network as an example, we train the upper branch and the lower branch with RGB-Depth pairs and Sematic-Depth pairs, respectively. The depth estimation sub-network is then fine-tuned with the RGB-Depth-Semantic triplets. The semantic segmentation sub-network is trained in a similar way.

At the test stage, since each sub-network expects the output of the other sub-network as part of input, we use the following strategy. We need an initialized semantic segmentation or depth estimation which can be easily obtained by disabling one of the branches in the original network structure. For example, if we want to obtain an initial depth estimation for semantic segmentation, we disable the semantic segmentation branch in the depth estimation sub-network and then extract features from RGB branches as an initial depth. We then alternately run the two sub-networks, with the output of one sub-network used as input for the other sub-network. The additional depth information helps improve semantic segmentation, and the semantic segmentation in turn contributes to improved depth estimation. In practice, we find that there is no significant improvement after 3 iterations, which shows quick convergence.

Implementation Details.

The proposed approach is implemented on the Caffe framework

[22] and runs on a computer with an Nvidia GTX 1080ti graphics card (11GB). For depth estimation sub-network, the learning rate is initialized at and decreases by

for every 30 epochs. The batch size is set to 16. The momentum and the weight decay are set to 0.9 and 0.0005, respectively. The semantic segmentation sub-network follows the same training rules, but the initial learning rate is set to

. The parameters of batch size, momentum and weight decay are set to 8, 0.9 and 0.005, respectively. The learning rate decreases by for every 20 epochs. When the pretraining of each branch is finished, we fine-tune the sub-networks, and the initial learning rates are set to and for depth estimation and semantic segmentation, respectively. The batch size, momentum and weight decay remain the same as the pretraining.

Iii-C Joint Global and Local Reconstruction

After obtaining the depth and the semantic segmentation for the image of each view, we reconstruct the whole 3D scene by fusing the depths of different views. The straightforward way is to use the ICP algorithm to align the point clouds transformed from the depths of different perspectives. However, it is difficult to achieve satisfactory alignment. First, the depths are obtained by a monocular depth estimation network, not captured by Kinect or other depth cameras, containing some non-statistical errors. It is therefore insufficient to align two depth point clouds with just one rigid transformation. Second, for sparse perspectives, the overlap between two adjacent views is limited which is difficult to handle by standard ICP algorithms. Hence, we propose a new joint global and local registration method by exploiting photometric and semantic information to improve reconstruction quality.

Before fusion, we filter the messy points based on the plane constraint similar to [3]. Let be the sparse view set, where is the total number of views for reconstruction. After depth estimation and semantic segmentation, each view now contains three components: color , depth and segmentation . We align all the depth point clouds in sequence with the previous registration result used as the next target model. Each alignment has two stages, namely global alignment and local alignment.

Global alignment. Taking the point cloud generated using the previous views as the target, our goal for global alignment is to find an optimal global rigid transformation for view , which is composed of two parts: rotation and translation . Specifically, we first convert the depth map into a point cloud . is a point set for the i-th view, and represents the total number of points in the view. We take a global ICP-type framework alternating two steps, until convergence. The transformation is initialized by a identity matrix. Assuming the target point cloud is containing all the fused points from the previous views, the first step finds for each point its corresponding point if possible, and the second step updates the transformation such that when applied to the point cloud is aligned with .

In the first step, we exploit the additional photometric and semantic information. We lift each point from 3D to a point in a 7-dimensional (7D) space, , including its 3D position , RGB color and semantic label . Similarly, the point is lifted to a 7D point . Our global registration method for aligning and first finds the corresponding point for each point in by the following optimization:

(1)

where and are weights to balance the importance of geometric, photometric and semantic information. They are set to be and in our experiments.

Due to limited overlap, not all the points in have their corresponding points in . We reject if the matching error is larger than a threshold. In our implementation, this threshold is set to 5cm, and correspondences with higher distances are ignored. Let be the set of retained correspondences. In the second step, since photometric and semantic matching errors are independent of rigid transformations, we use a standard ICP algorithm [15] to find the transformation between the two point clouds:

(2)
Figure 6: Comparison of different alignment methods: From left to right are results of standard ICP algorithm [15], 4PCS [1], global alignment using the estimated depth without the help of semantic branch, and our joint global and local alignment method.

Local alignment. Using the 7D global registration method, we achieve coarse alignment which broadly aligns different views, but still cannot cope with the problem of non-statistical errors in monocular depth estimation, as such local deformation is no longer rigid. To address this problem, we further propose a local registration strategy to refine the previous coarse estimation, similar to coarse-to-fine refinement. Specifically, we first extract local point sets from the original point cloud according to their semantic labels, and then register each of them using the above method. Note that in this case, a subset of points from one view is only matched to subsets of points with the same semantic label. Therefore, when finding the matched point, the semantic difference term in Eq. (1

) is always zero. For each local set, once it is aligned, we fuse the registered parts from different views by averaging 3D positions of overlaps to mitigate the influence of noise. The key for our joint global and local registration method is to use multiple transformations to register sparse views with coarse-to-fine refinement, rather than just one single transformation, which is more robust to the noise and outliers in the monocular depth estimation.

Iv Experimental Results

Iv-a Ablation Study

We compare the full model with full model without semantic segmentation and full model without depth estimation in Table II. It can be seen that our full model has achieved the best performance. Figure 6 shows the fusion results of an ICP matching method [15], 4PCS [1], global alignment using the estimated depth without the help of semantic branch, and our proposed joint global and local registration method. Some misalignments occur in local areas for standard ICP methods. On the contrary, our method achieves better fusion result in terms of both global structure and local details.

Method F-S F-D F
rel (lower is better) 0.176 - 0.136
log10 (lower is better) 0.088 - 0.062
rms (lower is better) 1.012 - 0.507
P-acc.(%) (higher is better) - 67.35 75.54
M-acc.(%) (higher is better) - 68.29 74.49
IoU(%) (higher is better) - 54.21 63.98
Table II: Ablation study on our dataset. F-S: full model without semantic; F-D: full model without depth; F: full model.

Our iterative scheme in IterNet usually converges to promising results after three iterations and is stable for various images. Figure 7 shows the decreasing of average RMS (root mean squared) errors of depth estimation over all the test images in iterations and the increasing of average pixel accuracy of semantic segmentation over all the test images in iterations. It can be seen that there is no significant improvement for both depth estimation and semantic segmentation beyond three iterations.

To study and verify the role of IterNet in depth estimation, we compare two recent backbone architectures including Structured Attention Guided Convolution Neural Fields [55] and CCRF [54] which achieve promising performance in depth estimation. Figure 8 shows the comparison results on our IterNet RGB-D dataset. We crop high resolution images into small pieces of 426 426 and feed them into the networks. It can be seen that our framework significantly enhances the attention with clear object structures, and refines the CCRF architecture with sharper contours for some objects such as the pillow and the chair.

Figure 7: Convergence curves of the proposed IterNet for NYUv2 dataset [39] and our dataset (averaged over all test images in each dataset).
Figure 8: Comparison of depth estimation with two different network architectures.
Figure 9: Depth estimation results on NYUv2 dataset (top two rows) and our dataset (bottom row). From left to right are the input RGB images, the ground-truths depth and the depth results estimated by Eigen et al.[13], Xu et al.[54], Xu and Wang [55], and our method.

Iv-B Depth Estimation

We compare our approach with several state-of-the-art methods on NYUv2 dataset [39] in Table III. We use 795 images for training and the other 654 images for testing as other methods did. We also use the same raw data as other methods and adopt data augmentation (finally 4770 images for training) to avoid the over-fitting problem. Referring to previous work [12, 13, 51], we evaluate the depth estimation results with the following metrics: (1) mean relative error (rel): ; (2) root mean squared error (rms): ; (3) mean log10 error (log10): and (4) accuracy with threshold : percentage() of subject to max, where and denote the predicted depth value and the ground-truth value for pixel .

is the total number of pixels. The results of the compared methods are quoted from their papers. Our method outperforms thirteen competing methods in all metrics, and is comparable to PAD-Net

[53] which has a more complex network structure and requires ground-truth contours and normals as part of labels. We run multiple training trials and consistently achieve the results. We also quantitatively evaluate some methods with their provided code on our IterNet RGB-D dataset. As shown in Table IV, our method achieves the most accurate depth estimation on all the metrics. Figure 9 gives some visual comparison results on NYUv2 dataset [39] and our dataset. Figure 10 gives more qualitative comparison results with enlarged local areas on NYUv2 dataset [39] and our dataset. It can be seen that our method achieves more accurate depth estimation consistent with the quantitative evaluation. Although [54] also has good visual results due to promising estimation of relative depths between objects, our method achieves more accurate results both visually and quantitatively.

To evaluate the generalizability of our model trained by our dataset, we show some depth estimation results for real indoor scenes on NYUv2 dataset [39] and SUN RGB-D dataset [48] without finetuning in Figure 11. It can be seen that our model trained using our dataset has good generalization ability to other datasets.

Figure 10: Depth estimation results on NYUv2 dataset [39] and our dataset. From left to right are the input RGB images and the ground-truth depths, the depth results estimated by Eigen et al.[13], the depth results estimated by Xu et al.[54], the depth results estimated by Xu and Wang [55], and the depth results estimated by our method.
Figure 11: Depth estimation results on NYUv2 dataset [39] (a, b, c) and SUN RGB-D dataset [48] (d, e, f) using our model trained by our dataset. From top to bottom are the input color images, the ground truths, and our estimated depths.
Method Error (lower is better) Accuracy (higher is better)
rel log10 rms
Saxena et al.[44] 0.349 - 1.214 0.447 0.745 0.897
Liu et al.[32] 0.335 0.127 1.06 - - -

Karsch et al.[23]
0.35 0.131 1.20 - - -

Ladicky et al.[25]
- - - 0.542 0.829 0.941
Zhou et al.[58] 0.305 0.122 1.04 0.525 0.838 0.962
Liu et al.[31] 0.213 0.087 0.759 0.650 0.906 0.976
Roi and Todorovic [42] 0.187 0.078 0.744 - - -
Eigen et al.[13] 0.215 - 0.907 0.611 0.887 0.971
Eigen and Fergus [12] 0.158 - 0.641 0.769 0.950 0.988
Laina et al.[26] 0.129 0.056 0.583 0.801 0.950 0.986
Xu et al.[54] 0.139 0.063 0.609 0.793 0.948 0.984
Xu and Wang [55] 0.121 0.052 0.586 0.811 0.954 0.987
Joint HCRF [51] 0.220 0.094 0.745 0.605 0.890 0.970
Jafari et al.[21] 0.157 0.068 0.673 0.762 0.948 0.988
PAD-Net [53] 0.120 0.055 0.582 0.817 0.954 0.987
Ours 0.122 0.051 0.582 0.819 0.953 0.988
Table III: Quantitative evaluation for depth estimation on NYUv2 dataset.
Figure 12: Semantic segmentation results on NYUv2 dataset (top two rows) and our dataset (bottom row). From left to right are the input RGB images, the ground-truths and the results estimated by FCN [35], Chen et al.[5], Li et al.[27], Zhao et al.[57] and our method.

Iv-C Semantic Segmentation

To evaluate the performance of semantic segmentation, we use NYUv2-40 dataset [35] in which all objects in the NYUv2 dataset [39] are divided into 40 categories. We use the same training and testing data as other methods and adopt three metrics in percentage (): pixel accuracy, mean accuracy, and Intersection over Union (IoU). As shown in Table V, our inferred semantic segmentation results outperform those state-of-the-art methods. We also quantitatively evaluate some recent work that provide source code on our IterNet RGB-D dataset in Table VI. It can be seen that our method also achieves the best performance. Figure 12 presents some visual comparison results on NYUv2-40 dataset and our dataset mapped into 87 categories. Being consistent with the quantitative results in Table V and Table VI, our approach generates more accurate semantic segmentation results on both real dataset (NYUv2) and synthetic dataset (IterNet RGB-D) than state-of-the-art methods. More qualitative comparison results for semantic segmentation are depicted in Figure 13 and Figure 14. It can be observed that our approach generates more accurate semantic segmentation on both real dataset (NYUv2) and synthetic dataset (IterNet RGB-D) than other four competing methods.

Figure 13: Semantic segmentation results on NYUv2 dataset [39]. From left to right are the input RGB images, the ground-truths and the results estimated by FCN [35], Chen et al.[5], Li et al.[27], Zhao et al.[57] and our method.
Figure 14: Semantic segmentation results on our dataset. From left to right are the input RGB images, the ground-truths and the results estimated by FCN [35], Chen et al.[5], Li et al.[27], Zhao et al.[57] and our method.

Iv-D Multi-view Reconstruction

Method Error (lower is better) Accuracy (higher is better)
rel log10 rms
Eigen et al.[13] 0.948 0.285 4.711 0.054 0.205 0.492
Laina et al.[26] 0.404 0.235 3.433 0.102 0.310 0.581
Xu et al.[54] 0.175 0.089 1.010 0.435 0.700 0.907
Xu and Wang [55] 0.151 0.067 0.620 0.536 0.817 0.975
Ours 0.136 0.062 0.507 0.568 0.918 0.982
Table IV: Quantitative evaluation for depth estimation on our dataset.
Method Pixel Accuracy Mean Accuracy IoU
Deng et al.[10] 63.8 31.5 -
FCN [35] 60.0 42.2 29.2
FCN-HHA [35] 65.4 46.1 34.0
Eigen et al.[12] 65.6 45.1 34.1
Lin et al.[29] 70.0 53.6 40.6
RefineNet[28] 73.6 58.9 46.5
Kong et al.[24] 72.1 - 44.5
Saxena et al.[44] - 55.7 43.1
Gupta et al.[16] 60.3 - 28.6
Mousavian et al.[37] 68.6 52.3 39.2

Ours
74.3 59.4 48.7
Table V: Quantitative evaluation for semantic segmentation on the NYUv2-40 dataset.
Method Pixel Accuracy Mean Accuracy IoU
FCN [35] 47.07 33.76 24.63
Chen et al.[5] 66.28 67.98 53.90
Li et al.[27] 61.97 46.93 40.46
Zhao et al.[57] 74.82 72.36 60.91
Ours 75.54 74.49 63.98
Table VI: Quantitative evaluation for semantic segmentation on our dataset.
Figure 15: Comparison of scene reconstruction results of different methods on NYUv2 dataset (top two rows) and our dataset (bottom two rows). From left to right are the results of COLMAP [45, 46], PMVS2 [14], OpenMVS [41], DeepMVS [19] and our method.

In Figure 15, we evaluate multi-view 3D reconstruction performance of the proposed method on NYUv2 dataset [39] and our dataset using three wide-baseline views, compared with four state-of-the-art multi-view stereo methods: COLMAP [45, 46], PMVS2 [14], OpenMVS [41] and DeepMVS [19]. We obtain the sparse views for NYUv2 dataset by selecting 1 frame per 30-40 frames, and use the camera parameters estimated by COLMAP [45] for OpenMVS [41], PMVS2 [14] and DeepMVS [19]. As shown in Figure 15, COLMAP [45, 46] fails to generate meaningful results on NYUv2 dataset from sparse views. We can see obviously wrong points for PMVS2 [14] and OpenMVS [41]

: some points gather together from side view and top view on NYUv2 dataset. Moreover, their obtained point clouds are too sparse to provide acceptable results by linear interpolation. DeepMVS reconstructs more points compared with the traditional methods, but the reconstructed model contains a lot of noise and outliers. On the contrary, our method achieves the best results for sparse multi-view reconstruction by considering 7-D information (geometry, photometry and semantics) and using joint global and local registration. More results on NYUv2 dataset

[39] and our dataset using three or four sparse views are given in Figure 16 and Figure 17, respectively. It can be seen that the multi-view stereo method in COLMAP [46] fails to generate 3D point clouds, and the point clouds reconstructed by OpenMVS [41] and PMVS2 [14] lack sufficient density and completeness. Although DeepMVS [19] achieves dense reconstruction, the reconstructed model contains many wrong points. In contrast, our method achieves accurate and complete reconstruction from sparse views. Because COLMAP [46] fails for most scenes in NYUv2 dataset [39], we give quantitative evaluation on our dataset in Table VII. We use two indicators to evaluate the results of MVS reconstruction: accuracy and completeness. Accuracy represents the average distance between the points on reconstructed model and the nearest points on the ground-truth model. Completeness measures the percentage of the points on the ground-truth model that can find corresponding points on the reconstructed model within a certain distance threshold (0.1). We generate the 3D ground-truth model by fusing multi-view ground-truth depth point clouds using ICP. As shown in Table VII, our method achieves the most complete reconstruction and meanwhile ensures the accuracy. Although traditional multi-view stereo methods [46, 14, 41] have higher accuracy, their reconstructed points are too sparse to provide acceptable results by linear interpolation. Figure 18 shows our reconstructed models on NYUv2 dataset [39] and our dataset presented from five different views.

Method Accuracy Completeness
(lower is better) (higher is better)
COLMAP [46] 3.74 2.33%
PMVS2 [14] 3.71 1.83%
OpenMVS [41] 3.68 1.25%
DeepMVS [19] 21.49 12.47%
Ours 17.72 31.55%
Table VII: Quantitative evaluation for multi-view reconstruction.
Figure 16: Comparison of multi-view reconstruction results of different methods on NYUv2 dataset [39]. From left to right are the results of COLMAP [45, 46], PMVS2 [14], OpenMVS [41], DeepMVS [19] and our method.
Figure 17: Comparison of scene reconstruction results of different methods on our dataset. From left to right are the results of COLMAP [45, 46], PMVS2 [14], OpenMVS [41], DeepMVS [19] and our method.
Figure 18: Our reconstructed models on NYUv2 dataset [39] and our dataset presented from five different views as illustrated for each scene.

V Conclusions

In this paper, we solve a challenging problem: reconstructing and understanding indoor 3D scenes based on several color images captured from uncalibrated sparse views. We propose IterNet, a novel iterative network to jointly estimate depth map and semantic segmentation from a single color image, and a joint global and local registration method to reconstruct indoor 3D scenes from sparse views. We also introduce and make available IterNet RGB-D dataset, a new dataset that simultaneously provides high-resolution photorealistic RGB images, accurate depth maps, and pixel-level semantic labels for thousand of layouts. Experimental results on both public datasets and our dataset demonstrate that our method achieves the best results on depth estimation, semantic segmentation and multi-view reconstruction, compared with state-of-the-art methods.

References

  • [1] D. Aiger, N. J. Mitra, and D. Cohen-Or. 4-points congruent sets for robust surface registration. ACM Trans. Graphics, 27(3):#85, 1–10, 2008.
  • [2] I. Armeni, S. Sax, A. R. Zamir, and S. Savarese. Joint 2D-3D-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105, 2017.
  • [3] A. Bódis-Szomorú, H. Riemenschneider, and L. Van Gool. Superpixel meshes for fast edge-preserving surface reconstruction. In CVPR, pages 2011–2020, 2015.
  • [4] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3D: Learning from RGB-D data in indoor environments. arXiv preprint arXiv:1709.06158, 2017.
  • [5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. PAMI, 40(4):834–848, 2018.
  • [6] H. Cui, S. Shen, W. Gao, and Z. Hu. Efficient large-scale structure from motion by fusing auxiliary imaging information. IEEE Trans. Image Processing, 24(11):3561–3573, 2015.
  • [7] A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, and M. Nießner. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In CVPR, volume 2, page 10, 2017.
  • [8] A. Dai, M. Nießner, M. Zollhöfer, S. Izadi, and C. Theobalt. BundleFusion: Real-time globally consistent 3D reconstruction using on-the-fly surface reintegration. ACM Trans. Graphics, 36(4):76a, 2017.
  • [9] Y. Deng, F. Bao, X. Deng, R. Wang, Y. Kong, and Q. Dai. Deep and structured robust information theoretic learning for image analysis. IEEE Trans. Image Processing, 25(9):4209–4221, 2016.
  • [10] Z. Deng, S. Todorovic, and L. Jan Latecki. Semantic segmentation of RGBD images with mutex constraints. In ICCV, pages 1733–1741, 2015.
  • [11] D. L. Donoho and J. M. Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81(3):425–455, 1994.
  • [12] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV, pages 2650–2658, 2015.
  • [13] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In NIPS, pages 2366–2374, 2014.
  • [14] Y. Furukawa and J. Ponce. Accurate, dense, and robust multiview stereopsis. IEEE Trans. PAMI, 32(8):1362–1376, 2010.
  • [15] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In CVPR, 2012.
  • [16] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik. Learning rich features from RGB-D images for object detection and segmentation. In ECCV, pages 345–360, 2014.
  • [17] A. Handa, V. Pătrăucean, S. Stent, and R. Cipolla. Scenenet: An annotated model generator for indoor scene understanding. In ICRA, pages 5737–5743, 2016.
  • [18] D. Hoiem, A. A. Efros, and M. Hebert. Automatic photo pop-up. ACM Trans. Graphics, 24(3):577–584, 2005.
  • [19] P.-H. Huang, K. Matzen, J. Kopf, N. Ahuja, and J.-B. Huang. DeepMVS: Learning multi-view stereopsis. In CVPR, pages 2821–2830, 2018.
  • [20] W. Huang, X. Cao, K. Lu, Q. Dai, and A. C. Bovik. Toward naturalistic 2D-to-3D conversion. IEEE Trans. Image Processing, 24(2):724–733, 2015.
  • [21] O. H. Jafari, O. Groth, A. Kirillov, M. Y. Yang, and C. Rother. Analyzing modular CNN architectures for joint depth prediction and semantic segmentation. In ICRA, pages 4620–4627, 2017.
  • [22] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM MM, pages 675–678, 2014.
  • [23] K. Karsch, C. Liu, and S. B. Kang. Depth transfer: Depth extraction from video using non-parametric sampling. IEEE Trans. PAMI, 36(11):2144–2158, 2014.
  • [24] S. Kong and C. Fowlkes. Recurrent scene parsing with perspective understanding in the loop. arXiv preprint arXiv:1705.07238, 2017.
  • [25] L. Ladicky, J. Shi, and M. Pollefeys. Pulling things out of perspective. In CVPR, pages 89–96, 2014.
  • [26] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In 3DV, pages 239–248, 2016.
  • [27] Z. Li, Y. Gan, X. Liang, Y. Yu, H. Cheng, and L. Lin. LSTM-CF: Unifying context modeling and fusion with lstms for RGB-D scene labeling. In ECCV, pages 541–557, 2016.
  • [28] G. Lin, A. Milan, C. Shen, and I. D. Reid. RefineNet: Multi-path refinement networks for high-resolution semantic segmentation. In CVPR, 2017.
  • [29] G. Lin, C. Shen, A. Van Den Hengel, and I. Reid. Efficient piecewise training of deep structured models for semantic segmentation. In CVPR, pages 3194–3203, 2016.
  • [30] J. Lin, X. Ji, W. Xu, and Q. Dai. Absolute depth estimation from a single defocused image. IEEE Trans. Image Processing, 22(11):4545–4550, 2013.
  • [31] F. Liu, C. Shen, G. Lin, and I. D. Reid. Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. PAMI, 38(10):2024–2039, 2016.
  • [32] M. Liu, M. Salzmann, and X. He. Discrete-continuous depth estimation from a single image. In CVPR, pages 716–723, 2014.
  • [33] Y. Liu, Q. Dai, and W. Xu. A point-cloud-based multiview stereo algorithm for free-viewpoint video. IEEE Trans. VCG, 16(3):407–418, 2010.
  • [34] Z. Liu, Z. Shi, and W. Xu. On optimal dynamic sequential search for matching in real-time machine vision. IEEE Trans. Image Processing, 19(11):3000–3011, 2010.
  • [35] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
  • [36] L. Ma, J. Stückler, C. Kerl, and D. Cremers. Multi-view deep learning for consistent semantic mapping with RGB-D cameras. In IROS, 2017.
  • [37] A. Mousavian, H. Pirsiavash, and J. Košecká. Joint semantic segmentation and depth estimation with deep convolutional networks. In 3DV, pages 611–619, 2016.
  • [38] M. Naseer, S. H. Khan, and F. Porikli. Indoor scene understanding in 2.5/3D: A survey. arXiv preprint arXiv:1803.03352, 2018.
  • [39] P. K. Nathan Silberman, Derek Hoiem and R. Fergus. Indoor segmentation and support inference from RGBD images. In ECCV, 2012.
  • [40] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon. KinectFusion: Real-time dense surface mapping and tracking. In ISMAR, pages 127–136, 2011.
  • [41] OpenMVS. Open multi-view stereo reconstruction library. http://cdcseacave.github.io/openMVS.
  • [42] A. Roy and S. Todorovic. Monocular depth estimation using neural regression forest. In CVPR, pages 5506–5514, 2016.
  • [43] A. Saxena, S. Min, and A. Y. Ng. 3-d reconstruction from sparse views using monocular vision. In ICCV, 2007.
  • [44] A. Saxena, M. Sun, and A. Y. Ng. Make3D: Learning 3D scene structure from a single still image. IEEE Trans. PAMI, 31(5):824–840, 2009.
  • [45] J. L. Schönberger and J.-M. Frahm. Structure-from-Motion Revisited. In CVPR, 2016.
  • [46] J. L. Schönberger, E. Zheng, M. Pollefeys, and J.-M. Frahm. Pixelwise View Selection for Unstructured Multi-View Stereo. In ECCV, 2016.
  • [47] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: exploring photo collections in 3D. ACM Transactions on Graphics, 25(3):835–846, 2006.
  • [48] S. Song, S. P. Lichtenberg, and J. Xiao. SUN RGB-D: A RGB-D scene understanding benchmark suite. In CVPR, pages 567–576, 2015.
  • [49] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic scene completion from a single depth image. CVPR, 2017.
  • [50] R. Tracing. Distributed ray tracing. ACM Siggraph Computer Graphics, 18(3):137–145, 1984.
  • [51] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille. Towards unified depth and semantic prediction from a single image. In CVPR, pages 2800–2809, 2015.
  • [52] T. Whelan, R. F. Salas-Moreno, B. Glocker, A. J. Davison, and S. Leutenegger. ElasticFusion: Real-time dense SLAM and light source estimation. International Journal of Robotics Research, 35(14):1697–1716, 2016.
  • [53] D. Xu, W. Ouyang, X. Wang, and N. Sebe. PAD-Net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. arXiv preprint arXiv:1805.04409, 2018.
  • [54] D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe. Multi-scale continuous CRFs as sequential deep networks for monocular depth estimation. In CVPR, 2017.
  • [55] D. Xu, W. Wang, H. Tang, H. Liu, N. Sebe, and E. Ricci. Structured attention guided convolutional neural fields for monocular depth estimation. In CVPR, pages 3917–3925, 2018.
  • [56] Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan. Mvsnet: Depth inference for unstructured multi-view stereo. In ECCV, pages 767–783, 2018.
  • [57] H. Zhao, Y. Zhang, S. Liu, J. Shi, C. C. Loy, D. Lin, and J. Jia. PSANet: Point-wise spatial attention network for scene parsing. In ECCV, 2018.
  • [58] W. Zhuo, M. Salzmann, X. He, and M. Liu. Indoor scene structure analysis for single image depth estimation. In CVPR, pages 614–622, 2015.