StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimation

08/19/2021 ∙ by Boying Li, et al. ∙ Shanghai Jiao Tong University 8

Self-supervised monocular depth estimation has achieved impressive performance on outdoor datasets. Its performance however degrades notably in indoor environments because of the lack of textures. Without rich textures, the photometric consistency is too weak to train a good depth network. Inspired by the early works on indoor modeling, we leverage the structural regularities exhibited in indoor scenes, to train a better depth network. Specifically, we adopt two extra supervisory signals for self-supervised training: 1) the Manhattan normal constraint and 2) the co-planar constraint. The Manhattan normal constraint enforces the major surfaces (the floor, ceiling, and walls) to be aligned with dominant directions. The co-planar constraint states that the 3D points be well fitted by a plane if they are located within the same planar region. To generate the supervisory signals, we adopt two components to classify the major surface normal into dominant directions and detect the planar regions on the fly during training. As the predicted depth becomes more accurate after more training epochs, the supervisory signals also improve and in turn feedback to obtain a better depth model. Through extensive experiments on indoor benchmark datasets, the results show that our network outperforms the state-of-the-art methods. The source code is available at https://github.com/SJTU-ViSYS/StructDepth .

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 7

page 9

page 13

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Our self-supervised monocular depth learning leverages the structural regularities of indoor environments for training. The aligned normal (with Manhattan directions) and the planar regions provide extra losses in training and lead to better 3D structures at inference.

Inferring the dense 3D map from a single image is a challenging problem without satisfactory solutions until the booming of deep neural networks. With the deep convolutional neural networks (CNNs), we can predict the accurate depth from a single image, via training the network with a lot of ground-truth depth labels. The recent self-supervised learning paradigm does not require the ground-truth depth, while still obtaining high-quality results on benchmark datasets, using the photometric consistency as the major supervisory signal. Nevertheless, when existing self-supervised methods are trained on indoor images, the quality of depth estimation degrades notably

[53][3]. The main reason is the lack of textures in indoor images. Unlike outdoor scenes, the indoor scenes are full of texture-less regions, such as white walls, ceilings, and floors. Without rich textures, the photometric loss becomes too weak to train a good depth model. Seeking stronger or extra supervisory signals is therefore necessary for training a better depth network.

There have been a few attempts. An optical-flow field propagated from the sparse SURF[1] flow by a self-supervised network, is used to guide training on texture-less regions [53]. Another attempt [50] is to use an image patch instead of individual pixels to compute the photometric loss and apply extra constraints to the depth within the planar regions extracted from image segmentation. Though those attempts improve the results, they did not fully exploit the structural regularities presented in indoor environments, a valuable source of information for 3D learning. The structural regularities, known as the Manhattan-world model[6], describe that the scene consists of major planes aligned with dominant directions. This simple yet effective high-level prior leads to a much better performance in many vision tasks, such as indoor modeling[16][17][5], visual SLAM[52][12][44], and visual-inertial odometry[56], but has not been applied to monocular depth learning.

In this work, we propose to apply the high-level prior of indoor structural regularities to self-supervised depth estimation as shown in Fig. 1. Specifically, we adopt two extra supervisory signals for training: 1) the Manhattan normal constraint and 2) the co-planar constraint. The Manhattan normal constraint enforces the major surfaces (the floor, ceiling, and walls) to be aligned with dominant directions. The co-planar constraint states that the 3D points be well fitted by a plane if they are located within the same planar region. We add two extra components into the training process. The first one is Manhattan normal detection. It classifies the major surface normal, computed from the depth predicted by the network, into the directions associated with the vanishing points by an adaptive thresholding scheme. The second one is planar region detection. We fuse the color and the geometric information derived from the depth and apply a classic segmentation algorithm to extract planar regions. During training, the two components incorporate the estimated depth to produce supervisory signals on the fly. Though those signals may be noisy in early epochs because of inaccurate depth, they will gradually improve as the depth quality improves, and in turn benefit the depth estimation.

We conduct experiments on the indoor benchmark datasets: NYU-v2 [40], ScanNet[7], and InteriorNet[29]. The results show that our method outperforms the existing state-of-the-art methods. Our main contributions are as follows:

1) A novel learning pipeline for self-supervised depth estimation leveraging structural regularities of indoor environments. To our best knowledge, this has not been presented in previous work.

2) Two novel components providing extra supervisory signals on the fly during the training process. Our components can be used to train a multi-task network including depth estimation, normal estimation, and planar region detection in a self-supervised manner, although the latter two tasks serve to train a better depth model in our current implementation.

3) We set a new state-of-the-art in self-supervised indoor depth estimation.

Figure 2: Our self-supervised monocular depth learning pipeline, which consists of three major components: a) DepthNet: The neural network to be trained to predict the depth from a single image. b) Manhattan normal detection: It classifies the surface normal estimated from depth prediction into dominant directions. c) Planar region detection: Both the color and geometric information are used to extract planar regions by a graph-based segmentation. The planar region detection is kept updated with the improved depth during training iterations. Two extra losses, Manhattan normal loss and co-planar loss, are used to train the network, as indicated by the red dot arrows.

2 Related Work

Monocular depth estimation.

Depth estimation from a single image is an ill-posed problem that is known as extremely hard to be solved. Since the pioneer works[10, 9] employed the convolution neural networks (CNNs) to regress the depth directly, a lot of CNN-based monocular depth estimation methods have been proposed [32, 26, 24, 43, 15], producing impressively accurate results in benchmark datasets. Most of them are supervised methods that require the ground-truth depth data for training.

Self-supervised depth learning without the ground-truth depth has emerged as a promising alternative as acquiring the ground-truth depth at a large scale is challenging. The image appearance was firstly introduced in [19] to replace the ground-truth depth as the supervisory signal to train a depth network. One image in a stereo pair was warped to the other view by the predicted depth. The difference between the synthesized image and the real image, or the photometric error, is then used for supervision. The idea was further extended to monocular settings [54][19]. By the careful design of network architectures[20]

, loss functions

[39], and online refinement [4], self-supervised approaches obtain impressive results on benchmark datasets.

Despite achieving impressive performance on outdoor datasets, such as KITTI[18] and Make3D[37], existing self-supervised methods perform poorly in indoor datasets. The reason is that the indoor scenes are full of texture-less regions, such as white walls and ceilings, making the photometric loss become too weak to supervise the depth learning. Zhou et al.[53] adopted an optical-flow-based training paradigm supervised by the flow field from an optical flow network, initialized from sparse SURF [1] correspondences. The recent work [50] employed the more discriminative patches instead of individual pixels to compute the photometric loss, and also applied the piece-wise planar prior to depth learning by assuming that the homogeneous-color regions are planar regions. Though their approaches improve the performance. They did not fully exploit the structural prior of the environments. In addition, the planar-region assumption in [50] does not hold for planes with the same color, e.g. mutually perpendicular white walls. It therefore leads to false planar regions deteriorating the depth model.

Planar region detection.

Though powerful planar-region detectors [30][45][51] have been proposed recently and have shown high-quality results in complex indoor images. Those CNN-based detectors require a huge number of plane labels for training and are not suited for the self-supervised learning scheme. Though detecting planes in the image is challenging, if the depth is available, this task becomes much easier[36][23]. Here, we detect the planar regions using a classic graph-based segmentation approach [11] similar to [50], while employing the additional geometric information extracted from the depth estimated on the fly when training. Though the depth may not be precise initially, it will gradually improve as the training progresses such that the segmentation will improve as well. With the additional geometric information, our approach avoids false planar regions that are indistinguishable by colors and produces less over-segmentation on texture-rich planar regions.

Structural regularities in indoor environments.

Indoor scenes exhibit strong structural regularities, which can be described as the “Manhattan world”. Namely, the scene can be decomposed into major planes, where their normal vectors are mutually orthogonal. These structural regularities are valuable priors that have been applied to a wide range of indoor 3D vision tasks, such as vSLAM

[52][12][44], VIO[56], and mapping[16][17][5]

. In fact, exploiting the structural prior of indoor scenes was probably the only geometric way to infer the 3D information from a single image in early days

[8][27]. It is natural to think that structural regularities should also benefit the learning-based vision tasks in indoor environments.

Wang et al. [41] propose to use the vanishing points and lines to train a surface normal estimator which achieves the state-of-the-art performance. Our work adopts a similar spirit but differs from theirs in that our major task is depth estimation, where the surface normal is just an intermediate result that serves for better training. In addition, our depth network is trained in a fully self-supervised manner and does not require the line map as the extra input. To our best knowledge, our work is the first one incorporating the structural regularities of indoor environments into self-supervised monocular depth estimation.

3 Method

Our self-supervised depth learning pipeline is illustrated in Fig. 2. It consists of three major components. The first one is the depth network, which takes a single image as the input and predicts a depth map. We use the same architecture as in [50] for the depth network. Based on the predicted depth, the other two components, Manhattan normal detection and planar region detection, are used to produce the supervisory signals leveraging the structural prior of indoor environments. Manhattan normal detection aligns the normal computed from the depth map with the dominant orientations, estimated from the vanishing points in the image. Planar region detection applies a graph-based segmentation to detect the planar regions with the combination of color, normal, and plane-to-origin distance information. Both Manhattan normal detection and planar region detection may be inaccurate in the initial training epochs, but they will improve in later epochs as the depth prediction becomes better. The improved supervisory signals lead to a better depth prediction as well.

In the following sections, we’ll describe how we apply the Manhattan normal constraint and the co-planar constraint in our training process.

3.1 Manhattan normal constraint

Dominant direction extraction. The structural regularities of indoor environments imply that most indoor scenes contain planar surfaces aligned with dominant directions. The dominant directions can be estimated from the structural lines in the image. The intersection of a set of parallel structural lines in the image is the vanishing point. Let be a vanishing point extracted from the 2D image. One of the dominant directions in the camera coordinate system is computed as

(1)

where is a unit vector representing this dominant direction and is the camera intrinsic matrix. Note that we need only two vanishing points to get all the dominant directions, since the third dominant direction can be obtained by the cross product. We apply the 2-Line searching method [33] to extract the dominant directions from the image. The dominant direction extraction is done only once before training.

Both the extracted directions and their reverse directions are considered to be the possible normal directions of the major planes in the scene, such as the ceiling, the floor, and the walls.

Surface normal estimation. To estimate the surface normal, we first get the 3D coordinates of each pixel from the predicted depth by

(2)

Here, denotes the depth predicted by the depth network. Next, we adopt a differentiable point-to-normal layer[46, 47, 22] to estimate the surface normal from the 3D points. Specifically, the normal of a given pixel is calculated from a set of 3D points within a small neighborhood centering on point . The neighborhood is set as in our implementation as the previous work[46].

Manhattan normal detection. Given the surface normal prediction , we propose the Manhattan normal detection to classify the surface normal that belongs to the dominant planes. Our strategy is to compare the difference between the estimated normal vector and each dominant direction

by using a cosine similarity

and choose the one with the best similarity, namely

(3)

where is the aligned normal and the cosine similarity is defined as Let the maximum similarity of each pixel be . We define the Manhattan mask as:

(4)

where and represent Manhattan and non-Manhattan regions respectively

During the training, we use an adaptive thresholding scheme for detecting the Manhattan regions. We initially set a relatively small threshold to allow more pixels being classified into the Manhattan region because of inaccurate normal estimates, and gradually increase the threshold since the normal estimates become accurate in later epochs. In our implmentation, the threshold grows with the iteration number linearly: , where and are set to and respectively.

Manhattan normal loss. We apply the Manhattan normal constraint within the Manhattan region by using the aligned normal obtained in (3) as the supervisory signal. The constraint enforces the estimated normal to be as close to the aligned normal as possible, which is described by a loss function :

(5)

where is the number of pixels located in Manhattan regions, and indicates whether the pixel locates in the planar regions, which we’ll introduce how to detect them in the following section.

3.2 Co-planar constraint

Figure 3: The pipeline of planar region detection. Both the color and geometric information are used to compute the dissimilarity for planar region segmentation. The color dissimilarity is calculated by comparing the RGB colors. The geometry dissimilarity is the sum of the normal and the plane-to-origin distance dissimilarities. Based on the proposed dissimilarity, a graph-based segmentation [11] is applied to extract the planar regions.
Figure 4: The proposed planar region detection during training. From the left to the right columns: the input images, the groud-truth depth, the estimated depth, the dissimilarity map, and the planar regions detected by only colors [50] and our method based on the color and geometric information. First row: Two walls cannot be distinguished by colors, but can be separated by our method. Second row: The floor is over-segmented by using only colors but can be correctly detected by our method.

Planar region detection. To enforce the co-planar constraint, we need to detect the piece-wise planar region correctly. Previous work [50] detects the planar regions by assuming the regions with homogeneous colors are planar. This simple strategy, however, usually leads to false detection or over-segmentation producing false supervisory signals. We propose a novel planar region detection method, as shown in Fig. 3, which integrates both the color and the online updated geometry information to extract the planar areas more reliably.

The key idea is that we adopt a novel dissimilarity map in the following graph-based segmentation. This dissimilarity takes the color, normal, and the plane-to-origin distance into consideration. We use the aligned normal to derive the dissimilarity instead of the estimated normal since we found the latter is too noisy. Let the 3D coordinates of a pixel to be . Suppose this 3D point lies in the plane where the normal is the aligned normal . The plane-to-origin distance is computed as

(6)

Let be the adjacent pixel of . The normal dissimilarity between them is defined as the Euclidean distance between the two vectors:

(7)

Denoting the minimum and maximum dissimilarities among all the adjacent pixels by respectively, we define a operator to normalize the dissimilarity via

(8)

The plane-to-origin distance dissimilarity is defined as

(9)

The geometric dissimilarity combines the normalized version of the two dissimilarities as

(10)

The color dissimilarity is computed as

(11)

where are the RGB colors. Finally, we get the dissimilarity combining both the color and geometric information by

(12)

Based on the dissimilarity, we apply the graph-based segmentation [11] and filter out small areas to obtain the planar regions following [50]. The advantage of using such a dissimilarity definition can be seen in Fig. 4. Comparing with using only the color information, our method avoids false planar regions that cannot be distinguished by colors and also over-segmentation caused by different colors.

Note that our planar region segmentation is be updated during training. As the training progresses, the gradually improved depth leads to better segmentation and vise versa.

Figure 5: Visualization of the NYUv2 results, better viewed by zooming on screen. The depth results are on the left columns, and the surface normal results are on the right columns. The results of Monodepth2[20], PNet[50], and the ground-truth depth / normal are presented for comparison. Compared with PNet[50] and Monodepth2[20], our method obtains better surface normal and depth estimation as indicated by the red rectangles. Please refer to the Tab. 1 and Tab. 2 for the quantitative results.
Method Sup. RMS AbsRel Log10 Hu et al.(2019)[21] 0.530 0.115 0.050 86.6 97.5 99.3 Yin et al.(2019)[48] 0.416 0.108 0.048 87.5 97.6 99.4 AdaBins(2021)[2] 0.364 0.103 0.044 90.3 98.4 99.7 Niklaus et al.(2019)[34] 0.300 0.080 0.030 94.0 99.0 100.0 PlaneNet(2018)[31] 0.514 0.142 0.060 81.2 95.7 98.9 PlaneReg(2019)[51] 0.503 0.134 0.057 82.7 96.3 99.0 MovingIndoor(2019)[53] 0.712 0.208 0.086 67.4 90.0 96.8 Monodepth2(2019)[20] 0.600 0.161 0.068 77.1 94.8 98.7 (2020)[50] 0.561 0.150 0.064 79.6 94.8 98.6 Ours 0.540 0.142 0.060 81.3 95.4 98.8 Ours pp 0.534 0.140 0.060 81.7 95.5 98.8 The first two blocks list the results of supervised methods. The second block contains the supervised methods with plane detection. The third and fourth blocks list the results of self-supervised methods. indicates the lower the better, indicates the higher the better. Our approach performs best among the self-supervised ones. - supervised learning - self-supervised learning pp - with post processing as in [19]
Table 1: Depth estimation results on NYUv2 dataset.

Generate the co-planar depth. After detection of planar regions, we invoke the co-planar constraint to flatten the 3D points located within those plane regions. The first step is plane fitting for 3D points within the planar region. We obtain the plane parameters as previous work[28, 50] by solving the least squares problem

(13)

where each column of represents a 3D point within the planar region. After that, the inverse depth of the pixel by plane fitting is computed as

(14)

where represents the camera intrinsic matrix. We then transform the inverse depth to the depth with the maximum and minimum protection following [19, 20, 50].

Co-planar loss. The depth obtained from plane fitting is then used as an extra signal to constrain the estimated depth. The loss function is defined as

(15)

where is the number of pixels within the planar regions .

3.3 Total loss

We use the image patches instead of individual pixels to compute the photometric loss as suggested in [50], which is defined as the combination of L1 loss and a structure similarity loss SSIM[55]:

(16)

where denotes the local window surrounding . is the relative weight of two parts and set as the same as previous work[20]. We also adopt the edge-aware smoothness loss

(17)

where is the mean-normalized inverse depth, and , are the gradients along the and directions. The overall loss is defined as

(18)

where , and are set to 0.001, 0.05, 0.1, respectively.

Method Train Mean 11.2 22.5 30
Surface normal estimation networks
3DP(2013)[13] 33.0 18.8 40.7 52.4
Fouhey et al.(2014)[14] 35.2 40.5 54.1 58.9
Wang et al.(2015)[42] 28.8 35.2 57.1 65.5
Eigen et al.(2015)[9] 23.7 39.2 62.0 71.1
Surface normal computed from the depth
GeoNet(2018)[35] 36.8 15.0 34.5 46.7
DORN(2018)[15] 36.6 15.7 36.5 49.4
MovingIndoor(2019)[53] 43.5 10.2 26.8 37.9
Monodepth2(2019)[20] 45.1 10.4 27.3 37.6
(2020)[50] 36.6 15.0 36.7 49.0
Ours 34.5 21.9 44.4 55.2
Ours pp 34.2 22.6 44.7 55.4

Table 2: Surface normal estimation results on NYUv2. We report the results of surface normal estimation networks in the first block. The normal results computed from the depth networks are in the second and the third block, where ’’ denotes supervised methods, and ’’ denotes self-supervised ones. The normal computation is the same for all methods. Our method outperforms existing monocular depth estimation methods in surface normal estimation.

4 Experimental results

We train our model on the NYUv2 dataset [40] using the data split the same as the previous work [53][50], and evaluate our methods on NYUv2[40], ScanNet[7], and InteriorNet[29] datasets. We detect the vanishing points on the training images and skip 18 image sequences that fail to detect valid vanishing points. This results in 21465 monocular training sequences and 654 images for validation. Each monocular training sequence consists of five frames. Our network model adopts the same architecture as [50].

We compare our method with the state-of-the-art methods of monocular depth estimation. Apart from depth estimation, we also evaluate the performance of surface normal estimation, and present ablation studies about the effectiveness of the proposed supervisory signals, and using different network architectures. More results can be found in the supplementary material.

4.1 Implementation details

The network is trained for a total of 50 epochs with a batch size of 32 based on the pre-trained model [50]. We use Adam optimizer and a multi-step learning rate reduction strategy. We set the initial learning rate as , then decay it by 0.1 at the 26th epoch and 36th epoch. We perform random flipping and color augmentation during training. All images are firstly undistorted and cropped by 16 pixels from the border, and then scaled to for training. The camera intrinsic parameters come from the official specification [40], and are adjusted to be consistent with the image cropping and scaling. We follow the same criteria used in [20, 50] for evaluation. Namely, we cap the depth to

and use the median scaling strategy to avoid the scale ambiguity of monocular depth estimation. The evaluation metrics include root mean squared error (RMS), absolute relative error (AbsRel), mean log10 error (Log10), and the accuracy under threshold

.

4.2 Results on NYUv2 Dataset

Depth estimation. The quantitative results of depth estimation are listed in Tab. 1. The results show that our method outperforms MovingIndoor[53] and PNet[50], the state-of-the-art self-supervised methods on indoor monocular depth estimation, by a large margin. The results also show that our method surpasses some supervised approaches. The depth estimation results are visualized in Fig. 5. We can see that our method obtains more accurate indoor structures and smoother planes than existing methods.

Surface normal estimation. We also evaluate the surface normal estimation as shown in Tab. 2. Our method outperforms existing methods, and also some supervised methods[13, 35, 15]. Results are also shown in Fig. 5.

4.3 Results on ScanNet and InteriorNet

Figure 6: ScanNet results with the trained model on NYUv2. The holes in the ground truth are excluded from evaluation.
Method RMS AbsRel Log10
Monov2[20] 0.451 0.191 0.080 69.3 92.6 98.3
PNet [50] 0.420 0.175 0.074 74.0 93.2 98.2
PNet-finetune 0.412 0.172 0.073 74.3 93.5 98.4
Our 0.400 0.165 0.070 75.4 93.9 98.5

Table 3: ScanNet results with the trained model on NYUv2.

We use the model trained only on NYUv2 to evaluate our methods generalized to other indoor datasets. ScanNet[7] is captured with a depth camera attached to a iPad, containing around 2.5M RGBD video captured in 1513 scenes. We use the test split proposed by [50] which includes 533 images. The evaluation results are shown in Tab. 3 and Fig. 6. InteriorNet[29] is a synthetic dataset of indoor video sequences containing millions of well-designed interior design layouts, furniture and object models. Because there is no current official train/test split on InteriorNet for depth estimation, here we selected 540 images randomly from the HD7 data of the full dataset as test images. The evaluation results are shown in Tab. 4 and Fig. 7.

Although ScanNet and InteriorNet have not been used for training, the results show that our method still generalizes well and outperforms existing methods.

Figure 7: InteriorNet results with the trained model on NYU V2.
Method RMS AbsRel Log10
Monov2[20] 0.817 0.368 0.124 58.6 81.5 89.8
PNet [50] 0.737 0.346 0.115 64.2 83.3 90.2
PNet-finetune 0.736 0.340 0.114 64.4 83.3 90.3
Our 0.715 0.330 0.111 66.0 84.0 90.5

Table 4: InteriorNet results with the trained model on NYUv2.
Methods RMS AbsRel Log10
PNet[50] 0.561 0.150 0.064 79.6 94.8 98.6
PNet-finetune 0.555 0.147 0.062 80.4 95.2 98.7
Coplanar-only 0.548 0.144 0.061 80.8 95.3 98.8
Normal-only 0.543 0.143 0.061 81.0 95.5 98.9
Our(full) 0.540 0.142 0.060 81.3 95.4 98.8

Table 5: Ablation study about using different supervisory signals. We evaluate the performances using only the Manhattan normal constraint (Normal-only), using only the co-planar constraint (Coplanar-only), and the proposed method (Our(full)). We also present the result of fine-tuned PNet model (PNet-finetune). Note all the models were trained with the same number of epochs for fair comparison.
Train RMS AbsRel Log10
Using the Monodepth2 [20] architecture
Original 0.600 0.161 0.068 77.1 94.8 98.7
Original-finetune 0.598 0.159 0.067 77.5 94.9 98.7
Ours 0.564 0.151 0.065 79.1 95.0 98.8
Using the PNet [50] architecture
Original 0.561 0.150 0.064 79.6 94.8 98.6
Original-finetune 0.555 0.147 0.062 80.4 95.2 98.7
Ours 0.540 0.142 0.060 81.3 95.4 98.8

Table 6: Ablation study about using different network architectures. Our extra training losses improves both models, indicating our method is universal to different architectures.

4.4 Ablation study

To better understand the effectiveness of each part of our method, we perform an ablation study by changing various components of our model on NYU V2 dataset. We initialize the network with the pre-trained model [50] and train it with the proposed supervisory signals. The results are shown in Tab. 5. Either the Manhattan normal loss or the co-planar loss leads to depth estimations better than that of the original and the original-finetune methods. Incorporating them together leads to the maximum gain in performance.

We also test our method using different network architectures. As shown in Tab. 6, using the proposed supervisory signals, both models are improved, indicating our method is universal to different network architectures. But the results based on Monodepth2 are worse than those based on PNet. This is largely due to the patch-based photometric loss that is better for texture-less regions as suggested in [50].

4.5 Planar-region detection in training

We show the intermediate planar region detection results during training in Fig. 8. The results show that the planar region segmentation gradually improves with the updated depth and normal estimates. By contrast, the color-only method produces false planar regions as indicated by the red rectangles.

Figure 8: First row: The planar regions detected by the color-only method [50]. Bottom rows: The estimated depth, surface normal and segmentation results at different epochs on NYUv2. Our segmentation results gradually improve as the training progresses.

5 Limitation

We discuss the limitations of our method. The first limitation is that extracting dominant directions highly relies on the Manhattan world assumption. It may not work well in indoor scenes with irregular layouts containing slant planes. Possible solutions include using a relaxed version of Manhattan world assumption as in [38][56], or directly using the estimated direction from each detected vanishing point to derive the normal constraint. In other words, those dominant directions are not restricted to be mutually perpendicular. The second limitation is that the low quality of initial depth should be avoided. As our planar region detection relies on depth information, the low depth quality will deteriorate the segmentation results and generate false supervisory signals, which in turn prevent the network from converging to a good model. Our solution is to use a pre-trained depth model or train the model only with photometric and smoothness losses in early epochs. It leaves open to design a better planar region detector given low-quality initial depth estimates.

6 Conclusion

In this paper, we propose to leverage the structural regularities of indoor environments for monocular depth estimation. Two extra losses, Manhattan normal loss and co-planar loss, are used to supervise the depth learning. Those supervisory signals are generated on the fly during training by Manhattan normal detection and planar region detection. Our method achieves the state-of-the-art result on indoor benchmark datasets.

References

  • [1] H. Bay, T. Tuytelaars, and L. Van Gool (2006) Surf: speeded up robust features. In

    Proceedings of the European conference on computer vision

    ,
    pp. 404–417. Cited by: §1, §2.
  • [2] S. F. Bhat, I. Alhashim, and P. Wonka (2021) Adabins: depth estimation using adaptive bins. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 4009–4018. Cited by: Table 1.
  • [3] J. Bian, H. Zhan, N. Wang, T. Chin, C. Shen, and I. Reid (2020) Unsupervised depth learning in challenging indoor video: weak rectification to rescue. arXiv preprint arXiv:2006.02708. Cited by: §1.
  • [4] Y. Chen, C. Schmid, and C. Sminchisescu (2019) Self-supervised learning with geometric constraints in monocular video: connecting flow, depth, and camera. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7063–7072. Cited by: §2.
  • [5] A. Concha, M. W. Hussain, L. Montano, and J. Civera (2014) Manhattan and piecewise-planar constraints for dense monocular mapping.. In Robotics: Science and systems, Cited by: §1, §2.
  • [6] J. M. Coughlan and A. L. Yuille (1999)

    Manhattan world: compass direction from a single image by bayesian inference

    .
    In Proceedings of the IEEE/CVF International Conference on Computer Vision, Vol. 2, pp. 941–947. Cited by: §1.
  • [7] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner (2017) Scannet: richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 5828–5839. Cited by: §1, §4.3, §4.
  • [8] E. Delage, H. Lee, and A. Y. Ng (2006)

    A dynamic bayesian network model for autonomous 3d reconstruction from a single indoor image

    .
    In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 2418–2428. Cited by: §2.
  • [9] D. Eigen and R. Fergus (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2650–2658. Cited by: §2, Table 2.
  • [10] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. In neurips, Cited by: §2.
  • [11] P. F. Felzenszwalb and D. P. Huttenlocher (2004) Efficient graph-based image segmentation. International journal of computer vision 59 (2), pp. 167–181. Cited by: §2, Figure 3, §3.2.
  • [12] A. Flint, C. Mei, I. Reid, and D. Murray (2010) Growing semantically meaningful models for visual slam. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 467–474. Cited by: §1, §2.
  • [13] D. F. Fouhey, A. Gupta, and M. Hebert (2013) Data-driven 3d primitives for single image understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3392–3399. Cited by: Table 2, §4.2.
  • [14] D. F. Fouhey, A. Gupta, and M. Hebert (2014) Unfolding an indoor origami world. In Proceedings of the European conference on computer vision, pp. 687–702. Cited by: Table 2.
  • [15] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao (2018) Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2002–2011. Cited by: §2, Table 2, §4.2.
  • [16] Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski (2009) Manhattan-world stereo. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1422–1429. Cited by: §1, §2.
  • [17] Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski (2009) Reconstructing building interiors from images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 80–87. Cited by: §1, §2.
  • [18] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. Cited by: §2.
  • [19] C. Godard, O. Mac Aodha, and G. J. Brostow (2017) Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 270–279. Cited by: §2, §3.2, Table 1.
  • [20] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow (2019) Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3828–3838. Cited by: Figure S2, Figure S3, Figure S4, §2, §2, §2, Figure 5, §3.2, §3.3, Table 1, Table 2, Table 2, §4.1, Table 3, Table 4, Table 6.
  • [21] J. Hu, M. Ozay, Y. Zhang, and T. Okatani (2019) Revisiting single image depth estimation: toward higher resolution maps with accurate object boundaries. In IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1043–1051. Cited by: Table 1.
  • [22] M. Kaneko, K. Sakurada, and K. Aizawa (2019) TriDepth: triangular patch-based deep depth prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §3.1.
  • [23] P. Kim, B. Coltin, and H. Jin Kim (2018) Linear rgb-d slam for planar environments. In Proceedings of the European conference on computer vision, pp. 333–348. Cited by: §2.
  • [24] S. Kim, K. Park, K. Sohn, and S. Lin (2016) Unified depth prediction and intrinsic image decomposition from a single image via joint convolutional neural fields. In Proceedings of the European conference on computer vision, pp. 143–159. Cited by: §2.
  • [25] T. Koch, L. Liebel, F. Fraundorfer, and M. Korner (2018) Evaluation of cnn-based single-image depth estimation methods. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Cited by: §3.
  • [26] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab (2016) Deeper depth prediction with fully convolutional residual networks. In International conference on 3D vision, pp. 239–248. Cited by: §2.
  • [27] D. C. Lee, M. Hebert, and T. Kanade (2009) Geometric reasoning for single image structure recovery. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2136–2143. Cited by: §2.
  • [28] B. Li, D. Zou, D. Sartori, L. Pei, and W. Yu (2020) Textslam: visual slam with planar text features. In Proceedings of the IEEE International Conference on Robotics and Automation, pp. 2102–2108. Cited by: §3.2.
  • [29] W. Li, S. Saeedi, J. McCormac, R. Clark, D. Tzoumanikas, Q. Ye, Y. Huang, R. Tang, and S. Leutenegger (2018) InteriorNet: mega-scale multi-sensor photo-realistic indoor scenes dataset. In British Machine Vision Conference, Cited by: §1, §4.3, §4.
  • [30] C. Liu, K. Kim, J. Gu, Y. Furukawa, and J. Kautz (2019) Planercnn: 3d plane detection and reconstruction from a single image. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 4450–4459. Cited by: §2.
  • [31] C. Liu, J. Yang, D. Ceylan, E. Yumer, and Y. Furukawa (2018) Planenet: piece-wise planar reconstruction from a single rgb image. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2579–2588. Cited by: Table 1.
  • [32] F. Liu, C. Shen, G. Lin, and I. Reid (2015) Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (10), pp. 2024–2039. Cited by: §2.
  • [33] X. Lu, J. Yaoy, H. Li, Y. Liu, and X. Zhang (2017) 2-line exhaustive searching for real-time vanishing point estimation in manhattan world. In IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 345–353. Cited by: §3.1.
  • [34] S. Niklaus, L. Mai, J. Yang, and F. Liu (2019) 3d ken burns effect from a single image. ACM Transactions on Graphics (TOG) 38 (6), pp. 1–15. Cited by: Table 1.
  • [35] X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia (2018) Geonet: geometric neural network for joint depth and surface normal estimation. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 283–291. Cited by: Table 2, §4.2.
  • [36] R. F. Salas-Moreno, B. Glocken, P. H. Kelly, and A. J. Davison (2014) Dense planar slam. In Proceedings of the IEEE/ACM International Symposium on Mixed and Augmented Reality, pp. 157–164. Cited by: §2.
  • [37] A. Saxena, M. Sun, and A. Y. Ng (2008) Make3d: learning 3d scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (5), pp. 824–840. Cited by: §2.
  • [38] G. Schindler and F. Dellaert (2004)

    Atlanta world: an expectation maximization framework for simultaneous low-level edge grouping and camera calibration in complex man-made environments

    .
    In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. I–I. Cited by: §5.
  • [39] C. Shu, K. Yu, Z. Duan, and K. Yang (2020) Feature-metric loss for self-supervised learning of depth and egomotion. In Proceedings of the European conference on computer vision, pp. 572–588. Cited by: §2.
  • [40] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus (2012) Indoor segmentation and support inference from rgbd images. In Proceedings of the European conference on computer vision, pp. 746–760. Cited by: §1, §4.1, §4.
  • [41] R. Wang, D. Geraghty, K. Matzen, R. Szeliski, and J. Frahm (2020) Vplnet: deep single view normal estimation with vanishing points and lines. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 689–698. Cited by: §2.
  • [42] X. Wang, D. Fouhey, and A. Gupta (2015) Designing deep networks for surface normal estimation. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 539–547. Cited by: Table 2.
  • [43] D. Wofk, F. Ma, T. Yang, S. Karaman, and V. Sze (2019) Fastdepth: fast monocular depth estimation on embedded systems. In Proceedings of the IEEE International Conference on Robotics and Automation, pp. 6101–6108. Cited by: §2.
  • [44] S. Yang, Y. Song, M. Kaess, and S. Scherer (2016) Pop-up slam: semantic monocular plane slam for low-texture environments. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1222–1229. Cited by: §1, §2.
  • [45] Z. Yang, P. Wang, Y. Wang, W. Xu, and R. Nevatia (2018) Every pixel counts: unsupervised geometry learning with holistic 3d motion understanding. In Proceedings of the European Conference on Computer Vision Workshops, pp. 0–0. Cited by: §2.
  • [46] Z. Yang, P. Wang, Y. Wang, W. Xu, and R. Nevatia (2018) Lego: learning edge with geometry all at once by watching videos. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 225–234. Cited by: §3.1.
  • [47] Z. Yang, P. Wang, W. Xu, L. Zhao, and R. Nevatia (2018) Unsupervised learning of geometry from videos with edge-aware depth-normal consistency. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 32. Cited by: §3.1.
  • [48] W. Yin, Y. Liu, C. Shen, and Y. Yan (2019) Enforcing geometric constraints of virtual normal for depth prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5684–5693. Cited by: Table 1.
  • [49] W. Yin, J. Zhang, O. Wang, S. Niklaus, L. Mai, S. Chen, and C. Shen (2021) Learning to recover 3d scene shape from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 204–213. Cited by: Table 2, §3.
  • [50] Z. Yu, L. Jin, and S. Gao (2020) Pnet: patch-match and plane-regularization for unsupervised indoor depth estimation. In Proceedings of the European conference on computer vision, Cited by: Figure S2, Figure S3, Figure S4, §1, §2, §2, §2, Figure 4, Figure 5, §3.2, §3.2, §3.2, §3.3, Table 1, Table 2, Table 2, §3, Figure 8, §4.1, §4.2, §4.3, §4.4, §4.4, Table 3, Table 4, Table 5, Table 6, §4.
  • [51] Z. Yu, J. Zheng, D. Lian, Z. Zhou, and S. Gao (2019) Single-image piece-wise planar 3d reconstruction via associative embedding. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1029–1037. Cited by: §2, Table 1.
  • [52] H. Zhou, D. Zou, L. Pei, R. Ying, P. Liu, and W. Yu (2015) StructSLAM: visual slam with building structure lines. IEEE Transactions on Vehicular Technology 64 (4), pp. 1364–1375. Cited by: §1, §2.
  • [53] J. Zhou, Y. Wang, K. Qin, and W. Zeng (2019) Moving indoor: unsupervised video depth learning in challenging environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8618–8627. Cited by: §1, §1, §2, Table 1, Table 2, §4.2, §4.
  • [54] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1851–1858. Cited by: §2.
  • [55] Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. External Links: Document Cited by: §3.3.
  • [56] D. Zou, Y. Wu, L. Pei, H. Ling, and W. Yu (2019) StructVIO: visual-inertial odometry with structural regularity of man-made environments. IEEE Transactions on Robotics 35 (4), pp. 999–1013. Cited by: §1, §2, §5.

1 Extra qualitative results

We include additional qualitative results on NYUv2, ScanNet, and InteriorNet datasets. Fig. S2 shows the 3D structure recovered from the estimated depth. Fig. S3 and Fig. S4 illustrate the results of the depth and surface normal estimation. Those results show that our method achieves more accurate depth estimation and produces more accurate 3D structures, compared with the existing methods.

Figure S3: Qualitative visualization results on NYUv2. The top rows show the depth results and the bottom rows show the surface normal results. The results of Monodepth2[20], [50], our method, and the ground-truth depth / normal are presented for comparison. Compared with [50] and Monodepth2[20], our method obtains better surface normal estimation and depth prediction as indicated by the red rectangles.
Figure S4: Qualitative visualization results on ScanNet and InteriorNet datasets. The top rows show the depth results and the bottom rows show the surface normal results. The results of Monodepth2[20], [50], ours and the ground-truth depth / normal are presented for comparison. Compared with [50] and Monodepth2[20], our method obtains better surface normal estimation and depth prediction as indicated by the red rectangles. Our models were trained on the NYUv2 dataset.

2 Outdoor tests

We present the results of our method on the KITTI dataset, which is captured in outdoor scenes. We use the split composed of 44234 images as the training dataset, the same as Monodepth2[20]. We firstly detected the vanishing points on the training images and skipped 335 images which fail to detect valid vanishing points. Consequently, 39500 image sequences were used for training and 4397 image sequences for validation. Other dataset preprocessing settings are consistent with [20]. The total epoch number of training is 17 with a batch size of 16. The initial learning rate is and drops to after 15 epochs. Results are shown in Tab. 1.

Train RMS AbsRel SqRel
Using the Monodepth2 architecture
Original 4.863 0.115 0.903 87.7 95.9 98.1
Original-finetune 4.882 0.117 0.894 87.2 95.8 98.0
Ours 4.850 0.120 0.906 87.0 95.8 98.1
Using the PNet architecture
Original 5.008 0.121 0.964 86.6 95.4 97.9
Original-finetune 5.041 0.121 0.996 86.7 95.4 97.8
Ours 4.969 0.120 0.941 86.7 95.5 97.9

Table 1: Outdoor tests using different network architectures on KITTI dataset.

From the results in Tab. 1, we can see that using the Monodepth2 [20] architecture achieves better performance than using PNet. This is largely due to that the outdoor environments are full of textures. The well-designed Monodepth2 works well in such kinds of scenes, while the strategies adopted in PNet are more suitable for indoor scenes. This has been discussed in [50]. Though our method does not improve the performance too much by using Monodepth2 architecture, we can see the effectiveness of our extra structural losses by using the PNet architecture.

The depth, the surface normal, and the plane detection results of our method on KITTI dataset are shown in Fig. S1. Note that the detected planar regions are mostly located on the road, where the textures are rich enough to supervise a good depth. This may be the major reason why our extra losses did not help too much within the Monodepth2 training pipeline. The other reason may be that the extracted dominant directions may not be strictly mutually perpendicular in outdoor scenes, leading to large surface normal errors.

3 Plane quality tests

We evaluate the plane quality on the IBims-1[25] as [49]. All models are trained on the NYUv2 dataset with the same number of epochs for fair comparison (pretrain epochs are also included in our method). From the results in Tab. 2, as expected, our method produces the best plane quality (the second column), even though P2Net also adopts co-planar losses. The improvements are largely due to the global constraint from Manhattan normal loss. However, all methods produce low structure quality especially the depth edge comparing with supervised methods, indicating great efforts are still required to improve self-supervised depth learning in indoor scenes.

Method Sup. AbsRel
Wei Yin, et al.[49] 1.90 5.73 2.0 7.41 0.079
Monodepth2[20] 4.455 68.127 12.160 30.924 0.220
[50] 4.922 67.833 10.823 28.783 0.241
-finetune 4.628 49.926 10.322 28.750 0.232
Ours 4.611 67.828 9.669 27.215 0.227

Table 2: IBims-1 results with the trained model on NYUv2.