Inferring the dense 3D map from a single image is a challenging problem without satisfactory solutions until the booming of deep neural networks. With the deep convolutional neural networks (CNNs), we can predict the accurate depth from a single image, via training the network with a lot of ground-truth depth labels. The recent self-supervised learning paradigm does not require the ground-truth depth, while still obtaining high-quality results on benchmark datasets, using the photometric consistency as the major supervisory signal. Nevertheless, when existing self-supervised methods are trained on indoor images, the quality of depth estimation degrades notably. The main reason is the lack of textures in indoor images. Unlike outdoor scenes, the indoor scenes are full of texture-less regions, such as white walls, ceilings, and floors. Without rich textures, the photometric loss becomes too weak to train a good depth model. Seeking stronger or extra supervisory signals is therefore necessary for training a better depth network.
There have been a few attempts. An optical-flow field propagated from the sparse SURF flow by a self-supervised network, is used to guide training on texture-less regions . Another attempt  is to use an image patch instead of individual pixels to compute the photometric loss and apply extra constraints to the depth within the planar regions extracted from image segmentation. Though those attempts improve the results, they did not fully exploit the structural regularities presented in indoor environments, a valuable source of information for 3D learning. The structural regularities, known as the Manhattan-world model, describe that the scene consists of major planes aligned with dominant directions. This simple yet effective high-level prior leads to a much better performance in many vision tasks, such as indoor modeling, visual SLAM, and visual-inertial odometry, but has not been applied to monocular depth learning.
In this work, we propose to apply the high-level prior of indoor structural regularities to self-supervised depth estimation as shown in Fig. 1. Specifically, we adopt two extra supervisory signals for training: 1) the Manhattan normal constraint and 2) the co-planar constraint. The Manhattan normal constraint enforces the major surfaces (the floor, ceiling, and walls) to be aligned with dominant directions. The co-planar constraint states that the 3D points be well fitted by a plane if they are located within the same planar region. We add two extra components into the training process. The first one is Manhattan normal detection. It classifies the major surface normal, computed from the depth predicted by the network, into the directions associated with the vanishing points by an adaptive thresholding scheme. The second one is planar region detection. We fuse the color and the geometric information derived from the depth and apply a classic segmentation algorithm to extract planar regions. During training, the two components incorporate the estimated depth to produce supervisory signals on the fly. Though those signals may be noisy in early epochs because of inaccurate depth, they will gradually improve as the depth quality improves, and in turn benefit the depth estimation.
We conduct experiments on the indoor benchmark datasets: NYU-v2 , ScanNet, and InteriorNet. The results show that our method outperforms the existing state-of-the-art methods. Our main contributions are as follows:
1) A novel learning pipeline for self-supervised depth estimation leveraging structural regularities of indoor environments. To our best knowledge, this has not been presented in previous work.
2) Two novel components providing extra supervisory signals on the fly during the training process. Our components can be used to train a multi-task network including depth estimation, normal estimation, and planar region detection in a self-supervised manner, although the latter two tasks serve to train a better depth model in our current implementation.
3) We set a new state-of-the-art in self-supervised indoor depth estimation.
2 Related Work
Monocular depth estimation.
Depth estimation from a single image is an ill-posed problem that is known as extremely hard to be solved. Since the pioneer works[10, 9] employed the convolution neural networks (CNNs) to regress the depth directly, a lot of CNN-based monocular depth estimation methods have been proposed [32, 26, 24, 43, 15], producing impressively accurate results in benchmark datasets. Most of them are supervised methods that require the ground-truth depth data for training.
Self-supervised depth learning without the ground-truth depth has emerged as a promising alternative as acquiring the ground-truth depth at a large scale is challenging. The image appearance was firstly introduced in  to replace the ground-truth depth as the supervisory signal to train a depth network. One image in a stereo pair was warped to the other view by the predicted depth. The difference between the synthesized image and the real image, or the photometric error, is then used for supervision. The idea was further extended to monocular settings . By the careful design of network architectures39], and online refinement , self-supervised approaches obtain impressive results on benchmark datasets.
Despite achieving impressive performance on outdoor datasets, such as KITTI and Make3D, existing self-supervised methods perform poorly in indoor datasets. The reason is that the indoor scenes are full of texture-less regions, such as white walls and ceilings, making the photometric loss become too weak to supervise the depth learning. Zhou et al. adopted an optical-flow-based training paradigm supervised by the flow field from an optical flow network, initialized from sparse SURF  correspondences. The recent work  employed the more discriminative patches instead of individual pixels to compute the photometric loss, and also applied the piece-wise planar prior to depth learning by assuming that the homogeneous-color regions are planar regions. Though their approaches improve the performance. They did not fully exploit the structural prior of the environments. In addition, the planar-region assumption in  does not hold for planes with the same color, e.g. mutually perpendicular white walls. It therefore leads to false planar regions deteriorating the depth model.
Planar region detection.
Though powerful planar-region detectors  have been proposed recently and have shown high-quality results in complex indoor images. Those CNN-based detectors require a huge number of plane labels for training and are not suited for the self-supervised learning scheme. Though detecting planes in the image is challenging, if the depth is available, this task becomes much easier. Here, we detect the planar regions using a classic graph-based segmentation approach  similar to , while employing the additional geometric information extracted from the depth estimated on the fly when training. Though the depth may not be precise initially, it will gradually improve as the training progresses such that the segmentation will improve as well. With the additional geometric information, our approach avoids false planar regions that are indistinguishable by colors and produces less over-segmentation on texture-rich planar regions.
Structural regularities in indoor environments.
Indoor scenes exhibit strong structural regularities, which can be described as the “Manhattan world”. Namely, the scene can be decomposed into major planes, where their normal vectors are mutually orthogonal. These structural regularities are valuable priors that have been applied to a wide range of indoor 3D vision tasks, such as vSLAM, VIO, and mapping
. In fact, exploiting the structural prior of indoor scenes was probably the only geometric way to infer the 3D information from a single image in early days. It is natural to think that structural regularities should also benefit the learning-based vision tasks in indoor environments.
Wang et al.  propose to use the vanishing points and lines to train a surface normal estimator which achieves the state-of-the-art performance. Our work adopts a similar spirit but differs from theirs in that our major task is depth estimation, where the surface normal is just an intermediate result that serves for better training. In addition, our depth network is trained in a fully self-supervised manner and does not require the line map as the extra input. To our best knowledge, our work is the first one incorporating the structural regularities of indoor environments into self-supervised monocular depth estimation.
Our self-supervised depth learning pipeline is illustrated in Fig. 2. It consists of three major components. The first one is the depth network, which takes a single image as the input and predicts a depth map. We use the same architecture as in  for the depth network. Based on the predicted depth, the other two components, Manhattan normal detection and planar region detection, are used to produce the supervisory signals leveraging the structural prior of indoor environments. Manhattan normal detection aligns the normal computed from the depth map with the dominant orientations, estimated from the vanishing points in the image. Planar region detection applies a graph-based segmentation to detect the planar regions with the combination of color, normal, and plane-to-origin distance information. Both Manhattan normal detection and planar region detection may be inaccurate in the initial training epochs, but they will improve in later epochs as the depth prediction becomes better. The improved supervisory signals lead to a better depth prediction as well.
In the following sections, we’ll describe how we apply the Manhattan normal constraint and the co-planar constraint in our training process.
3.1 Manhattan normal constraint
Dominant direction extraction. The structural regularities of indoor environments imply that most indoor scenes contain planar surfaces aligned with dominant directions. The dominant directions can be estimated from the structural lines in the image. The intersection of a set of parallel structural lines in the image is the vanishing point. Let be a vanishing point extracted from the 2D image. One of the dominant directions in the camera coordinate system is computed as
where is a unit vector representing this dominant direction and is the camera intrinsic matrix. Note that we need only two vanishing points to get all the dominant directions, since the third dominant direction can be obtained by the cross product. We apply the 2-Line searching method  to extract the dominant directions from the image. The dominant direction extraction is done only once before training.
Both the extracted directions and their reverse directions are considered to be the possible normal directions of the major planes in the scene, such as the ceiling, the floor, and the walls.
Surface normal estimation. To estimate the surface normal, we first get the 3D coordinates of each pixel from the predicted depth by
Here, denotes the depth predicted by the depth network. Next, we adopt a differentiable point-to-normal layer[46, 47, 22] to estimate the surface normal from the 3D points. Specifically, the normal of a given pixel is calculated from a set of 3D points within a small neighborhood centering on point . The neighborhood is set as in our implementation as the previous work.
Manhattan normal detection. Given the surface normal prediction , we propose the Manhattan normal detection to classify the surface normal that belongs to the dominant planes. Our strategy is to compare the difference between the estimated normal vector and each dominant direction
by using a cosine similarityand choose the one with the best similarity, namely
where is the aligned normal and the cosine similarity is defined as Let the maximum similarity of each pixel be . We define the Manhattan mask as:
where and represent Manhattan and non-Manhattan regions respectively
During the training, we use an adaptive thresholding scheme for detecting the Manhattan regions. We initially set a relatively small threshold to allow more pixels being classified into the Manhattan region because of inaccurate normal estimates, and gradually increase the threshold since the normal estimates become accurate in later epochs. In our implmentation, the threshold grows with the iteration number linearly: , where and are set to and respectively.
Manhattan normal loss. We apply the Manhattan normal constraint within the Manhattan region by using the aligned normal obtained in (3) as the supervisory signal. The constraint enforces the estimated normal to be as close to the aligned normal as possible, which is described by a loss function :
where is the number of pixels located in Manhattan regions, and indicates whether the pixel locates in the planar regions, which we’ll introduce how to detect them in the following section.
3.2 Co-planar constraint
Planar region detection. To enforce the co-planar constraint, we need to detect the piece-wise planar region correctly. Previous work  detects the planar regions by assuming the regions with homogeneous colors are planar. This simple strategy, however, usually leads to false detection or over-segmentation producing false supervisory signals. We propose a novel planar region detection method, as shown in Fig. 3, which integrates both the color and the online updated geometry information to extract the planar areas more reliably.
The key idea is that we adopt a novel dissimilarity map in the following graph-based segmentation. This dissimilarity takes the color, normal, and the plane-to-origin distance into consideration. We use the aligned normal to derive the dissimilarity instead of the estimated normal since we found the latter is too noisy. Let the 3D coordinates of a pixel to be . Suppose this 3D point lies in the plane where the normal is the aligned normal . The plane-to-origin distance is computed as
Let be the adjacent pixel of . The normal dissimilarity between them is defined as the Euclidean distance between the two vectors:
Denoting the minimum and maximum dissimilarities among all the adjacent pixels by respectively, we define a operator to normalize the dissimilarity via
The plane-to-origin distance dissimilarity is defined as
The geometric dissimilarity combines the normalized version of the two dissimilarities as
The color dissimilarity is computed as
where are the RGB colors. Finally, we get the dissimilarity combining both the color and geometric information by
Based on the dissimilarity, we apply the graph-based segmentation  and filter out small areas to obtain the planar regions following . The advantage of using such a dissimilarity definition can be seen in Fig. 4. Comparing with using only the color information, our method avoids false planar regions that cannot be distinguished by colors and also over-segmentation caused by different colors.
Note that our planar region segmentation is be updated during training. As the training progresses, the gradually improved depth leads to better segmentation and vise versa.
Generate the co-planar depth. After detection of planar regions, we invoke the co-planar constraint to flatten the 3D points located within those plane regions. The first step is plane fitting for 3D points within the planar region. We obtain the plane parameters as previous work[28, 50] by solving the least squares problem
where each column of represents a 3D point within the planar region. After that, the inverse depth of the pixel by plane fitting is computed as
Co-planar loss. The depth obtained from plane fitting is then used as an extra signal to constrain the estimated depth. The loss function is defined as
where is the number of pixels within the planar regions .
3.3 Total loss
where denotes the local window surrounding . is the relative weight of two parts and set as the same as previous work. We also adopt the edge-aware smoothness loss
where is the mean-normalized inverse depth, and , are the gradients along the and directions. The overall loss is defined as
where , and are set to 0.001, 0.05, 0.1, respectively.
|Surface normal estimation networks|
|Fouhey et al.(2014)||35.2||40.5||54.1||58.9|
|Wang et al.(2015)||28.8||35.2||57.1||65.5|
|Eigen et al.(2015)||23.7||39.2||62.0||71.1|
|Surface normal computed from the depth|
4 Experimental results
We train our model on the NYUv2 dataset  using the data split the same as the previous work , and evaluate our methods on NYUv2, ScanNet, and InteriorNet datasets. We detect the vanishing points on the training images and skip 18 image sequences that fail to detect valid vanishing points. This results in 21465 monocular training sequences and 654 images for validation. Each monocular training sequence consists of five frames. Our network model adopts the same architecture as .
We compare our method with the state-of-the-art methods of monocular depth estimation. Apart from depth estimation, we also evaluate the performance of surface normal estimation, and present ablation studies about the effectiveness of the proposed supervisory signals, and using different network architectures. More results can be found in the supplementary material.
4.1 Implementation details
The network is trained for a total of 50 epochs with a batch size of 32 based on the pre-trained model . We use Adam optimizer and a multi-step learning rate reduction strategy. We set the initial learning rate as , then decay it by 0.1 at the 26th epoch and 36th epoch. We perform random flipping and color augmentation during training. All images are firstly undistorted and cropped by 16 pixels from the border, and then scaled to for training. The camera intrinsic parameters come from the official specification , and are adjusted to be consistent with the image cropping and scaling. We follow the same criteria used in [20, 50] for evaluation. Namely, we cap the depth to
and use the median scaling strategy to avoid the scale ambiguity of monocular depth estimation. The evaluation metrics include root mean squared error (RMS), absolute relative error (AbsRel), mean log10 error (Log10), and the accuracy under threshold.
4.2 Results on NYUv2 Dataset
Depth estimation. The quantitative results of depth estimation are listed in Tab. 1. The results show that our method outperforms MovingIndoor and PNet, the state-of-the-art self-supervised methods on indoor monocular depth estimation, by a large margin. The results also show that our method surpasses some supervised approaches. The depth estimation results are visualized in Fig. 5. We can see that our method obtains more accurate indoor structures and smoother planes than existing methods.
4.3 Results on ScanNet and InteriorNet
We use the model trained only on NYUv2 to evaluate our methods generalized to other indoor datasets. ScanNet is captured with a depth camera attached to a iPad, containing around 2.5M RGBD video captured in 1513 scenes. We use the test split proposed by  which includes 533 images. The evaluation results are shown in Tab. 3 and Fig. 6. InteriorNet is a synthetic dataset of indoor video sequences containing millions of well-designed interior design layouts, furniture and object models. Because there is no current official train/test split on InteriorNet for depth estimation, here we selected 540 images randomly from the HD7 data of the full dataset as test images. The evaluation results are shown in Tab. 4 and Fig. 7.
Although ScanNet and InteriorNet have not been used for training, the results show that our method still generalizes well and outperforms existing methods.
|Using the Monodepth2  architecture|
|Using the PNet  architecture|
4.4 Ablation study
To better understand the effectiveness of each part of our method, we perform an ablation study by changing various components of our model on NYU V2 dataset. We initialize the network with the pre-trained model  and train it with the proposed supervisory signals. The results are shown in Tab. 5. Either the Manhattan normal loss or the co-planar loss leads to depth estimations better than that of the original and the original-finetune methods. Incorporating them together leads to the maximum gain in performance.
We also test our method using different network architectures. As shown in Tab. 6, using the proposed supervisory signals, both models are improved, indicating our method is universal to different network architectures. But the results based on Monodepth2 are worse than those based on PNet. This is largely due to the patch-based photometric loss that is better for texture-less regions as suggested in .
4.5 Planar-region detection in training
We show the intermediate planar region detection results during training in Fig. 8. The results show that the planar region segmentation gradually improves with the updated depth and normal estimates. By contrast, the color-only method produces false planar regions as indicated by the red rectangles.
We discuss the limitations of our method. The first limitation is that extracting dominant directions highly relies on the Manhattan world assumption. It may not work well in indoor scenes with irregular layouts containing slant planes. Possible solutions include using a relaxed version of Manhattan world assumption as in , or directly using the estimated direction from each detected vanishing point to derive the normal constraint. In other words, those dominant directions are not restricted to be mutually perpendicular. The second limitation is that the low quality of initial depth should be avoided. As our planar region detection relies on depth information, the low depth quality will deteriorate the segmentation results and generate false supervisory signals, which in turn prevent the network from converging to a good model. Our solution is to use a pre-trained depth model or train the model only with photometric and smoothness losses in early epochs. It leaves open to design a better planar region detector given low-quality initial depth estimates.
In this paper, we propose to leverage the structural regularities of indoor environments for monocular depth estimation. Two extra losses, Manhattan normal loss and co-planar loss, are used to supervise the depth learning. Those supervisory signals are generated on the fly during training by Manhattan normal detection and planar region detection. Our method achieves the state-of-the-art result on indoor benchmark datasets.
Surf: speeded up robust features.
Proceedings of the European conference on computer vision, pp. 404–417. Cited by: §1, §2.
Adabins: depth estimation using adaptive bins.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4009–4018. Cited by: Table 1.
-  (2020) Unsupervised depth learning in challenging indoor video: weak rectification to rescue. arXiv preprint arXiv:2006.02708. Cited by: §1.
-  (2019) Self-supervised learning with geometric constraints in monocular video: connecting flow, depth, and camera. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7063–7072. Cited by: §2.
-  (2014) Manhattan and piecewise-planar constraints for dense monocular mapping.. In Robotics: Science and systems, Cited by: §1, §2.
Manhattan world: compass direction from a single image by bayesian inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Vol. 2, pp. 941–947. Cited by: §1.
-  (2017) Scannet: richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 5828–5839. Cited by: §1, §4.3, §4.
A dynamic bayesian network model for autonomous 3d reconstruction from a single indoor image. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 2418–2428. Cited by: §2.
-  (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2650–2658. Cited by: §2, Table 2.
-  (2014) Depth map prediction from a single image using a multi-scale deep network. In neurips, Cited by: §2.
-  (2004) Efficient graph-based image segmentation. International journal of computer vision 59 (2), pp. 167–181. Cited by: §2, Figure 3, §3.2.
-  (2010) Growing semantically meaningful models for visual slam. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 467–474. Cited by: §1, §2.
-  (2013) Data-driven 3d primitives for single image understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3392–3399. Cited by: Table 2, §4.2.
-  (2014) Unfolding an indoor origami world. In Proceedings of the European conference on computer vision, pp. 687–702. Cited by: Table 2.
-  (2018) Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2002–2011. Cited by: §2, Table 2, §4.2.
-  (2009) Manhattan-world stereo. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1422–1429. Cited by: §1, §2.
-  (2009) Reconstructing building interiors from images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 80–87. Cited by: §1, §2.
-  (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. Cited by: §2.
-  (2017) Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 270–279. Cited by: §2, §3.2, Table 1.
-  (2019) Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3828–3838. Cited by: Figure S2, Figure S3, Figure S4, §2, §2, §2, Figure 5, §3.2, §3.3, Table 1, Table 2, Table 2, §4.1, Table 3, Table 4, Table 6.
-  (2019) Revisiting single image depth estimation: toward higher resolution maps with accurate object boundaries. In IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1043–1051. Cited by: Table 1.
-  (2019) TriDepth: triangular patch-based deep depth prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §3.1.
-  (2018) Linear rgb-d slam for planar environments. In Proceedings of the European conference on computer vision, pp. 333–348. Cited by: §2.
-  (2016) Unified depth prediction and intrinsic image decomposition from a single image via joint convolutional neural fields. In Proceedings of the European conference on computer vision, pp. 143–159. Cited by: §2.
-  (2018) Evaluation of cnn-based single-image depth estimation methods. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Cited by: §3.
-  (2016) Deeper depth prediction with fully convolutional residual networks. In International conference on 3D vision, pp. 239–248. Cited by: §2.
-  (2009) Geometric reasoning for single image structure recovery. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2136–2143. Cited by: §2.
-  (2020) Textslam: visual slam with planar text features. In Proceedings of the IEEE International Conference on Robotics and Automation, pp. 2102–2108. Cited by: §3.2.
-  (2018) InteriorNet: mega-scale multi-sensor photo-realistic indoor scenes dataset. In British Machine Vision Conference, Cited by: §1, §4.3, §4.
-  (2019) Planercnn: 3d plane detection and reconstruction from a single image. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 4450–4459. Cited by: §2.
-  (2018) Planenet: piece-wise planar reconstruction from a single rgb image. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2579–2588. Cited by: Table 1.
-  (2015) Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (10), pp. 2024–2039. Cited by: §2.
-  (2017) 2-line exhaustive searching for real-time vanishing point estimation in manhattan world. In IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 345–353. Cited by: §3.1.
-  (2019) 3d ken burns effect from a single image. ACM Transactions on Graphics (TOG) 38 (6), pp. 1–15. Cited by: Table 1.
-  (2018) Geonet: geometric neural network for joint depth and surface normal estimation. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 283–291. Cited by: Table 2, §4.2.
-  (2014) Dense planar slam. In Proceedings of the IEEE/ACM International Symposium on Mixed and Augmented Reality, pp. 157–164. Cited by: §2.
-  (2008) Make3d: learning 3d scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (5), pp. 824–840. Cited by: §2.
Atlanta world: an expectation maximization framework for simultaneous low-level edge grouping and camera calibration in complex man-made environments. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. I–I. Cited by: §5.
-  (2020) Feature-metric loss for self-supervised learning of depth and egomotion. In Proceedings of the European conference on computer vision, pp. 572–588. Cited by: §2.
-  (2012) Indoor segmentation and support inference from rgbd images. In Proceedings of the European conference on computer vision, pp. 746–760. Cited by: §1, §4.1, §4.
-  (2020) Vplnet: deep single view normal estimation with vanishing points and lines. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 689–698. Cited by: §2.
-  (2015) Designing deep networks for surface normal estimation. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 539–547. Cited by: Table 2.
-  (2019) Fastdepth: fast monocular depth estimation on embedded systems. In Proceedings of the IEEE International Conference on Robotics and Automation, pp. 6101–6108. Cited by: §2.
-  (2016) Pop-up slam: semantic monocular plane slam for low-texture environments. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1222–1229. Cited by: §1, §2.
-  (2018) Every pixel counts: unsupervised geometry learning with holistic 3d motion understanding. In Proceedings of the European Conference on Computer Vision Workshops, pp. 0–0. Cited by: §2.
-  (2018) Lego: learning edge with geometry all at once by watching videos. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 225–234. Cited by: §3.1.
Unsupervised learning of geometry from videos with edge-aware depth-normal consistency.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §3.1.
-  (2019) Enforcing geometric constraints of virtual normal for depth prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5684–5693. Cited by: Table 1.
-  (2021) Learning to recover 3d scene shape from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 204–213. Cited by: Table 2, §3.
-  (2020) Pnet: patch-match and plane-regularization for unsupervised indoor depth estimation. In Proceedings of the European conference on computer vision, Cited by: Figure S2, Figure S3, Figure S4, §1, §2, §2, §2, Figure 4, Figure 5, §3.2, §3.2, §3.2, §3.3, Table 1, Table 2, Table 2, §3, Figure 8, §4.1, §4.2, §4.3, §4.4, §4.4, Table 3, Table 4, Table 5, Table 6, §4.
-  (2019) Single-image piece-wise planar 3d reconstruction via associative embedding. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1029–1037. Cited by: §2, Table 1.
-  (2015) StructSLAM: visual slam with building structure lines. IEEE Transactions on Vehicular Technology 64 (4), pp. 1364–1375. Cited by: §1, §2.
-  (2019) Moving indoor: unsupervised video depth learning in challenging environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8618–8627. Cited by: §1, §1, §2, Table 1, Table 2, §4.2, §4.
-  (2017) Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1851–1858. Cited by: §2.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. External Links: Cited by: §3.3.
-  (2019) StructVIO: visual-inertial odometry with structural regularity of man-made environments. IEEE Transactions on Robotics 35 (4), pp. 999–1013. Cited by: §1, §2, §5.
1 Extra qualitative results
We include additional qualitative results on NYUv2, ScanNet, and InteriorNet datasets. Fig. S2 shows the 3D structure recovered from the estimated depth. Fig. S3 and Fig. S4 illustrate the results of the depth and surface normal estimation. Those results show that our method achieves more accurate depth estimation and produces more accurate 3D structures, compared with the existing methods.
2 Outdoor tests
We present the results of our method on the KITTI dataset, which is captured in outdoor scenes. We use the split composed of 44234 images as the training dataset, the same as Monodepth2. We firstly detected the vanishing points on the training images and skipped 335 images which fail to detect valid vanishing points. Consequently, 39500 image sequences were used for training and 4397 image sequences for validation. Other dataset preprocessing settings are consistent with . The total epoch number of training is 17 with a batch size of 16. The initial learning rate is and drops to after 15 epochs. Results are shown in Tab. 1.
|Using the Monodepth2 architecture|
|Using the PNet architecture|
From the results in Tab. 1, we can see that using the Monodepth2  architecture achieves better performance than using PNet. This is largely due to that the outdoor environments are full of textures. The well-designed Monodepth2 works well in such kinds of scenes, while the strategies adopted in PNet are more suitable for indoor scenes. This has been discussed in . Though our method does not improve the performance too much by using Monodepth2 architecture, we can see the effectiveness of our extra structural losses by using the PNet architecture.
The depth, the surface normal, and the plane detection results of our method on KITTI dataset are shown in Fig. S1. Note that the detected planar regions are mostly located on the road, where the textures are rich enough to supervise a good depth. This may be the major reason why our extra losses did not help too much within the Monodepth2 training pipeline. The other reason may be that the extracted dominant directions may not be strictly mutually perpendicular in outdoor scenes, leading to large surface normal errors.
3 Plane quality tests
We evaluate the plane quality on the IBims-1 as . All models are trained on the NYUv2 dataset with the same number of epochs for fair comparison (pretrain epochs are also included in our method). From the results in Tab. 2, as expected, our method produces the best plane quality (the second column), even though P2Net also adopts co-planar losses. The improvements are largely due to the global constraint from Manhattan normal loss. However, all methods produce low structure quality especially the depth edge comparing with supervised methods, indicating great efforts are still required to improve self-supervised depth learning in indoor scenes.