Three-Filters-to-Normal: An Accurate and Ultrafast Surface Normal Estimator

05/17/2020 ∙ by Rui Fan, et al. ∙ ARISTOTLE UNIVERSITY OF THESSALONIKI IEEE The Hong Kong University of Science and Technology 18

Over the past decade, significant efforts have been made to improve the trade-off between speed and accuracy of surface normal estimators (SNEs). This paper introduces an accurate and ultrafast SNE for structured range data. The proposed approach computes surface normals by simply performing three filtering operations, namely, two image gradient filters (in horizontal and vertical directions, respectively) and a mean/median filter, on an inverse depth image or a disparity image. Despite the simplicity of the method, no similar method already exists in the literature. In our experiments, we created three large-scale synthetic datasets (easy, medium and hard) using 24 3-dimensional (3D) mesh models. Each mesh model is used to generate 1800–2500 pairs of 480x640 pixel depth images and the corresponding surface normal ground truth from different views. The average angular errors with respect to the easy, medium and hard datasets are 1.6 degrees, 5.6 degrees and 15.3 degrees, respectively. Our C++ and CUDA implementations achieve a processing speed of over 260 Hz and 21 kHz, respectively. Our proposed SNE achieves a better overall performance than all other existing computer vision-based SNEs. Our datasets and source code are publicly available at: sites.google.com/view/3f2n.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

ruirangerfan

Our project webpage: https://sites.google.com/view/3f2n

Dataset links: https://sites.google.com/view/3f2n/datasets?authuser=0

Demo video: https://www.youtube.com/watch?v=a_TdEHzvB5I&lc=UgzyIJKA7oDkZy8ym8N4AaABAg

Arxiv link: https://arxiv.org/abs/2005.08165

We are now working on releasing the source code on GitHub, it will include C++, Matlab and CUDA versions. 


-- The authors of 3F2N 

Authors

page 1

page 2

page 4

page 5

page 6

page 7

page 8

Code Repositories

Three-Filters-to-Normal

Three-Filters-to-Normal: An Accurate and Ultrafast Surface Normal Estimator (RAL+ICRA'21)


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

REAL-TIME 3-dimensional (3D) object recognition is a very challenging computer vision task [3]. Surface normal is an informative and important feature descriptor used in 3D object recognition [4]. Over the past decade, there has not been much research on surface normal estimation, as it is merely considered as an auxiliary functionality for other computer vision applications. However, such applications are generally required to perform in an online fashion, and thus, the estimation of surface normals must be carried out extremely fast [4].

The surface normals can be estimated from either a 3D point cloud or a depth/disparity image (see Figure 1). The former, such as a LiDAR point cloud, is generally unstructured. Estimating surface normals from unstructured range data usually requires the generation of an undirected graph, e.g.a -nearest neighbor graph or a Delaunay tessellation graph. However, the generation of such graphs is very computationally intensive. Therefore, in recent years, many researcher have been focused on surface normal estimation from structured range data, i.e., depth/disparity images.

The existing surface normal estimators (SNEs) can be classified as either computer vision-based

[3, 4, 5, 6]

or machine learning-based

[7, 8, 9, 10, 11, 12, 13]. The former typically computes the surface normals by fitting planar or curved surfaces to locally selected 3D point sets, using statistical analysis or optimization techniques, e.g.

, singular value decomposition (SVD) or principal component analysis (PCA)

[4]. On the other hand, the latter generally utilizes data-driven classification/regression models, e.g.

, convolutional neural networks (CNNs) to infer surface normal information from RGB or depth images

[12].

In recent years, with rapid advances in machine/deep learning, many researchers have resorted to deep convolutional neural networks (DCNNs) for surface normal estimation. For example, Xu

et al.[7]

utilized a so-called prediction-and-distillation network (PAD-Net) to simultaneously solve two continuous regression tasks (monocular depth prediction and surface normal inference) and two discrete classification tasks (scene parsing and contour detection). Similarly, Li

et al.[13] designed a DCNN model to learn the mapping from multi-scale image patches to surface normals and monocular depth. Such inferences were then refined using conditional random fields (CRF) [14]. Furthermore, Bansal et al.[10] built a skip-network model based on a pre-trained Oxford VGG-16 CNN [15] for 2.5D surface normal prediction and 3D object recognition in 2D images. Recently, Huang et al.[16] formulated the problem of densely estimating local 3D canonical frames from a single RGB image as a joint estimation of surface normals, canonical tangent directions and projected tangent directions. Such problem was then addressed by a DCNN.

The existing data-driven SNEs are generally trained using supervised learning techniques. Hence, they require a large amount of labeled training data to find the best CNN parameters

[13]. Additionally, such CNNs were not specifically designed for surface normal estimation, because SNEs were only used as an auxiliary functionality for other computer vision applications, e.g., scene parsing [7], 3D object detection [9], depth perception [13], etc. Furthermore, many robotics and computer vision applications, e.g., autonomous driving, require very fast surface normal estimation (in milliseconds). Unfortunately, the existing machine/deep learning-based SNEs are not that fast. Moreover, the accuracy achieved by data-driven SNEs is still far from satisfactory (the average proportion of good pixels, detailed in Section IV, is usually lower than ) [10, 13]. Most importantly, it can be considered more reasonable to estimate surface normals from point clouds or disparity/depth images rather than from RGB images. Hence, there is a strong motivation to develop a lightweight SNE for structured range data with high accuracy and speed.

The main novel contributions of this work are as follows:

a) A novel, accurate and ultrafast SNE is proposed. We implement our SNE in Matlab C, C++ and CUDA. The source code will be publicly available at IEEE Xplore for research purposes. Compared with other computer vision-based SNEs, the proposed SNE greatly improves the trade-off between speed and accuracy.

b) Three datasets (easy, medium and hard) are created using 24 3D mesh models. Each mesh model is used to generate 1800–2500 depth images from different views. The corresponding surface normal ground truth is also provided, as 3D mesh object models (rather than the objects themselves) are available for surface normal ground truth generation.

The rest of this paper continues in the following manner: Section II reviews the state-of-the-art computer vision-based SNEs; Section III introduces our proposed SNE; the experimental results and the performance evaluation are provided in Section IV; in Section V, we discuss the applications of our SNE; finally, Section VI summarizes the paper and provides recommendations for future work.

Ii Related Work

This section provides an overview of computer vision-based SNEs.

1) PlaneSVD SNE [17]: The simplest way to estimate the surface normal of an observed 3D point in the camera coordinate system (CCS) is to fit a local plane:

(1)

to the points in , where () is a set of neighboring points of . The surface normal can be estimated by solving:

(2)

where and is an

-entry vector of ones. (

1) can be solved by factorizing into using SVD. (the optimum ) is a column vector in corresponding to the smallest singular value in [4].

2) PlanePCA SNE [18]: can also be estimated by removing the empirical mean from and rearranging (2) as follows:

(3)

where . Minimizing (3) is equivalent to performing PCA on and selecting the principal component with the smallest covariance [4].

3) VectorSVD SNE [4]: A straightforward alternative to fitting (1) to is to minimize the sum of the inner dot products between and , namely,

(4)

This minimization is done by SVD.

4) AreaWeighted SNE [4]: A triangle can be formed by a given pair of and , as defined above. A general expression of averaging-based SNEs is as follows [4]:

(5)

where is a weight and . In AreaWeighted SNE, the surface normal of each triangle is weighted by the magnitude of its area:

(6)

5) AngleWeighted SNE [4]: The weight of each triangle relates to the angle between and :

(7)

where is a dot product operator.

6) FALS SNE [5]

: The relationship between the Cartesian coordinate system and the spherical coordinate system (SCS) is as follows

[5]:

(8)

where , and . Since all points in are in a small neighborhood [5], their are considered to be identical in FALS SNE. (2) and (8) result in:

(9)

where , and .

7) SRI SNE [5]: Similar to FALS SNE, SRI SNE first transforms the range data from the Cartesian coordinate system to the SCS. is then obtained by computing the partial derivative of the local tangential surface :

(10)

where is an SO(3) matrix with respect to and . , and are the unit vectors in the , and coordinate axes, respectively. can be obtained by applying standard image convolutional kernels.

8) LINE-MOD SNE [3]: Firstly, the optimal gradient of a depth map is computed. Then, a 3D plane is formed by three points , and :

(11)

where is the vector along the line of sight that goes through an image pixel and is computed using camera intrinsic parameters. The surface normal can be computed using:

(12)

Iii 3f2n Sne

In this paper, we propose a novel, highly accurate and ultrafast SNE, which is simple to understand and use. Our SNE can compute surface normals from structured range data using three filters, namely, a horizontal image gradient filter, a vertical image gradient filter and a mean/median filter. Hence, we call it three-filters-to-normal (3F2N) SNE.

A 3D point in the CCS can be transformed to using [19]:

(13)

where is the camera intrinsic matrix, is the image principal point, and and are the camera focal lengths (in pixels) in the and directions, respectively. Combining (1) and (13) results in:

(14)

Differentiating (14) with respect to and leads to:

(15)

which can be approximated by respectively performing horizontal and vertical image gradient filters, e.g., Sobel, Scharr and Prewitt, on the inverse depth image (an image storing the values of ). Rearranging (15) results in the following expressions of and :

(16)

Given an arbitrary , we can compute the corresponding by plugging (16) into (1):

(17)

where . In this paper, and is an 8-connected neighborhood. Since (16) and (17) have a common factor of , they can be simplified as:

(18)

where is a mean or median operator used to estimate . Please note: if the depth value of is identical to those of all its neighboring points , we consider that the direction of its corresponding surface normal is perpendicular to the image plane and simply set to . The performances of estimating using the mean filter and using the median filter will be compared in Section IV.

Fig. 2: comparisons with respect to different image gradient filters and mean/median filter: (a) easy dataset; (b) medium dataset; (c) hard dataset. Please note: (a), (b) and (c) use different scales.
Fig. 3: comparisons with respect to different filter sizes: (a) easy dataset; (b) medium dataset; (c) hard dataset. Please note: (a), (b) and (c) use different scales.
Fig. 4: Examples of the experimental results: (1)–(5) columns on (a), (d) and (g) rows show the 3D mesh models, depth images, surface normal ground truth and the experimental results obtained using BG-Mean and BG-Median SNEs, respectively; (1)–(5) columns on (b), (e) and (h) rows show the angular error maps obtained by PlaneSVD, PlanePCA, VectorSVD, AreaWeighted and AngleWeighted SNEs, respectively; (1)–(5) columns on (c), (f) and (i) rows show the angular error maps obtained by FALS, SRI, LINE-MOD, BG-Mean and BG-Median SNEs, respectively.
Fig. 5: comparisons among different computer vision-based SNEs: (a) easy dataset; (b) medium dataset; (c) hard dataset. Please note: (a), (b) and (c) use different scales.
Fig. 6: Examples of the DIODE dataset: (a) RGB images; (b) depth images; (c) surface normal ground truth; (d) BG-Mean SNE results; (e) BG-Median SNE results; (f) BG-Mean SNE error maps; (g) BG-Median SNE error maps.

Specifically, for a stereo camera, , and the relationship between the depth and disparity is as follows:

(19)

where is the stereo rig baseline. Therefore,

(20)

Plugging (19) and (20) into (18) results in:

(21)

Therefore, our SNE can also estimate surface normals from a disparity image using the three filters.

Iv Experiments

Gradient filter Mean filter Median filter
BG 3.722 10.973
Sobel 3.824 11.167
Scharr 3.848 11.355
Prewitt 3.743 11.065
TABLE I: The runtime (ms) of the CPU implementations (using a single thread) with respect to different image gradient filters and mean/median filters.
Method Jetson TX2 GTX 1080 Ti RTX 2080 Ti

BG-Mean
0.823521 0.049504 0.046944
Sobel-Mean 0.855843 0.052288 0.051232
Scharr-Mean 0.860319 0.052320 0.051280
Prewitt-Mean 0.857762 0.052256 0.050816

BG-Median
1.206337 0.102368 0.065536
Sobel-Median 1.217023 0.104608 0.067840
Scharr-Median 1.239041 0.105376 0.071008
Prewitt-Median 1.240479 0.105152 0.069024

TABLE II: The runtime (ms) of the GPU implementations with respect to different image gradient filters and mean/median filters.

Iv-a Datasets and Evaluation

In our experiments, we used 24 3D mesh models from Free3D111free3d.com to create three datasets (eight models in each dataset). According to different difficulty levels, we name our datasets “easy”, “medium” and “hard”, respectively. Each 3D mesh model is first fixed at a certain position. A virtual range sensor with pre-set intrinsic parameters is then used to capture depth images at 1800–2500 different view points. At each view point, a pixel depth image is generated by rendering the 3D mesh model using OpenGL Shading Language222www.opengl.org/sdk/docs/tutorials/ClockworkCoders/glsl_overview.php

(GLSL). However, since the OpenGL rendering process applies linear interpolation by default, rendering surface normal images is infeasible. Hence, the surface normal of each triangle, constructed by three mesh vertices, is considered to be the ground truth surface normal of any 3D points residing on this triangle. Our datasets are publicly available at:

sites.google.com/view/3f2n. In addition to our datasets, we also utilize the DIODE dataset333diode-dataset.org [20] to evaluate the SNE performance.

Furthermore, we utilize two metrics: a) the average angular error (AAE) and b) the proportion of good pixels (PGP) [6]:

(22)

to quantify the SNE accuracy, where:

(23)
(24)

is the number of 3D points used for evaluation, is the angular error tolerance, and and are the estimated and ground truth surface normals, respectively. In addition to accuracy, we also record the SNE processing time (ms) and introduce a new metric:

(25)

to quantify the trade-off between the speed and accuracy of a given SNE. A fast and precise SNE achieves a low score.

Iv-B Filter Settings and Implementation Details

As discussed in Section III, and can be estimated by convolving an inverse depth image or a disparity map with image convolutional kernels, e.g., Sobel, Scharr, Prewitt, etc. Hence, in our experiments, we first compare the accuracy of the surface normals estimated using the aforementioned convolutional kernels. Then, the brute-force search method is utilized to find the best parameters for a kernel. Our experiments illustrate that the basic gradient (BG) kernel, i.e., , can achieve the best overall performance.

We implement the proposed SNE in Matlab C and C++ on a CPU and in CUDA on a GPU. The source code are publicly available at: sites.google.com/view/3f2n. Similar to the FALS, SRI and LINE-MOD SNE implementations provided in the opencv_contrib repository,444github.com/opencv/opencv_contrib we use advanced vector extensions 2 (AVX2) and streaming SIMD (single instruction, multiple data) extensions (SSE) instruction sets to optimize our C++ implementation. Since our approach estimates surface normals from an 8-connected neighborhood, we also use memory alignment strategies to speed up our SNE. In the GPU implementation, we first create a texture object in the GPU texture memory and then bind this object with the address of the input depth/disparity image, which greatly reduces the memory requests from the GPU global memory.

Method (ms) (degrees/kHz)
Easy Medium Hard
PlaneSVD [18] 393.69 813.87 2389.73 6923.18


PlanePCA [17]
631.88 1306.29 3835.59 11111.92

VectorSVD [4]
563.21 1199.63 3529.11 10142.34
AreaWeighted [4] 1092.24 2407.74 6843.56 18600.68
AngleWeighted [4] 1032.88 1850.00 5855.62 13693.24
FALS [5] 4.11 9.26 25.20 71.17
SRI [5] 12.18 32.18 81.66 238.78
LINE-MOD [3] 6.43 41.93 63.84 202.08


BG-Mean
3.72 7.96 24.80 56.96

BG-Median
10.97 18.18 62.38 168.03

TABLE III: The comparisons of runtime (ms) and scores among different computer vision-based SNEs.
Method
Easy Medium Hard

=10 =20 =30 =10 =20 =30 =10 =20 =30

PlaneSVD [18]
0.9648 0.9792 0.9855 0.8621 0.9531 0.9718 0.6202 0.7394 0.7914
PlanePCA [17] 0.9648 0.9792 0.9855 0.8621 0.9531 0.9718 0.6202 0.7394 0.7914
VectorSVD [4] 0.9643 0.9777 0.9846
0.8601
0.9495 0.9683 0.6187 0.7346 0.7848
AreaWeighted [4] 0.9636 0.9753 0.9819
0.8634
0.9504 0.9665 0.6248 0.7448 0.7977
AngleWeighted [4] 0.9762 0.9862 0.9893 0.8814 0.9711 0.9809 0.6625 0.8075 0.8651
FALS [5] 0.9654 0.9794 0.9857
0.8621
0.9547 0.9731 0.6209 0.7433 0.7961
SRI [5] 0.9499 0.9713 0.9798 0.8431 0.9403 0.9633 0.5594 0.6932 0.7605
LINE-MOD [3] 0.8542 0.9085 0.9343 0.7277 0.8803 0.9282 0.3375 0.4757 0.5636

BG-Mean
0.9563 0.9767 0.9864 0.8349 0.9423 0.9674 0.6191 0.7671 0.8368
BG-Median 0.9723 0.9829 0.9889 0.8722 0.9600 0.9766 0.6631 0.7821 0.8289
TABLE IV: comparison among different computer vision-based SNEs with respect to different on easy, medium and hard datasets.

Iv-C Performance Evaluation

We first compare the performances of the proposed SNE with respect to different image gradient filters (BG, Sobel, Scharr and Prewitt) and mean/median filter. scores with respect to the easy, medium and hard datasets are illustrated in Figure 2. The runtime of our implementations on an Intel Core i7-8700K CPU (using a single thread) and three state-of-the-art GPUs (Jetson TX2, GTX 1080 Ti and RTX 2080 Ti) is also given in Table I and II, respectively. We can see that BG outperforms Sobel, Scharr and Prewitt in terms of on all datasets. Also, using the median filter can achieve better surface normal accuracy than using the mean filter, because an candidate in (17) can differ significantly from the ground truth value, introducing significant noise to the mean filter. The scores achieved using BG-Median SNE are approximately , and (with respect to the easy, medium and hard datasets, respectively) higher than those obtained using BG-Mean SNE. Furthermore, Figure 3 illustrates the values of with respect to different filter sizes, where readers can see that decreases gradually with the increase of the filter size. However, median filter is much more computationally intensive and time-consuming than the mean filter, because it needs to sort eight candidates and find the median value. From Table I and II, we can observe that both BG-Mean SNE and BG-Median SNE perform much faster than real-time across different computing platforms. The processing speed of BG-Mean SNE is over 1 kHz and 21 kHz on the Jetson TX2 GPU and RTX 2080 Ti GPU, respectively. Furthermore, BG-Mean SNE performs around 1.4 to 2.1 times faster than the BG-Median SNE. Therefore, the latter achieves the best surface normal accuracy, while the former achieves the best processing speed.

Moreover, we compare our SNE with all other computer vision-based SNEs, as mentioned in Section II. Some examples of the experimental results are shown in Figure 4, where it can be seen that the bad estimates mainly reside on the object edges. Additionally, Figure 5 shows comparisons of on the easy, medium and hard datasets, where we can find that BG-Median SNE achieves the best score on the easy dataset, while AngleWeighted SNE achieves the best scores on the medium and hard datasets. Meanwhile, the scores achieved by BG-Median SNE and AngleWeighted SNE are very similar. The runtime (C++ implementations using a single thread) and scores achieved by the aforementioned SNEs are given in Table III, where we can observe that the averaging-based SNEs are the most time-consuming ones, while BG-Mean SNE achieves the fastest processing speed. Furthermore, BG-Mean, FALS and BG-Median SNEs occupy the first three places, respectively, in terms of score. Moreover, Table IV compares their PGP scores with respect to different on the easy, medium and hard datasets, where we can see that AngleWeighted SNE achieves the best scores, except for (hard dataset). However, according to Table III, AngleWeighted SNE is extremely time-consuming and achieves a very bad score. On the other hand, BG-Median SNE and AngleWeighted SNE achieve similar scores, but the former performs about 100 times faster than the latter.

In addition to our created datasets, we also use the DIODE dataset [20] to compare the performances of the above-mentioned SNEs. Examples of our experimental results are shown in Figure 6. The runtime and average angular errors obtained by different SNEs are given in Table V, where it can be seen that BG-Mean SNE is the fastest among all SNEs, while BG-Median SNE achieves the lowest average angular errors. Therefore, 3F2N SNE outperforms all other state-of-the-art computer vision-based SNEs in terms of both accuracy and speed. Researchers can use either BG-Mean SNE or BG-Median SNE in their work, according to their demand for speed or accuracy.

V Discussion

Fig. 7: 3D scene reconstruction comparison: (a) conventional 3D scene reconstruction; (b) 3D scene reconstruction aided by our proposed SNE.

A SNE can be applied in a variety of computer vision and robotics tasks. In this section, we first use the ICL-NUIM RGB-D dataset [21] to show an example of 3D geometry reconstruction benefiting from 3F2N SNE. Then, we discuss the possibilities of using 3F2N SNE to improve the performance of the state-of-the-art CNNs.

In our experiments, we first utilize an off-the-shelf registration algorithm provided by the point cloud library555http://pointclouds.org/ (PCL) to match the 3D point cloud generated from each depth image with a global 3D geometry model. The sensor poses and motion trajectory can then be obtained. Meanwhile, we integrate the surface normal information into the point cloud registration process and acquire another collection of sensor poses and motion trajectory. Then, we utilize ElasticFusion [22], a real-time dense visual simultaneous localization and mapping (SLAM) system, to reconstruct the 3D scenery using the input RGB-D data and two collections of sensor poses and motion trajectories. Two reconstructed 3D scenes are illustrated in Figure 7, where it is obvious that the proposed SNE can improve the 3D geometry reconstruction accuracy. According to the quantitative analysis of our experimental results, the 3D reconstruction accuracy can be improved by approximately 19%, when using the surface normal information obtained by 3F2N SNE.

Fig. 8: Examples of the Synthia-SF dataset: (a) RGB images; (b) disparity images; (c) 3F2N SNE results.

Furthermore, we perform 3F2N SNE on the disparity images provided in the Synthia-SF dataset [23]. Examples of the experimental results are shown in Figure 8. It can be seen that the 3D points on each planar (or near planar) surface, such as a road or building side, possess similar surface normals. Therefore, we believe that our proposed SNE can be utilized to extract informative features for CNNs in various autonomous driving perception tasks, such as semantic image segmentation and freespace detection, without affecting their training/prediction speed.

Method Runtime (ms) (degrees)

indoor outdoor

PlaneSVD [18]
883.458 10.8879 16.5789

PlanePCA [17]
1501.707 10.8879 16.5789





VectorSVD [4]
1327.847 10.8684 16.5143


AreaWeighted [4]
2522.729 10.8871 16.5597

AngleWeighted [4]
2661.607 10.7591 16.5453
FALS [5] 10.706 11.0715 16.6705


SRI [5]
39.075 11.1543 16.9029


LINE-MOD [3]
17.026 12.8388 17.2719


BG-Mean
9.511 11.2018 16.9811


BG-Median
30.193 10.5887 16.2544


TABLE V: The runtime (ms) and comparisons among different computer vision-based SNEs on the DIODE dataset.

Vi Conclusion and Future Work

In this paper, we presented a precise and ultrafast SNE named 3F2N for structured range data. Our proposed SNE can compute surface normals from an inverse depth image or a disparity image using three filters, namely, a horizontal image gradient filter, a vertical image gradient filter and a mean/median filter. To evaluate the performance of our proposed SNE, we created three datasets (containing about 60k pairs of depth images and the corresponding surface normal ground truth) using 24 3D mesh models. Our datasets are publicly available at https://sites.google.com/view/3f2n for research purposes. According to our experimental results, BG outperforms other image gradient filters, e.g., Sobel, Scharr and Prewitt, in terms of both precision and speed. BG-Median SNE achieves the best surface normal precision (, and on easy, medium and hard datasets, respectively), while BG-Mean SNE is most effective for minimizing the trade-off between speed and accuracy. Furthermore, our proposed 3F2N SNE achieves better overall performance than all other computer vision-based SNEs. We believe that our SNE can be easily applied in various computer vision and robotics tasks, e.g., autonomous driving, etc.

As a future work, we plan to use the proposed method to learn depth prediction from monocular images, as many methods have already applied the constraints between depth and normal in monocular depth prediction.

References

  • [1] S. Choi, Q.-Y. Zhou, and V. Koltun, “Robust reconstruction of indoor scenes,” in

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2015.
  • [2] S. Martull, M. Peris, and K. Fukui, “Realistic cg stereo image dataset with ground truth disparity maps,” in ICPR workshop TrakMark2012, vol. 111, no. 430, 2012, pp. 117–118.
  • [3] S. Hinterstoisser, C. Cagniart, S. Ilic, P. Sturm, N. Navab, P. Fua, and V. Lepetit, “Gradient response maps for real-time detection of textureless objects,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 5, pp. 876–888, 2011.
  • [4] K. Klasing, D. Althoff, D. Wollherr, and M. Buss, “Comparison of surface normal estimation methods for range sensing applications,” in 2009 IEEE International Conference on Robotics and Automation.   IEEE, 2009, pp. 3206–3211.
  • [5] H. Badino, D. Huber, Y. Park, and T. Kanade, “Fast and accurate computation of surface normals from range images,” in 2011 IEEE International Conference on Robotics and Automation.   IEEE, 2011, pp. 3084–3091.
  • [6] F. Lu, X. Chen, I. Sato, and Y. Sato, “Symps: Brdf symmetry guided photometric stereo for shape and light source estimation,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 1, pp. 221–234, 2017.
  • [7] D. Xu, W. Ouyang, X. Wang, and N. Sebe, “Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 675–684.
  • [8] P. Wang, X. Shen, B. Russell, S. Cohen, B. Price, and A. L. Yuille, “Surge: Surface regularized geometry estimation from a single image,” in Advances in Neural Information Processing Systems, 2016, pp. 172–180.
  • [9] T. Hashimoto and M. Saito, “Normal estimation for accurate 3d mesh reconstruction with point cloud model incorporating spatial structure,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 54–63.
  • [10] A. Bansal, B. Russell, and A. Gupta, “Marr revisited: 2d-3d alignment via surface normal prediction,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5965–5974.
  • [11] S. Tozza, W. A. Smith, D. Zhu, R. Ramamoorthi, and E. R. Hancock, “Linear differential constraints for photo-polarimetric height estimation,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2279–2287.
  • [12] X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia, “Geonet: Geometric neural network for joint depth and surface normal estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 283–291.
  • [13]

    B. Li, C. Shen, Y. Dai, A. Van Den Hengel, and M. He, “Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1119–1127.
  • [14] H. M. Wallach, “Conditional random fields: An introduction,” Technical Reports (CIS), p. 22, 2004.
  • [15] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations, 2015.
  • [16] J. Huang, Y. Zhou, T. Funkhouser, and L. J. Guibas, “Framenet: Learning local canonical frames of 3d surfaces from a single rgb image,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 8638–8647.
  • [17] K. Jordan and P. Mordohai, “A quantitative evaluation of surface normal estimation in point clouds,” in 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.   IEEE, 2014, pp. 4220–4226.
  • [18] K. Klasing, D. Wollherr, and M. Buss, “Realtime segmentation of range data using continuous nearest neighbors,” in 2009 IEEE International Conference on Robotics and Automation.   IEEE, 2009, pp. 2431–2436.
  • [19] R. Hartley and A. Zisserman, Multiple view geometry in computer vision.   Cambridge university press, 2003.
  • [20] I. Vasiljevic, N. Kolkin, S. Zhang, R. Luo, H. Wang, F. Z. Dai, A. F. Daniele, M. Mostajabi, S. Basart, M. R. Walter, and G. Shakhnarovich, “DIODE: A Dense Indoor and Outdoor DEpth Dataset,” CoRR, vol. abs/1908.00463, 2019. [Online]. Available: http://arxiv.org/abs/1908.00463
  • [21] A. Handa, T. Whelan, J. McDonald, and A. Davison, “A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM,” in IEEE Intl. Conf. on Robotics and Automation, ICRA, Hong Kong, China, May 2014.
  • [22] T. Whelan, S. Leutenegger, R. Salas-Moreno, B. Glocker, and A. Davison, “Elasticfusion: Dense slam without a pose graph.”   Robotics: Science and Systems, 2015.
  • [23] D. Hernandez-Juarez, L. Schneider, A. Espinosa, D. Vazquez, A. M. Lopez, U. Franke, M. Pollefeys, and J. C. Moure, “Slanted stixels: Representing san francisco’s steepest streets,” in British Machine Vision Conference (BMVC), 2017, 2017.