1 Introduction
Simultaneous localization and mapping (SLAM) is a key technology for applications such as autonomous systems and augmented reality. Whereas LiDARbased SLAM[34, 11, 25] is well established and widely used in autonomous vehicles, visual SLAM with a monocular camera does not provide sufficient accuracy and robustness, particularly regarding dense map reconstruction, to replace LiDARbased SLAM. Although some visual SLAM algorithms that use stereo cameras[7, 32], RGBD cameras[12, 14, 33], and inertial sensors[20, 18, 2, 28, 19] have achieved high performance, these methods are based on pure monocular SLAM; hence, improving monocular SLAM is important.
Monocular SLAM methods can be classified into two types: direct methods and indirect methods.
Direct methods: Direct methods estimate camera poses and reconstruct the scene by minimizing the photometric error defined as a sum of the intensity difference between each pixel in the latest image and the reprojection of the color / monochrome 3D map. Direct methods, such as LSDSLAM[5], SVO[9], and DSO[6], process almost all pixels in incoming images. They do not require exact pixel correspondences among multiple views unlike indirect methods, which leads to denser map reconstruction. However, direct methods are susceptible to image noise, luminance fluctuation, and lens aberration because they directly use pixel intensities. To overcome this drawback, Bergmann et al.[1] proposed a normalization method against luminance fluctuation and a calibration method for lens aberration. As another approach, Zhang et al.[35] proposed an autoexposure method that is suitable for direct methods.
Indirect methods: Indirect methods minimize the geometric error between observed 2D feature points and reprojections of the corresponding 3D points. As a result of the use of feature descriptors, indirect methods such as PTAM[15] and ORBSLAM[21]
are robust against brightness changes and image noise. Additionally, indirect methods explicitly establish feature point correspondences; hence, outliers are easily removed using RANSAC
[8] or Mestimation[27]. This characteristic, however, can be a drawback: Indirect methods carefully select stable feature points, thus the reconstructed 3D map tends to be sparse and does not provide detailed geometry. Densification methods, such as PMVS[10] and the extension LPMVS[26], might be useful for obtaining dense geometry; however, they are offline methods and not applicable in real time.In this paper, we propose the novel VIsual Tracking And MappINg with Extremely dense feature points, “VITAMINE,” which is highly precise, robust, and dense because of the tracking of a large number of feature points. Indirect methods are inherently robust against noise, illumination change, and outliers as a result of the use of feature descriptors. Retaining this advantage, we reconstruct detailed 3D maps by establishing dense point correspondence. The contributions of this study are as follows: We first introduce a new dense feature point tracking algorithm based on dominant flow estimation and curvature extrema tracing. This allows VITAMINE to process an enormous number of feature points; however, the need to maintain them simultaneously might lead to a high computational cost. Therefore, we also introduce a novel optimization technique, called subspace Gauss–Newton method, for bundle adjustment. The optimization technique significantly improves the efficiency of bundle adjustment by partially updating the variables. Moreover, VITAMINE generates meshes from the reconstructed feature points and integrates them using a truncated signed distance function (TSDF)[22, 30, 24]. Compared with not only conventional indirect methods but also stateoftheart direct methods, VITAMINE provides highly detailed 3D geometry as shown in Figure 1 in real time using only a CPU
2 Dense Feature Point Tracking
2.1 Feature Point Tracking
Indirect methods that use image descriptors can be unstable because of incorrect feature point correspondences. They build feature point correspondences between multiple views by matching the descriptors. Extracting consistent descriptors over multiple frames, however, becomes difficult because descriptors vary as the monocular camera changes its pose. Methods such as the KLT tracker[29] that continuously track feature points while updating the feature descriptors might be useful for overcoming the problem. However, because the tracked positions drift as a result of a minute change of feature descriptors, the correspondences over multiple views tend to be incorrect. These problems originate with the use of feature descriptors.
Rather than associating feature points based on descriptors, VITAMINE tracks the local extrema of curvature in incoming images. In the proposed method, feature points denote the extrema of curvature on image intensities. Let be an image, then curvature of image is as follows:
(1) 
where represents the partial derivative of with respect to , which can be obtained using a Sobel operator or similar technique. VITAMINE builds point correspondences over multiple images by tracking the local maximum point of curvature , which is the extension of to time domain . Figure 2(a) shows an example scene from which a large number of extrema of curvature are extracted. Whereas conventional indirect methods rely only on feature points with a large curvature to obtain stable correspondences, the proposed method tracks all detected extrema to reconstruct detailed geometry.
2.2 Dominant Flow Estimation
After detecting the extrema of curvature, the proposed method estimates a dominant flow that represents the average of optical flow over the images, which provides a good initial value to extrema tracking and makes it significantly stable, as explained later. Specifically, we determine the corresponding feature pairs between current and previous images using the BRIEF[4]
feature. Because we only have to identify coarse feature pairs over consecutive frames at this moment, feature matching is performed on lowresolution images, subsampled to 1/6 of the former size.
Then, we fit the affine transformation model to the feature pairs. and denote the position of a feature point in the previous and current frame, respectively, and and represent a matrix of and a 2D translation, respectively. and are obtained by minimizing cost function using the Gauss–Newton method:
(2) 
where and denote the total number of corresponding points and a kernel function for Mestimation, respectively. The following Geman–McClure kernel with scale parameter is used in VITAMINE :
(3) 
As a result of Mestimation, the dominant flow represented by and can be estimated stably, even for lowresolution images, and it allows us to roughly predict the position of feature points in the next frame. Note that VITAMINE does not rely on conventional feature matching in its core but only for prior information for dense extrema tracking, as described in the next section. Whereas conventional feature matching has difficulty in making all feature points couple correctly between consecutive frames, affine transformation is easily obtained when at least three correspondences are given.
2.3 Curvature Extrema Tracking
VITAMINE tracks feature points by tracing the extrema of image curvature by making use of the dominant flow. Because it depends only on extrema instead of feature descriptors used in conventional indirect methods, VITAMINE is free from the variation of feature descriptors caused by image noise or illumination changes, which makes VITAMINE highly robust.
According to the dominant flow represented by and , we first predict a current position of tracking point :
(4) 
Next, prediction is corrected to by maximizing evaluation function :
(5) 
where stores the curvature in each pixel, and and denote an evaluation function and weight for the prediction, respectively. The maximization is performed using the hill climbing method, with as the initial position. Specifically, maximum point is obtained by iterating the hill climbing method in eight neighboring pixels at each step until it reaches the local maximum value of . Function prevents the maximization process from falling into wrong extrema, thereby playing a regularization role.
Note that extrema tracking can easily fall into local solutions because there are many extrema in image curvature and it is almost impossible to distinguish them without any descriptors. However, the prediction according to the dominant flow boosts the accuracy of extrema tracking and enables it to approach the optimal solution.
3 Bundle Adjustment for Dense Tracking
3.1 Bundle Adjustment
Bundle adjustment iteratively adjusts the reconstructed map by minimizing reprojection errors. Given the th 3D point , the th camera’s rotation and translation , and the 2D position of observed in the th camera frame, the objective function is formulated as follows:
(6) 
where and are the numbers of feature points and camera poses respectively, denotes the 3D2D projection function, and is a kernel function for Mestimation. Specifically, optimal camera variables = (, ) and are obtained by applying the Gauss–Newton method to Equation 6, which results in iteratively solving the following equations:
(7) 
where , and and represent the Hessian matrix and gradient around of , respectively. and can be represented by the camera variable block and the feature point variable block as follows:
(8) 
Hessian matrix in bundle adjustment has a unique structure: and are sparse matrices with only diagonal elements in block units, whereas is a dense matrix. Efficient solutions that focus on this unique structure are the keys to developing highly precise and robust visual SLAM.
Stateoftheart monocular SLAM methods, such as ORBSLAM[21] and DSO[6], solve Equation 7 by decomposing it using the Schur complement matrix instead of directly solving it:
(9)  
(10) 
where and . The decomposition allows us to solve bundle adjustment faster. The number of camera variables is remarkably smaller than that of feature point variables, and the inverse matrix of can be easily calculated because it has only diagonal components; thus, the Schur complement matrix whose size is is significantly tiny compared with the original , and the inverse matrix is rapidly computable.
Equation 9 is also called marginalization. When regarding as a new and as a new , the decomposition is equivalent to eliminating all feature point variables from cost function . Stateoftheart SLAMs make themselves efficient using the marginalization technique to prevent the increase in computational cost caused by a large number of variables.
However, in the case of maintaining thousands of feature points in every frame, as in the dense extrema tracking in VITAMINE, the size of matrix fundamentally cannot be made sufficiently small because variable elimination is applicable only to old variables unrelated to the current frame for stability. Moreover, the size of the Schur complement matrix is proportional to the number of feature points; thus, the calculation cost of bundle adjustment over tens of thousands points, where the size of is 100,000 100,000 or more, becomes too high to run bundle adjustment in real time.
3.2 Subspace Gauss–Newton Method
To deal with the explosion in the size of , we propose a novel optimization technique called the “subspace Gauss–Newton method.” It partially updates variables rather than updating all of them at once, as in Equations 9 and 10, by decomposing these equations further as follows:
(11) 
(12) 
Equation 11 updates of a camera variable, and Equation 12 of a feature point variable. , , and are matrices of , , and , respectively. The subspace Gauss–Newton method iteratively solves Equations 11 and 12 until and converge, respectively.
These formulae are extensions of the Gauss–Seidel method, which is an iterative method to solve a linear system of equations, and equivalent to solving the Gauss–Newton method by fixing all variables except the variables to be optimized. The advantage of the proposed subspace Gauss–Newton method is that it does not require a large inverse matrix, unlike Equation 9, but instead, only an inverse matrix of at most. Additionally, as in ORBSLAM[21] and DSO[6], further speedup is possible by appropriately performing variable elimination that sets most elements of and to zero. Because the proposed optimization method limits the search space in the Gauss–Newton method to its subspace, we call it the “subspace Gauss–Newton method.” ^{1}^{1}1 Alternating optimization, such as our method, has been used in some contexts[36][31]. See the supplementary information for details of the novel aspect of our method.
4 Dense Reconstruction
A large number of accurate 3D points are generated in real time with VITAMINE as a result of the dense extrema tracking and subspace Gauss–Newton method described in previous sections. This leads not only to point cloud generation but also allows further dense geometry reconstruction that cannot be achieved by conventional indirect methods.
Meshing and Noise Removal: We first project the 3D points onto an image and apply Delaunay triangulation to generate triangular meshes. Then, We use NLTGV minimization proposed by Greene et al[23] to remove noise on the meshes. NLTGV minimization allows us to smooth the meshes, thereby retaining local surface structures, unlike typical mesh denoising methods such as Laplacian smoothing. Figure 2(c) shows example results of Delaunay triangulation and smoothing with NLTGV minimization.
Mesh Integration in TSDF: Finally, we integrate the meshes in a TSDF to reconstruct the entire geometry of the scene. The TSDF represents an object shape by discretizing the space into grids that store the distance from the object surface, and can merge multiple triangular meshes by storing the average value of distances from the meshes to each grid.
Sequence name  Our method  DSO[6]  ORBSLAM[21]  LSDSLAM[5]  

(no. of images)  w/o loop closure  w/o loop closure  
MH01 easy  12.9 0.5  cm  6.0 0.8  cm  5.2 1.1  cm  (44.9 7.2)  cm  
(3682)  100.0 0.0  %  100.0 0.0  %  97.7 1.6  %  28.9 23.6  %  
0 0  0 0  19 11  
MH02 easy  8.8 0.5  cm  4.2 0.2  cm  4.1 0.4  cm  (58.3 6.9)  cm  
(3040)  100.0 0.0  %  100.0 0.0  %  92.4 1.1  %  73.0 1.5  %  
0 0  0 0  56 6  
MH03 medium  10.6 1.3  cm  21.1 0.9  cm  (4.5 0.4)  cm  (266.2 61.3)  cm  
(2700)  100.0 0.0  %  100.0 0.0  %  48.9 0.8  %  28.4 20.7  %  
0 0  0 0  0 0  
MH04 difficult  19.3 1.6  cm  20.3 1.0  cm  33.6 9.4  cm  (136.4 114.3)  cm  
(2033)  100.0 0.0  %  95.7 0.0  %  95.2 0.8  %  27.2 7.0  %  
0 0  5 0  6 1  
MH05 difficult  14.7 1.1  cm  10.2 0.6  cm  14.9 4.6  cm  (27.4 16.4)  cm  
(2273)  100.0 0.0  %  95.5 0.0  %  90.0 4.0  %  22.7 0.5  %  
0 0  2 0  18 5  
V101 easy  9.7 0.2  cm  13.4 5.8  cm  8.8 0.1  cm  (20.0 22.8)  cm  
(2911)  100.0 0.0  %  100.0 0.0  %  96.6 0.0  %  11.6 11.2  %  
0 0  0 0  1 0  
V102 medium  9.3 0.6  cm  53.0 5.5  cm  (14.5 11.7)  cm  (67.0 14.0)  cm  
(1710)  100.0 0.0  %  100.0 0.0  %  52.0 3.3  %  15.2 0.1  %  
0 0  0 0  17 4  
V103 difficult  11.3 0.5  cm  85.0 36.4  cm  (37.2 20.7)  cm  (29.3 2.0)  cm  
(2149)  100.0 0.0  %  100.0 0.0  %  65.5 8.8  %  11.0 0.1  %  
0 0  0 0  56 26  
V201 easy  7.5 0.4  cm  7.6 0.5  cm  6.0 0.1  cm  (131.3 20.4)  cm  
(2280)  100.0 0.0  %  100.0 0.0  %  95.2 0.0  %  74.1 8.9  %  
0 0  0 0  0 0  
V202 medium  8.6 0.7  cm  11.8 1.4  cm  12.3 2.7  cm  (42.1 9.2)  cm  
(2348)  100.0 0.0  %  100.0 0.0  %  99.5 1.2  %  11.3 0.2  %  
0 0  0 0  0 0  
V203 difficult  140.0 5.2  cm  147.5 6.6  cm  (104.3 64.0)  cm  (17.7 1.6)  cm  
(1922)  100.0 0.0  %  100.0 0.0  %  16.8 15.9  %  11.9 0.2  %  
0 0  0 0  233 123 
Our method  DSO  ORBSLAM  LSDSLAM 
36 msec/frame  53 msec/frame  25 msec/frame  30 msec/frame 
Frontend  Backend  

Feature tracking  Localization & mapping  Meshing & denoising  TSDF updating & marching cubes  TSDF updating & marching cubes 
(lowresolution; voxel size 15 cm)  (highresolution; voxel size 2.5 cm)  
36 msec/frame  25 msec/frame  45 msec/frame  175 msec/time  4000 msec/time 
5 Experimental Results
5.1 Setup
We evaluated the performance of the proposed VITAMINE on the visual SLAM benchmark EuRoC[3]. ^{2}^{2}2 See the supplementary information for experimental results on other datasets. The dataset was created using flying drones equipped with a stereo camera and an IMU in an indoor environment, and provided ground truth of trajectories obtained by a Leica MS 50 laser tracker and Vicon motion capture. EuRoC is also wellknown for data variations, with different difficulties ranked by the movement speed and lighting conditions. In this experiment, we compared results obtained using VITAMINE and other monocular SLAM methods, DSO[6], ORBSLAM[21], and LSDSLAM[5], using only the left images of the stereo camera in the EuRoC dataset. Note that because VITAMINE and DSO[6] do not include loop closing and relocalization, these functions in ORBSLAM[21] and LSDSLAM[5] were disabled to evaluate performance fairly. Similar evaluations can be found in the papers on DSO[6] and SVO[9].
VITAMINE ran on a Core i77820 HQ without any GPUs, threading each process for real time processing. Initialization was performed using essential matrix decomposition. Bundle adjustment is significantly sensitive to the initial value of the variable. In VITAMINE, we initialized the camera variables using P3P[17, 13] with RANSAC[8] and feature point variables using triangulation. Note that the proposed bundle adjustment ran so fast that we applied it to every frame rather than each key frame as in ORBSLAM and DSO. To manage the TSDF, OpenChisel[16] was used in our implementation because of its ability to handle a TSDF on a CPU.
5.2 Evaluation Criteria
The evaluation on EuRoC was performed according to the following criteria:
Localization success rate: We defined the localization success rate as , where and denote the number of images that were successfully tracked and the images in the entire sequence, respectively. If the success rate was less than 90%, we regarded the trial as a failure. When localization failed even once, the methods could not estimate the camera position later because loop closing and relocalization were disabled in this experiment. Therefore, robustness greatly contributed to the success rate.
Localization accuracy: The localization accuracy was computed by scaling the estimated trajectories so that the RMSE from ground truth trajectories was minimized because scale is not available in monocular SLAM. Note that we did not evaluate the accuracy when the success rate was less than 90% because the RMSE of very short trajectories tends to have an unfairly high accuracy.
Number of initialization retries: Initialization plays an important role in monocular visual SLAM and significantly affects the success rate. Because different methods have different initialization processes, the number of initialization retries in each method is not directly comparable, but can be a reference regarding whether the method has a weakness in initialization in certain cases.
5.3 Results and Discussion
Table 1 shows the experimental results. The results were obtained by applying each method to the image sequences in EuRoC five times, and Table 1
shows the average values and the standard deviations of the aforementioned criteria. Bold font is used to emphasize the highest accuracy in each sequence. Regarding LSDSLAM, the results of initialization retries are not included in the table because LSDSLAM did not have a reinitialization function and failed to initialize in any sequences. We thus manually identified the frame for which the initialization worked well in each trial, so the number of retries of LSDSLAM was excluded from the evaluation.
The EuRoC image sequences MH01, MH02, MH04, MH05, V101, V201, and V202 are relatively easy cases for visual SLAM because the camera motion is relatively slow and the illumination does not change frequently. By contrast, the camera moves fast in MH03, V102, V103, and V203, and additionally the lighting conditions dynamically change in V102, V103, and V203. Furthermore, in V203, the exposure time is so long that we can see severe motion blur in the image sequence, particularly in an extremely dark environment.
Even for the challenging environments, VITAMINE never lost localization and outperformed other SLAM methods, DSO, ORBSLAM, and LSDSLAM, both in terms of accuracy and robustness, as shown in Table 1. Particularly, in the sequences that contain fast camera motion, such as V102 and V103, the proposed method was superior to the existing methods. The reconstruction results are shown in Figures 3 and 4. Despite the proposed VITAMINE successfully generating dense and accurate geometry compared with its competitors, it performed equally fast on a CPU, as shown in Tables 2 and 3. Note that the smaller the size of the voxel in a TSDF for a detailed 3D model, the higher the computational cost.
The high accuracy and robustness of VITAMINE derives from tracking a large number of feature points and performing bundle adjustment for every frame. Sharing reprojection errors among variables is important for accurate monocular SLAM, and the proposed method efficiently diffuses errors to an enormous number of variables via fast bundle adjustment. Simultaneously, it prevents localization failure caused by losing sight of some feature points by handling a large number of feature points.
6 Conclusion
In this paper, we proposed a novel visual SLAM method that reconstructs dense geometry with a monocular camera. To process a large number of feature points, we proposed curvature extrema tracking using the dominant flow between consecutive frames. The subspace Gauss–Newton method was also introduced to maintain an enormous number of variables by partially updating them to avoid a large inverse matrix calculation in bundle adjustment. Moreover, supported by the accurate and dense point clouds, we achieved highly dense geometry reconstruction with NLTGV minimization and TSDF integration.
VITAMINE is executable on a CPU in real time, and it outperformed stateoftheart SLAM methods, DSO, ORBSLAM, and LSDSLAM, both in terms of accuracy and robustness on the EuRoC benchmark dataset. Performance should be improved when loop closing is introduced in VITAMINE, and fusing IMU data would also be effective to stably estimate the camera pose for challenging environments such as EuRoC V203.
Acknowledgements: This work was supported in part by JSPS KAKENHI, Japan Grant Numbers 16K16084 and 18K18072, and a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO).
References
 [1] Paul Bergmann, Rui Wang, and Daniel Cremers. Online photometric calibration of auto exposure video for realtime visual odometry and SLAM. IEEE Robotics and Automation Letters, 3, 2018.
 [2] Michael Blosch, Sammy Omari, Marco Hutter, and Roland Siegwart. Robust visual inertial odometry using a direct ekfbased approach. In Proc. of International Conference on Intelligent Robots (IROS), 2015.
 [3] Michael Burri, Janosch Nikolic, Pascal Gohl, Thomas Schneider, Joern Rehder, Sammy Omari, Markus W Achtelik, and Roland Siegwart. The EuRoC micro aerial vehicle datasets. International Journal of Robotics Research, 35(10):1157–1163, 2016.

[4]
Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua.
BRIEF: Binary robust independent elementary features.
In
Proc. of European Conference on Computer Vision (ECCV)
, 2010.  [5] Jakob Engel and Daniel Cremers. LSDSLAM: Largescale direct monocular SLAM. In Proc. of European Conference on Computer Vision (ECCV), 2014.
 [6] Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
 [7] Jakob Engel, Jörg Stückler, and Daniel Cremers. Largescale direct slam with stereo cameras. In Proc. of International Conference on Intelligent Robots (IROS), 2015.
 [8] Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
 [9] Christian Forster, Zichao Zhang, Michael Gassner, Manuel Werlberger, and Davide Scaramuzza. SVO: Semidirect visual odometry for monocular and multicamera systems. IEEE Transactions on Robotics, 33(2):249–265, 2017.
 [10] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview stereopsis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(8):1362–1376, 2010.
 [11] Wolfgang Hess, Damon Kohler, Holger Rapp, and Daniel Andor. Realtime loop closure in 2D LIDAR SLAM. In Proc. of International Conference on Robotics and Automation (ICRA), 2016.
 [12] Albert S. Huang, Abraham Bachrach, Peter Henry, Michael Krainin, Daniel Maturana, Dieter Fox, and Nicholas Roy. Visual odometry and mapping for autonomous flight using an RGBD camera. In Proc. of International Symposium on Robotics Research (ISRR), 2011.

[13]
Tong Ke and Stergios Roumeliotis.
An efficient algebraic solution to the perspectivethreepoint
problem.
In
Proc. of International Conference on Computer Vision and Pattern Recognition (CVPR)
, 2017.  [14] Christian Kerl, Jürgen Sturm, and Daniel Cremers. Robust odometry estimation for RGBD cameras. In Proc. of International Conference on Robotics and Automation (ICRA), 2013.
 [15] Georg Klein and David Murray. Parallel tracking and mapping for smallar workspaces. In Proc. of International Symposium on Mixed and Augmented Reality (ISMAR), 2007.
 [16] Matthew Klingensmith, Ivan Dryanovski, Siddhartha Srinivasa, and Jizhong Xiao. Chisel: Real time large scale 3D reconstruction onboard a mobile device using spatially hashed signed distance fields. Robotics: Science and Systems, 2015.
 [17] Laurent Kneip, Davide Scaramuzza, and Roland Siegwart. A novel parametrization of the perspectivethreepoint problem for a direct computation of absolute camera position and orientation. In Proc. of International Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
 [18] Stefan Leutenegger, Simon Lynen, Michael Bosse, Roland Siegwart, and Paul Furgale. Keyframebased visualinertial odometry using nonlinear optimization. International Journal of Robotics Research, 34(3):314–334, 2015.
 [19] Simon Lynen, Markus Wilhelm Achtelik, Stephan Weiss, and Margarita Chli. A robust and modular multisensor fusion approach applied to mav navigation. In Proc. of International Conference on Intelligent Robots (IROS), 2013.

[20]
Anastasios I. Mourikis and Stergios I. Roumeliotis.
A multistate constraint Kalman filter for visionaided inertial navigation.
In Proc. of International Conference on Robotics and Automation (ICRA), 2007.  [21] Raul MurArtal and Juan D. Tardós. ORBSLAM2: an opensource SLAM system for monocular, stereo and RGBD cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017.
 [22] Richard A. Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J. Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion: Realtime dense surface mapping and tracking. In Proc. of International Symposium on Mixed and Augmented Reality (ISMAR), 2011.
 [23] W. Nicholas Greene and Nicholas Roy. Flame: Fast lightweight mesh estimation using variational smoothing on delaunay graphs. In Proc. of International Conference on Computer Vision (ICCV), 2017.
 [24] Helen Oleynikova, Zachary Taylor, Marius Fehr, Roland Siegwart, and Juan Nieto. Voxblox: Incremental 3D euclidean signed distance fields for onboard mav planning. In Proc. of International Conference on Intelligent Robots (IROS), 2017.
 [25] Chanoh Park, Peyman Moghadam, Sooshin Kim, Alberto Elfes, Clinton Fookes, and Sridha Sridharan. Elastic lidar fusion: Dense mapcentric continuoustime slam. In Proc. of International Conference on Robotics and Automation (ICRA), 2018.
 [26] Alex Poms, Chenglei Wu, ShoouI Yu, and Yaser Sheikh. Learning patch reconstructability for accelerating multiview stereo. In Proc. of International Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
 [27] William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery. Numerical Recipes in C. Cambridge University Press, Cambridge, USA, second edition, 1992.
 [28] Tong Qin, Peiliang Li, and Shaojie Shen. Vinsmono: A robust and versatile monocular visualinertial state estimator. IEEE Transactions on Robotics, 34(4):1004–1020, 2018.
 [29] Jianbo Shi and Carlo Tomasi. Good features to track. In Proc. of International Conference on Computer Vision and Pattern Recognition (CVPR), 1994.
 [30] Frank Steinbrücker, Jürgen Sturm, and Daniel Cremers. Volumetric 3D mapping in realtime on a cpu. In Proc. of International Conference on Robotics and Automation (ICRA), 2014.
 [31] Bill Triggs, Philip McLauchlan, Richard Hartley, and Andrew Fitzgibbon. Bundle adjustment  a modern synthesis. In Proc. of the Int. WS on Vision Algorithms: Theory and Practice, 2000.
 [32] R. Wang, M. Schwörer, and D. Cremers. Stereo DSO: Largescale direct sparse visual odometry with stereo cameras. In International Conference on Computer Vision (ICCV), 2017.
 [33] Ji Zhang, Michael Kaess, and Sanjiv Singh. Realtime depth enhanced monocular odometry. In Proc. of International Conference on Intelligent Robots (IROS), 2014.
 [34] Ji Zhang and Sanjiv Singh. LOAM: Lidar odometry and mapping in realtime. Robotics: Science and Systems, 2014.
 [35] Zichao Zhang, Christian Forster, and Davide Scaramuzza. Active exposure control for robust visual odometry in HDR environments. In Proc. of International Conference on Robotics and Automation (ICRA), 2017.
 [36] QianYi Zhou and Vladlen Koltun. Color map optimization for 3d reconstruction with consumer depth cameras. ACM Trans. Graph., 33(4):155:1–155:10, 2014.