I Introduction
Simultaneous localization and mapping (SLAM) addresses the incremental construction and instantaneous establishment of a global geometric reference using measurements from a dynamic observer as input. Applications benefiting from such capabilities include robotics, augmented/virtual reality, and autonomous driving. For visual SLAM, the choice of geometric representation and estimation framework must balance requirements for online operation (e.g. latency and computational burden) and achievement of taskrelevant performance metrics (e.g. robustness, accuracy, and completeness). Recent denseindirect methods challenge the standard dichotomy among featurebased and direct SLAM methods.
Featurebased methods determine multiview relationships among input video frames through the geometric analysis of sparse keypoint correspondences. Popular featurebased SLAM systems [20, 17] have a visual odometry (VO) frontend solving feature correspondence/triangulation and PnP instances to initialize camera pose estimates, which are subsequently refined (e.g. bundleadjusted) by backend and mapping modules. Given that sparse feature analysis can operate on realtime, the tradeoffs between online operation and accuracy/robustness achieved by these methods is mainly determined by the scope of the analysis performed by the backend. However, the accuracy of frontend modules may be affected by lack of texture content, repetitive structures, and/or degenerate geometric configurations; compromising the efficacy and efficiency of backend modules.
Direct methods jointly estimate photometrically consistent scene structure and camera poses from image content, obviating keypoint correspondence. Dense direct methods [18] are typically sensitive to illumination changes, and need a sufficient number of images to triangulate structure at poorly textured regions. Sparse or semidense direct methods [6, 5] are less sensitive to illumination changes, but may still require accurate photometric calibration for full performance.
Recent denseindirect (DI) formulations for VO [15] are a promising alternative to both their sparse and direct counterparts. The DI approach conditions local 3D geometry and camera motion estimates on their consistency w.r.t. (observed) dense optical flow (OF) estimates. Such frameworks actively leverage ongoing improvements in accuracy and robustness of learningbased OF estimators, while yielding highquality dense geometry estimates as a byproduct.
To extend the recent DI inference model VOLDOR [15] into a SLAM pipeline, we develop integrative modules customized to the intermediate geometric estimates produced by its inference process. We address the efficient estimation, management and refinement of globalscope 3D data associations within the context of DI geometric estimates. Our contributions are: 1) extending VOLDOR’s inference framework to both generate and input explicit geometric priors for improved estimation accuracy across monocular, stereo and RGBD image sources, 2) a robust pointtoplane inversedepth alignment formulation for sourceagnostic frame registration, and 3) a priority linking framework for incremental realtime pose graph management.
Ii Related Work
Indirect SLAM. Classic approaches like MonoSLAM [3] rely on EKFbased frameworks for fusing multiple observations to recover scene geometry. PTAM [12] simultaneously runs a frontend for VO and a backend refining the estimation through bundle adjustment. ORBSLAM [16, 17], introduced a versatile SLAM system with a more powerful backend with global relocalization and loop closing, allowing large environment applications. More recently, ORBSLAM3 [1] further built a multimap system to survive from long periods of poor visual information.
Direct SLAM. DTAM [18] first introduced a GPUbased realtime dense modelling and tracking approach for small workspaces. It estimates a dense geometry model through a photometric term combined with a regularizer while estimating camera motion by finding a warping of the model that minimizes the photometric error w.r.t. the video frames. LSDSLAM [6] switched to keyframe based semidense model that allows large scale CPU realtime application. DSO [5] builds sparse models and combines a probabilistic model that jointly optimizes for all parameters as well as further integrates a full photometric calibration.
Direct RGBD SLAM. RGBD SLAM [10] forgoes geometry estimation by using RGBD input, while estimates camera poses by minimizing both photometric and geometric error, and managing keyframes within a pose graph. ElasticFusion [29] uses dense frametomodel camera tracking, windowed surfelbased fusion, and frequent model refinement through nonrigid surface deformation. BundleFusion [2] tracks camera poses using sparsetodense optimization and a localtoglobal strategy for global mapping consistency. Recently, BADSLAM [22] proposed a fast direct bundleadjustment for poses, surfels and intrinsics.
DenseIndirect VO. Valgaerts et al. [26] addressed robust recovery of fundamental matrix from a single dense OF field. Ranftl et al. [21] further addressed the motion segmentation and the recovery of depth maps. While these works highlighted the potential of DI modeling, they focused on the twoview geometry problem. VOLDOR [15] introduced the DI paradigm to the multiview domain by modelling the VO problem within a probabilistic framework w.r.t. a loglogistic residual model on OF batches and developed a generalizedEM framework for joint realtime inference of camera motion, 3D structure and track reliability. While highly accurate, geometric analysis scope was strictly local.
Deep learning optical flow.
Current deep learning works on OF estimation outperform traditional methods in terms of accuracy and robustness under challenging conditions such as textureless regions, motion blur and large occlusions. FlowNet/FlowNet2
[4, 8] introduced an encoderdecoder CNN for OF estimation. PWCNet [24] integrates spatial pyramids, warping and cost volumes into deep OF estimation, improving performance and generalization. Most recently, MaskFlowNet [33] proposed an asymmetric occlusion aware feature matching module and achieved current stateoftheart performance.Iii Method
System Architecture. We build upon the recently proposed dense indirect VO method of Min et al. [15], which addressed the joint probabilistic estimation of camera motion, 3D structure and track reliability from a set of input dense OF estimates. As standard practice in SLAM literature [17, 5], our proposal VOLDOR
SLAM has a VO frontend and a mapping backend. The frontend operates over small sequential batches (i.e. temporal sliding window) of dense OF estimates. We adaptively determine keyframe selection and stride among subsequent batches based on visibilitycoverage metrics designed for our dense SLAM system. To enforce consistency within a larger geometric scope, while enabling online operation, the backend adaptively prioritizes the analysis and establishment pose constraints between keyframe pairs balancing the search for loop closure connections and the reinforcement of local keyframe connectivity. VOLDOR
SLAM also implements a loop closure scheme based on an image retrieval and geometric verification module
[7]. Finally, all pairwise camera pose constraints are managed within a based pose graph [23]. Module dependencies and data flow of our system are illustrated in Fig.2. The resulting dense SLAM implementation enables online operation at FPS on a single GTX1080Ti GPU.Iiia FrontEnd
VOLDOR: An extended VO model. Our frontend extends and improves upon the VOLDOR framework [15], which formulates monocular VO from observed OF input batches as a generalized EM problem where the three hidden variables are camera pose, depth, and track reliability. Originally, [15] proposed the probabilistic graphical model
(1) 
where denotes the observed OF batch, the set of camera poses, the depth map of the first frame in the batch and
the set of rigidness maps used to downweight OF pixels belonging to occlusions, moving objects or outliers. The likelihood function in
[15] defines residuals as the endpointerror (EPE) between the observed optical flow vs. the rigid flow computed from the parameters. This residual is assumed to follow an empirically validated loglogistic (i.e. Fisk) distribution model. The ensuing MAP problem is solved using a generalizedEM framework alternatively updating each of the parameters. Whereas [15] bootstraps each of the hidden variables deterministically, VOLDOR accommodates explicit priors, extending Eq.(1) to yield(2) 
where the set of explicit depth map priors is denoted by ; their associated (fixed) relative camera poses by and their pixellevel rigidness maps by . Henceforth, unless otherwise stated, our VOLDOR extensions strictly follow the framework described in [15] and we refer the reader to that publication for further details.
Geometric Priors. While [15] estimates depth maps exclusively from monocular OF fields, VOLDOR
extends the probabilistic model to optionally account for input depth map priors. Limiting our assumptions (for now) on the source of these priors, to knowing their first and second moments, we model a generic likelihood function of the depth value at pixel
as a GaussianUniform mixture(3) 
where denotes the component of the cameraspace 3D coordinates of a depth value transferred across frames having relative motion , denotes the pixel of the reprojection of a depth estimate across camera frames, while and
are functions adjusting the variance of a Gaussian distribution and the density of a uniform distribution w.r.t. the observed depth prior, which are set in proportion to inverse depth value. As in
[15], we apply the likelihood function to a maximum inlier estimation framework to define the energy function for prior depths(4) 
where
is the probability density of
. Finally, aggregating with the original energy function conditioned on input OF described in [15], optimal depth values are selected based on the criterion(5) 
In practice each VO batch will typically have two prior depth maps: One generated during the analysis of the previous batch and another from the most recent available keyframe. Whenever both instances point to the same depth map, we will use that single depth prior. When stereo capture (or RGBD imagery) is available, we add the corresponding depth map as a prior. However, in such case (i.e. depth priors come from known external source), VOLDOR replaces Eq.(3) by an empirical residual model as done in [15].
Depth Map Confidence Estimation. For each depth map generated through the VOLDOR framework, we associate a confidence map , which we define as the pixelwise average of previously estimated rigidness of optical flows and depth priors used for its estimation:
(6) 
When estimated depth maps are subsequently used as depth priors, the value is used as a prior for in Eq.(3), such that a new weight will replace in the depth update step. Similarly, is also used in the backend keyframe alignment as will be described in Eq.(12).
Uncertaintyaware Pose Estimation
. The probabilistic camera pose inference in [15] assumes other parameters fixed and approximates the camera pose posterior distribution by MonteCarlo sampling of P3P instances(7) 
where denotes the time stamp, the total number of samples, the sample index, and is a variational distribution approximating the intractable sample posterior. Further, let
be a normal distribution
, where is solved using a AP3P solver [9], while the covariance matrix is a fixed hyperparameter. The camera pose defined by the mode of the approximated posterior, is found using meanshift with a Gaussian kernel in Lie algebra. To accommodate subsequent backend modules, VOLDOR estimates each camera pose uncertaintyby using the meanshift result as initialization and iteratively fitting a Gaussian distribution to the samples. We discard samples outside 3sigma for robustness and force all eigenvalues of
to be larger than an epsilon to ensure numerical stability.Hierarchical Propagation Scheme. For estimating depth and rigidness maps, [15] adopts a sampling and propagation scheme [34]
based on a generalizedEM framework. The image 2D field is broken into alternatively directed 1D chains. In the Mstep, depth values are randomly sampled and propagated through each chain. In the Estep, hidden Markov chain smoothing is imposed on the chains of rigidness maps using the forwardbackward algorithm. Such sequential global depth propagation scheme imposes a performance bottleneck for GPU implementations, as it is not massively parallelizable. Hence, VOLDOR
proposes a hierarchical propagation scheme, where a global propagation is done on a depth map of reduced scale to ensure convergence rate, and local propagations done in local windows at full resolution to retain fine details. Per Fig.3, VOLDOR achieves 3x speedup w/o loss in depth map quality w.r.t. [15].VO Stride and Keyframe Selection. Once VOLDOR runs on an input batch and estimates local 3D geometries, we determine the stride to the next batch and select keyframes using visibilitycoverage (VC) metrics. The VC metrics are defined over a pair of frames with known poses, where only may have a depth map. We define visibility score as the proportion of depth pixels in whose projection in is inside ’s image frame. Conversely, we define the coverage score as the proportion of the image area in being covered by the depth pixels’ projection from
. We define the VC score as the harmonic mean of visibility and coverage
(8) 
The VC score is estimated between the reference frame and all other frames within the batch. The first frame with VC score below is selected as the next batch’s reference. Finally, if the VC score between the current batch’s reference frame and the latest keyframe is below , we register the current batch’s reference frame as a keyframe.
IiiB BackEnd
The frontend VO only establishes local pairwise constraints between successive frames and within an input batch. Conversely, the backend manages spatial relationships of larger scope by establishing spatial constraints between keyframes and detecting loop closure instances.
Keyframe Alignment. We establish keyframe alignment (KA) constraints by 3D registration of their depthmaps. That is, for two depth maps and , we estimate both their relative motion and scale factor . Similar to [19], we formulate the depth image alignment objective in terms of the difference of the inverse depth values
(9) 
where is the Cauchy kernel function, is the confidence associated with pixel defined in Eq.(6). While effective, the convergence rate of Eq.(9) impeded online operation. Towards this end, we generalize the pointtoplane error to account for an inverse depth parameterization:
(10) 
where
denotes the vector inner product, while
denotes the normal vector at pixel of depth map . Such objective implicitly enforces spatial regularization given that is computed from the local depth estimates and significantly increases convergence speed and quality. Geometrically, scaling pairwise pointtoplane distances among depth estimates and inversely proportional to their depths, prioritizes geometry nearby to the reference frame. Conversely, weighting each kernel output by our confidence measure, mitigates depth outliers. Optionally, our keyframe alignment process can benefit from the input intensity images by adding the energy function for photometric consistency:(11) 
where are the pixel brightness affine transformation parameters to be estimated, while yields the intensity value of pixel in image . The photometric term is optional since we favor our system to be an indirect method, and rely on external OF modules to handle common intensitybased imaging aberrations such as exposure and/or whitebalance variations, nonLambertian reflections, etc. Also, VOLDOR estimates depth maps from small batches of successive imagery, while temporally distant observations are subject to arbitrary changes in appearance (i.e. global illumination, shadows, specularities) even if their underlying geometry is consistent. Moreover, our depth inference framework targets the estimation of static geometry observed throughout a multiframe input batch. Empirically, we observe it consistently ”looks through” dynamic or small foreground structures observed in the batch (first) reference frame and estimates the background’s depth. For such cases, the photometric term may reduce the overall accuracy. Finally, the criteria for keyframe alignment is
(12) 
where balances geometry and photometric errors, which are omitted when intensity images are absent. The relative depth scaling factor is set to for depth input with absolute scale (e.g. stereo or RGBD). Once Eq.(12) is optimized using LevenbergMarquardt, the covariance of is linearly approximated through its Jacobian matrix as .
Priority Linking. The aggregation of all pairwise geometric constraints among keyframes yields a fully connected graph (i.e. quadratic complexity). To enable online operation, we avoid exhaustive exploration of these relationships by devising an adaptive prioritization mechanism aimed at:
1) Linking keyframes with sufficient overlap for robust and accurate depth map 3D alignment
2) Balancing exploration for potential loop closures vs. exploiting local connectivity.
While loop closure benefits are selfevident, strengthening local connectivity fosters pose corrections to the current keyframe, which may be used as geometric priors in the VOLDOR frontend, contributing to more accurate VO estimates.
We construct a priority matrix for pairwise keyframe linkage, based on the observation that temporal proximity serves as a lowcost proxy for VC scores.
That is, the distance between indexes roughly encodes covisibility under the assumption of smooth local motion.
Accordingly, we systematically update the entries in
based on two linkage types (i.e. realtime linkage and loop closure linkage).
First, for realtime linkages, after the current keyframe with index is created, we update the priority of each arbitrary keyframe pair with indices , as
(13) 
where the parameter controls the priority w.r.t. covisibility (proximal keyframes correlate to larger shared overlaps), while controls the priority w.r.t. the timeliness of considered keyframe pair (we prioritize recent keyframes over older ones). Second, to foster the inclusion of loop closure linkages, we run an image retrieval engine based on the DBoW3 [7] library, using the current keyframe as query, to obtain a candidate keyframe . Assume pair passes a geometric verification test, whenever , we compute the linking priorities in the proximity of the detected overlap as:
(14) 
where controls the scope of priority propagation surrounding the loop closure pair. Given total detected loop closure pairs, the final linking priority matrix is
(15) 
The largest unlinked element in larger than will be processed by the keyframe alignment module to determine an accurate relative motion among their associated depth maps. Finally, the selected pairwise motion constraint is incorporated into a pose graph framework [23], which enforces both globally consistent correction propagation (i.e. from loop closure linkages) as well as local refinements (i.e. from realtime linkages).
Stereo  Monocular  

ORBSLAM3 

VOLDORSLAM  ORBSLAM3  DSO  VOLDORSLAM  













abandonedfactory  0.0404  0.0914  0.0345  0.1095  0.0269  0.0924  0.969  0.2204  0.970  0.2624  0.983  0.3210  
abandonedfactory_night  0.1633  0.0765  0.0227  0.0800  0.0196  0.0702  0.932  0.4907  0.916  1.4233  0.964  0.2229  
amusement  0.0129  0.0654  0.0125  0.0315  0.0118  0.0265  0.640    0.505    1.000  0.1188  
carwelding  0.0060  0.0269  0.0124  0.0518  0.0131  0.0535  0.995  0.0850  0.435    0.997  0.3203  
ocean  0.0664  0.4379  0.0376  0.1340  0.0335  0.1311  0.908  0.7564  0.898  1.7069  0.958  0.4775  
office  0.0035  0.0323  0.0088  0.0740  0.0071  0.0624  0.907  0.2872  0.950  4.4242  0.853  0.1715  
japanesealley  0.0193  0.0773  0.0105  0.0369  0.0102  0.0413  0.964  0.1819  0.959  0.1291  1.000  0.1150  
seasonsforest  0.1160  0.3102  0.0293  0.1094  0.0196  0.0733  0.330    0.534    1.000  0.1525  
westerndesert  0.0906  0.3311  0.0177  0.0898  0.0149  0.0786  0.918  0.3933  0.855  1.4918  0.957  0.2145  
Avg.  0.0576  0.1610  0.0207  0.0797  0.0174  0.0699  0.840  0.3449  0.780  1.5729  0.968  0.2348 
Iv Experiments
Experimental Setup.
We benchmarked on the photorealistic synthetic dataset TartanAir[28], which provides groundtruth camera poses, depth maps and optical flows. The dataset features challenging environments in the presence of moving objects, changing light, various weather conditions, with diverse viewpoints and motion patterns that are difficult to obtain in real world. We tested on the ”hard” track of 9 sequences covering diverse indoor and outdoor environments.
Our offtheshelf OF estimator MaskFlowNet [33] was not trained or finetuned on the TartanAir dataset.
For stereo input, we use the Xcomponent of the OF estimated from the left to the right camera as the disparity map, enabling us to use the same empirical residual model as described in [15] both for stereo and optical flow.
For all sequences, we use the photometric consistency term of Eq.(11). We compared VOLDORSLAM (full pipeline) and VOLDOR (VOonly) vs. ORBSLAM3 [1] (stereo and monocular) and DSO [5] (monocular).
Stereo Results.
Per Table.I (left) ORBSLAM3 excels on
stable environments with enough textures such as ’office’ and ’carwelding’, but accuracy drops dramatically for sequences exhibiting poorlytextured regions and rapidly changing viewpoints (i.e. trees that rapidly move closer) such as ’ocean’ and ’seasonsforest’. Conversely, both our variants perform stably across various environments with considerable improvement from our full pipeline, which has the best overall pose accuracy scores.
Monocular Results.
For monocular input, the challenging dataset caused our baselines to frequently lose tracking, obfuscating a fair comparison over translation error, which requires scale correction due to scale ambiguity. That is, whenever a system loses tracking, a new map with arbitrary scale is generated. If we scale each subsequence respectively, the system losing tracking more often will benefit from getting more accurate scale correction.
Thus, we replace translation error with the completeness metric, similar to the success rate proposed in [28].
The results in Table.I (right) show VOLDORSLAM can robustly handle different environments with high completeness compared to our baselines, while achieving best overall rotation accuracy.
Depth Evaluation.
Fig.4 conveys our depth map quality results.
As baselines, we use GANet [32] and MaskFlowNet [33] (the stereo input of our stereobased VO). The accuracy metrics used are inlierrate and EPE. A depth value is considered inlier when its corresponding disparity EPE is less than pixels or groundtruth disparity.
Results show our subset of pixels with high confidence greatly outperforms existing baselines, where around pixels have a confidence over . Our framework provides only slightly more accurate depth maps using stereo instead of monocular input.
Since VOLDOR
disadvantages the estimation of observed dynamic content,
we deem reported accuracy values an underestimate for static environments.
Qualitative Results. Representative reconstructions from TartanAir, KITTI and TUM datasets can be found in Fig.5.
V Conclusions
VOLDORSLAM demonstrates the potential of denseindirect (DI) estimation frameworks for the geometric analysis of largescale and unstructured environments. The modular nature of the DI approach allows for the exploration of novel formulations and ancillary tools. In particular, recently proposed learning approaches jointly estimating depth and poses [36, 31, 27, 30], and multiway 3D registrations [14, 13, 35, 25] are highly related to our work and offer promising research paths toward tighter couplings between DI estimation and global map management modules.
References
 [1] (2020) ORBslam3: an accurate opensource library for visual, visualinertial and multimap slam. arXiv preprint arXiv:2007.11898. Cited by: §II, §IV.
 [2] (2017) Bundlefusion: realtime globally consistent 3d reconstruction using onthefly surface reintegration. ACM Transactions on Graphics (ToG) 36 (4), pp. 1. Cited by: §II.
 [3] (2007) MonoSLAM: realtime single camera slam. IEEE Transactions on Pattern Analysis & Machine Intelligence (6), pp. 1052–1067. Cited by: §II.

[4]
(2015)
Flownet: learning optical flow with convolutional networks.
In
Proceedings of the IEEE international conference on computer vision
, pp. 2758–2766. Cited by: §II.  [5] (2017) Direct sparse odometry. IEEE transactions on pattern analysis and machine intelligence 40 (3), pp. 611–625. Cited by: §I, §II, §III, §IV.
 [6] (2014) LSDslam: largescale direct monocular slam. In European conference on computer vision, pp. 834–849. Cited by: §I, §II.
 [7] (201210) Bags of binary words for fast place recognition in image sequences. IEEE Transactions on Robotics 28 (5), pp. 1188–1197. External Links: Document, ISSN 15523098 Cited by: §IIIB, §III.

[8]
(2017)
Flownet 2.0: evolution of optical flow estimation with deep networks.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 2462–2470. Cited by: §II.  [9] (2017) An efficient algebraic solution to the perspectivethreepoint problem. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7225–7233. Cited by: §IIIA.
 [10] (2013) Dense visual slam for rgbd cameras. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2100–2106. Cited by: §II.
 [11] (2010) Visual odometry based on stereo image sequences with ransacbased outlier rejection scheme. In 2010 ieee intelligent vehicles symposium, pp. 486–492. Cited by: TABLE I.
 [12] (2007) Parallel tracking and mapping for small ar workspaces. In Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, pp. 1–10. Cited by: §II.
 [13] (2018) Deepim: deep iterative matching for 6d pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 683–698. Cited by: §V.
 [14] (2019) Taking a deeper look at the inverse compositional algorithm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4581–4590. Cited by: §V.
 [15] (2020) VOLDOR: visual odometry from loglogistic dense optical flow residuals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4898–4909. Cited by: VOLDORSLAM: For the times when featurebased or direct methods are not good enough , §I, §I, §II, §IIIA, §IIIA, §IIIA, §IIIA, §III, §IV.
 [16] (2015) ORBslam: a versatile and accurate monocular slam system. IEEE transactions on robotics 31 (5), pp. 1147–1163. Cited by: §II.
 [17] (2017) Orbslam2: an opensource slam system for monocular, stereo, and rgbd cameras. IEEE Transactions on Robotics 33 (5), pp. 1255–1262. Cited by: §I, §II, §III.
 [18] (201111) DTAM: dense tracking and mapping in realtime. In 2011 International Conference on Computer Vision, Vol. , pp. 2320–2327. External Links: Document, ISSN 23807504 Cited by: §I, §II.
 [19] (2017) Colored point cloud registration revisited. In Proceedings of the IEEE International Conference on Computer Vision, pp. 143–152. Cited by: §IIIB.
 [20] (2015) Stereo parallel tracking and mapping for robot localization. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1373–1378. Cited by: §I.
 [21] (2016) Dense monocular depth estimation in complex dynamic scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4058–4066. Cited by: §II.
 [22] (2019) BAD slam: bundle adjusted direct rgbd slam. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 134–144. Cited by: §II.
 [23] (2010) Scale driftaware large scale monocular slam. Robotics: Science and Systems VI 2 (3), pp. 7. Cited by: §IIIB, §III.
 [24] (2018) PWCnet: cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8934–8943. Cited by: §II.
 [25] (2018) Banet: dense bundle adjustment network. arXiv preprint arXiv:1806.04807. Cited by: §V.
 [26] (2012) Dense versus sparse approaches for estimating the fundamental matrix. International Journal of Computer Vision 96 (2), pp. 212–234. Cited by: §II.
 [27] (2019) Recurrent neural network for (un) supervised learning of monocular video visual odometry and depth. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5555–5564. Cited by: §V.
 [28] (2020) TartanAir: a dataset to push the limits of visual slam. arXiv preprint arXiv:2003.14338. Cited by: §IV.
 [29] (2015) ElasticFusion: dense slam without a pose graph. Cited by: §II.
 [30] (2020) D3VO: deep depth, deep pose and deep uncertainty for monocular visual odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1281–1292. Cited by: §V.

[31]
(2018)
Geonet: unsupervised learning of dense depth, optical flow and camera pose
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1983–1992. Cited by: §V.  [32] (201906) GAnet: guided aggregation net for endtoend stereo matching. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §IV.
 [33] (2020) MaskFlownet: asymmetric feature matching with learnable occlusion mask. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6278–6287. Cited by: §II, §IV.
 [34] (2014) Patchmatch based joint view selection and depthmap estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1510–1517. Cited by: §IIIA.
 [35] (2018) Deeptam: deep tracking and mapping. In Proceedings of the European conference on computer vision (ECCV), pp. 822–838. Cited by: §V.
 [36] (2017) Unsupervised learning of depth and egomotion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858. Cited by: §V.
Comments
There are no comments yet.