Robust SLAM Systems: Are We There Yet?

09/27/2021 ∙ by Mihai Bujanca, et al. ∙ 0

Progress in the last decade has brought about significant improvements in the accuracy and speed of SLAM systems, broadening their mapping capabilities. Despite these advancements, long-term operation remains a major challenge, primarily due to the wide spectrum of perturbations robotic systems may encounter. Increasing the robustness of SLAM algorithms is an ongoing effort, however it usually addresses a specific perturbation. Generalisation of robustness across a large variety of challenging scenarios is not well-studied nor understood. This paper presents a systematic evaluation of the robustness of open-source state-of-the-art SLAM algorithms with respect to challenging conditions such as fast motion, non-uniform illumination, and dynamic scenes. The experiments are performed with perturbations present both independently of each other, as well as in combination in long-term deployment settings in unconstrained environments (lifelong operation).



There are no comments yet.


page 1

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

SLAM algorithms are an essential component of embodied AI systems, providing a fundamental infrastructure necessary for navigation and other high-level tasks. The progress of SLAM systems during the last three decades has been remarkable, improving both the localisation and the mapping capabilities. While initially only very small spaces such as table tops or small rooms could be mapped, today’s SLAM algorithms can operate on large scales [5, 30]

. Thanks to advancements in computing hardware, sensors, and machine learning, SLAM has also extended well beyond the initial landmark-based mapping, leading to dense 3D reconstruction, non-rigid 3D reconstruction, and semantic mapping. The localisation accuracy of SLAM systems has also improved dramatically: the top 40 submissions on the KITTI odometry benchmark

[17] have errors below 1%.

Thanks to these advancements, SLAM has enabled new applications and while opportunities for further improvement remain ahead, robustness is widely regarded as today’s most difficult challenge [9]. We define robustness as the capacity of a system to avoid fatal failures either by continuously performing accurately, or by detecting and quickly recovering from soft failures. A fatal failure is any failure that renders a system unable to perform its duties without external intervention, and is most commonly caused by environmental perturbations such as noise, dim or bright lighting, blurred frames, as well as short or long-term scene changes (e.g. dynamic objects). While some use cases only require episodic or short-term operation, many applications call for long-term deployment: home maintenance, autonomous inspection of industrial facilities, and so on. In the context of robot navigation, we refer to such long-term operation as Lifelong SLAM. Given the current capabilities and performance described above, we believe that the success of Lifelong SLAM is primarily dependent on the capacity of a system to be generally robust with respect to perturbations which may not be known a priori.

Previous efforts in evaluating the robustness of multiple SLAM systems have focused on specific types of perturbations [35, 42], often limited to a specific sensing modality [29, 41]

, without considering whether building in resilience against specific perturbations may incur any trade-offs with respect to other challenging factors or measuring the general robustness of the system. Our work addresses this gap by introducing an evaluation methodology for assessing the robustness of SLAM solutions supporting various sensing modalities and degrees of freedom, in the presence of a variety of perturbations evaluated independently as well as in combination. We demonstrate the validity of our approach by performing an extensive evaluation of 6 SLAM systems (Table

II) on 6 datasets (Table I) across 3 computing platforms, in both episodic and long-term operation settings. Figure LABEL:fig:samples contains a selection of frames with occlusions and dynamically-moving elements, illumination changes in real and synthetic scenes, frames from drone sequences containing motion blur and no reliable features, lifelong operation challenges, colour frames containing lighting differences, blur, and dynamic objects. The accompanying video shows a qualitative comparison of 4 algorithms running on a sequence with dynamic elements.

Ii Related Work

While the problem of robustness has been acknowledged since the early days of SLAM [15, 27, 9], it remains one of the most significant challenges. We briefly review the literature on SLAM robustness with respect to perturbations relevant to our work.

Illumination changes may occur due to natural (e.g. varying sunlight) or artificial causes (e.g. blinking lightbulbs), translating into sudden changes in image brightness, either locally or globally. A large number of works rely on brightness constancy for mapping [57, 22, 16, 14], and may be negatively affected by such changes. Methods to improve robustness to illumination changes include active exposure control [60, 48, 24], binary local descriptors for brightness normalization such as Census transform [1, 2], while other works developed illumination-invariant metrics to register images [36, 56]. A detailed evaluation of the performance of direct methods under such perturbations is presented in [35], whose dataset we adopt.

Dynamic elements are one of the most widely encountered type of perturbation: virtually all settings where SLAM is employed, from home robots to autonomous vehicles to augmented reality are bound to feature movement. Over the years, a number of solutions have been proposed [43]. Given that the static part of a scene provides the most reliable information for computing the camera pose, many approaches to dynamic SLAM attempt to segment the input into static and dynamic parts. Methods include the use of optical flow [12, 45, 11], geometric constraints [51], alignment residuals [34], and semantic information [7, 3, 59, 46, 58, 31].

Fast camera movement on robots and drones often results in motion blur, hindering both feature detection and direct alignment, methods widely employed by SLAM systems. [40, 26, 33] use frame deblurring to ensure reliable features can be identified; FLaME [19]

proposes to use low quality but high frequency depth estimation to aid obstacle avoidance in drone flight.

Lifelong SLAM and long-term localisation are long-standing problems [53, 21, 25, 13]. In the past year, new benchmarks challenging the state-of-the-art have appeared [52, 47], and promising results (usually based on detecting learned features) have been proposed for localisation [44, 54, 50] as well as SLAM [28]. We reuse the dataset and metrics of the Lifelong Robotic Vision challenge in our evaluation [47].

We aim for a comprehensive evaluation of the robustness of SLAM systems, but recognise that other factors, such as weather [39, 38] or limited visibility [23] are also of practical importance. Our methodology should help evaluate such perturbations in the future, as well as other factors (e.g. 3D reconstruction, semantic labelling).

Iii Methodology

Iii-a Evaluation workflow

We design our pipeline to support single and multi-sequence inputs and use the latter for Lifelong SLAM evaluation. Our evaluation pipeline adopts and extends the tools for trajectory alignment, visualisation, and metric computation provided by the open-source SLAMBench framework111Code available at[4, 6]. The software containing routines for configuring and initialising each system, streaming data into the algorithm and collecting outputs (estimated pose and monitoring the state of the system) will be made public. Importantly, preparing an algorithm for evaluation only involves writing a thin wrapper around each algorithm and does not require modifying the code of the system.

To assess the accuracy of each algorithm, we use the Absolute Trajectory Error (ATE) and Relative Pose Error (RPE) introduced in the TUM RGB-D [49] dataset. In contrast to most evaluation procedures where the alignment and computation of the metrics is done only using the final trajectory, we continuously monitor the ATE and RPE by realigning the trajectories in using Umeyama’s method [55] and measuring the errors every time the SLAM system outputs a new pose. To prevent algorithms from dropping frames, new data is sent after the previous frame finished processing.

Figure 3 shows spikes in error correlated with the input data causing them, allowing us to deduce the particular sensitivities of individual algorithms, as well as to identify scenes which are generally challenging for SLAM algorithms. Since performing these routines for every frame can be expensive and could affect measurements, we use existing mechanisms in SLAMBench to report execution times and resource usage by the SLAM algorithms independently of evaluation and trajectory alignment.

Iii-B Lifelong SLAM

Evaluating Lifelong SLAM entails simulating common long-term operation scenarios. Each algorithm is fed multiple sequences captured in the same environment, with aspects such as initial position, time of day, lighting, and so on, varying across sequences.

In addition to computing the per-sequence ATE and RPE, the metric Correct Rate of Tracking (CRT) is adopted [47]. Environmental perturbations may cause SLAM algorithms to lose tracking. The ATE may be unevenly affected by a loss of tracking; e.g. losing tracking in the late stages of a sequence could have a significant impact on the Mean ATE. On the other hand, the CRT metric measures the ratio of correct tracking time with respect to the whole time span of the data. Correctness of each estimated pose can be determined with user-specified thresholds of ATE and other per-frame metrics. By combining the two metrics, we can better capture the overall performance, tracking failures, as well as the time spent correctly tracking the camera pose.

Iv Experiments

Iv-a Experimental setup

We perform the experiments on the following three hardware platforms (Workstation, Laptop, and Jetson), running under 64-bit Ubuntu 18.04 OS:

The workstation is a desktop with 32 GB of RAM, a 14-core Intel Core i9-9940X chip (3.30GHz), and an Nvidia TITAN RTX GPU with 24GB VRAM and 4608 CUDA cores. The laptop is a Lenovo ThinkPad P53 with 16GB of RAM, a 6-core Intel Core i7-9850H (2.60GHz), and an Nvidia Quadro RTX 3000 with 1920 CUDA cores and 6GB of VRAM. The Jetson is an Nvidia Jetson Xavier AGX. This is a platform commonly used in ground robots, featuring a 8-core ARMv8.2 64-bit CPU (2.25GHz), 16 GB of RAM, and a 512-core Nvidia Volta GPU. The device is set up to deliver the maximum performance, with a peak power use of 30W.

To control for any differences not inherent to the algorithms, we ensure that, wherever possible, on each platform any common dependencies undertaking significant computational tasks, such as OpenCV or g2o, are fixed to the same version across all the SLAM systems evaluated. We use gcc 7

for compilation across all algorithms and platforms, and CUDA 10.2 for GPU-based implementations. The DVFS of the processing cores (Turboboost) and GPU (adaptive clocking) are disabled. All the build processes have been modified to use the highest levels of compiler optimisation. The hyperparameters of SLAM systems are configured following the recommendations of the original papers/repositories, if available, or otherwise using the default settings.

Using the appropriate input modalities provided by each dataset (Table I), we evaluate 6 open-source SLAM systems selected to cover a diversity of designs with respect to input modalities and map representations.

OpenVINS [18]

is a stereo visual-inertial SLAM system which uses an Extended Kalman Filter to fuse visual odometry with inertial measurements.

ORB-SLAM2 [32] is a popular real-time SLAM system based on sparse ORB features. It incorporates RGB-D, monocular and stereoscopic input modalities.

ORB-SLAM3 [10] is a recently released SLAM system developed on top of ORB-SLAM2 which introduces a multiple map system and visual-inertial odometry to improve robustness.

ElasticFusion [57] provides a globally-consistent dense RGB-D reconstruction approach that does not require a pose graph and represents the map using fused surfels [37].

FullFusion [7] is a framework for semantic reconstruction of dynamic scenes. FullFusion leverages semantic information to separate RGB-D inputs into a static and a dynamic frame. A modified implementation of KinectFusion is used to compute the pose and reconstruct a semantically labelled model of the static scene elements.

ReFusion [34] is a dense RGB-D 3D reconstruction method which exploits residuals obtained after the registration of input data with the reconstructed model to identify and filter out dynamic elements in the scene.

The datasets have been chosen to cover a wide range of conditions common in SLAM applications:

  • Camera motion / hardware platform: ground robot, aerial vehicle, handheld sensor, linear motion (in synthetic scenes).

  • Scene type: synthetic, indoor, outdoor, empty corridors, busy market or cafe.

  • Lighting: sudden exposure changes, daylight, night, continuously changing local and global illumination, flashlights.

  • Movement: varying levels of movement, both rigid and non-rigid. All combinations of static and moving camera with static and moving scenes.

  • Sensors: RGB-D, Stereo cameras, IMU, Wheel odometry, Sequences featuring sensor degradation.

An experiment refers to the execution combining one SLAM system (Table II) and one sequence of a given dataset (Table I). Each experiment is performed 10 times on each of the 3 platforms generating a total of approximately 20000 data points.

Given the large number of experiments and the necessity to differentiate by platform and perturbation, the paper contains only a subset of the results and metrics. Full data is available on the website222 We adopt the following strategy to present aggregated data under each setting: for each sequence, we compute the median of the translational ATE-RMSE over the 10 runs, normalised by the metric length of the sequence to ensure equal weighting across sequences. Note that the aggregate plots may not always be representative of the performance on individual sequences.

(a) TUM (b) ICL-NUIM (c) EuRoC-MAV

Fig. 1: Baseline performance results.

Iv-B Results

Baseline performance – We evaluate the trajectory estimation accuracy of each SLAM system on selected sequences of widely-adopted datasets where no significant perturbations are present. The RGB-D based SLAM systems are evaluated with 12 sequences from the TUM freiburg1 and freiburg2 datasets [49] and the 4 sequences of the ICL-NUIM living room dataset [20]. Our results (Figures 1-a and 1-b) are consistent with the existing literature. ORB-SLAM2, ORB-SLAM3 and ElasticFusion are accurate within 1% on all sequences and no individual runs exceeded 3% error. FullFusion and ReFusion maintained their ATE below 3% on most runs, but scored worse than the aforementioned systems (with few exceptions). ORB-SLAM3 is the most accurate in this baseline setting, with ORB-SLAM2 closely after.

SLAM systems supporting stereo and visual-inertial SLAM are evaluated on the 7 easy and medium sequences of the EuRoC-MAV dataset. Figure 1-c shows all 3 SLAM systems have similar accuracies and performed within a 0.5% error margin. Unexpectedly, on some of the machine hall sequences, ORB-SLAM3, using the stereo VIO mode, performed slightly worse than ORB-SLAM2 in stereo mode.

Fig. 2: ReFusion, ORB-SLAM3 and FullFusion executing the moving_nonobstructing_box sequence of the Bonn dataset. The red rectangle highlights the period of time when a person enters the scene, moves a box and leaves.

Illumination changes – We use the ETH Illumination dataset to analyse the resilience of SLAM systems using 3 real and 10 synthetic RGB-D sequences. The dataset features multiple types of illumination change: local, global, local and global, and flashlight. The real sequences are captured with handheld Kinect v1 sensor, in an environment closely resembling the TUM RGB-D setting. The synthetic scenes are adapted from the ICL-NUIM dataset. Thanks to the illumination invariance of ORB features, both ORB-SLAM2 and ORB-SLAM3 appear to be unaffected by any type of illumination change, obtaining similar scores to the baseline TUM and ICL-NUIM. In contrast, ElasticFusion and ReFusion use photometric errors which assume constant illumination, leading to high error rates. Figure 3 highlights the effects of changes in illumination on ReFusion.

Fig. 3: Single run of ReFusion on the syn2 sequence with both local and global illumination changes. Significant spikes in error occur during changes in illumination (top). Bottom: frames corresponding to the highlighted area.

Dynamic elements — We use the 24 dynamic sequences in the Bonn RGB-D Dataset. All scenes were captured in the same space and include people handling objects such as boxes and balloons. Varied levels of movement are present, ranging from mostly static scenes to complete occlusion of the background by moving objects for extended periods.

Having been published together with the Bonn dataset, ReFusion performs best in the presence of dynamic elements. Nonetheless, ORB-SLAM2 and ORB-SLAM3 perform more accurately than ReFusion on scenes with negligible movement or when the dynamic elements are untextured, relying mostly on background keypoints, and are able to recover when dynamic objects briefly enter and leave the frame, but fail under severe motion. Figure 2 highlights the moderate increase in error in ReFusion and FullFusion compared to a significant spike in error for ORB-SLAM3. FullFusion’s segmentation module relies on semantics to remove dynamic objects from frames. As such, FullFusion performs well only when recognized classes are present in the scene (person), but is highly sensitive to any other movement, often experiencing failures (balloon, box sequences), unlike more general algorithms such as ReFusion. As expected, due to using all the data in the frame and assuming only camera movement, ElasticFusion is severely affected, with a noticeable drop in accuracy occurring as soon as a dynamic object enters the scene, without subsequent recovery, and fails entirely on the highly dynamic scenes.

Lifelong SLAM — OpenLORIS-Scene is a comprehensive dataset featuring a total of 22 sequences captured in 5 common environments (office, corridor, home, cafe, and market) at different times of the day using commercial service robots. Compared to most SLAM datasets which often present tightly controlled scenarios, OpenLORIS contains realistic settings for service robots, and a wide variety of challenging factors: occlusions, dynamic motion, featureless areas, and lighting changes.

OpenLORIS is the most challenging of the datasets. Figure 5 illustrates ATE for a subset of the sequences evaluated. Figure 6 illustrates the CRT metric for all the sequences. Thus we can observe, for example, that for the sequence cafe2 although the ATE may be less than 1 meter for ReFusion, ElasticFusion and FullFusion, their CRT illustrates significant portions of frames where the error was larger than 3 meters. ORB-SLAM2 and ORB-SLAM3 are severely affected in textureless environments. In particular, most of the corridor and home sequences disproportionately affect sparse algorithms. ReFusion performed well in the presence of dynamic objects as long as they moved in a consistent fashion. However on the market

sequences, where persons often moved and stopped, artefacts were produced in the reconstruction, impacting the pose estimation accuracy. FullFusion performs well when it is able to recognise dynamic objects, but tends to drift whenever unknown objects enter the scene.

Iv-C Other Observations

In assessing the robustness of a SLAM system, one should consider not only variation across perturbations, but also matters of portability, setup, ease of use, consistency, and operation in previously untested environments.

Jetson 40 25 0.2 10 5 5
Laptop 50 30 10 17.5 5 10
Workstation 50 150 15 20 10 12
TABLE III: Average frame rate for each SLAM algorithm.

Setup and execution — ORB-SLAM2 and ORB-SLAM3 produced hard crashes (segfault) more than 10% of the time across all platforms, requiring frequent restarts. Additionally, their reliance on old dependencies made it hard to identify working versions across all algorithms. Discrepancies in performance across platforms may relate to different versions of these dependencies. OpenVINS is highly sensitive to correct initial parameters, which may not always be available in deployment. We were not able to find working hyperparameters for the OpenLORIS dataset. FullFusion attempts to compute dynamic masks whether or not there are any dynamic objects in the scene, resulting in slightly lower accuracy as well as up to 80% lower frame rate on compute-constrained platforms compared to disabling masks on sequences known to be static. ReFusion sees a drastic drop in frame rate to 0.2 FPS on the Jetson from 10-15 FPS on Laptop and Workstation.

Consistency — While overall we have found no major discrepancies between the results on each platform, ORB-SLAM2 and ORB-SLAM3 were the least consistent across runs and across platforms, with some variability observed in other SLAM systems (see Figure 4). At the other end, OpenVINS performed almost identically across all runs for any sequence on a given platform.

Fig. 4: Consistency evaluation using 30 runs on (fr1_360).
cafe2 corridor1 corridor4 corridor5 home2 market2
office1 office2 office4 office5 office6 office7
Fig. 5: OpenLORIS – accuracy results for a subset of scenes.
Fig. 6: Correct Rate of Tracking displayed as percentage of frames within an absolute error threshold on OpenLORIS sequences. The threshold values suggested in the dataset: for office, for home and cafe, and for corridor and market sequences.

Iv-D Summary of Results

Table IV

provides a summary of the results presented. The second to fourth columns show the dense SLAM systems. For the accuracy on baseline datasets (TUM, ICL-NUIM, EuRoC-MAV), no SLAM system is classified as

Excellent because, although their ATEs show good accuracy, they all have sequences where accuracy problems occur. ElasticFusion is fast and accurate when no perturbations are present but generally not robust due to the photometric error assuming fixed coefficients for the RGB channels. ReFusion is not very accurate on the baseline datasets and is sensitive to illumination changes due to the photometric error similar to ElasticFusion, however it is robust to dynamic objects. FullFusion is robust to illumination because it only uses depth data for mapping, but this can be a disadvantage in structureless areas. FullFusion has proven sensitive to unrecognized dynamic objects.

The fifth to seventh columns present the sparse SLAM systems. OpenVINS is the most consistent across runs and across platforms, and attains high accuracy on drone sequences, but is strictly visual-inertial and could not be tested on datasets without IMU data. ORB-SLAM2 and ORB-SLAM3 cover the broadest variety of input modalities. Consistent with the published ORB-SLAM3 [10] results, but using different datasets, we have found that the addition of a VIO mode over ORB-SLAM2 and the multi-map merging scheme improves robustness against temporary tracking loss (usually caused by dynamic objects or fast movement).

We use 5 categories to qualitatively describe the results. Excellent means almost perfect - the system performs similarly with or without perturbations. This is only awarded in Illumination where FullFusion does not use color data and ORB-SLAM performs well even in low light. The category Very Good covers the vast majority of perturbations and does not fail even when perturbations are significant. Nonetheless, a SLAM system may fail when encountering severe perturbations for a long period. If short term failures are encountered, it is often able to recover. The Good category captures mostly robust outcomes. A good example is FullFusion which has robustness against a set number of classes, but may fail when encountering unknown classes. The category Acceptable captures SLAM systems with some robustness, which can deal with perturbations for a short amount of time. For example, ORB-SLAM3 can recover well in the presence of dynamic objects, if they are encountered for a couple of frames or occupy only a small portion of the frame, but will fail otherwise.

*FullFusion is not impacted by illumination changes as it does not use color information.

TABLE IV: Overall robustness – A qualitative summary of the experiments.

V Conclusions

This paper has presented a systematic evaluation of the robustness of 6 open-source state-of-the-art SLAM algorithms with respect to challenging conditions, such as fast motion, non-uniform illumination, and dynamic scenes. The experiments have covered 6 datasets across 3 computing platforms, in both episodic and long-term operation settings. Thus, this evaluation is the most comprehensive study of the robustness of SLAM systems to date. By including the Nvidia Jetson Xavier platform, we also consider constraints associated with deployments on systems embedded within robots.

Overall, we have found that ORB-SLAM3 provides the best balance between baseline accuracy, illumination and fast changes, support for dynamic environments and Lifelong scenarios, although its FPS is below 15 (5 FPS on Jetson). Considering the three dense SLAM systems, FullFusion provides the best balance, but reaches 30 FPS only on the laptop and workstation (Jetson 25 FPS). ElasticFusion offers between 40-50 FPS processing on the three platforms, but its robustness falls below the other SLAM systems.

Finally, the sparse SLAM systems have proved more robust than the dense ones, probably because there are fewer data points which can negatively impact pose estimation. We consider that combining sparse tracking with dense 3D reconstruction will help systems build expressive representations while maintaining high robustness.


This research is supported by the EPSRC, grant RAIN Hub EP/R026084/1. Mikel Luján is supported by an Arm/RAEng Research Chair Award and a Royal Society Wolfson Fellowship. Thanks to Patrick Geneva for assisting with experiments on OpenVINS. Thanks to all researchers who provided the datasets.


  • [1] H. Alismail, B. Browning, and S. Lucey (2016) Direct visual odometry using bit-planes. arXiv preprint arXiv:1604.00990. Cited by: §II.
  • [2] H. Alismail, M. Kaess, B. Browning, and S. Lucey (2016) Direct visual odometry in low light using binary descriptors. IEEE Robotics and Automation Letters 2 (2), pp. 444–451. Cited by: §II.
  • [3] B. Bescos, J. M. Fácil, J. Civera, and J. Neira (2018) DynaSLAM: tracking, mapping, and inpainting in dynamic scenes. IEEE Robotics and Automation Letters 3 (4), pp. 4076–4083. Cited by: §II.
  • [4] B. Bodin, H. Wagstaff, S. Saecdi, L. Nardi, E. Vespa, J. Mawer, A. Nisbet, M. Luján, S. Furber, A. J. Davison, et al. (2018) SLAMBench2: multi-objective head-to-head benchmarking for visual slam. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 3637–3644. Cited by: §III-A.
  • [5] M. Bosse, P. Newman, J. Leonard, and S. Teller (2004) Simultaneous localization and map building in large-scale cyclic environments using the atlas framework. The International Journal of Robotics Research 23 (12), pp. 1113–1139. Cited by: §I.
  • [6] M. Bujanca, P. Gafton, S. Saeedi, A. Nisbet, B. Bodin, F. O’Boyle Michael, A. J. Davison, G. Riley, B. Lennox, M. Luján, and S. Furber (2019)

    SLAMBench 3.0: Systematic automated reproducible evaluation of SLAM systems for robot vision challenges and scene understanding

    In IEEE International Conference on Robotics and Automation (ICRA), pp. 6351–6358. Cited by: §III-A.
  • [7] M. Bujanca, M. Luján, and B. Lennox (2019) FullFusion: a framework for semantic reconstruction of dynamic scenes. In

    Proceedings of the IEEE International Conference on Computer Vision Workshops

    pp. 0–0. Cited by: TABLE II, §II, §IV-A.
  • [8] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart (2016) The euroc micro aerial vehicle datasets. The International Journal of Robotics Research. External Links: Document, Link, Cited by: TABLE I.
  • [9] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid, and J.J. Leonard (2016) Past, present, and future of simultaneous localization and mapping: towards the robust-perception age. IEEE Transactions on Robotics 32 (6), pp. 1309–1332. Cited by: §I, §II.
  • [10] C. Campos, R. Elvira, J. J. Gomez, J. M. M. Montiel, and J. D. Tardos (2020) ORB-SLAM3: an accurate open-source library for visual, visual-inertial and multi-map SLAM. arXiv preprint arXiv:2007.11898. Cited by: TABLE II, §IV-A, §IV-D.
  • [11] J. Cheng, Y. Sun, and M. Q. Meng (2019) Improving monocular visual slam in dynamic environments: an optical-flow-based approach. Advanced Robotics 33 (12), pp. 576–589. Cited by: §II.
  • [12] M. Derome, A. Plyer, M. Sanfourche, and G. L. Besnerais (2015) Moving object detection in real-time using stereo from a mobile platform. Unmanned Systems 3 (04), pp. 253–266. Cited by: §II.
  • [13] E. Einhorn and H. Gross (2015) Generic ndt mapping in dynamic environments and its application for lifelong slam. Robotics and Autonomous Systems 69, pp. 28–39. Cited by: §II.
  • [14] J. Engel, T. Schöps, and D. Cremers (2014) LSD-slam: large-scale direct monocular slam. In European Conference on Computer Vision, pp. 834–849. Cited by: §II.
  • [15] J. B. Folkesson and H. I. Christensen (2004) Robust slam. IFAC Proceedings Volumes 37 (8), pp. 722–727. Cited by: §II.
  • [16] C. Forster, M. Pizzoli, and D. Scaramuzza (2014) SVO: fast semi-direct monocular visual odometry. In IEEE International Conference on Robotics and Automation (ICRA), pp. 15–22. Cited by: §II.
  • [17] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §I.
  • [18] P. Geneva, K. Eckenhoff, W. Lee, Y. Yang, and G. Huang (2020) OpenVINS: a research platform for visual-inertial estimation. In IEEE International Conference on Robotics and Automation (ICRA), Paris, France. External Links: Link Cited by: TABLE II, §IV-A.
  • [19] W. N. Greene and N. Roy (2017) FLaME: fast lightweight mesh estimation using variational smoothing on delaunay graphs. In IEEE International Conference on Computer Vision (ICCV), pp. 4696–4704. Cited by: §II.
  • [20] A. Handa, T. Whelan, J.B. McDonald, and A.J. Davison (2014-05) A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM. In IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, pp. 1524–1531. Cited by: TABLE I, §IV-B.
  • [21] H. Johannsson (2013) Toward lifelong visual localization and mapping. Ph.D. Thesis, Massachusetts Institute of Technology. Cited by: §II.
  • [22] C. Kerl, J. Sturm, and D. Cremers (2013) Dense visual slam for rgb-d cameras. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2100–2106. Cited by: §II.
  • [23] A. Kim and R. M. Eustice (2013) Real-time visual slam for autonomous underwater hull inspection using visual saliency. IEEE Transactions on Robotics 29 (3), pp. 719–733. Cited by: §II.
  • [24] P. Kim, B. Coltin, O. Alexandrov, and H. J. Kim (2017) Robust visual localization in changing lighting conditions. In IEEE International Conference on Robotics and Automation (ICRA), pp. 5447–5452. Cited by: §II.
  • [25] H. Kretzschmar, G. Grisetti, and C. Stachniss (2010) Lifelong map learning for graph-based slam in static environments. KI-Künstliche Intelligenz 24 (3), pp. 199–206. Cited by: §II.
  • [26] H. S. Lee, J. Kwon, and K. M. Lee (2011) Simultaneous localization, mapping and deblurring. In IEEE International Conference on Computer Vision (ICCV), pp. 1203–1210. Cited by: §II.
  • [27] J. Levinson and S. Thrun (2010) Robust vehicle localization in urban environments using probabilistic maps. In IEEE International Conference on Robotics and Automation (ICRA), pp. 4372–4378. Cited by: §II.
  • [28] D. Li, X. Shi, Q. Long, S. Liu, W. Yang, F. Wang, Q. Wei, and F. Qiao (2020)

    DXSLAM: a robust and efficient visual slam system with deep features

    arXiv preprint arXiv:2008.05416. Cited by: §II.
  • [29] J. Lomps, A. Lind, and A. Hadachi (2020) Evaluation of the robustness of visual slam methods in different environments. arXiv preprint arXiv:2009.05427. Cited by: §I.
  • [30] S. Lynen, T. Sattler, M. Bosse, J. A. Hesch, M. Pollefeys, and R. Siegwart (2015) Get out of my lab: large-scale, real-time visual-inertial localization.. In Robotics: Science and Systems, vol. 1, Cited by: §I.
  • [31] X. Mu, B. He, X. Zhang, T. Yan, X. Chen, and R. Dong (2019)

    Visual navigation features selection algorithm based on instance segmentation in dynamic environment

    IEEE Access 8, pp. 465–473. Cited by: §II.
  • [32] R. Mur-Artal and J. D. Tardós (2017) ORB-SLAM2: An open-source slam system for monocular, stereo, and RGB-D cameras. IEEE Transactions on Robotics 33 (5), pp. 1255–1262. Cited by: TABLE II, §IV-A.
  • [33] J. Mustaniemi, J. Kannala, S. Särkkä, J. Matas, and J. Heikkilä (2018) Fast motion deblurring for feature detection and matching using inertial measurements. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 3068–3073. Cited by: §II.
  • [34] E. Palazzolo, J. Behley, P. Lottes, P. Giguère, and C. Stachniss (2019) ReFusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7855–7862. Cited by: TABLE I, TABLE II, §II, §IV-A.
  • [35] S. Park, T. Schöps, and M. Pollefeys (2017) Illumination change robustness in direct visual slam. In IEEE International Conference on Robotics and Automation (ICRA), Cited by: TABLE I, §I, §II.
  • [36] G. Pascoe, W. Maddern, M. Tanner, P. Piniés, and P. Newman (2017) Nid-slam: robust monocular slam using normalised information distance. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1435–1444. Cited by: §II.
  • [37] H. Pfister, M. Zwicker, J. Van Baar, and M. Gross (2000) Surfels: surface elements as rendering primitives. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 335–342. Cited by: §IV-A.
  • [38] H. Porav, T. Bruls, and P. Newman (2019) Don’t worry about the weather: unsupervised condition-dependent domain adaptation. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pp. 33–40. Cited by: §II.
  • [39] H. Porav, T. Bruls, and P. Newman (2019) I can see clearly now: image restoration via de-raining. In IEEE International Conference on Robotics and Automation (ICRA), pp. 7087–7093. Cited by: §II.
  • [40] A. Pretto, E. Menegatti, M. Bennewitz, W. Burgard, and E. Pagello (2009) A visual odometry framework robust to motion blur. In IEEE International Conference on Robotics and Automation (ICRA), pp. 2250–2257. Cited by: §II.
  • [41] D. Prokhorov, D. Zhukov, O. Barinova, K. Anton, and A. Vorontsova (2019) Measuring robustness of visual slam. In 2019 16th International Conference on Machine Vision Applications (MVA), pp. 1–6. Cited by: §I.
  • [42] O. Roesler and V. P. Ravindranath (2020) Evaluation of slam algorithms for highly dynamic environments. In Robot 2019: Fourth Iberian Robotics Conference, M. F. Silva, J. Luís Lima, L. P. Reis, A. Sanfeliu, and D. Tardioli (Eds.), Cham, pp. 28–36. External Links: ISBN 978-3-030-36150-1 Cited by: §I.
  • [43] M. R. U. Saputra, A. Markham, and N. Trigoni (2018) Visual slam and structure from motion in dynamic environments: a survey. ACM Computing Surveys (CSUR) 51 (2), pp. 1–36. Cited by: §II.
  • [44] P. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich (2020)

    Superglue: learning feature matching with graph neural networks

    In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4938–4947. Cited by: §II.
  • [45] R. Scona, M. Jaimez, Y. R. Petillot, M. Fallon, and D. Cremers (2018) Staticfusion: background reconstruction for dense rgb-d slam in dynamic environments. In IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9. Cited by: §II.
  • [46] C. Sheng, S. Pan, W. Gao, Y. Tan, and T. Zhao (2020) Dynamic-dso: direct sparse odometry using objects semantic information for dynamic environments. Applied Sciences 10 (4), pp. 1467. Cited by: §II.
  • [47] X. Shi, D. Li, P. Zhao, Q. Tian, Y. Tian, Q. Long, C. Zhu, J. Song, F. Qiao, L. Song, et al. (2020) Are we ready for service robots? the openloris-scene datasets for lifelong slam. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 3139–3145. Cited by: TABLE I, §II, §III-B.
  • [48] I. Shim, T. Oh, J. Lee, J. Choi, D. Choi, and I. S. Kweon (2018) Gradient-based camera exposure control for outdoor mobile platforms. IEEE Transactions on Circuits and Systems for Video Technology 29 (6), pp. 1569–1583. Cited by: §II.
  • [49] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers (2012-Oct.) A benchmark for the evaluation of rgb-d slam systems. In International Conference on Intelligent Robots and Systems (IROS), Cited by: TABLE I, §III-A, §IV-B.
  • [50] H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefeys, J. Sivic, T. Pajdla, and A. Torii (2018) InLoc: indoor visual localization with dense matching and view synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7199–7209. Cited by: §II.
  • [51] W. Tan, H. Liu, Z. Dong, G. Zhang, and H. Bao (2013) Robust monocular slam in dynamic environments. In 2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp. 209–218. Cited by: §II.
  • [52] The long term visulization challenge. Note: Cited by: §II.
  • [53] G. D. Tipaldi, D. Meyer-Delius, and W. Burgard (2013) Lifelong localization in changing environments. The International Journal of Robotics Research 32 (14), pp. 1662–1678. Cited by: §II.
  • [54] A. Torii, R. Arandjelović, J. Sivic, M. Okutomi, and T. Pajdla (2015) 24/7 place recognition by view synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II.
  • [55] S. Umeyama (1991) Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis & Machine Intelligence (4), pp. 376–380. Cited by: §III-A.
  • [56] F. Wang and B. C. Vemuri (2007) Non-rigid multi-modal image registration using cross-cumulative residual entropy. International journal of computer vision 74 (2), pp. 201–215. Cited by: §II.
  • [57] T. Whelan, S. Leutenegger, R. F. Salas-Moreno, B. Glocker, and A. J. Davison (2015) ElasticFusion: dense slam without a pose graph. Proc. Robotics: Science and Systems, Rome, Italy. Cited by: TABLE II, §II, §IV-A.
  • [58] L. Xiao, J. Wang, X. Qiu, Z. Rong, and X. Zou (2019)

    Dynamic-slam: semantic monocular visual localization and mapping based on deep learning in dynamic environment

    Robotics and Autonomous Systems 117, pp. 1–16. Cited by: §II.
  • [59] C. Yu, Z. Liu, X. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei (2018) DS-slam: a semantic visual slam towards dynamic environments. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1168–1174. Cited by: §II.
  • [60] Z. Zhang, C. Forster, and D. Scaramuzza (2017) Active exposure control for robust visual odometry in hdr environments. In IEEE International Conference on Robotics and Automation (ICRA), pp. 3894–3901. Cited by: §II.