Robotic applications require modeling environment for various tasks, guidance, search and rescue, etc. A precondition for an autonomous robot is to obtain an accurate model of its environment. A major problem in obtaining this model is the problem of pose uncertainty. The mobile robot mapping problem under pose uncertainty is often referred to as the simultaneous localization and mapping (SLAM) or concurrent mapping and localization (CML) problem [1, 2, 3].
SLAM is a complex problem because a robot needs a homogeneous map inorder to localize itself. To obtain a map, the robot requires a good estimate of its location. The limited range and limited field of view of ranging sensor adds to the problems. The mutual dependency among the pose and the map estimates makes the SLAM problem hard. Even a small error in the map could prevent a robot from working in the environment. Hence it is important to tackle the SLAM problem.
Many different techniques to tackle the SLAM problem have been presented. There are different approaches to the problem like Extended Kalman Filter SLAM (EKF), Sparse Extended Information Filter (SEIF), Extended Information Form (EIF), FastSLAM, GraphSLAM. There are also some proposed evaluation metrics[4, 5, 6, 7] for comparing the results of different SLAM algorithms. People often use visual inspection to compare maps or overlays with blueprints of buildings for grid-based estimation techniques. This kind of traditional evaluation becomes more and more difficult as new SLAM algorithms show increasing capabilities. Meaningful comparisons between different SLAM algorithms require some common performance metrics. The metrics should enable the user to compare the outcome of different mapping approaches when applying them on the same dataset. This research uses two such common evaluation metrics on the results obtained from popular SLAM algorithms.
This paper utilizes the metrics for evaluation proposed in  to evaluate the performance of two popular SLAM methods: RGB-D SLAM and RTABMap. The evaluation metrics used for the comparison are Absolute Trajectory Error (ATE) and Relative Pose Error (RPE).
The TUM RGB-D dataset  is applied to both the algorithms and the resulting trajectory estimate is compared to the groundtruth by evaluating Absolute Trajectory Error (ATE), Relative Pose Error (RPE) and the time taken to process the sequence of images. The sequences are intentionally selected to depict the difficulties encountered by SLAM algorithms when operating in real world. The results of the evaluation are analyzed to determine the ideal algorithm for different situations.
2 System Configuration
The system used for testing has the following configuration
|CPU||Intel Core i7-7700HQ 2.80GHz x 8|
|Operating System||Ubuntu 14.04 LTS|
|GPU||GeForce GTX 1050 Ti/PCIe/SSE2|
3 Algorithms Selected for Evaluation
3.1 Rgb-D Slam
RGB-D SLAM  works well with a hand-held Kinect-style depth sensor. It uses visual features s.a. SURF or SIFT to match pairs of acquired images, and uses RANSAC to robustly estimate the 3D transformation between them. To achieve online processing, the current image is matched only versus a subset of the previous images. Subsequently, it constructs a graph whose nodes correspond to camera views and whose edges correspond to the estimated 3D transformations. The graph is then optimized with HOG-Man  to reduce the accumulated pose errors.
RTAB-Map (Real-Time Appearance-Based Mapping) is a RGB-D, Stereo and Lidar Graph-Based SLAM approach based on an incremental appearance-based loop closure detector. The loop closure detector uses a bag-of-words approach to determinate how likely a new image comes from a previous location or a new location. When a loop closure hypothesis is accepted, a new constraint is added to the map’s graph, then a graph optimizer minimizes the errors in the map. A memory management approach described in  is used to limit the number of locations used for loop closure detection and graph optimization, so that real-time constraints on large-scale environments are always respected. RTAB-Map can be used alone with a handheld Kinect, a stereo camera or a 3D lidar for 6DoF mapping, or on a robot equipped with a laser rangefinder for 3DoF mapping.
The TUM RGBD dataset  is a large set of data with sequences containing both RGB-D data and ground truth pose estimates from a motion capture system. The following seven sequences used in this analysis depict different situations and intended to test robustness of algorithms in these conditions.
4.1 freiburg2 desk with person
This sequence is a typical office scene with a person sitting at a desk. The person moves and interacts with some of the objects on the table. This sequence is intended for checking the robustness of a SLAM system against dynamic objects and persons, but it can also be used to differentiate maps and find changes in the scene.
4.2 freiburg2 pioneer 360 robot slam
This sequence was recorded from a Kinect mounted on top of an ActiveMedia Pioneer 3 robot (See Figure 0(b)). With these sequences, it becomes possible to demonstrate the applicability of SLAM systems to wheeled robots. Due to the large dimensions of the hall, the Kinect could not observe the depth of the distant walls for parts of the sequence. Several objects like office containers, boxes, and other feature-poor objects are placed through the scene. As a consequence, this sequence has depth, but is highly challenging for methods that rely on distinctive keypoints.
4.3 freiburg2 360 kidnap
In this sequence the Kinect sensor is covered several times while it is pointed to a different location ("kidnap") for testing algorithms that can recover from tracking problems.
4.4 freiburg3 long office household
The RGB-D sensor is moved along a large round through a household and office scene with much texture and structure. The end of the trajectory overlaps with the beginning so that there is a large loop closure.
4.5 freiburg3 no structure no texture near with loop
The RGB-D sensor is moved in approximately one meter height along a planar, wooden surface of approximately 3m x 3m. This sequence intentionally has little to no visible structure and texture. The beginning and the end of the sequence overlaps, i.e., there is a loop closure.
4.6 freiburg3 no structure texture near with loop
The RGB-D sensor has been moved in one meter height in a circle a textured, planar surface. The texture is highly discriminative as it consists of several conference posters. The beginning and the end of the trajectory overlap, so that there is a loop closure.
4.7 freiburg3 sitting static
In this sequence, two persons sit at a desk, talk, and gesticulate. The sensor has been kept in place manually. This sequence is intended to evaluate the robustness of visual SLAM and odometry algorithms to slowly moving dynamic objects.
5 Evaluation Metrics
The output of a SLAM algorithm is an estimated camera trajectory along with an estimate of the resulting map. While it is in principle possible to evaluate the quality of the resulting map, accurate ground truth maps are impossible to obtain as there are various uncontrollable factors involved. So we base our analysis on the quality of the estimated trajectory obtained from a sequence of RGB images. Even though an accurate trajectory does not imply a good map and error free operation of robot, it is the most common parameter used to measure the accuracy of SLAM algorithms. For the evaluation, we assume that the output of the algorithm is a sequence of poses from the estimated trajectory and from the ground truth trajectory .
For simplicity of notation, it is assumed that the two sequences are time-synchronized, equally sampled, and both have length
. However, in reality, these two sequences vary in sampling rates, lengths and have missing data, so an additional data association and interpolation step is required. Both sequences consist of homogeneous transformation matrices that give the pose of the RGB optical frame of the RGBD sensor from some other arbitrary reference frame. This reference frame does not have to be the same for both sequences, i.e., the estimated sequence might start at the origin, while the ground truth sequence is an absolute coordinate frame which was defined during calibration. While, in principle, the choice of the reference frame on the RGBD sensor is also arbitrary, the RGB optical frame is used as the reference because the depth images in the dataset have already been registered to this frame. In the remainder of this section, two common evaluation metrics for visual odometry and visual SLAM evaluation given in are defined.
5.1 Absolute Trajectory Error (ATE)
For a visual SLAM system, the global consistency of the estimated trajectory is an important quantity. The Absolute Trajectory Error (ATE) is evaluated by comparing the absolute distances between the estimated and the ground truth trajectory. As both trajectories can be specified in arbitrary coordinate frames, they first need to be aligned. This is done in closed form by using the method of Horn , which finds the rigid-body transformation corresponding to the least-squares solution that maps the estimated trajectory onto the ground truth trajectory . The absolute trajectory error at time step can be calculated as
Evaluating the root mean squared error (RMSE) over all time indices of the translation components, we get,
5.2 Relative Pose Error (RPE)
The Relative Pose Error measures the local accuracy of the trajectory over a fixed time interval . Therefore, the Relative Pose Error corresponds to the drift of the robot’s trajectory which is useful for the evaluation of visual odometry systems. We define the relative pose error at time step as
From a sequence of camera poses, we obtain in this way individual relative pose errors along the sequence. From these errors, we propose to compute the Root Mean Squared Error (RMSE) over all time indices of the translation component as
where refers to the translation components of the Relative Pose Error
. Some prefer to evaluate the mean error instead of the root mean squared error as it affords less influence to outliers. Some use the median instead of the mean, which attributes even less influence to outliers. For visual odometry systems that match consecutive frames,, the time parameteris which is an intuitive choice; then gives the drift per frame. For systems that use more than one previous frame, larger values of can also be appropriate, for example, gives the drift per second on a sequence recorded at 30 Hz. A common choice is to set which means that the start point is directly compared to the end point. This metric can be misleading as it penalizes rotational errors in the beginning of a trajectory more than towards the end [5, 13] and must not be used. It therefore makes sense to average over all possible time intervals , i.e., to calculate
Note that the computational complexity of this expression is quadratic in the trajectory length. Therefore, it is approximated by calculating it from a fixed number of relative pose samples.
The RPE can be used to evaluate the global error of a trajectory by averaging over all possible time intervals. The RPE assesses both translational and rotational errors, while the ATE only assesses the translational errors. Therefore, the RPE is always greater than the ATE (or equal if there is no rotational error). Thus, the RPE metric gives us a way to combine rotational and translational errors into a single measure. However, rotational errors are also indirectly captured by the ATE as it manifest itself in wrong translations. From a practical perspective, the ATE has an intuitive visualization which facilitates visual inspection. Nevertheless, the two metrics are strongly correlated.
The evaluation of both the SLAM methods was carried out on the selected sequences from the dataset. The values of Absolute Trajectory Error, Relative Pose Error are root mean squared. The time required to process the entire sequence was recorded. All the values were plotted for ease of comparison (See Figure 2). The estimated trajectories and detailed results including min, max, mean, mode, and median values of errors are available at .
From the obtained results it can be concluded that RTABMap takes longer to process a sequence than RGB-D SLAM. The exception being fr2 360 kidnap; the reasons for which are discussed later on in this section. A general trend can not be observed from the Absolute Trajectory Error (ATE) RMSE comparison to clearly suggest which SLAM algorithm performs better. We must observe the estimated trajectory from both SLAM algorithms to analyze the problems faced by each.
The estimated trajectory comparison for fr2 360 kidnap sequence when juxtaposed with the groundtruth (See Figure 3) reveals that RTABMap failed to recover after the sensor was covered. No estimated trajectory exists for the motion after the covering of the sensor for RTABMap. This is clearly a flaw with the used evaluation metrics as the evaluation ignores the motion for which no poses were generated. RGB-D SLAM recovers from the covering of the sensor and generates poses for the motion afterwards. Because of how the evaluation metrics are defined, RGB-D SLAM could be misidentified as the worse of the two algorithms in this situation despite its recovery from the tracking error.
The time taken by RTABMap to process the sequence is less than the time taken by RGB-D SLAM due to the same reason. RTABMap fails to recover from the tracking problem and does not process the remaining sequence thus resulting in a shorter processing time.
In case of fr3 no structure no texture sequence, the same flaw is observed (See Figure 4). RTABMap fails to generate valid poses for the latter half of the motion. The algorithm lost tracking due to lack of features in the frame. The algorithm never recovered and hence we have an incomplete estimation of the trajectory.
It must be emphasized that RTABMap generates significantly lesser number of pose pairs for evaluation than RGB-D SLAM. Hence an accurate comparison is not possible with the raw estimated trajectories generated by both methods.
After the analysis of the results it can be concluded that RGB-D SLAM has a better performance overall in the tested circumstances while ignoring the underlying flaws in the evaluation metrics. It has a shorter processing time. It is better at recovering from tracking errors than RTABMap. It also sustains fast camera motion much better than RTABMap.
A need for a better evaluation metric was realized due to the aforementioned flaws in the evaluation metrics as observed from the results.
The author would like to thank Jurgen Sturm from the Computer Vision Group at TUM for without his dataset and benchmark evaluation method this paper would not have been possible. The author also appreciates the support and resources provided by Research Center at Pune Institute of Computer Technology (PICT), India . The author also thanks Dr. Gaurav Bansod and Dr. Geetanjali Kale for their guidance and support.
-  R. C. Smith and P. Cheeseman, “On the representation and estimation of spatial uncertainty,” The international journal of Robotics Research, vol. 5, no. 4, pp. 56–68, 1986.
-  G. Dissanayake, H. Durrant-Whyte, and T. Bailey, “A computationally efficient solution to the simultaneous localisation and map building (slam) problem,” in Robotics and Automation, 2000. Proceedings. ICRA’00. IEEE International Conference on, vol. 2, pp. 1009–1014, IEEE, 2000.
-  M. Montemerlo and S. Thrun, “Simultaneous localization and mapping with unknown data association using fastslam,” in ICRA, pp. 1985–1991, 2003.
-  J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” in Proc. of the International Conference on Intelligent Robot Systems (IROS), Oct. 2012.
-  R. Kümmerle, B. Steder, C. Dornhege, M. Ruhnke, G. Grisetti, C. Stachniss, and A. Kleiner, “On measuring the accuracy of slam algorithms,” Autonomous Robots, vol. 27, no. 4, p. 387, 2009.
-  F. Steinbrücker, J. Sturm, and D. Cremers, “Real-time visual odometry from dense rgb-d images,” in Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pp. 719–722, IEEE, 2011.
A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the
kitti vision benchmark suite,” in
Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 3354–3361, IEEE, 2012.
-  F. Endres, J. Hess, N. Engelhard, J. Sturm, D. Cremers, and W. Burgard, “An evaluation of the rgb-d slam system,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on, pp. 1691–1696, IEEE, 2012.
-  M. Labbé and F. Michaud, “Memory management for real-time appearance-based loop closure detection,” in Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on, pp. 1271–1276, IEEE, 2011.
-  TUM, “Rgb-d slam dataset and benchmark.” https://vision.in.tum.de/data/datasets/rgbd-dataset, 2012. Accessed: 2018-09-30.
-  G. Grisetti, R. Kümmerle, C. Stachniss, U. Frese, and C. Hertzberg, “Hierarchical optimization on manifolds for online 2d and 3d mapping,” in Robotics and Automation (ICRA), 2010 IEEE International Conference on, pp. 273–278, IEEE, 2010.
-  B. K. Horn, “Closed-form solution of absolute orientation using unit quaternions,” JOSA A, vol. 4, no. 4, pp. 629–642, 1987.
-  A. Kelly, “Linearized error propagation in odometry,” The International Journal of Robotics Research, vol. 23, no. 2, pp. 179–218, 2004.
-  A. Kasar, “Git repository for the results and scripts.” https://github.com/ameykasar/benchmarking-slam, 2018. Accessed: 2018-11-22.