OpenVSLAM: A Versatile Visual SLAM Framework

10/02/2019
by   Shinya Sumikura, et al.
0

In this paper, we introduce OpenVSLAM, a visual SLAM framework with high usability and extensibility. Visual SLAM systems are essential for AR devices, autonomous control of robots and drones, etc. However, conventional open-source visual SLAM frameworks are not appropriately designed as libraries called from third-party programs. To overcome this situation, we have developed a novel visual SLAM framework. This software is designed to be easily used and extended. It incorporates several useful features and functions for research and development. OpenVSLAM is released at https://github.com/xdspacelab/openvslam under the 2-clause BSD license.

READ FULL TEXT VIEW PDF

Authors

page 1

page 5

08/24/2021

DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras

We introduce DROID-SLAM, a new deep learning based SLAM system. DROID-SL...
02/21/2019

GSLAM: A General SLAM Framework and Benchmark

SLAM technology has recently seen many successes and attracted the atten...
08/21/2018

SLAMBench2: Multi-Objective Head-to-Head Benchmarking for Visual SLAM

SLAM is becoming a key component of robotics and augmented reality (AR) ...
02/08/2017

Monocular LSD-SLAM Integration within AR System

In this paper, we cover the process of integrating Large-Scale Direct Si...
02/08/2021

OV^2SLAM : A Fully Online and Versatile Visual SLAM for Real-Time Applications

Many applications of Visual SLAM, such as augmented reality, virtual rea...
07/02/2016

A survey on non-filter-based monocular Visual SLAM systems

Extensive research in the field of Visual SLAM for the past fifteen year...

Code Repositories

openvslam

OpenVSLAM: A Versatile Visual SLAM Framework


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Simultaneous localization and mapping (SLAM) systems have experienced a notable and rapid progression through enthusiastic research and investigation conducted by researchers in the fields of computer vision and robotics. In particular, ORB–SLAM (Mur-Artal et al., 2015; Mur-Artal and Tardós, 2017), LSD–SLAM (Engel et al., 2014), and DSO (Engel et al., 2018) constitute major approaches regarded as de facto standards of visual SLAM, which performs SLAM processing using imagery. These approaches have achieved state-of-the-art performance as visual SLAM. In addition, researchers can reproduce the behavior of these systems on their computers because their source code is open to the public. However, they are not appropriately designed in terms of usability and extensibility as visual SLAM libraries. Thus, researchers and engineers have to make a great effort to apply those SLAM systems to their applications. In other words, it is inconvenient to use existing open-source software (OSS) for visual SLAM as the basis of applications derived from 3D modeling and mapping techniques, such as autonomous control of robots and unmanned aerial vehicles (UAVs), and augmented reality (AR) on mobile devices. Therefore, it is definitely valuable to provide an open-source visual SLAM framework that is easy to use and to extend by users of visual SLAM.

In this paper, we present OpenVSLAM, a monocular, stereo, and RGBD visual SLAM system that comprises well-known SLAM approaches, encapsulating them in several separated components with clear application programming interfaces (APIs). We also provide extensive documentation for it, including sample code snippets. The main contributions of OpenVSLAM are

  • It is compatible with various types of camera models and can be customized for optional camera models.

  • Created maps can be stored and loaded, then OpenVSLAM can localize new images using prebuilt maps.

  • A cross-platform viewer running on web browsers is provided for convenience of users.

One of the noteworthy features of OpenVSLAM is that the system can deal with various types of camera models, such as perspective, fisheye, and equirectangular, as shown in Figure 1. AR on mobile devices such as smartphones needs a SLAM system with a regular perspective camera. Meanwhile, fisheye cameras are often mounted on UAVs and robots for visual SLAM and scene understanding because they have a wider field of view (FoV) than perspective cameras. OpenVSLAM can be used with almost the same implementation between perspective and fisheye camera models. In addition, it is a significant contribution that equirectangular images can constitute inputs to our SLAM system. By using cameras that can capture omnidirectional imagery, the tracking performance of SLAM can be improved. Our efforts to make use of equirectangular images for visual SLAM enable tracking and mapping not to depend on the direction of a camera. Furthermore, OpenVSLAM provides interfaces that can be employed for applications and researches that use visual SLAM. For example, our SLAM system incorporates interfaces to store and load a map database and a localization function based on a prebuilt map.

We contribute to the community of computer vision and robotics by providing this SLAM framework with a more lax OSS license than most of the conventional visual SLAM frameworks, as shown in Table 1. Additionally, we continuously maintain the software so that researchers can jointly contribute to its development. Our code is released at https://github.com/xdspacelab/openvslam.

2. Related Work

2.1. OSS for Scene Modeling

In this section, mapping and localization techniques whose programs are released as OSS are briefly described. Such techniques are essential in a wide variety of application scenarios for autonomous control of UAVs and robots, AR on mobile devices, etc. Some OSS packages for those tasks using images have been open to the public.

Structure from motion (SfM) and visual SLAM are often employed as scene modeling techniques based on imagery. Regarding SfM, it is usually assumed that the entire image set is prepared in advance. Then the algorithm performs 3D reconstruction via batch processing. Concerning visual SLAM, 3D reconstruction is processed in real-time. Therefore, it assumes that images are sequentially input. OpenMVG (Moulon et al., 2019), Theia (Sweeney et al., 2015), OpenSfM (AB, 2019), and COLMAP (Schönberger and Frahm, 2016) are well-known OSS packages for SfM. Some SfM frameworks (Moulon et al., 2019; AB, 2019) are capable of dealing with fisheye and equirectangular imagery. The compatibility with such images has improved the performance and usability of SfM packages as 3D modeling frameworks. Meanwhile, researchers often use visual SLAM, such as ORB–SLAM (Mur-Artal et al., 2015; Mur-Artal and Tardós, 2017), LSD–SLAM (Engel et al., 2014), and DSO (Engel et al., 2018), for real-time 3D mapping. Unlike some SfM frameworks, most of the visual SLAM software programs can only handle perspective imagery. In our case, inspired by the aforementioned SfM frameworks, we do provide a novel visual SLAM framework compatible with various types of camera models. We thus aim at improving usability and extensibility of visual SLAM for 3D mapping and localization.

2.2. Visual SLAM

Some visual SLAM programs are introduced and some of their features are explained in this section. Table 1 compares characteristics of well-known visual SLAM frameworks with our OpenVSLAM.

ORB–SLAM (Mur-Artal et al., 2015; Mur-Artal and Tardós, 2017) is a kind of indirect SLAM that carries out visual SLAM processing using local feature matching among frames at different time instants. In this approach, the FAST algorithm (Rosten and Drummond, 2006; Rosten et al., 2010)

is used for keypoint detection. The binary vector 

(Rublee et al., 2011) is then used for its descriptor. Quick methods that can extract keypoints and match feature vectors enable visual SLAM algorithms to be processed in real-time. Similar approaches are employed in ProSLAM (Schlegel et al., 2018), which is the simple visual SLAM framework for perspective stereo and RGBD camera systems. UcoSLAM (Muñoz-Salinas and Medina Carnicer, 2019) adopts an algorithm that combines artificial landmarks, such as squared fiducial markers, and binary descriptor used by ORB–SLAM and ProSLAM. Meanwhile, LSD–SLAM (Engel et al., 2014) and DSO (Engel et al., 2018) are two different approaches of direct SLAM, which realizes visual SLAM processing directly exploiting brightness information of each pixel in images. It should be noted that the direct method does not have to explicitly extract any keypoints from images. Unlike the indirect method, the direct method can be correctly operated in more texture-less environments because it utilizes whole information from images. However, the direct method presents more susceptibility to changes in lighting conditions. Additionally, it has been reported that the direct method achieves lower performance than the indirect one when using rolling shutter cameras (Engel et al., 2014, 2018). Given that image sensors in smartphones and consumer cameras are rolling shutter, OpenVSLAM adopts the indirect method for visual SLAM.

Most of the visual SLAM frameworks cannot store and load map databases, as highlighted in Table 1. Localization based on a prebuilt map is important in practical terms for a lot of application. Accordingly, it is clear that the ability to store and load created maps improves the usability and extensibility of a visual SLAM framework. Therefore, functions for I/O of map databases are implemented in OpenVSLAM.

3. Implementation

Figure 2. Main modules of OpenVSLAM: tracking, mapping, and global optimization modules.

OpenVSLAM is mainly implemented with C++. It includes well-known libraries, such as Eigen111C++ template library for linear algebra: http://eigen.tuxfamily.org/ for matrix calculation, OpenCV222Open Source Computer Vision Library: http://opencv.org/

for I/O operation of images and feature extraction, and g2o 

(Kümmerle et al., 2011) for map optimization. In addition, extensive documentation including sample code snippets is provided. Users can employ these snippets for their programs.

3.1. SLAM Algorithm

In this section, we present a brief outline of the SLAM algorithm adopted by OpenVSLAM and its module structure. As in ORB–SLAM (Mur-Artal et al., 2015; Mur-Artal and Tardós, 2017) and ProSLAM (Schlegel et al., 2018), the graph-based SLAM algorithm (Grisetti et al., 2010) with the indirect method is used in OpenVSLAM. It adopts ORB (Rublee et al., 2011) as a feature extractor. The module structure of OpenVSLAM is carefully designed for the customizability.

The software of OpenVSLAM is roughly divided into three modules, as shown in Figure 2

: tracking, mapping, and global optimization modules. The tracking module estimates a camera pose for every frame that is sequentially inputted to OpenVSLAM via keypoint matching and pose optimization. This module also decides whether to insert a new keyframe (KF) or not. When a frame is regarded as appropriate for a new KF, it is sent to the mapping and the global optimization modules. In the mapping module, new 3D points are triangulated using the inserted KFs; that is, the map is created and extended. Additionally, the windowed map optimization, called local bundle adjustment (BA), is performed in this module. Loop detection, pose-graph optimization, and global BA are carried out in the global optimization module. Trajectory drift, which often becomes a problem in SLAM, is resolved via pose-graph optimization implemented with g2o 

(Kümmerle et al., 2011). Scale drift is also canceled in this way, especially for monocular camera models.

3.2. Camera Models

OpenVSLAM can accept images captured with perspective, fisheye, and equirectangular cameras. In regard to perspective and fisheye camera models, the framework is compatible not only with monocular but also with stereo and RGBD setups. Additionally, users can easily add new camera models (e.g., dual fisheye and catadioptric) by implementing new camera model classes derived from a base class camera::base. This is a great advantage compared to other SLAM frameworks because new camera models can be implemented easily.

It is a noteworthy point that OpenVSLAM can perform SLAM with an equirectangular camera. Equirectangular cameras, such as RICOH THETA series, insta360 series, and Ladybug series, have been recently used to capture omnidirectional images and videos. In regard to visual SLAM, being compatible with equirectangular cameras implies a significant benefit for tracking and mapping because they have omnidirectional view, unlike perspective ones. To the best of our knowledge, this is the first open-source visual SLAM framework that can accept equirectangular imagery.

3.3. Map I/O and Localization

As opposed to most of the visual SLAM frameworks, OpenVSLAM has functions to store and load map information, as shown in Table 1. In addition, users can localize new frames based on a prebuilt map. The map database is stored in MessagePack format, hence the map information can be reused for any third-party applications in addition to OpenVSLAM.

4. Quantitative Evaluation

In this section, tracking accuracy of OpenVSLAM is evaluated using EuRoC MAV dataset (Burri et al., 2016) and KITTI Odomery dataset (Geiger et al., 2012), both of which have ground-truth trajectories. ORB–SLAM2 (Mur-Artal and Tardós, 2017), the typical indirect SLAM, is selected for comparison. In addition to tracking accuracy, tracking times are also compared.

Absolute trajectory error (ATE) (Sturm et al., 2012) is used for evaluation of estimated trajectories. To align an estimated trajectory and the corresponding ground-truth, transformation parameters between the two trajectories are estimated using Umeyama’s method (Umeyama, 1991). transformation is estimated for monocular sequences because tracked trajectories are up-to-scale. On the contrary, transformation is used for stereo sequences. The laptop computer used for the evaluations equips a Core i7-7820HK CPU (2.90GHz, 4C8T) and 32GB RAM.

4.1. EuRoC MAV Dataset

Figure 3 shows ATEs on the 11 sequences of EuRoC MAV dataset. From the graph, it is found that OpenVSLAM is comparable to ORB–SLAM with respect to tracking accuracy for UAV-mounted cameras. Concerning the sequences including dark scenes (MH_04 and MH_05), the trajectories estimated with OpenVSLAM are more accurate than that with ORB–SLAM. This is mainly because frame tracking method based on robust matching is additionally implemented in OpenVSLAM.

Subsequently, tracking times measured using the MH_02 sequence of EuRoC MAV dataset are shown in Figure 4. Mean and median tracking times are presented in the table as well. From the table, it is found that OpenVSLAM consumes less tracking time than ORB–SLAM. This is mainly because the implementation of ORB extraction in OpenVSLAM is more optimized than that in ORB–SLAM. In addition, it should be noted that OpenVSLAM requires less tracking time than ORB–SLAM in later parts of the sequence as shown in the graph. This is because OpenVSLAM efficiently prevents a local map from being enlarged in the tracking module when a global map is expanded.

Figure 3. Absolute trajectory errors on the 11 sequences in EuRoC MAV dataset (monocular). Lower is better.
ORB–SLAM OpenVSLAM
mean [ms/frame] 27.96 23.84
medain [ms/frame] 24.97 23.38
Figure 4. Tracking times on the MH_02 sequence of EuRoC MAV dataset (monocular). The table shows mean and median tracking times on each of the two frameworks. The graph shows the change in tracking times. Lower is better.

4.2. KITTI Odometry Dataset

Figure 5 shows ATEs on the 11 sequences of KITTI Odometry dataset. From the graph, it is found that OpenVSLAM has comparable performance to ORB–SLAM with respect to tracking accuracy for car-mounted cameras.

Subsequently, tracking times measured using the sequence number 05 of KITTI Odometry dataset are shown in Figure 6. Mean and median tracking times are also presented in the table. OpenVSLAM consumes less tracking time than ORB–SLAM for the same reason described in Section 4.1. The difference in the tracking times shown in Figure 6 is bigger than that in Figure 4 because 1) image size of KITTI Odometry dataset is larger than that of EuRoC MAV dataset, and 2) stereo-matching implementation in OpenVSLAM is also more optimized than that in ORB–SLAM.

Figure 5. Absolute trajectory errors on the 11 sequences in KITTI Odometry dataset (stereo). Lower is better.
ORB–SLAM OpenVSLAM
mean [ms/frame] 68.78 56.32
medain [ms/frame] 66.78 54.45
Figure 6. Tracking times on the sequence number 05 of KITTI Odometry dataset (stereo). The table shows mean and median tracking times on each of the two frameworks. The graph shows the change in tracking times. Lower is better.

5. Qualitative Results

5.1. Fisheye Camera

In this section, experimental results of visual SLAM with a fisheye camera are presented both outdoors and indoors. A LUMIX DMC-GX8 which equips an 8mm fisheye lens (Panasonic Corp.) is used for capturing image sequences. The FPS of the camera is .

The 3D map shown in Figure 7 is created with OpenVSLAM using a fisheye video captured outdoor. The number of frames is about . The difference in elevation is observed from the side-view of the 3D map. Also, it should be noted that camera pose tracking succeeded even in high dynamic range scenes.

Figure 7. Mapping result using the outdoor fisheye video. The top image depicts side-view of the 3D map, while the center image is top-view. The difference in elevation can be observed from the side-view.

Figure 8 presents the 3D map built from an indoor fisheye video. The number of frames is about . It is found that the shape of the room is reconstructed clearly. In addition, tracking succeeded even in areas which have less common views (e.g., the right-bottom image of Figure 8).

These results allow us to conclude that visual SLAM with fisheye cameras is correctly performed both outdoors and indoors.

5.2. Equirectangular Camera

In this section, experimental results of visual SLAM with a THETA V (RICOH Co., Ltd.), a consumer equirectangular camera, are presented.

The 3D map shown in the right half of Figure 1 is created with OpenVSLAM using an equirectangular video captured outdoor. The FPS is and the number of frames is . It is found that tracking of camera movement, loop-closing, and global optimization work well even for the large-scale sequence.

Meanwhile, Figure 9 presents the 3D map based on an indoor equirectangular video. In this case, the FPS is and the number of frames is . It should be noted that the camera poses are correctly tracked even in texture-less areas thanks to omnidirectional observation.

These results allow us to conclude that visual SLAM with equirectangular cameras is correctly performed both outdoors and indoors.

6. Conclusion

In this project, we have developed OpenVSLAM, a visual SLAM framework with high usability and extensibility. The software is designed to be easily used for various application scenarios of visual SLAM. It incorporates several useful functions for research and development. In this paper, the quantitative performance is evaluated using the benchmarking dataset. In addition, experimental results of visual SLAM with fisheye and equirectangular camera models are presented. We will continuously maintain this framework for the further development of computer vision and robotics fields.

Acknowledgements.
The authors would like to thank Mr. H. Ishikawa, Mr. M. Ichihara, Dr. M. Onishi, Dr. R. Nakamura, and Prof. N. Kawaguchi, for their support for this project.
Figure 8. Mapping result using the indoor fisheye video. The center image depicts the top-view of the 3D map. It is found that the shape of the room is reconstructed well.
Figure 9. Mapping result using the indoor equirectangular video. Tracking succeeds even in the texture-less areas.

References

  • M. AB (2019) OpenSfM. Note: https://github.com/mapillary/OpenSfM Cited by: §2.1.
  • M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart (2016) The euroc micro aerial vehicle datasets. International Journal of Robotics Research (IJRR) 35 (10), pp. 1157–1163. External Links: Document Cited by: §4.
  • J. Engel, V. Koltun, and D. Cremers (2018) Direct sparse odometry. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 40 (3), pp. 611–625. External Links: Document Cited by: Table 1, §1, §2.1, §2.2.
  • J. Engel, T. Schöps, and D. Cremers (2014) LSD–SLAM: large-scale direct monocular SLAM. In Proceedings of European Conference on Computer Vision (ECCV), pp. 834–849. External Links: Document Cited by: Table 1, §1, §2.1, §2.2.
  • A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the KITTI vision benchmark suite. In

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 3354–3361. External Links: Document Cited by: §4.
  • G. Grisetti, R. Kümmerle, C. Stachniss, and W. Burgard (2010) A tutorial on graph-based slam. IEEE Transactions on Intelligent Transportation Systems Magazine 2 (4), pp. 31–43. External Links: Document Cited by: §3.1.
  • R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard (2011) G2o: a general framework for graph optimization. In Proceedings of IEEE International Conference on Robotics and Automation (ICRA), pp. 3607–3613. External Links: Document Cited by: §3.1, §3.
  • P. Moulon, P. Monasse, R. Marlet, et al. (2019) OpenMVG: an open multiple view geometry library. Note: https://github.com/openMVG/openMVG Cited by: §2.1.
  • R. Muñoz-Salinas and R. Medina Carnicer (2019) UcoSLAM: simultaneous localization and mapping by fusion of keypoints and squared planar markers. External Links: 1902.03729, Document Cited by: Table 1, §2.2.
  • R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós (2015) ORB–SLAM: a versatile and accurate monocular SLAM system. IEEE Transactions on Robotics 31 (5), pp. 1147–1163. External Links: Document Cited by: §1, §2.1, §2.2, §3.1.
  • R. Mur-Artal and J. D. Tardós (2017) ORB–SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras. IEEE Transactions on Robotics 33 (5), pp. 1255–1262. External Links: Document Cited by: Table 1, §1, §2.1, §2.2, §3.1, §4.
  • E. Rosten and T. Drummond (2006) Machine learning for high-speed corner detection. In Proceedings of European Conference on Computer Vision (ECCV), pp. 430–443. External Links: Document Cited by: §2.2.
  • E. Rosten, R. Porter, and T. Drummond (2010) Faster and better: a machine learning approach to corner detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 32 (1), pp. 105–119. External Links: Document Cited by: §2.2.
  • E. Rublee, V. Rabaud, K. Konolige, and G. Bradski (2011) ORB: an efficient alternative to SIFT or SURF. In Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 2564–2571. External Links: Document Cited by: §2.2, §3.1.
  • D. Schlegel, M. Colosi, and G. Grisetti (2018) ProSLAM: Graph SLAM from a Programmer’s Perspective. In Proceedings of IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9. External Links: Document Cited by: Table 1, §2.2, §3.1.
  • J. L. Schönberger and J. Frahm (2016) Structure-from-motion revisited. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4104–4113. External Links: Document Cited by: §2.1.
  • J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers (2012) A benchmark for the evaluation of rgb-d slam systems. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 573–580. External Links: Document Cited by: §4.
  • C. Sweeney, T. Hollerer, and M. Turk (2015) Theia: a fast and scalable structure-from-motion library. In Proceedings of the 23rd ACM International Conference on Multimedia, pp. 693–696. External Links: Document Cited by: §2.1.
  • S. Umeyama (1991) Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 13 (4), pp. 376–380. External Links: Document Cited by: §4.