Improved Real-Time Monocular SLAM Using Semantic Segmentation on Selective Frames

Monocular simultaneous localization and mapping (SLAM) is emerging in advanced driver assistance systems and autonomous driving, because a single camera is cheap and easy to install. Conventional monocular SLAM has two major challenges leading inaccurate localization and mapping. First, it is challenging to estimate scales in localization and mapping. Second, conventional monocular SLAM uses inappropriate mapping factors such as dynamic objects and low-parallax ares in mapping. This paper proposes an improved real-time monocular SLAM that resolves the aforementioned challenges by efficiently using deep learning-based semantic segmentation. To achieve the real-time execution of the proposed method, we apply semantic segmentation only to downsampled keyframes in parallel with mapping processes. In addition, the proposed method corrects scales of camera poses and three-dimensional (3D) points, using estimated ground plane from road-labeled 3D points and the real camera height. The proposed method also removes inappropriate corner features labeled as moving objects and low parallax areas. Experiments with six video sequences demonstrate that the proposed monocular SLAM system achieves significantly more accurate trajectory tracking accuracy compared to state-of-the-art monocular SLAM and comparable trajectory tracking accuracy compared to state-of-the-art stereo SLAM.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 9

page 10

09/22/2018

DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments

Simultaneous Localization and Mapping (SLAM) is considered to be a funda...
11/15/2020

BirdSLAM: Monocular Multibody SLAM in Bird's-Eye View

In this paper, we present BirdSLAM, a novel simultaneous localization an...
04/25/2020

GPO: Global Plane Optimization for Fast and Accurate Monocular SLAM Initialization

Initialization is essential to monocular Simultaneous Localization and M...
09/02/2019

Estimation of Absolute Scale in Monocular SLAM Using Synthetic Data

This paper addresses the problem of scale estimation in monocular SLAM b...
11/13/2016

Semi-Dense 3D Semantic Mapping from Monocular SLAM

The bundle of geometry and appearance in computer vision has proven to b...
07/29/2020

A SLAM Map Restoration Algorithm Based on Submaps and an Undirected Connected Graph

Many visual simultaneous localization and mapping (SLAM) systems have be...
11/12/2020

Learning to Segment Dynamic Objects using SLAM Outliers

We present a method to automatically learn to segment dynamic objects us...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Simultaneous localization and mapping (SLAM) techniques evolve and have been widely applied to advanced driver assistance systems (ADAS) and autonomous driving systems. While SLAM approaches using light detection and ranging (LiDAR) sensors are accurate, the cost of LiDAR sensors is high and they have not been widely used in commercial products. Visual SLAM systems that use camera(s) become a popular alternative to LiDAR-based SLAM. Monocular SLAM systems that use a single camera are attractive as they are cheap and easy to install. Monocular SLAM was initially suggested with filter-based approaches [davison2007monoslam, civera2008inverse, chiuso2002structure, eade2006scalable]. The filter-based methods are computationally inefficient, since both localization and mapping run on every frame [younes2017keyframe]. To resolve the issue of filter-based methods, keyframe-based approaches [mouragnon2006real, klein2007parallel, mur2015orb] (see other references in [younes2017keyframe]) run the mapping process only on selective frames, called keyframes, while the localization process estimates a camera pose in every frame. The keyframe-based SLAM improved the localization accuracy and computational efficiency of filter-based methods [strasdat2012visual], and became the de facto standard in monocular SLAM [younes2017keyframe].

Conventional monocular SLAM has major challenges in scale estimation and mapping. There are two scale estimation challenges in monocular SLAM, namely, scale ambiguity and scale drift [zhou2019ground]

. First, one cannot recover camera poses and three-dimensional (3D) points with absolute scale. Second, errors are accumulated by following the SLAM procedures, i.e., feature extraction, feature matching, camera pose estimation, and 3D structure reconstruction. In monocular SLAM, the scale ambiguity is the unique problem and scale drifts become more severe due to the scale ambiguity. If the scale issues exist, monocular SLAM can mis-localize a current camera (or vehicle) position and generate 3D points in “off-the-grid” locations. The other major challenge is inaccurate mapping from two-dimensional (2D) points to 3D points. The mapping process of monocular SLAM consists of the triangulation of 2D point pairs obtained from two adjacent frames collected from different time points 

[hartley2003multiple]. For accurate mapping, static scenes and sufficient parallax are required. In practice, however, some objects such as cars, pedestrians, etc. moves in a scene, or some objects are distant from a camera with low-parallax.

Fig. 1: Overview of the proposed monocular SLAM system between previous and current keyframe time points, indexed by and , respectively. The keyframe data consists of current keyframe and its corner features and camera pose, and previous keyframe and its corrected camera pose. These are denoted by , , , , , respectively; see notations in Section II-A. The connected data consists of refined corner features, corrected camera pose, segmentation result and corrected 3D points of connected keyframes. These are denoted by , , , for , respectively; see notations in Section II-A. The symbol colors correspond to those in Fig. 2.

To resolve the scale ambiguity issue, many existing methods use prior knowledge, e.g., camera height [zhou2019ground, grater2015robust, lovegrove2011accurate, mirabdollah2015fast, pereira2017monocular, scaramuzza2009absolute, yang2019cubeslam], object size [botterill2011bag, botterill2012correcting, song2015high], and pedestrian height [botterill2012correcting], and estimate the absolute scale in monocular SLAM. To reduce scale drift, [strasdat2010scale] proposed a loop detection method that recognizes revisiting prior locations. However, many driving trajectories do not complete a loop. [zhou2019ground] uses a camera height to estimate the absolute scale and correct scale drifts. The method estimates the ground plane from a fixed region of interest (ROI) that corresponds to road area(s). The estimation performance is limited if an ROI is blocked by obstacles such as preceding vehicles.

To improve mapping accuracy performances, researchers combined monocular SLAM and deep learning-based semantic segmentation to filter out dynamic objects [an2017semantic, kaneko2018mask, yu2018ds, brasch2018semantic] and/or sky areas [kaneko2018mask]. This approach showed excellent performance even for scenes with dynamic objects. These methods, however, use a few segmented labels including dynamic objects and sky areas. One may use more labels, such as road areas, to estimate ground planes. More importantly, deep learning-based semantic segmentation has high computational costs, and applying it to each frame can terminate monocular SLAM. This limits the practical use of this powerful combination in real-time applications such as autonomous driving.

This paper proposes an improved real-time monocular SLAM framework that solves the aforementioned limitations of monocular SLAM, by using deep learning-based semantic segmentation in sophisticated ways. To resolve the time complexity problem of segmentation-based monocular SLAM, we apply deep learning-based semantic segmentation only to keyframes. This is different from existing methods [kaneko2018mask, an2017semantic, yu2018ds, brasch2018semantic] that apply semantic segmentation to all frames. To further improve the efficiency, segmentation processes is performed in a different thread in parallel with mapping processes. To overcome the scale estimation challenges, we estimate the absolute scale at each keyframe time point using segmented road labels and the actual camera height, and apply it to correct scales of camera poses and 3D points. To reduce inaccurate mappings, we filter out corner features of both moving objects, such as cars, pedestrians, etc., and low-parallax areas, that may generate inappropriate 3D points. To discriminate low-parallax areas, we use the Kanade–Lucas–Tomasi (KLT) tracker [baker2004lucas] that is widely used in the optical flow, and select corner features with small displacement in the sky, terrain, and building areas. Different from existing methods [kaneko2018mask, an2017semantic, yu2018ds, brasch2018semantic], the proposed monocular SLAM framework using semantic segmentation simultaneously resolves the aforementioned scale estimation and mapping challenges.

The remainder of this paper is organized as follows. Section II describes the proposed efficient integration approach of deep learning-based semantic segmentation with monocular SLAM, and the proposed scale correction and inaccurate mapping reduction methods by labeling 3D points and 2D corner features labeled based on segmented keyframes, respectively. Section III reports experiments where the proposed monocular SLAM system significantly improves trajectory tracking accuracy compared to state-of-the-art monocular SLAM. Furthermore, Section III reports experiments where the proposed monocular SLAM system achieves comparable trajectory tracking accuracy compared to state-of-the-art stereo SLAM in driving trajectories without loops. Section IV concludes the study and includes some future works.

Fig. 2: Proposed combination of keyframe-based monocular SLAM and semantic segmentation. Blue boxes indicate modified or added (sub-)modules. Red symbols indicate corner features, green symbols indicate camera pose matrix, blue symbols indicate keyframes, and magenta symbols indicate 3D point map. The input and output data sets are listed on left and right, respectively

Ii Proposed method

The proposed monocular SLAM system consists of three modules, localization, mapping, and segmentation, where each module runs in different threads. We modify the backbone of the conventional keyframe-based monocular SLAM method [mur2015orb], efficiently integrating segmentation. We summarize processes of each module between previous and current keyframe time point. The localization module selects a keyframe when both mapping and segmentation modules complete processing on a previous keyframe; similar to [mur2015orb], it provides corner features and camera pose of a current keyframe (extracting corner features of each frame and estimating its camera pose with 3D points of connected keyframes). The mapping module generates new 3D points by using triangulation between current corner features and connected corner features [mur2015orb], and assigns segmented labels from connected segmented keyframes to new 3D points. It then estimates a ground plane using only road-labeled 3D points and correct a scale of camera poses and 3D points using estimated ground plane and real camera height. While the mapping module provides scale-corrected camera poses and 3D points, the segmentation module performs deep learning-based segmentation to a downsampled keyframe and refines keyframe’s corner features by removing moving objects and low-parallax areas using segmentation result. Fig. 1 overviews the proposed monocular SLAM system.

Section II-A describes notations for data in the proposed method. Section II-B describes how we integrate deep learning-based segmentation with keyframe-based monocular SLAM. Using segmented keyframe results, Sections II-C and II-D describe details of scale correction in 3D mapping and removal of factors in a current keyframe that potentially cause inappropriate mapping, respectively.

Ii-a Preliminaries

This section summarizes notations for data in the proposed method:

  • The superscript denotes the keyframes index. indicates the current keyframe time point.

  • denotes segmented image or points with semantic labels. denotes that input is refined or scale-corrected.

  • denotes the keyframe. denotes segmented keyframe.

  • denotes a 2D image with ORB corner features [rublee2011orb] extracted from via the localization module. denotes refined corner features by removing factors that potentially cause inappropriate mapping, via the segmentation module.

  • denotes normalized camera pose matrix [hartley2003multiple] of . Normalized camera pose matrix is defined by , where and

    denote a 3D rotation matrix and translation vector, respectively. These are calculated by the localization module.

    denotes scale-corrected normalized camera pose matrix, where its translation vector is calculated by scale-corrected coordinate of the camera center. A camera center coordinate is calculated by [hartley2003multiple], where indicates the inverse of a matrix. denotes scale-corrected camera center coordinate calculated from .

  • denotes a 3D map with 3D points generated from via the mapping module. denotes a scale-corrected 3D map, where a scale of 3D point coordinate are corrected. denotes a scale-corrected 3D map with semantic labels.

  • denotes a set of indices for connected keyframes with the keyframe, in previous keyframe time points. If two keyframes generate 3D points at equal to or greater than same coordinates, they are called connected [mur2015orb]. Specifically, , where is the number of connected keyframes with .

Ii-B Proposed efficient integration of semantic segmentation with monocular SLAM

To efficiently integrate semantic segmentation with monocular SLAM, we apply semantic segmentation only to keyframes. Applying semantic segmentation to monocular SLAM will improve the monocular SLAM accuracy, by preventing generating inappropriate 3D points [kaneko2018mask, an2017semantic, yu2018ds, brasch2018semantic] and assigning semantic labels to 3D points for better ground-plane estimation. However, applying semantic segmentation to every frame is computational demanding and unsuitable for real-time monocular SLAM. 3D points are generated from the mapping process that uses only keyframes, and segmenting only keyfremes is sufficient for processing 3D points in monocular SLAM.

Fig. 3: Proposed keyframe selection strategy and its timing. In localization thread, the length of each box indicates the processing time for each frame. In mapping and segmentation thread, the length of each box indicates the processing time for each keyframe. Gray and black box denotes keyframe candidate and selected keyframe, respectively. We choose a keyframe only after both mapping and segmentaion processes are completed, although localization thread flags frame(s) as kefyrame candidate(s). Each candidate is determined at the end of each frame localization.

In addition, we perform the mapping and segmentation process in two different threads. In monocular SLAM, 3D points are generated by triangulating matched features from two keyframes [hartley2003multiple]. Two keyframes have matched features, so segmenting only one of two keyframes may be sufficient for subsequent 3D point processing. The proposed integration approach performs the mapping process with connected segmented keyframes and related data in one thread, in parallel with performing the segmentation process with a current keyframe and related data in the other thread. Fig. 2 illustrates the parallel processing. We describe details of each thread processing below.

In the mapping thread, we first generate a 3D point map by triangulation between a current corner features and each of connected (refined) corner features , using a current camera matrix and each of connected (scale-corrected) camera matrices . We then assign semantic labels in to generated and obtain . Finally, we correct a scale of keyframe’s camera center coordinates computed by each of and that of 3D point coordinates in each of . While the mapping thread performs aforementioned processes, the segmentation thread first segments a current keyframe , and obtains . We then refine a current corner feature , using current data and previous data , and obtain . Fig. 2 summarizes the above processes. We will describe details of the scale correction and corner feature refinement process later in Section II-C and II-D, respectively.

Note that we choose a keyframe only after both the mapping and segmentation processes are completed. If the localization thread generates keyframe candidates in the middle of mapping or segmentation process, we disregard them and choose the one generated right after the mapping and segmentation processes are completed, as a keyframe. (It is likely that subsequent frames after a keyframe candidate are also keyframe candidates.) Fig. 3 summarizes this keyframe selection strategy. Specifically, we select from the localization module and begin the mapping and segmentaiton processes, only after and are obtained by the mapping and segmentation thread in the previous keyframe time point, respectively.

Lastly, we downsample each 2D keyframe to reduce computational complexity of semantic segmentation (and upsample segmented keyframes back to the original resolution). Considering the above keyframe selection strategy and observations that semantic segmentation is generally more computationally expensive than the mapping module, the segmentation computation time critically affects the decision timing of a new keyframe. The slower semantic segmentation, the longer time gap between keyframes. This leads to fewer matched corner features between two keyframe time points and fewer generated 3D points via triangulation. The localization module projects 3D points onto each frame and estimates camera pose from projected 3D points and extracted corner features [mur2015orb]

. If the number of 3D points is small, then pose estimation accuracy reduces and eventually monocular SLAM performance significantly degrades. Reducing the keyframe image size/resolution can decrease the computational costs of semantic segmentation methods using convolutional neural network (CNN), at a cost of segmentation accuracy. Our experiments disregarding segmentation CNN computation costs demonstrate that although CNN segmentation accuracy decreases by reduced image size, the proposed monocular SLAM gives comparable performance with that using the original image size. See details later in Section 

III-B.

Ii-C Scale correction

At each current keyframe time point, we correct a scale of keyframe’s camera center coordinate and each 3D point coordinate. We first estimate ground plane parameters using only road labeled 3D points in . A ground plane is parameterized with a normal vector () and the perpendicular distance between a ground plane and the origin, , using the plane equation in 3D [anton2013elementary], where denotes the inner product between two vectors and is a coordinate of any 3D point on a plane. For accurate ground plane estimation, we use a robust parameter estimation method, random sample consensus (RANSAC) [fischler1981random] that uses only inliers of road-labeled 3D points (we normalized at each RANSAC iteration). Different from existing methods that use unlabeled 3D points in ROI for RANSAC, e.g., [zhou2019ground], we use 3D points with road label.

We then compute a scaling factor [grater2015robust] using virtual camera height and a real camera height that is measured in advance. We estimate a virtual camera height with the estimated ground plane parameters above. A virtual camera height is the perpendicular distance between estimated ground plane and a current keyframe’s camera center coordinate , and calculated by . We compute a scaling factor using and :

(1)

Intuitively speaking, above quantifies the scale drift at the current keyframe time point.

Finally, we adjust coordinates of keyframe’s camera center computed by each of and that of each 3D point in , using the scaling factor in (1). Applying follows the scale correction processes in [zhou2019ground]. We perform scale correction when the scaling factor satisfies , to avoid frequent or intensive correction. We adjust camera center coordinates of keyframes, by scaling the distance between and a camera center coordinate calculated from each of with . Likewise, we adjust 3D point coordinates by scaling the distance between and each 3D point coordinate in with .

In the first several keyframes, we do not perform the scale correction since road-labeled 3D points are insufficient. When the number of road-labeled 3D points become large or equal than , we estimate by averaging the absolute differences between ’s -coordinate and -coordinates of road-labeled 3D points in , and correct scales of with in (1). (RANSAC uses a user-specific threshold in determining inlier samples and this threshold depends on scales of samples [grater2015robust], but such 3D points are not yet scale-corrected in this keyframe time point.) Subsequent keyframes will use RANSAC-based ground plane estimation for the scale correction process, as described earlier.

Ii-D Corner feature refinement by removing inappropriate mapping factors

At each current keyframe time point, we refine current corner features by removing corner features of moving objects and low-parallax areas, using current data and previous data . Using a current segmented keyframe , we first remove corner features in areas labeled as movable objects, such as car, person, and bicycle, etc., similar to the existing methods [kaneko2018mask, an2017semantic, brasch2018semantic]. We next remove corner features labeled as sky, terrain, and building – referred to as backgrounds – in low-parallax areas. For each corner feature in backgrounds, we compute the displacement between a current keyframe and a previous keyframe , using KLT tracker [baker2004lucas]. Corner features with the displacement less than some threshold are considered as low-parallax areas.

When a 3D point is projected on a current keyframe and a previous keyframe, the displacement between projected 2D points changes depending on the distance between two keyframes. We adaptively select a threshold at each current keyframe timepoint, using the 3D projection model on two keyframes (see below) and estimated distance between current and previous keyframes.

This paragraph describes the 3D pin-hole camera projection model on current and previous keyframes, and its notations. A 3D point with coordinate projects onto - and -keyframe 2D planes (towards their camera centers with coordinates and , respectively). The mapped points on current and previous keyframes are denoted by and , respectively. The distance between and along the -direction is denoted by . Fig. 4(a) illustrates this. In setting or estimating , , and , we use camera intrinsic parameters, focal length and principal point coordinate (obtained by camera calibration in advance).

(a) 3D pin-hole camera projection on current and previous keyframes.
(b) is the distance between and on 2D keyframe plane.
The principal point coordinate is defined on 2D keyframe plane.
Fig. 4: Geometrical illustrations of the proposed adaptive selection using 3D pin-hole projection model on current and previous keyframes.

The proposed adaptive selection consists of four steps. First, we set using principal point coordinate:

(2)

where we observed that in general, low-parallax areas are located near this coordinate; see its location on a 2D keyframe plane in Fig. 4(b). Second, we set as follows: we set the -coordinate of to by assuming that inappropriate mapping occurs in objects at a distance of m or more, and calculate - and -coordinates of by

(3)

We obtained the result in (3) by back-projecting in (2) to using the perspective projection equation [hartley2003multiple]:

where we set to express in the form of homogeneous coordinate. Third, we estimate using :

(4)

We obtained the result in (4) by mapping in (3) with the -coordinate increased by onto -keyframe 2D plane (using the perspective projection):

where we set to express in the form of homogeneous coordinate. Finally, we select a threshold by calculating the distance between in (2) and in (4):

This is illustrated in Fig. 4(b). Objects more than m away from the -point may have the displacement less than , so we remove the corresponding corner features displacement less than .

Iii Results and discussion

We compared the proposed monocular SLAM with four variations of ORB-SLAM2 [mur2017orb]: 1) monocular setup without loop closing (LC), 2) monocular setup with LC, 3) stereo setup without LC, and 4) stereo setup with LC. The ORB-SLAM2 stereo setup uses two cameras and resolves the scale ambiguity & drift [mur2017orb] and inaccurate mapping issues [scaramuzza2011visual]. LC can further improve the estimation accuracy of camera poses and 3D point coordinates in ORB-SLAM2, if it detects a loop. It can be applied to both monocular and stereo setups.

In measuring the trajectory tracking accuracy, we calculated absolute trajectory error (ATE) [sturm2012benchmark], root-mean-squared-error between predicted camera center coordinates of keyframes and their ground-truth in meters (m) (if ground-truth exists).

Iii-a Experimental setup

We first determined an appropriate downsampling factor for keyframe segmentation (see the last paragraph in Section II-B), and compared the proposed method using the determined downsampling factor with the four state-of-the-art visual SLAM methods described above.

Iii-A1 Comparisons of different keyframe (spatial) resolutions for semantic segmentation in the proposed method

We compared the accuracy and running time of sementic segmentation and the trajectory tracking accuracy of the proposed monocular SLAM system, with different keyframe resolutions. To generate keyframes with different resolutions, we downsampled the width and height of original keyframes by factors of , , and

using bicubic dowmsampling. We upsampled segmented keyframes with nearest neighbor interpolation (using an upsampling factor identical to corresponding downsampling factor). In the proposed method, long segmetation running time(s) can jeopardize camera pose estimation in localization (due to limited 3D points), so we disregarded segmentation computation costs in determining an appropriate downsampling factor for keyframes.

We used the advanced Nvidia semantic-segmentation network [zhu2019improving] for keyframe segmentation.111 This method uses video prediction based data synthesis [reda2018sdc] to increase the training data size, and it took the first place in the KITTI semantic segmentation benchmark [andreas2020kitti_semantic]. This network was pre-trained with the KITTI semantic segmentation dataset [Alhaija2018IJCV] that is different from the KITTI odometry dataset [Geiger2012CVPR]. The proposed monocular SLAM method uses four classes, road, movable objects, background, and others, where we categorized predictions of the segmentation network as follows: road (“road”), movable objects (“person”, “rider”, “car”, “truck”, “bus”, “train”, “motorcycle”, and “bicycle”), background (“building”, “terrain”, and “sky”), and others.

In measuring the trajectory tracking accuracy and segmentation computation time, we used the sequence of the KITTI odometry dataset [Geiger2012CVPR]

. In measuring the segmentation accuracy, we used the Cityscapes dataset 

[Cordts2016Cityscapes] that consists of images driving in urban environments, since the KITTI- dataset does not have ground-truth segmentation annotations. We evaluated the semantic segmentation accuracy by mean intersection over union (IoU) of the road, movable objects, and background labels (others were not used in the proposed method). In measuring the segmentation computation times (in seconds, s), we used NVIDIA GeForce GTX 1080 with 8GB.

We determined an appropriate spatial downsampling factor for keyframe segmentation, considering trade-off between trajectory tracking accuracy and segmentation running time.

Iii-A2 Comparisons between different visual SLAM methods

We ran the proposed monocular SLAM system considering segmentation computation times with a determined keyframe spatial resolution via the trade-off investigation experiment in Section III-A1. Segmentation running time includes computational costs of downsampling, sementic segmentation, upsampling, and data transfer times between CPU and GPU. We segmented all downsampled frames in advance; when a keyframe is selected, we added the corresponding segmentation running time to the segmentation thread.

For comparing different visual SLAM methods, we used the sequences , , , , and of the KITTI odometry dataset [Geiger2012CVPR]. The KITTI odometry dataset consists of sequence sets, where each sequence set is obtained by driving in a different residential area. The sequences , , and have a loop or loops in their trajectories. The sequences and do not have loop(s), so we only ran ORB-SLAM2 without LC (for both monocular and stereo setups). The KITTI odometery dataset has only a small number of moving and distant objects. To better evaluate the effects of removing such inappropriate mapping factors by proposed monocular SLAM, we used an additional sequence provided by StradVision Inc. [stradvision_2021]. The sequence includes monocular video of frames where many cars drive on the highway (i.e., there exist many moving objects) surrounded by mountains (i.e., low-parallax area). The sequence does not have any loops, so we only ran monocular ORB-SLAM2 without LC.

In addition to the quantitative analyses with ATEs, we qualitatively compared trajectory tracking accuracy in the KITTI odometry dataset by comparing estimated trajectories with their ground-truths. Since the StradVision sequence does not have ground-truth trajectory, we aligned trajectories with the map provided by Google Maps for qualitative comparison. (In monocular ORB-SLAM2 without LC, we also aligned a scale of the estimated trajectory.)

Iii-B Comparisons of different keyframe resolutions for semantic segmentation in the proposed method

Table I shows that when keyframes are downsampled by a factor of or , both segmentation accuracy and trajectory tracking accuracy slightly decrease, compared to the case with the original keyframe resolution. When keyframes are downsampled by a factor of , both segmentation accuracy and trajectory tracking accuracy significantly decrease, compared to the case with the original keyframe resolution. Specifically, the segmentation accuracy (in IoU) decreases by , , and in road, movable objects, and background labels, respectively; the trajectory tracking accuracy decreases by .

Downsampl.
factor
Segmentation accuracy (IoU)
ATE (m)
Road Movable Obj. Background
1 0.934 0.865 0.867 6.721
1.333 0.926 0.863 0.863 6.755
2 0.918 0.859 0.855 6.784
4 0.898 0.782 0.806 8.681
  • The Nvidia semantic-segmentation network [zhu2019improving] was used.

TABLE I: Segmentation accuracy and trajectory tracking accuracy with different keyframe downsampling factors in proposed monocular SLAM
Sequence
(a) Proposed
method
(b) Monocular
ORB-SLAM2
without LC
(c) Monocular
ORB-SLAM2
with LC
(d) Stereo
ORB-SLAM2
without LC
(e) Stereo
ORB-SLAM2
with LC
With
loop(s)
00 6.797 231.701 199.658 8.515 6.869
05 2.888 170.815 143.368 2.949 1.567
07 2.354 80.490 67.158 2.856 1.057
No
loops
03 1.306 140.736 - 1.387 -
08 12.973 196.803 - 12.765 -
TABLE II: ATE values (m) of different visual SLAM methods in different KITTI odometry sequences

The segmentation running times for keyframes downsampled by a factor of , , , and are , , , and , respectively (all in s). When segmentation computation times are considered (see Section III-A2), we observed with the KITTI-00 sequence that for the downsampling factors and , the proposed monocular SLAM system frequently stops running; for the downsampling factors and , it runs continuously. The reason is that long segmentation running time can significantly delay keyframe selection and limit 3D point generation, and eventually, cause the entire system failure; see further details in Section II-B.

The aforementioned empirical observations demonstrate that the (spatial) downsampling factor has a reasonable trade-off between trajectory tracking accuracy and segmentation running time. The keyframe downsampling factor of , 1) slightly degrades the trajectory tracking accuracy (by ) and segmentation accuracy (by , and for road, moving objects, and background labels, respectively), compared to the original keyframe resolution, but 2) significantly reduce the segmentation running times (by a factor of ), so proposed monocular SLAM runs continuously. To achieve real-time execution of the proposed monocular visual SLAM system, we decide to spatially downsample keyframes by the factor for semantic segmentation.

Iii-C Comparisons between different visual SLAM methods

In all the sequences of the KITTI odometry dataset, Tables II(a)–II(c) and comparing their trajectories in Figs. 56 show that the proposed monocular SLAM significantly improves trajectory tracking accuracy compared to monocular ORB-SLAM2, regardless of using LC. The monocular ORB-SLAM2 methods suffer from scale ambiguity regardless of LC, their estimated trajectories are off the scale from ground-truth trajectories, and ATE values significantly increase. See Tables II(b)–II(c) and compare their trajectories in Figs. 56. The proposed monocular visual SLAM system resolves the scale ambiguity by correcting scales using segmented road labels, and achieves significantly better trajectory tracking accuracy.

Regardless of the loop(s) existence in the KITTI odometry sequences, Tables II(a) and II(d) and comparing their trajectories in Figs. 56 demonstrate that the proposed method achieves comparable trajectory tracking accuracy compared to stereo ORB-SLAM2 without LC. The results imply that using only a single camera, the proposed method can resolve scale ambiguity & drift and inaccurate mapping issues with comparable performance to stereo ORB-SLAM2 without LC.

In the sequences with loop(s), stereo ORB-SLAM2 with LC gave slightly better trajectory tracking accuracy compared to the proposed method. See Tables II(a) and II(e) and compare their results in Fig. 5. In both monocular and stereo ORB-SLAM2s, the LC scheme improves trajectory tracking accuracy by correcting accumulated localization errors, when a vehicle revisit a previously mapped area (i.e, a loop is created). Compare results between Tables II(b) and II(c), and those between and Tables II(d) and II(e). Based on the results in Tables II(a) and II(d), we conjecture that if LC is additionally used, the proposed monocular SLAM system achieves comparable or better trajectory tracking accuracy, compared to stereo ORB-SLAM2 with LC.

In the StradVision sequence (that has no loops), Fig. 7 shows that the proposed method significantly improves the trajectory tracking accuracy compared to monocular ORB-SLAM2 without LC. Whereas the proposed method consistently followed a road route, monocular ORB-SLAM2 without LC gave deviated vehicle localization at the th frame, and was eventually terminated at the th frame. The monocular ORB-SLAM2 system using no LC does not remove moving objects and low-parallax areas in generating 3D points at its mapping module. Such inappropriate 3D points could significantly degrade camera pose estimation performance, and eventually may terminate the entire system. The proposed SLAM system overcomes that challenge by corner feature refinements in the segmentation thread; see Section II-D.

Proposed monocular SLAM using the keyframe downsampling factor for sementation segmentation, did not experience system termination/failure in all the six sequences.


(a) Sequence

(b) Sequence

(c) Sequence

Fig. 5: Qualitative comparisons in the KITTI odometry dataset. Sequence , , and have a loop or loops in trajectories.

(a) Sequence

(b) Sequence

Fig. 6: Qualitative comparisons in the KITTI odometry dataset. Sequence and have no loop(s) in trajectories.
Fig. 7: Comparison of the proposed method and monocular ORB-SLAM2 without LC in the StradVision sequence. The blue solid and red dashed lines denote estimated trajectory by the proposed method and monocular ORB-SLAM2 without LC, respectively. The estimated trajectories are aligned with the map in Google Maps.

Iv Conclusion

Monocular SLAM methods use only a single camera and thus, have major limitations in scale ambiguity & drift and inaccurate mapping. The proposed monocular SLAM system resolves the limitations by efficiently using deep learning-based semantic segmentation: it applies a state-of-the-art semantic segmentation only to (downsampled) keyframes in parallel with mapping processes. The proposed method performs scale correction using road-labeled 3D points and removes features labeled by dynamic objects and low parallax areas in 3D point generations. Regardless of the existence of loops in video sequences, proposed monocular SLAM achieves significantly better trajectory tracking accuracy compared to state-of-the-art monocular SLAM, monocular ORB-SLAM2 [mur2017orb]. In video sequences without loops, proposed monocular SLAM achieves comparable trajectory tracking accuracy compared to state-of-the-art stereo SLAM, stereo ORB-SLAM2 [mur2017orb]. We believe that the proposed monocular SLAM system would be particularly useful in ADAS and autonomous driving systems, by using only a single camera but having comparable performances to stereo visual SLAM systems. The proposed efficient integration of segmentation segmentation with monocular SLAM could be applicable to an visual SLAM integration with other deep-learning methods needing high computation costs, such as depth estimation.

There are a number of avenues for future work. First, we will additionally run LC in an additional thread, to further improve trajectory tracking accuracy. Second, we plan to use camera poses, 3D point coordinates, etc. to refine segmentation results, motivated by [wang2019unified]. On the imager side, the future work is modifying the proposed framework for fish-eye or omnidirectional camera.

References