I Introduction
Simultaneous localization and mapping (SLAM) techniques evolve and have been widely applied to advanced driver assistance systems (ADAS) and autonomous driving systems. While SLAM approaches using light detection and ranging (LiDAR) sensors are accurate, the cost of LiDAR sensors is high and they have not been widely used in commercial products. Visual SLAM systems that use camera(s) become a popular alternative to LiDARbased SLAM. Monocular SLAM systems that use a single camera are attractive as they are cheap and easy to install. Monocular SLAM was initially suggested with filterbased approaches [davison2007monoslam, civera2008inverse, chiuso2002structure, eade2006scalable]. The filterbased methods are computationally inefficient, since both localization and mapping run on every frame [younes2017keyframe]. To resolve the issue of filterbased methods, keyframebased approaches [mouragnon2006real, klein2007parallel, mur2015orb] (see other references in [younes2017keyframe]) run the mapping process only on selective frames, called keyframes, while the localization process estimates a camera pose in every frame. The keyframebased SLAM improved the localization accuracy and computational efficiency of filterbased methods [strasdat2012visual], and became the de facto standard in monocular SLAM [younes2017keyframe].
Conventional monocular SLAM has major challenges in scale estimation and mapping. There are two scale estimation challenges in monocular SLAM, namely, scale ambiguity and scale drift [zhou2019ground]
. First, one cannot recover camera poses and threedimensional (3D) points with absolute scale. Second, errors are accumulated by following the SLAM procedures, i.e., feature extraction, feature matching, camera pose estimation, and 3D structure reconstruction. In monocular SLAM, the scale ambiguity is the unique problem and scale drifts become more severe due to the scale ambiguity. If the scale issues exist, monocular SLAM can mislocalize a current camera (or vehicle) position and generate 3D points in “offthegrid” locations. The other major challenge is inaccurate mapping from twodimensional (2D) points to 3D points. The mapping process of monocular SLAM consists of the triangulation of 2D point pairs obtained from two adjacent frames collected from different time points
[hartley2003multiple]. For accurate mapping, static scenes and sufficient parallax are required. In practice, however, some objects such as cars, pedestrians, etc. moves in a scene, or some objects are distant from a camera with lowparallax.To resolve the scale ambiguity issue, many existing methods use prior knowledge, e.g., camera height [zhou2019ground, grater2015robust, lovegrove2011accurate, mirabdollah2015fast, pereira2017monocular, scaramuzza2009absolute, yang2019cubeslam], object size [botterill2011bag, botterill2012correcting, song2015high], and pedestrian height [botterill2012correcting], and estimate the absolute scale in monocular SLAM. To reduce scale drift, [strasdat2010scale] proposed a loop detection method that recognizes revisiting prior locations. However, many driving trajectories do not complete a loop. [zhou2019ground] uses a camera height to estimate the absolute scale and correct scale drifts. The method estimates the ground plane from a fixed region of interest (ROI) that corresponds to road area(s). The estimation performance is limited if an ROI is blocked by obstacles such as preceding vehicles.
To improve mapping accuracy performances, researchers combined monocular SLAM and deep learningbased semantic segmentation to filter out dynamic objects [an2017semantic, kaneko2018mask, yu2018ds, brasch2018semantic] and/or sky areas [kaneko2018mask]. This approach showed excellent performance even for scenes with dynamic objects. These methods, however, use a few segmented labels including dynamic objects and sky areas. One may use more labels, such as road areas, to estimate ground planes. More importantly, deep learningbased semantic segmentation has high computational costs, and applying it to each frame can terminate monocular SLAM. This limits the practical use of this powerful combination in realtime applications such as autonomous driving.
This paper proposes an improved realtime monocular SLAM framework that solves the aforementioned limitations of monocular SLAM, by using deep learningbased semantic segmentation in sophisticated ways. To resolve the time complexity problem of segmentationbased monocular SLAM, we apply deep learningbased semantic segmentation only to keyframes. This is different from existing methods [kaneko2018mask, an2017semantic, yu2018ds, brasch2018semantic] that apply semantic segmentation to all frames. To further improve the efficiency, segmentation processes is performed in a different thread in parallel with mapping processes. To overcome the scale estimation challenges, we estimate the absolute scale at each keyframe time point using segmented road labels and the actual camera height, and apply it to correct scales of camera poses and 3D points. To reduce inaccurate mappings, we filter out corner features of both moving objects, such as cars, pedestrians, etc., and lowparallax areas, that may generate inappropriate 3D points. To discriminate lowparallax areas, we use the Kanade–Lucas–Tomasi (KLT) tracker [baker2004lucas] that is widely used in the optical flow, and select corner features with small displacement in the sky, terrain, and building areas. Different from existing methods [kaneko2018mask, an2017semantic, yu2018ds, brasch2018semantic], the proposed monocular SLAM framework using semantic segmentation simultaneously resolves the aforementioned scale estimation and mapping challenges.
The remainder of this paper is organized as follows. Section II describes the proposed efficient integration approach of deep learningbased semantic segmentation with monocular SLAM, and the proposed scale correction and inaccurate mapping reduction methods by labeling 3D points and 2D corner features labeled based on segmented keyframes, respectively. Section III reports experiments where the proposed monocular SLAM system significantly improves trajectory tracking accuracy compared to stateoftheart monocular SLAM. Furthermore, Section III reports experiments where the proposed monocular SLAM system achieves comparable trajectory tracking accuracy compared to stateoftheart stereo SLAM in driving trajectories without loops. Section IV concludes the study and includes some future works.
Ii Proposed method
The proposed monocular SLAM system consists of three modules, localization, mapping, and segmentation, where each module runs in different threads. We modify the backbone of the conventional keyframebased monocular SLAM method [mur2015orb], efficiently integrating segmentation. We summarize processes of each module between previous and current keyframe time point. The localization module selects a keyframe when both mapping and segmentation modules complete processing on a previous keyframe; similar to [mur2015orb], it provides corner features and camera pose of a current keyframe (extracting corner features of each frame and estimating its camera pose with 3D points of connected keyframes). The mapping module generates new 3D points by using triangulation between current corner features and connected corner features [mur2015orb], and assigns segmented labels from connected segmented keyframes to new 3D points. It then estimates a ground plane using only roadlabeled 3D points and correct a scale of camera poses and 3D points using estimated ground plane and real camera height. While the mapping module provides scalecorrected camera poses and 3D points, the segmentation module performs deep learningbased segmentation to a downsampled keyframe and refines keyframe’s corner features by removing moving objects and lowparallax areas using segmentation result. Fig. 1 overviews the proposed monocular SLAM system.
Section IIA describes notations for data in the proposed method. Section IIB describes how we integrate deep learningbased segmentation with keyframebased monocular SLAM. Using segmented keyframe results, Sections IIC and IID describe details of scale correction in 3D mapping and removal of factors in a current keyframe that potentially cause inappropriate mapping, respectively.
Iia Preliminaries
This section summarizes notations for data in the proposed method:

The superscript denotes the keyframes index. indicates the current keyframe time point.

denotes segmented image or points with semantic labels. denotes that input is refined or scalecorrected.

denotes the keyframe. denotes segmented keyframe.

denotes a 2D image with ORB corner features [rublee2011orb] extracted from via the localization module. denotes refined corner features by removing factors that potentially cause inappropriate mapping, via the segmentation module.

denotes normalized camera pose matrix [hartley2003multiple] of . Normalized camera pose matrix is defined by , where and
denote a 3D rotation matrix and translation vector, respectively. These are calculated by the localization module.
denotes scalecorrected normalized camera pose matrix, where its translation vector is calculated by scalecorrected coordinate of the camera center. A camera center coordinate is calculated by [hartley2003multiple], where indicates the inverse of a matrix. denotes scalecorrected camera center coordinate calculated from . 
denotes a 3D map with 3D points generated from via the mapping module. denotes a scalecorrected 3D map, where a scale of 3D point coordinate are corrected. denotes a scalecorrected 3D map with semantic labels.

denotes a set of indices for connected keyframes with the keyframe, in previous keyframe time points. If two keyframes generate 3D points at equal to or greater than same coordinates, they are called connected [mur2015orb]. Specifically, , where is the number of connected keyframes with .
IiB Proposed efficient integration of semantic segmentation with monocular SLAM
To efficiently integrate semantic segmentation with monocular SLAM, we apply semantic segmentation only to keyframes. Applying semantic segmentation to monocular SLAM will improve the monocular SLAM accuracy, by preventing generating inappropriate 3D points [kaneko2018mask, an2017semantic, yu2018ds, brasch2018semantic] and assigning semantic labels to 3D points for better groundplane estimation. However, applying semantic segmentation to every frame is computational demanding and unsuitable for realtime monocular SLAM. 3D points are generated from the mapping process that uses only keyframes, and segmenting only keyfremes is sufficient for processing 3D points in monocular SLAM.
In addition, we perform the mapping and segmentation process in two different threads. In monocular SLAM, 3D points are generated by triangulating matched features from two keyframes [hartley2003multiple]. Two keyframes have matched features, so segmenting only one of two keyframes may be sufficient for subsequent 3D point processing. The proposed integration approach performs the mapping process with connected segmented keyframes and related data in one thread, in parallel with performing the segmentation process with a current keyframe and related data in the other thread. Fig. 2 illustrates the parallel processing. We describe details of each thread processing below.
In the mapping thread, we first generate a 3D point map by triangulation between a current corner features and each of connected (refined) corner features , using a current camera matrix and each of connected (scalecorrected) camera matrices . We then assign semantic labels in to generated and obtain . Finally, we correct a scale of keyframe’s camera center coordinates computed by each of and that of 3D point coordinates in each of . While the mapping thread performs aforementioned processes, the segmentation thread first segments a current keyframe , and obtains . We then refine a current corner feature , using current data and previous data , and obtain . Fig. 2 summarizes the above processes. We will describe details of the scale correction and corner feature refinement process later in Section IIC and IID, respectively.
Note that we choose a keyframe only after both the mapping and segmentation processes are completed. If the localization thread generates keyframe candidates in the middle of mapping or segmentation process, we disregard them and choose the one generated right after the mapping and segmentation processes are completed, as a keyframe. (It is likely that subsequent frames after a keyframe candidate are also keyframe candidates.) Fig. 3 summarizes this keyframe selection strategy. Specifically, we select from the localization module and begin the mapping and segmentaiton processes, only after and are obtained by the mapping and segmentation thread in the previous keyframe time point, respectively.
Lastly, we downsample each 2D keyframe to reduce computational complexity of semantic segmentation (and upsample segmented keyframes back to the original resolution). Considering the above keyframe selection strategy and observations that semantic segmentation is generally more computationally expensive than the mapping module, the segmentation computation time critically affects the decision timing of a new keyframe. The slower semantic segmentation, the longer time gap between keyframes. This leads to fewer matched corner features between two keyframe time points and fewer generated 3D points via triangulation. The localization module projects 3D points onto each frame and estimates camera pose from projected 3D points and extracted corner features [mur2015orb]
. If the number of 3D points is small, then pose estimation accuracy reduces and eventually monocular SLAM performance significantly degrades. Reducing the keyframe image size/resolution can decrease the computational costs of semantic segmentation methods using convolutional neural network (CNN), at a cost of segmentation accuracy. Our experiments disregarding segmentation CNN computation costs demonstrate that although CNN segmentation accuracy decreases by reduced image size, the proposed monocular SLAM gives comparable performance with that using the original image size. See details later in Section
IIIB.IiC Scale correction
At each current keyframe time point, we correct a scale of keyframe’s camera center coordinate and each 3D point coordinate. We first estimate ground plane parameters using only road labeled 3D points in . A ground plane is parameterized with a normal vector () and the perpendicular distance between a ground plane and the origin, , using the plane equation in 3D [anton2013elementary], where denotes the inner product between two vectors and is a coordinate of any 3D point on a plane. For accurate ground plane estimation, we use a robust parameter estimation method, random sample consensus (RANSAC) [fischler1981random] that uses only inliers of roadlabeled 3D points (we normalized at each RANSAC iteration). Different from existing methods that use unlabeled 3D points in ROI for RANSAC, e.g., [zhou2019ground], we use 3D points with road label.
We then compute a scaling factor [grater2015robust] using virtual camera height and a real camera height that is measured in advance. We estimate a virtual camera height with the estimated ground plane parameters above. A virtual camera height is the perpendicular distance between estimated ground plane and a current keyframe’s camera center coordinate , and calculated by . We compute a scaling factor using and :
(1) 
Intuitively speaking, above quantifies the scale drift at the current keyframe time point.
Finally, we adjust coordinates of keyframe’s camera center computed by each of and that of each 3D point in , using the scaling factor in (1). Applying follows the scale correction processes in [zhou2019ground]. We perform scale correction when the scaling factor satisfies , to avoid frequent or intensive correction. We adjust camera center coordinates of keyframes, by scaling the distance between and a camera center coordinate calculated from each of with . Likewise, we adjust 3D point coordinates by scaling the distance between and each 3D point coordinate in with .
In the first several keyframes, we do not perform the scale correction since roadlabeled 3D points are insufficient. When the number of roadlabeled 3D points become large or equal than , we estimate by averaging the absolute differences between ’s coordinate and coordinates of roadlabeled 3D points in , and correct scales of with in (1). (RANSAC uses a userspecific threshold in determining inlier samples and this threshold depends on scales of samples [grater2015robust], but such 3D points are not yet scalecorrected in this keyframe time point.) Subsequent keyframes will use RANSACbased ground plane estimation for the scale correction process, as described earlier.
IiD Corner feature refinement by removing inappropriate mapping factors
At each current keyframe time point, we refine current corner features by removing corner features of moving objects and lowparallax areas, using current data and previous data . Using a current segmented keyframe , we first remove corner features in areas labeled as movable objects, such as car, person, and bicycle, etc., similar to the existing methods [kaneko2018mask, an2017semantic, brasch2018semantic]. We next remove corner features labeled as sky, terrain, and building – referred to as backgrounds – in lowparallax areas. For each corner feature in backgrounds, we compute the displacement between a current keyframe and a previous keyframe , using KLT tracker [baker2004lucas]. Corner features with the displacement less than some threshold are considered as lowparallax areas.
When a 3D point is projected on a current keyframe and a previous keyframe, the displacement between projected 2D points changes depending on the distance between two keyframes. We adaptively select a threshold at each current keyframe timepoint, using the 3D projection model on two keyframes (see below) and estimated distance between current and previous keyframes.
This paragraph describes the 3D pinhole camera projection model on current and previous keyframes, and its notations. A 3D point with coordinate projects onto  and keyframe 2D planes (towards their camera centers with coordinates and , respectively). The mapped points on current and previous keyframes are denoted by and , respectively. The distance between and along the direction is denoted by . Fig. 4(a) illustrates this. In setting or estimating , , and , we use camera intrinsic parameters, focal length and principal point coordinate (obtained by camera calibration in advance).
(a) 3D pinhole camera projection on current and previous keyframes. 
(b) is the distance between and on 2D keyframe plane. 
The principal point coordinate is defined on 2D keyframe plane. 
The proposed adaptive selection consists of four steps. First, we set using principal point coordinate:
(2) 
where we observed that in general, lowparallax areas are located near this coordinate; see its location on a 2D keyframe plane in Fig. 4(b). Second, we set as follows: we set the coordinate of to by assuming that inappropriate mapping occurs in objects at a distance of m or more, and calculate  and coordinates of by
(3) 
We obtained the result in (3) by backprojecting in (2) to using the perspective projection equation [hartley2003multiple]:
where we set to express in the form of homogeneous coordinate. Third, we estimate using :
(4) 
We obtained the result in (4) by mapping in (3) with the coordinate increased by onto keyframe 2D plane (using the perspective projection):
where we set to express in the form of homogeneous coordinate. Finally, we select a threshold by calculating the distance between in (2) and in (4):
This is illustrated in Fig. 4(b). Objects more than m away from the point may have the displacement less than , so we remove the corresponding corner features displacement less than .
Iii Results and discussion
We compared the proposed monocular SLAM with four variations of ORBSLAM2 [mur2017orb]: 1) monocular setup without loop closing (LC), 2) monocular setup with LC, 3) stereo setup without LC, and 4) stereo setup with LC. The ORBSLAM2 stereo setup uses two cameras and resolves the scale ambiguity & drift [mur2017orb] and inaccurate mapping issues [scaramuzza2011visual]. LC can further improve the estimation accuracy of camera poses and 3D point coordinates in ORBSLAM2, if it detects a loop. It can be applied to both monocular and stereo setups.
In measuring the trajectory tracking accuracy, we calculated absolute trajectory error (ATE) [sturm2012benchmark], rootmeansquarederror between predicted camera center coordinates of keyframes and their groundtruth in meters (m) (if groundtruth exists).
Iiia Experimental setup
We first determined an appropriate downsampling factor for keyframe segmentation (see the last paragraph in Section IIB), and compared the proposed method using the determined downsampling factor with the four stateoftheart visual SLAM methods described above.
IiiA1 Comparisons of different keyframe (spatial) resolutions for semantic segmentation in the proposed method
We compared the accuracy and running time of sementic segmentation and the trajectory tracking accuracy of the proposed monocular SLAM system, with different keyframe resolutions. To generate keyframes with different resolutions, we downsampled the width and height of original keyframes by factors of , , and
using bicubic dowmsampling. We upsampled segmented keyframes with nearest neighbor interpolation (using an upsampling factor identical to corresponding downsampling factor). In the proposed method, long segmetation running time(s) can jeopardize camera pose estimation in localization (due to limited 3D points), so we disregarded segmentation computation costs in determining an appropriate downsampling factor for keyframes.
We used the advanced Nvidia semanticsegmentation network [zhu2019improving] for keyframe segmentation.^{1}^{1}1 This method uses video prediction based data synthesis [reda2018sdc] to increase the training data size, and it took the first place in the KITTI semantic segmentation benchmark [andreas2020kitti_semantic]. This network was pretrained with the KITTI semantic segmentation dataset [Alhaija2018IJCV] that is different from the KITTI odometry dataset [Geiger2012CVPR]. The proposed monocular SLAM method uses four classes, road, movable objects, background, and others, where we categorized predictions of the segmentation network as follows: road (“road”), movable objects (“person”, “rider”, “car”, “truck”, “bus”, “train”, “motorcycle”, and “bicycle”), background (“building”, “terrain”, and “sky”), and others.
In measuring the trajectory tracking accuracy and segmentation computation time, we used the sequence of the KITTI odometry dataset [Geiger2012CVPR]
. In measuring the segmentation accuracy, we used the Cityscapes dataset
[Cordts2016Cityscapes] that consists of images driving in urban environments, since the KITTI dataset does not have groundtruth segmentation annotations. We evaluated the semantic segmentation accuracy by mean intersection over union (IoU) of the road, movable objects, and background labels (others were not used in the proposed method). In measuring the segmentation computation times (in seconds, s), we used NVIDIA GeForce GTX 1080 with 8GB.We determined an appropriate spatial downsampling factor for keyframe segmentation, considering tradeoff between trajectory tracking accuracy and segmentation running time.
IiiA2 Comparisons between different visual SLAM methods
We ran the proposed monocular SLAM system considering segmentation computation times with a determined keyframe spatial resolution via the tradeoff investigation experiment in Section IIIA1. Segmentation running time includes computational costs of downsampling, sementic segmentation, upsampling, and data transfer times between CPU and GPU. We segmented all downsampled frames in advance; when a keyframe is selected, we added the corresponding segmentation running time to the segmentation thread.
For comparing different visual SLAM methods, we used the sequences , , , , and of the KITTI odometry dataset [Geiger2012CVPR]. The KITTI odometry dataset consists of sequence sets, where each sequence set is obtained by driving in a different residential area. The sequences , , and have a loop or loops in their trajectories. The sequences and do not have loop(s), so we only ran ORBSLAM2 without LC (for both monocular and stereo setups). The KITTI odometery dataset has only a small number of moving and distant objects. To better evaluate the effects of removing such inappropriate mapping factors by proposed monocular SLAM, we used an additional sequence provided by StradVision Inc. [stradvision_2021]. The sequence includes monocular video of frames where many cars drive on the highway (i.e., there exist many moving objects) surrounded by mountains (i.e., lowparallax area). The sequence does not have any loops, so we only ran monocular ORBSLAM2 without LC.
In addition to the quantitative analyses with ATEs, we qualitatively compared trajectory tracking accuracy in the KITTI odometry dataset by comparing estimated trajectories with their groundtruths. Since the StradVision sequence does not have groundtruth trajectory, we aligned trajectories with the map provided by Google Maps for qualitative comparison. (In monocular ORBSLAM2 without LC, we also aligned a scale of the estimated trajectory.)
IiiB Comparisons of different keyframe resolutions for semantic segmentation in the proposed method
Table I shows that when keyframes are downsampled by a factor of or , both segmentation accuracy and trajectory tracking accuracy slightly decrease, compared to the case with the original keyframe resolution. When keyframes are downsampled by a factor of , both segmentation accuracy and trajectory tracking accuracy significantly decrease, compared to the case with the original keyframe resolution. Specifically, the segmentation accuracy (in IoU) decreases by , , and in road, movable objects, and background labels, respectively; the trajectory tracking accuracy decreases by .





Road  Movable Obj.  Background  
1  0.934  0.865  0.867  6.721  
1.333  0.926  0.863  0.863  6.755  
2  0.918  0.859  0.855  6.784  
4  0.898  0.782  0.806  8.681 

The Nvidia semanticsegmentation network [zhu2019improving] was used.
Sequence 








00  6.797  231.701  199.658  8.515  6.869  
05  2.888  170.815  143.368  2.949  1.567  
07  2.354  80.490  67.158  2.856  1.057  

03  1.306  140.736    1.387    
08  12.973  196.803    12.765   
The segmentation running times for keyframes downsampled by a factor of , , , and are , , , and , respectively (all in s). When segmentation computation times are considered (see Section IIIA2), we observed with the KITTI00 sequence that for the downsampling factors and , the proposed monocular SLAM system frequently stops running; for the downsampling factors and , it runs continuously. The reason is that long segmentation running time can significantly delay keyframe selection and limit 3D point generation, and eventually, cause the entire system failure; see further details in Section IIB.
The aforementioned empirical observations demonstrate that the (spatial) downsampling factor has a reasonable tradeoff between trajectory tracking accuracy and segmentation running time. The keyframe downsampling factor of , 1) slightly degrades the trajectory tracking accuracy (by ) and segmentation accuracy (by , and for road, moving objects, and background labels, respectively), compared to the original keyframe resolution, but 2) significantly reduce the segmentation running times (by a factor of ), so proposed monocular SLAM runs continuously. To achieve realtime execution of the proposed monocular visual SLAM system, we decide to spatially downsample keyframes by the factor for semantic segmentation.
IiiC Comparisons between different visual SLAM methods
In all the sequences of the KITTI odometry dataset, Tables II(a)–II(c) and comparing their trajectories in Figs. 5–6 show that the proposed monocular SLAM significantly improves trajectory tracking accuracy compared to monocular ORBSLAM2, regardless of using LC. The monocular ORBSLAM2 methods suffer from scale ambiguity regardless of LC, their estimated trajectories are off the scale from groundtruth trajectories, and ATE values significantly increase. See Tables II(b)–II(c) and compare their trajectories in Figs. 5–6. The proposed monocular visual SLAM system resolves the scale ambiguity by correcting scales using segmented road labels, and achieves significantly better trajectory tracking accuracy.
Regardless of the loop(s) existence in the KITTI odometry sequences, Tables II(a) and II(d) and comparing their trajectories in Figs. 5–6 demonstrate that the proposed method achieves comparable trajectory tracking accuracy compared to stereo ORBSLAM2 without LC. The results imply that using only a single camera, the proposed method can resolve scale ambiguity & drift and inaccurate mapping issues with comparable performance to stereo ORBSLAM2 without LC.
In the sequences with loop(s), stereo ORBSLAM2 with LC gave slightly better trajectory tracking accuracy compared to the proposed method. See Tables II(a) and II(e) and compare their results in Fig. 5. In both monocular and stereo ORBSLAM2s, the LC scheme improves trajectory tracking accuracy by correcting accumulated localization errors, when a vehicle revisit a previously mapped area (i.e, a loop is created). Compare results between Tables II(b) and II(c), and those between and Tables II(d) and II(e). Based on the results in Tables II(a) and II(d), we conjecture that if LC is additionally used, the proposed monocular SLAM system achieves comparable or better trajectory tracking accuracy, compared to stereo ORBSLAM2 with LC.
In the StradVision sequence (that has no loops), Fig. 7 shows that the proposed method significantly improves the trajectory tracking accuracy compared to monocular ORBSLAM2 without LC. Whereas the proposed method consistently followed a road route, monocular ORBSLAM2 without LC gave deviated vehicle localization at the th frame, and was eventually terminated at the th frame. The monocular ORBSLAM2 system using no LC does not remove moving objects and lowparallax areas in generating 3D points at its mapping module. Such inappropriate 3D points could significantly degrade camera pose estimation performance, and eventually may terminate the entire system. The proposed SLAM system overcomes that challenge by corner feature refinements in the segmentation thread; see Section IID.
Proposed monocular SLAM using the keyframe downsampling factor for sementation segmentation, did not experience system termination/failure in all the six sequences.
(a) Sequence
(b) Sequence
(c) Sequence
(a) Sequence
(b) Sequence
Iv Conclusion
Monocular SLAM methods use only a single camera and thus, have major limitations in scale ambiguity & drift and inaccurate mapping. The proposed monocular SLAM system resolves the limitations by efficiently using deep learningbased semantic segmentation: it applies a stateoftheart semantic segmentation only to (downsampled) keyframes in parallel with mapping processes. The proposed method performs scale correction using roadlabeled 3D points and removes features labeled by dynamic objects and low parallax areas in 3D point generations. Regardless of the existence of loops in video sequences, proposed monocular SLAM achieves significantly better trajectory tracking accuracy compared to stateoftheart monocular SLAM, monocular ORBSLAM2 [mur2017orb]. In video sequences without loops, proposed monocular SLAM achieves comparable trajectory tracking accuracy compared to stateoftheart stereo SLAM, stereo ORBSLAM2 [mur2017orb]. We believe that the proposed monocular SLAM system would be particularly useful in ADAS and autonomous driving systems, by using only a single camera but having comparable performances to stereo visual SLAM systems. The proposed efficient integration of segmentation segmentation with monocular SLAM could be applicable to an visual SLAM integration with other deeplearning methods needing high computation costs, such as depth estimation.
There are a number of avenues for future work. First, we will additionally run LC in an additional thread, to further improve trajectory tracking accuracy. Second, we plan to use camera poses, 3D point coordinates, etc. to refine segmentation results, motivated by [wang2019unified]. On the imager side, the future work is modifying the proposed framework for fisheye or omnidirectional camera.
Comments
There are no comments yet.