A Unified Framework for Mutual Improvement of SLAM and Semantic Segmentation

by   Kai Wang, et al.
CloudMinds Technologies Co. Ltd

This paper presents a novel framework for simultaneously implementing localization and segmentation, which are two of the most important vision-based tasks for robotics. While the goals and techniques used for them were considered to be different previously, we show that by making use of the intermediate results of the two modules, their performance can be enhanced at the same time. Our framework is able to handle both the instantaneous motion and long-term changes of instances in localization with the help of the segmentation result, which also benefits from the refined 3D pose information. We conduct experiments on various datasets, and prove that our framework works effectively on improving the precision and robustness of the two tasks and outperforms existing localization and segmentation algorithms.



There are no comments yet.


page 3

page 5

page 6


Fine-Grained Segmentation Networks: Self-Supervised Segmentation for Improved Long-Term Visual Localization

Long-term visual localization is the problem of estimating the camera po...

One Shot Joint Colocalization and Cosegmentation

This paper presents a novel framework in which image cosegmentation and ...

Visual Localization Using Semantic Segmentation and Depth Prediction

In this paper, we propose a monocular visual localization pipeline lever...

Joint Facade Registration and Segmentation for Urban Localization

This paper presents an efficient approach for solving jointly facade reg...

SemSegMap- 3D Segment-Based Semantic Localization

Localization is an essential task for mobile autonomous robotic systems ...

Hierarchical Segment-based Optimization for SLAM

This paper presents a hierarchical segment-based optimization method for...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Localization and Segmentation are two of the most fundamental tasks for robotic movement and sensing. The former makes the robot get aware of its current position and orientation, and the latter helps to perceive the distribution and precise boundaries of the objects of interest within the robot’s field of view. These two techniques are essential in many robotic applications including autonomous driving, Unmanned Aerial Vehicle (UAV), robot patrolling and logistics, etc.

For the localization task, visual Simultaneous Mapping and Localization (vSLAM) is one of the most promising methods due to its relatively low hardware and computational cost characteristics in recent years. It utilizes the image sequences with some auxiliary sensor data such as depth map, Inertial Measurement Unit(IMU) data, etc, to create the map of the environment and return the current location information at the same time. A big challenge in vSLAM is that the environment in which the robot locates is usually changeable. On one hand, instantaneous movement of some objects during mapping will affect the precision of the map due to the inconsistency of the moving trend in the scene [DynaSLAM]. On the other hand, the map created will no longer be consistent with the environment once some objects have movedafter mapping completes. As a result, subsequent localization based on this map will not be accurate.

For the segmentation task, 2D image-based semantic segmentation using deep neural network has proved to be effective in most cases and has been widely used in many systems 


. It is able to output the exact boundaries of a series of segmented regions and their classes. Anyway, as the deep learning methods rely heavily on training data, unprecise manual labeling and lack of similar training data usually lead to inaccurate segmentation results.

Previously, these two tasks were generally regarded as two independent tasks whose results were rarely utilized by each other. In this paper, we propose a novel framework for simultaneously improving the vSLAM as well as semantic segmentation precisions. The segmentation and vSLAM are performed in an interweaved method and the results are used to refine each other’s. Specifically, the computed pose information of the previous and current frames are utilized to refine the segmentation of the latter one, in which all the potentially moveable objects are then identified and sent to the vSLAM module for further computation of the tracking and mapping of the corresponding frame. This scheme repeats through the whole process and both the vSLAM and segmentation precisions of this sequence are therefore enhanced. Furthermore, the map created becomes more robust to changes of the scene and the localization in the same environment afterwards will benefit from it and become more precise. This framework is tested on different datasets and proves to be more effective over existing works on both the vSLAM and segmentation tasks.

The contributions of this paper include:

  • A unified framework of enhancing the vSLAM and segmentation tasks mutually.

  • A novel approach for enhancing both the mapping and localization precisions in vSLAM by identifying and processing both the moving and potentially moveable objects respectively.

  • An effective refinement scheme for image segmentation by making use of 3D pose information.

The rest of the paper is organized as follows: Section II reviews the related works on vSLAM and segmentation. Section III introduces the proposed framework and workflow in detail, and experimental results are shown and discussed in section IV. Section V gives the conclusion.

Ii Related Work

Fig. 1:

The overall workflow of the proposed framework, which contains a segmentation module and a vSLAM module. For each input frame, a coarse pose and segmentation are first calculated. The two results are then used to estimate a fine pose and update a tracking map. A long-term map is also maintained for the further visit of the same area. At the same time, the segmentation results can also be refined by using that of the previous frame and the poses estimated in the two frames. The refinement of the vSLAM and segmentation results is implemented within a single iteration for each frame.

Ii-a vSLAM for Dynamic Scenes

vSLAM is used to estimate the camera location and 3D map of the scene through a set of feature correspondences extracted from a series of images [SLAMsurvey]. Various works on vSLAM have been proposed in recent years, from the seminal work PTAM [PTAM] to the popular ORB-SLAM2 [ORBSLAM]. Most of these approaches assume that the observed scenes are relatively static, and pose estimation might drift or even be lost as there are not features to be matched consistently in the case of scenes with dynamic objects.

There have been works proposed to handle dynamic environment [dynamicvslamsurvey]. For example, [geoconstraint] used geometric constraints to segment static and dynamic features. [opticalflow]

computed the likelihood of a moving object based on a motion metric computed from optical flow and then segment the moving objects. Recently, researchers have shifted their focus to using deep neural network to do the segmentation to remove outliers for accurate pose estimation. For example, Mask-SLAM 

[MASKSLAM] excludes feature points detected in the sky area or on cars using the segmentation mask trained by DeepLab v2 [DeepLabv2][DriventoDistraction] estimates the likelihood that any pixel in an input image corresponds to either reliable static structure or dynamic objects. The work [DynaSLAM] proposed to combine multi-view geometry models and deep-learning-based algorithms for detecting dynamic objects and removed them from the frames. In [Robust]

, the depth map, sparse scene flow and semantic cues are combined to classify scene as either static background, movable and moving objects. While these methods have proved that excluding feature points in certain masked area makes the estimation of camera motion more stable, they rely heavily on the exact segmentation of the moveable objects and are prone to be inaccurate when its precision is limited.

Ii-B Image and Video Segmentation

The pioneer work [FCN]

on deep neural network based image segmentation explored the use of Convolutional Neural Network (CNN) to segment the images, through adapting classifiers for dense prediction by replacing the last fully-connected layer with convolution layers. Later on, 

[SegNet] made use of the encoder-decoder architecture and reused the pooling indices from the encoder to decrease parameters. DeepLabv3 [DeepLabv3] augments the Atrous Spatial Pyramid Pooling (ASPP) module in [DeepLabv2] with image-level feature to capture longer range information as in [lowtensor], and DeepLabv3+ [DeepLabv3+] further extends it to include an effective decoder module to refine the segmentation results along object boundaries. Pyramid Scene Parsing Network (PSPNet) [PSPNet] implements spatial pooling at several grid scales and demonstrates satisfactory performance.

Furthermore, algorithms have been proposed to achieve instance-level segmentation. The prior work [XXX0] task uses R-CNN [XXX1] to classify region proposals, which are then refined by category-specific coarse mask predictions. MNC [XXX4] proposed a cascaded structure, which consists of three networks used for differentiating instances, estimating masks, and categorizing objects respectively. FCIS [XXX5] performs object segmentation and detection sub-tasks jointly and exploits the strong correlation between the two sub-tasks with shared score maps. Mask R-CNN [XXX6] extends Faster R-CNN [XXX7] by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.

There are also some works proposed for video sequence-based segmentation. For example, [DST-FCN] made use of the spatial-temporal information of consecutive frames by introducing 3D-Conv [3D-Conv] and Conv-LSTM [Conv-LSTM] modules, so as to enhance the precision of video segmentation. Since the 3D spatial information of adjacent frames was not utilized, they may still fail to predict precise boundary information.

Iii Framework

Iii-a Overall Workflow

The general workflow of the proposed framework is shown in Fig. 1. This framework takes the RGB image sequences as well as the depth map sequences as input. It includes two major modules: the vSLAM module and the segmentation module. For each input frame, the vSLAM module will output the pose information of the camera w.r.t. the world and update the map of the environment for long-term use, and the segmentation module will produce an image segmentation result with the semantic information of each pixel.

Specifically, the initial input frame will be first segmented, and potentially dynamic objects are identified. At the same time, a coarse pose will be computed in the vSLAM module. The results will be sent to the vSLAM module to compute the initial pose information. Next, when a new frame comes, a coarse vSLAM and segmentation will be performed first, and the coarse pose together with the pose and segmentation result of the last frame will be sent to the segmentation module to refine the coarse result. After the final segmentation result of this frame is computed, it will be sent to the vSLAM module to proceed fine tracking and mapping, after which the precise map and location information will be obtained.

Next, the detailed vSLAM and segmentation modules will be introduced.

Iii-B Initial Segmentation

For each input RGB frame, we used the FCIS [XXX5] algorithm to perform an initial segmentation. We trained the network on MS COCO [MSCOCO] dataset which contains 80 classes for both indoor and outdoor objects. For an input RGB image, FCIS is able to compute the bounding box for each object. If the pixel value in the bounding box is larger than a threshold, it is regarded as part of the object; otherwise, it will be marked as the background. We repeat this operation for all the bounding boxes to get the mask for the whole image.

After the segmentation, we identified the moveable objects from all the instances in the result, according to a predefined shortlist in which only objects that are likely to move or be moved (such as person, cars, cup, chair, etc.) among all the 80 classes are selected. The result is in the form of a mask image with the region and instance ID of each segmented instance encoded, and will be sent to the vSLAM module to proceed the tracking and mapping computation.

Iii-C vSLAM based on Segmentation Result

We use the ORB-SLAM2 algorithm [ORBSLAM] which has shown satisfactory performance in many scenarios. To ensure the stability, we used the RGB-D version of ORB-SLAM2 which takes both RGB image and depth map as input.

Each time a new frame comes, we first implement a coarse tracking to get an initial guess of the pose of the current frame. Specifically, we first extract the ORB feature points and align them with the depth map to get the 3D coordinates of each point , and get the coarse rotation and translation by minimizing the reprojection error as what the original ORB-SLAM2 did.

The extracted feature points are then classified into a background set and other different sets according to their positions in different segmented areas. if a point lies in the background area, it belongs to set ; otherwise it falls into set which corresponds to the area of segmented instance . The motion states of the classified point sets will then be judged according to the coarse rotation and translation . Specifically, we project the points in the tracking map onto the current frame, and for each point in the frame, a best matching point is found. If the Euclidean distance between and is less than a predefined threshold, then is regarded as static. For the set that belongs to, if the percentage of moving points is less than a threshold, then the instance that set corresponds to is regarded as a static object in the current frame, otherwise, it is deemed moving.

An example of the segmented regions and classified features points is shown in Fig. 2.

Fig. 2: Illustration of feature points and segmented area classifications. (a) The detected feature points are classified into background (in green), moving (in red) and moveable (in blue) points; (b) The segmentation result with regions classified into background (A) and moving or moveable (B1-B6) .

Next, 2D-3D matching between the points in the background set and the sets that are considered as static and also in the tracking map is implemented by minimizing the reprojection error, and fine rotation and translation can thus be obtained. After the fine pose has been obtained, it will be sent to the segmentation module for the refinement of the initial segmentation result.

There are two types of maps created and maintained in the vSLAM module: tracking map and long-term map.

The tracking map is used to compute the trajectory of the camera during the tracking process. The new point in the tracking map is computed by projecting each point the background point set and moving point set of the new key frame onto the tracking map through . If there are already matching points, then no more update of the map is required; otherwise, the newly projected 3D points will be added into the tracking map. The use of only points of the static objects will help the preservation of the information used for computing the camera pose in the current scene, and thus improves the tracking stability and trajectory precision.

The long-term map is designed for long-term use. It only needs to be created at the first time when a robot navigates in a new area, and can be reused later on to avoid duplicated mapping computation when the same region is visited . Therefore, only the points whose positions will probably keep fixed over time should be included in this map to provide stable environment information. To do that, each time the tracking map is updated, we remove the points that belong to set

in the tracking map and have the potential to move in future, and add the rest points (i.e. the points in set ) into the long-term map.

Iii-D Refinement of Segmentation result

After we get the coarse pose , of the current frame, and the fine pose , as well as the segmentation result of the previous frame, we can use them to update the segmentation result in the current frame.

First, we project each 2D point of the segmented regions in the last frame which has been refined and assumed to be accurate to in the current frame according to the following equations:


In the above equations, , and are the focal lengths and principal point of the camera respectively. is the depth value of and is the depth factor of the depth map. and represent the relative rotation and translation w.r.t. the last frame. is the scale factor of the image.

Next, we try to refine the initially segmented image with each projected region . The workflow is listed in Algorithm 1. We first try to find a matching region for in the current frame, by measuring the similarity between each region in the roughly segmentation result and using:


where refers to the Euclidean distance between the barycenters of and . The function compute the total number of pixels inside a region. The first item measures the positional distances of the two regions and the second one is used to compute their shape difference. and are the weights for the two items respectively. The region with the smallest value which is smaller than a predefined threshold will be selected as the matching region for . We then compare the ratios of their intersection area to the two regions, and the region with larger ratio is considered as a reliable one and preserved as the finally segmented region.

1:for a projected region  do
2:     find the matching region for with (5)
3:     if found then
4:         compute
5:         compute
6:         compute
7:         if  then
8:              replace with
9:         else
10:              //do nothing.
11:         end if
12:     else
13:         if  then
14:              add to segmentation result
15:         else
16:              //do nothing.
17:         end if
18:     end if
19:end for
Algorithm 1 Workflow for segmented regions’ update.

If no matching region is found for , there is a high possibility that the segmentation algorithm failed to recognize an instance that was supposed to be segmented when the number of segmented instances in the current frame is less than that of the previous one. In that case, we will update the segmentation result by adding to it. If the numbers of segmented instances are same, then we simply skip the current and repeat the same process for the next .

It should be mentioned that this strategy is based on the assumption that there is no drastic changes between two adjacent frames. In some extreme circumstances, for example, an object moves much faster than the camera does, the algorithm may fail on judging the region correspondence and introduce fake results. This may be alleviated by including more frames into the computation, although this case is rarely seen in real applications.


We test our framework on different datasets with ground truths available, and compare with other state-of-the-art works on vSLAM and image segmentation. We run each sequence ten times as in [DynaSLAM] to compensate for the non-deterministic nature of dynamic scenes. All tests were implemented on a workstation with Intel i7 6700K CPU, with 32 GB RAM and Nvidia GTX1070 GPU.

Iv-a Test Results on TUM Dataset

We first test the performance of the vSLAM module of our framework on TUM dataset [TUM] in which 39 RGB-D sequences are collected. Each sequence contains both 8-bit RGB images and 16-bit depth images, with the ground truth of the camera trajectory provided. Especially, we select 6 sequences which contain ’walking’ and ’sitting’ from the ’fr3’ subset. The images were taken in the ’desk’ scene, in which two persons are either walking or sitting, and thus are suitable for testing the efficiency of our algorithm under scenes with dynamic objects.

We compared our algorithm with the original ORB-SLAM2 [ORBSLAM] and DynaSLAM [DynaSLAM] in terms of Absolute Trajectory Error (ATE) [TUM] which represents the tracking precision by taking the ground truth as reference, and the results are shown in Table I.

Sequence ORB-SLAM2 DynaSLAM Our vSLAM module
median min max
Walking_halfsphere 0.351 0.025 0.019 0.010 0.028
Walking_static 0.090 0.006 0.005 0.0005 0.008
Walking_rpy 0.662 0.035 0.032 0.002 0.036
Walking_xyz 0.459 0.015 0.014 0.001 0.029
Sitting_halfsphere 0.020 0.017 0.021 0.002 0.031
Sitting_xyz 0.009 0.015 0.009 0.001 0.022
TABLE I: Comparisons of ATE[m] of our vSLAM module against the original ORB-SLAM2 [ORBSLAM] and DynaSLAM [DynaSLAM].

It can be seen from Table I that the improvement of the performance of our algorithm on the ’walking’ datasets is obvious. In these datasets, ORB-SLAM2 created a lot matches of dynamic feature points due to the movement of the two persons. This enlarges the pose error during optimization. Similar to DynaSLAM, we segmented and discarded the moving objects which contribute to the dynamic points and therefore reach higher precisions. The reason why our algorithm outperforms DynaSLAM is because we refined the segmentation results using 3D pose information and obtained more accurate segmentation regions and boundaries. The enhancement of segmentation precision makes the removal of dynamic points more accurate and thus reduces the pose error. For the ’sitting’ datasets, the improvement of our algorithm is not quite obvious, as there are limited dynamic objects in that scene, which do not affect the feature points matching too much.

We also visualized the trajectory that our algorithm outputs with those of ORB-SLAM2 and ground truth in Fig. 3 with green, red and blue respectively. It can be seen that our result exhibits much higher similarity to ground truth than ORB-SLAM2 does.

Fig. 3: Comparison of output trajectories of our vSLAM module(in green), ORB-SLAM2 [ORBSLAM](in red) and ground truth(in blue) of the (a)’walking_halfsphere’, (b)’walking_static’, (c)’walking_rpy’ and (d)’walking_xyz’ of the TUM dataset [TUM] respectively.

The average time for the coarse tracking is 6 ms, and the fine tracking and mapping takes 22 ms.

Iv-B Test Results on ScanNet Dataset

As the ground truth for segmentation is not available in TUM dataset, we used the ScanNet dataset [Scannet] to evaluate the performance of our segmentation module. ScanNet contains 1500 RGBD sequences taken in indoor environment, and has totally 2.5 million images available. The resolutions of RGB images and depth maps are and respectively. With the provided extrinsic parameters, each depth map can be mapped to the RGB image. Ground truths of the segmentation is available for every RGB image. As the image sets in ScanNet has 550 object classes, we manually map each class to the MS COCO 80 classes according to its name or general type.

For all the images, we compute the mean Average Precision (mAP) and mean Intersection over Union (mIoU) for the results generated using our segmentation module and the original FCIS [XXX5] algorithm. The results are shown in Table II.

FCIS Our segmentation module
mAP 0.6314 0.6504
mIoU 0.5620 0.5751
TABLE II: Comparison of FCIS [XXX5] and our segmentation module on ScanNet dataset.

It can be seen that the segmentation precision of our module has been greatly improved comparing to that of FCIS [XXX5]. This proves that the use of 3D pose information for the refinement of segmented areas works well as expected.

Fig. 4: Examples of the refinement of segmentation. (a)-(c): The results of segmentation of last frame, initial segmentation of current frame with a missing part, and refined segmentation of the current frame respectively. (e)-(f): The results of segmentation of last frame, initial segmentation of current frame with an oversized part, and refined segmentation of the current frame respectively.

We selected two example groups of segmented images to illustrate the refinement of segmentation of our algorithm in Fig. 4. In Fig. 4, a table failed to be segmented probably due to motion blur is added to the refined result (see Fig. 4) by projecting and adding the segmented part of the last frame (see Fig. 4) onto the current one. Fig. 4 shows that the oversized segmented area (see the chair in Fig. 4) in the initial segmentation result was shrunk to its correct range by projecting and combining the result in the last frame (see Fig. 4).

The initial segmentation for each frame takes 113 ms on average, and the refinement takes about 50 ms. The latter process can be further accelerated by utilizing parallel computing or GPU techniques.

Iv-C Test Results on AirSim Generated Dataset

In the above two tests, the results of our algorithm on performing the two tasks have not been tested simultaneously. Meanwhile, there is also a lack of a test on the relocalization performance of the vSLAM module. Therefore, we created a series of sequences using the Microsoft AirSim simulator [airsim]. It allows the users to control the movement of a car or UAV in a virtual ourdoor environment, and collects the RGBD images as well as other sensor data during the process. The exact pose of the camera and also the exact segmentation results can be generated automatically.

To generate the image sequence data, we select totally 40 different routes in a virtual city area, and run two passes with different camera poses and moveable objects(vehicles, pedestrians, etc.) which may either be moving or static along each route, by controlling a virtual car. The resolutions for the RGB and depth images obtained from the virtual camera bound to the car are set to , with frame rate of 15 fps. The lengths of routes range from 160 m to 400 m. There are totally 16 classes in the segmentation results, and they are also mapped to the MS COCO 80 classes.

ORB-SLAM2 Our vSLAM module
median min max median min max
ATE[m] 0.82 0.43 1.03 0.39 0.28 0.61
TABLE III: Comparison of ORB-SLAM2 [ORBSLAM] and our vSLAM module on AirSim generated sequences.

To test the precision of relocalization in vSLAM, we use the long-term map created in the first pass to compute the fine tracking in the second pass, and compare the ATE[m] of our vSLAM module and that of ORB-SLAM2. It can be seen from the results shown in Table III that our vSLAM module is much better than those of ORB-SLAM2. Note that the ATE[m] values of the tracking results of the AirSim generated dataset is much higher than those of the TUM dataset. This is because the areas of the outdoor scenes in AirSim are much larger than those in TUM which are only limited regions indoors.

Fig. 5: Feature point matching for two adjacent frames with (a) and without (b) segmenting dynamic objects.

We show the matching points between two consecutive frames in the generated dataset in Fig. 5. It can be seen that the car contains some feature points which will be mapped to incorrect positions if the car disappears (see Fig. 5). By segmenting the car and excluding the feature points (see Fig. 5) on it during tracking and mapping, the tracking precision when revisiting the same region will be enhanced.

We also evaluate the performance on segmentation of our algorithm on all the 40 sequences of the second pass, and list the results in Table IV. It can be seen that by making use of the pose information to refine the initial segmentation result, our algorithm greatly enhances the accuracy of segmentation.

FCIS Our segmentation module
mAP 0.6702 0.6893
mIoU 0.6491 0.6611
TABLE IV: Comparison of FCIS [XXX5] and our segmentation module on AirSim generated dataset.

From the results tested on three different datasets, it can be seen that our framework effectively improves the precision of vSLAM and segmentation in both indoor and outdoor environment. The performance promotion of the two modules are more obvious for scenes with objects in motion in the current scan or relocated in further scans.

V Conclusions

We present a unified framework for combining the vision-based localization and segmentation tasks for robotics. An accurate pose can be refined from the coarse one by identifying and handling the moving and possibly moveable objects respectively with the help of the initial segmentation result, and it helps to remedy the errors and boundary inaccuracy of the segmented regions to get a more precise segmentation result. Experimental results on various datasets show that our approach is able to make enhancement to both the localization and segmentation for different environments, especially those with dynamic objects and obvious changes. The proposed framework has the potential to be applied to many robotic applications which use vision sensors for synthesized tasks, including autonomous driving, UAV, logistic robots, etc.