Learning to Segment Dynamic Objects using SLAM Outliers

11/12/2020 ∙ by Adrian Bojko, et al. ∙ CEA 0

We present a method to automatically learn to segment dynamic objects using SLAM outliers. It requires only one monocular sequence per dynamic object for training and consists in localizing dynamic objects using SLAM outliers, creating their masks, and using these masks to train a semantic segmentation network. We integrate the trained network in ORB-SLAM 2 and LDSO. At runtime we remove features on dynamic objects, making the SLAM unaffected by them. We also propose a new stereo dataset and new metrics to evaluate SLAM robustness. Our dataset includes consensus inversions, i.e., situations where the SLAM uses more features on dynamic objects that on the static background. Consensus inversions are challenging for SLAM as they may cause major SLAM failures. Our approach performs better than the State-of-the-Art on the TUM RGB-D dataset in monocular mode and on our dataset in both monocular and stereo modes.



There are no comments yet.


page 1

page 4

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Simultaneous Localization and Mapping (SLAM) algorithms [18, 14] are frequently used in autonomous vehicles [20], augmented reality [2, 10, 27] and robotics [7]. A SLAM localizes the camera in a static environment while reconstructing it [5]. Visual SLAMs rely on image features and assume that the camera moves in a static environment [11] (static world assumption), although it is rarely met in practice.

An extreme case where the static world assumption is not true is the motion consensus inversion, which we define as a situation where the SLAM relies more on dynamic objects than on the static environment. This may happen when an object close to the camera moves. The SLAM effectively uses a frame of reference that is not the ground thus fails. Some works give examples of such situations [22, 1] without further studying the problem. In this article, we specifically address this issue that is of high interest in practice.

Dynamic SLAMs aim to reduce the impact of dynamic objects by explicitly processing them. Most Dynamic SLAMs are constructed by improving an existing SLAM through different approaches. Geometry-based approaches detect motion at runtime [8] using geometrical algorithms, e.g., optical flow. These approaches compute the dominant motion of the image and may fail if there are motion consensus inversions.

Learning-based approaches use semantic masks [13]. Classes are defined at annotation time (before training), and any class that is not annotated is not recognized at runtime. Hybrid approaches try to compensate the limits of one approach with the advantages of the other. The first type combines geometrical and learned approaches at runtime [3], but our experiments show that it does not prevent consensus inversions. The second type uses geometry during training and semantic segmentation at runtime. An example is [1]: although their approach reduces the risk of consensus inversion, it needs training sequences recording the same location at different times, does not detect objects that do not move between training sequences and requires a full stereo camera + LIDAR setup.

Fig. 1: Steps of our method. We only need one monocular sequence per dynamic object to make the SLAM robust and prevent major failures due to dynamic objects. We train one network for our Consensus Inversion dataset and one for TUM RGB-D.

Similarly to [1], we use geometrical information (inliers and outliers) to localize dynamic objects and construct their masks, which we use to train a network. We integrate the trained network in an existing SLAM to make it Dynamic. However, our approach only needs one monocular sequence per dynamic object, while [1] requires at least two and up to eight in practice. Moreover, contrary to [1], we do not need stereo nor depth information, making our approach easier to implement and put into practice.

We use our method to make ORB-SLAM 2 [19] (Monocular, Stereo and RGB-D) and LDSO [9] robust to dynamic objects without any manual annotations. In addition to our method, we created the dataset Consensus Inversion and created two new metrics, Penalized ATE RMSE and Success Rate, to quantify the robustness of Dynamic SLAMs.

Ii Related Work

Given a geometric SLAM algorithm, making it robust to dynamic objects is relevant to real-world applications since it generalizes the SLAM from static to dynamic environments. Keeping the original, non-learned modules of an existing SLAM is valuable as it limits generalization issues encountered by learned algorithms, e.g., end-to-end SLAMs [31]. For these reasons, we focus on Dynamic SLAMs that keep the original SLAM modules.

Such Dynamic SLAMs include a dynamic object masking module in their pipeline. Most have the masking module placed right after feature computation: they either remove features on dynamic objects or process them separately. Rarely, some Dynamic SLAMs start by masking dynamic objects then detect features only in unmasked areas of the image [13].

Ii-a Geometry-based approaches

Geometry-based approaches do not use any kind of machine learning.

[21] surveys many non-learned Dynamic SLAMs and concludes with ”handling missing, noisy, and outlier data remains a future challenge for most of the discussed techniques […] Most techniques also have difficulty in dealing with degenerate and dependent motion.”. Optical flow approaches [7] compute pixel displacements between frames but may not work if dynamic objects occupy most of the scene or have an erratic motion. Depth maps approaches [25] use the additional depth information to identify salient objects but are limited by sensor range and resolution. Clustering/background-foreground approaches [16, 26]

identify dynamic objects by grouping and assigning probabilities to points with similar motions but have high computational costs and do not work well with noisy or degenerate motions

[21]. Our experiments show that geometrical approaches do not completely prevent consensus inversions.

Ii-B Learning-based approaches that use semantic masks

Learning-based approaches that use semantic masks [13, 29] remove features on dynamic objects at runtime according to their class (e.g. people and cars are usually masked) but are limited by the availability and scope of training data, not working with unknown objects [32].

Ii-C Hybrid approaches

Hybrid approaches are divided in two types.

The first type combines geometrical and learned algorithms at runtime [3, 4, 15, 22, 32], leveraging strong points while compensating weaknesses (as optical flow + semantics or depth map + semantics). However, there are still failure points common to the underlying algorithms, e.g., an object of unknown class causing a consensus inversion.

The second type uses geometrical approaches during training and learned approaches at runtime [1]. During training, geometrical approaches are used to generate the masks of dynamic objects, which are then learned. At runtime it is the same as learning-based approaches. [1] learns the appearance of dynamic objects by comparing 3D reconstructions over time but requires several traversals of the same environment at different times and a full stereo camera + LIDAR setup during training. Our approach is like [1] but we require only one monocular sequence per object, without needing additional traversals or sensors.

Iii Learning to Segment Dynamic Objects

Iii-a Overview

We outline in this section our approach, illustrated in Fig. 2, to generalize a feature-based SLAM into a Dynamic SLAM. Our goal is to protect the SLAM against the negative effects of dynamic objects in a given environment. We achieve this by training a segmentation network with two classes: static and dynamic, the latter being masked during the execution of the SLAM. Having detailed classes as car or person is unneeded since we always mask dynamic objects. We show that unconditional masking is more robust than geometry, which fails under consensus inversions. Among mask-based approaches, ours is the only one with automatic annotation and very low data requirements (one sequence per object).

Fig. 2: Overview of our approach. We collect inliers and outliers from example sequences and use them to create masks of dynamic objects. We train a semantic segmentation network with the created masks and integrate it in the SLAM after the keypoint detection step. At runtime we infer the masks of any sequence and remove all features on dynamic objects.

SLAMs usually reject non-static features with methods as RANSAC [19]. We make the hypothesis that, if there is no motion consensus inversion, the sudden apparition of dense clusters of outliers characterizes violations of the static world assumption by dynamic objects. Thus, if dense clusters of outliers suddenly replace inliers it means that the inliers that became outliers were in fact dynamic features, indicating that there is a whole object violating the static world assumption rather than just isolated features.

Given a SLAM, we define example sequences as sequences that respect our hypothesis, i.e., make the SLAM generate clusters of outliers on dynamic objects when they move. In practice, example sequences are sequences in which dynamic objects are reconstructed by the SLAM and do not cause motion consensus inversions, e.g., a sitting person that stands up a couple meters away from the camera.

The steps of our approach are:

  1. Outlier and inlier preprocessing: we use the SLAM to generate outliers and inliers. For non-deterministic SLAMs, we add a filtering step.

  2. Mask creation and network training: we use the inliers and outliers to create the masks of dynamic objects. Then we use the masks to train a network.

  3. SLAM Integration: we integrate the trained network right after the feature detection step of the SLAM.

Iii-B Outlier and inlier preprocessing

Once initialized, a feature-based SLAM algorithm computes camera poses for each frame in roughly three steps:

  1. 2D keypoint detection.

  2. 2D-3D matching between detected keypoints and known 3D map points + triangulation of new 3D map points.

  3. Bundle adjustment: robust optimization of 2D-3D matches and camera poses.

We save the coordinates of outliers and inliers of each frame right after the bundle adjustment: outliers are keypoints whose 2D-3D match was rejected, and inliers those whose 2D-3D match was not rejected. SLAMs may be non-deterministic due to multithreading or random functions (e.g,. RANSAC). So, we save inliers and outliers coordinates over several runs and merge them, filtering rarely observed coordinates as they tend to create spurious clusters when merged.

Iii-C Mask creation and network training

This step is divided in mask creation network training.

Iii-C1 Mask creation

this step consists in localizing dynamic objects with sliding windows, refining the windows into masks, and propagating the masks to the whole sequence.

Localizing dynamic objects with sliding windows

: on every image of every sequence we use rectangular sliding windows of different sizes, at a fixed stride, to evaluate how the inlier/outlier ratio changes.

Let be a window on image and its corresponding window on image . Then the outlier score is:


When a mapped object moves, many inliers become outliers, making the ratio drop between consecutive frames. We consider that contains a dynamic object if is less than a threshold , set by the user.

We roughly compensate camera motion with the homography where is the camera intrinsic matrix and is the relative rotation between and . We apply on window to have both and match the same physical location. This approximation is easy to compute using a trajectory generated by the SLAM and was accurate enough in our experiments.

Refining sliding windows into single masks: we merge all bounding boxes that overlap on the same image. Then we project each merged bounding box on the past and future frames and create image sequences with the content of these projected bounding boxes. The created sequences are a perfect fit for Unsupervised Video Object Segmentation (UVOS, methods that automatically segment salient/dynamic objects in videos) as there is no ambiguity on which object to segment. For each created sequence we apply the author’s implementation of COSNet  [17] (a State-of-the-Art UVOS network) on the central images, thus masking the dynamic objects inside the sliding windows.

Propagating single masks: now that we have a single accurate binary mask for every dynamic object, we can propagate them to past and future frames using semi-supervised video object segmentation. These methods track dynamic objects in videos but require very accurate initial guesses which we have thanks to the previous step. We apply the author’s implementation of SiamMask [30] (a State-of-the-Art network that is both lightweight and class-agnostic) towards past and future frames. The result is a set of binary masks, covering the whole sequence, for each dynamic object.

Iii-C2 Network training

Our goal is to train a semantic network able to mask all dynamic objects simultaneously. First, we train one instance of DeepLabv3+111Source: https://github.com/srihari-humbarwadi/person_segmentation_tf2.0 [6] (a State-of-the-Art semantic segmentation network) for every set of masks generated at the previous step. Then, we infer masks for every sequence and for every trained network. We superimpose the masks inferred on the same sequence and use all superimposed masks to train a final instance of DeepLabv3+. The computed model can be used to mask all dynamic objects of all sequences simultaneously.

Iii-D SLAM Integration

We integrate the final DeepLabv3+ trained network after the feature detection module in the SLAM. We infer the mask of dynamic objects from the current image, then filter all features whose coordinates are on masked areas.

Iv Consensus Inversion Dataset and new metrics

To the best of our knowledge, there is no dataset nor appropriate metrics to test the robustness of SLAMs with respect to motion consensus inversions. Hence, we created the Consensus Inversion dataset, illustrated in Fig. 3, made of two subsets: Dynamic and Static. We used a MYNT EYE D1000-120 stereo camera at 1280x720 / 30Hz. IMU data is recorded but not used. All sequences are about 500 to 1000 images long.

Fig. 3: Miniatures of our dataset Consensus Inversion. The camera moves in the Dynamic subset and stays static in the Static subset.

Iv-a Consensus Inversion / Dynamic

The camera is dynamic in the first subset. It includes two sequences with static objects (book/notes) and twelve with dynamic objects (dragon/dromedary/car, each object moves in three sequences). This subset covers different situations:

  • Static: a static object close to the camera. Tests oversegmentation, i.e., if the SLAM masks a static object and fails for this reason (all image features are masked).

  • Easy: an object moves but does not cause consensus inversion. Tests standard SLAM robustness.

  • Hard: an object moves and causes motion consensus inversion. Tests SLAM robustness to consensus inversions.

  • Very hard: an object moves rigidly with the camera while very close to it. Tests the robustness to consensus inversion when detecting object motion is extremely difficult.

We computed the ground truth and the ground truth tracking rate (% of tracked frames) using ORB-SLAM 2 stereo improved with our method and with the early stopping of bundle adjustment removed. We also computed the ground truth tracking rate in monocular mode.

Iv-B Consensus Inversion / Static

The camera is static in the second subset. An object close to the camera starts moving and causes a consensus inversion: the SLAM must, however, not compute any motion. We made five sequences per dynamic object.

Iv-C Metrics: Penalized ATE RMSE and Success Rate

The standard metrics to test SLAMs are the ATE RMSE (Absolute Trajectory Error) and the tracking rate (% of tracked frames). However, if tracking rates are too different the comparison of ATE RMSEs is biased: a SLAM that stops early might skip tricky parts of the sequence. Hence, we defined the Penalized ATE RMSE and Success Rate.

We consider that an ATE RMSE is invalid if: 1) it is unknown (e.g. when using reported results) or 2) the tracking rate is lower than , where is a fixed threshold and is the ground truth tracking rate (i.e. the expected % of tracked frames).

The Penalized ATE RMSE is computed in relation to other SLAMs in case of invalidation. With the penalty factor and the set of all valid ATE RMSEs computed by other SLAMs on the tested sequence, we define the Penalized ATE RMSE in Eq. 2:


Conversely, we consider that a SLAM successfully processes a sequence if all these conditions are respected: 1) The ATE RMSE is known 2) The ATE RMSE is below a fixed threshold 3) The tracking rate is at least . The Success Rate of a SLAM on a dataset is the ratio of sequences that the SLAM successfully processes.

V Experiments

V-a Experimental setup

V-A1 Datasets and parameters

we evaluate our method on the TUM RGB-D dataset [23] (dynamic sequences) and on the Consensus Inversion dataset. The purpose of the TUM RGB-D dataset is the evaluation of RGB-D SLAM systems; it was recorded using a Microsoft Kinect and the ground truth camera poses was obtained from a motion capture system. The sequences contain color and depth images at 640x480/30Hz. The dynamic sequences are a set of eight sequences that record people moving.

We use ORB-SLAM 2 [19] as the core SLAM algorithm for the main experiments and LDSO [9] specifically to test the extension to a Direct SLAM. We set the feature number to 3000 and use default/author settings otherwise. We use our method to train one network on the TUM RGB-D dataset (we use the sequences fr3_sitting_static and fr3_walking_static) and one on the Consensus Inversion dataset (we use the Easy sequences of the Dynamic subset).

We empirically found suitable parameters: sliding windows of size 100x100 / 200x200 / 300x300 / 400x400 with a stride of 50px, a difference of 3 images to compute outliers scores, an interval images (1s at 30Hz) for the unsupervised VOS and a max outlier score to determine if a window contains a dynamic object.

Fig. 4: Consensus inversion in fr3_walking_xyz (TUM RGB-D). Camera pose in blue and 3D map in red. Left: no masks, the camera trajectory is nonsensical as the SLAM uses features on moving people. Right: we apply masks using our method, the SLAM trajectory is coherent with the real motion.
Fig. 5: Example of monocular false start. The SLAM cannot initialize as the camera is perfectly static. Yet if the object is not masked the SLAM generates absurd trajectories and 3D maps. Masking dynamic objects prevents such situations.
Fig. 6: We apply our method on LDSO and test it on a hard sequence. Without masks the SLAM fails, with masks it ignores the object and works correctly.

V-A2 Metrics

To consider non-deterministic behaviors, we compute the median of the ATE RMSE of the keyframe trajectory with a Sim(3) alignment and the tracking rate over 10 executions for every sequence. We then compute the Penalized ATE RMSE and Success Rate (section IV-C) with penalty coefficient , max acceptable tracking rate decrease and max acceptable ATE RMSE .

V-B Results

State-of-the-Art ORB-SLAM 2 [19] +
L-K2[7] Dyna3[3] ST4[22] Uni5[29] DS5[32] Segmentation baselines Our seg.
Test dataset No seg. Mask R-CNN[12] PWC-Net[24] RVOS[28] COSNet[17]
Consensus Inversion - Mono 0,0547 0,0693 0,0692 N/A N/A 0,0860 0,0760 0,0237 0,0144 0,0297 0,0089
Consensus Inversion - Stereo N/A 0,0627 0,0699 N/A N/A 0,0756 0,0630 0,0803 0,0116 0,0148 0,0094
TUM RGB-D - Mono 0,0892 0,1108 0,1101 N/A N/A 0,0252 0,0235 0,0335 0,0331 0,0267 0,0222
TUM RGB-D - RGB-D N/A 0,0206 0,0173 0,0190 0,0802 0,1077 0,0172 0,0790 0,0218 0,0245 0,0185
TABLE I: Average Penalized ATE RMSE (m) of the State-of-the-Art and baselines on Consensus Inversion/Dynamic and TUM RGB-D/Dynamic datasets. N/A indicates that the SLAM mode is not supported.
State-of-the-Art ORB-SLAM 2 [19] +
L-K2[7] Dyna3[3] ST4[22] Uni5[29] DS5[32] Segmentation baselines Our seg.
Test dataset No seg. Mask R-CNN[12] PWC-Net[24] RVOS[28] COSNet[17]
Consensus Inversion - Mono 54,5% 63,6% 63,6% N/A N/A 45,5% 54,5% 72,7% 72,7% 72,7% 100,0%
Consensus Inversion - Stereo N/A 72,7% 63,6% N/A N/A 63,6% 63,6% 63,6% 81,8% 81,8% 100,0%
TUM RGB-D - Mono 50,0% 62,5% 62,5% N/A N/A 87,5% 87,5% 62,5% 62,5% 100,0% 100,0%
TUM RGB-D - RGB-D N/A 100,0% 100,0% 100,0% 87,5% 62,5% 100,0% 62,5% 100,0% 100,0% 100,0%
TABLE II: Success Rate (%) of the State-of-the-Art and baselines on Consensus Inversion/Dynamic and TUM RGB-D/Dynamic datasets. N/A indicates that the SLAM mode is not supported.
State-of-the-Art ORB-SLAM 2 [19] +
L-K2[7] Dyna3[3] ST4[22] Segmentation baselines Our seg.
No seg. Mask R-CNN[12] PWC-Net[24] RVOS[28] COSNet[17]
53,3% 60,0% 60,0% 60,0% 60,0% 66,7% 86,7% 80,0% 100,0%
TABLE III: Evaluation on Consensus Inversion/Static dataset. We report the ratio of sequences that do not cause initialization fails (false starts).

V-B1 Comparison with the State-of-the-Art

dynamic objects, especially when causing consensus inversions, may cause early SLAM failure and decrease in the tracking rate of SLAMs, making the comparison of ATE RMSEs biased. To take both the trajectory error and the tracking rate into account we use our new metrics: the Penalized ATE RMSE and the Success Rate. The Penalized ATE RMSE integrates failures, so it is directly comparable between SLAMs, and a higher Success Rate expresses that a SLAM is less affected by dynamic objects. We evaluate the methods on the TUM RGB-D and Consensus Inversion datasets.

Table I (cols. 2-6) shows the Penalized ATE RMSE of the State-of-the-Art. Our method performed better than others on our dataset (all modes) and on TUM RGB-D in monocular mode. It is in third place on TUM RGB-D in RGB-D mode.

On TUM RGB-D in monocular mode the results of L-K222We implemented a simplified version of [7] (which uses the Lucas-Kanade optical flow): we warp frames with an homography and we set the optical flow displacement threshold to 2px.[7], DynaSLAM333DynaSLAM randomly crashed in RGB-D mode on our system. We refer to the original results in this mode. [3] and SLAMANTIC444The publicly available code of SLAMANTIC does not support monocular mode so we adapted the available stereo code. [22] are explained by the harsh penalty we give to early failures (which happened to the three of them). The original ORB-SLAM 2 already performs well so removing dynamic objects is not really necessary. All Dynamic SLAMs555We report the results of DS-SLAM and Unified. performed well in RGB-D including our method. We reached the standard of other Dynamic SLAMs that rely on manually annotated networks. Fig. 4 illustrates how the SLAM can output nonsense in presence of consensus inversions (motions that do not exist) and that we prevent it.

On our dataset other Dynamic SLAMs performed poorly: they failed in hard / very hard sequences when the object was of an unknown class, e.g., dragon or dromedary. Even SLAMANTIC, that tries not to segment mobile objects that are not moving (e.g. a parked car) also failed in the very hard sequences as the objects are static during most of the sequence.

Table II (cols. 2-6) shows the Success Rate of the State-of-the-Art. Our approach has the best performance in all cases. The results are coherent with the Penalized ATE RMSE: a higher Success Rates correspond to a lower Penalized ATE RMSE. Except for TUM RGB-D in RGB-D mode (success rate ), all results are below 75%. The results in monocular mode show that it is more difficult to handle monocular dynamic sequences with geometrical approaches, possibly due to the arbitrary scale of the SLAM. The success rates on our dataset (about 60%) shows that both geometrical approaches and hybrid ones combining geometrical/semantic approaches at runtime fail if there are consensus inversions caused by unknown objects.

Our results prove that regarding the robustness to dynamic objects: 1) semantic networks should be fine-tuned to the dynamic objects of the working environment 2) geometrical approaches are unreliable under consensus inversions 3) Hybrid/combined approaches are unreliable under consensus inversion caused by unknown objects.

V-B2 Comparison with baselines

Unsupervised Video Object Segmentation (UVOS) networks and semantic segmentation networks may appear as trivial solutions to make a SLAM Dynamic as their integration is straightforward.

To test this aspect, we integrated Mask R-CNN [12] (we filter the same semantic classes as DynaSLAM), RVOS [28] (in zero-shot mode) and COSNet [17] (we only use past frames for inference) in ORB-SLAM 2. We also test a very simple optical flow solution with PWC-Net [24], a learned optical flow network, by masking the area of the image with the 50% most intense optical flow.

Table I (cols. 7-12) show that we perform better that all baselines. All results are comparable to the State-of-the-Art on TUM RGB-D except for PWC-Net, likely due to its naive integration (if there is no object moving the predicted mask will be wrong). However, on the Consensus Inversion dataset results are quite different: Mask R-CNN and PWC-Net both perform poorly while RVOS and COSNet have very good results. This shows that methods that are class-agnostic perform better than class-aware methods and – considering the other columns of the table – geometrical methods. The main issue with UVOS networks is in fact oversegmentation: RVOS and COSNet failed on the Static sequences of the Consensus Inversion / Static subset. They masked the only source of features in the image and made the SLAM fail. The same conclusions come from Table II. Fig. 7 illustrates failure cases.

The conclusion is that UVOS networks are better at making SLAMs robust to dynamic objects than the usual semantic and geometrical approaches. However, they also segment static objects and have an increased risk of failing early in static environments.

Fig. 7: Failure cases of baselines methods (ORB-SLAM 2 + existing network). Mask R-CNN ignores the dragon and considers the dromedary a cake. Other methods segment static objects.

V-B3 Evaluation of monocular false starts

monocular SLAMs as ORB-SLAM 2 require the camera to move to initialize, so we evaluate a specific kind of failure: false starts. Fig. 5 illustrates such a false start: since the camera is static, generated maps or trajectories are necessarily fake. We evaluate the State-of-the-Art and the baselines on the Consensus Inversion / Static subset. We performed best, never initializing incorrectly. All other methods failed, either because the object is unknown (semantic approaches), because the object caused a consensus inversion (geometrical approaches) or because it was not fully segmented (UVOS approaches). The results show again that it is essential to make a SLAM robust in a specific environment.

V-B4 Extension to a Direct SLAM

: we test LDSO, a monocular direct SLAM. We improve it by integrating the network we trained using ORB-SLAM 2 after the feature detector. Fig. 6 shows the effect of masking: the 3D reconstruction is incorrect if the object is not masked. However, when masked, the SLAM operates normally. Table IV shows that both the Penalized ATE RMSE and Success Rate improve. While the Success Rate does not reach 100%, the result is very interesting: it is possible to use one SLAM to improve another one.

max width=0.49 LDSO [9] + No segmentation Our segmentation Avg. Penalized ATE RMSE (m) 0.0833 0.0581 Success Rate (%) 36.4% 63.6%

TABLE IV: Avg. Penalized ATE RMSE (m) and Success Rate (%) of LDSO and our masked version on the Consensus Inversion/Dynamic dataset.

V-C Limitations

There are some limitations to our method. We rely on video segmentation networks but they may not work in ambiguous situations, e.g., when a dynamic object passes in front of a very similar object. Dynamic objects must be reconstructed by the SLAM to generate outliers so it is difficult to mask objects that move all the time and we do not yet handle new dynamic objects at runtime. Evidently, the improved SLAM stops tracking if a known dynamic object covers the image.

Vi Conclusions

In this paper we proposed a novel method to learn to segment dynamic objects using only one monocular sequence per dynamic object, which is an advantage in comparison to previous methods. More importantly, we do not need any manual labelling which makes our method much easier to use.

We also presented the Consensus Inversion dataset and new metrics to evaluate the robustness of Dynamic SLAMs. We showed that consensus inversions can cause major SLAM failures, even to State-of-the-Art Dynamic SLAMs. We improved ORB-SLAM 2 monocular/stereo/RGB-D as well as LDSO and achieved top results in very challenging scenarios.

Another advantage, in addition to preventing SLAM failures and improving motion estimation, is the improvement in map reconstruction. Tasks like relocalization and loop closing need accurate maps and should benefit from our approach.

Vii Acknowledgements

This publication was made possible by the use of the FactoryIA supercomputer, financially supported by the Ile-de-France Regional Council.


  • [1] D. Barnes, W. Maddern, G. Pascoe, and I. Posner (2018) Driven to Distraction: Self-Supervised Distractor Learning for Robust Monocular Visual Odometry in Urban Environments. In IEEE International Conference on Robotics and Automation (ICRA), Cited by: §I, §I, §I, §II-C.
  • [2] B. Besbes, S. N. Collette, M. Tamaazousti, S. Bourgeois, and V. Gay-Bellile (2012) An interactive augmented reality system: a prototype for industrial maintenance training applications. In Proceedings of the IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp. 269–270. Cited by: §I.
  • [3] B. Bescos, J. M. Fácil, J. Civera, and J. Neira (2018) DynaSLAM: Tracking, Mapping, and Inpainting in Dynamic Scenes. IEEE Robotics and Automation Letters 3 (4). Cited by: §I, §II-C, §V-B1, TABLE I, TABLE II, TABLE III.
  • [4] N. Brasch, A. Bozic, J. Lallemand, and F. Tombari (2018) Semantic Monocular SLAM for Highly Dynamic Environments. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Cited by: §II-C.
  • [5] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid, and J. J. Leonard (2016) Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age. IEEE Transactions on Robotics 32 (6). Cited by: §I.
  • [6] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Computer Vision – ECCV, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss (Eds.), Lecture Notes in Computer Science, Cham (en). External Links: ISBN 978-3-030-01234-2 Cited by: §III-C2.
  • [7] J. Cheng, Y. Sun, W. Chi, C. Wang, H. Cheng, and M. Q. Meng (2018) An Accurate Localization Scheme for Mobile Robots Using Optical Flow in Dynamic Environments. In IEEE International Conference on Robotics and Biomimetics (ROBIO), Cited by: §I, §II-A, §V-B1, TABLE I, TABLE II, TABLE III, footnote 2.
  • [8] J. Cheng, Y. Sun, and M. Q.-H. Meng (2019) Improving monocular visual SLAM in dynamic environments: an optical-flow-based approach. Advanced Robotics 33 (12). Cited by: §I.
  • [9] X. Gao, R. Wang, N. Demmel, and D. Cremers (2018) LDSO: Direct Sparse Odometry with Loop Closure. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Cited by: §I, §V-A1, TABLE IV.
  • [10] G. Gay-Bellile, S. Bourgeois, M. Tamaazousti, S. Naudet-Collette, and S. Knodel (2012) A mobile markerless augmented reality system for the automotive field. In IEEE ISMAR Workshop on Tracking Methods and Applications (TMA), Cited by: §I.
  • [11] R. Hartley and A. Zisserman (2003) Multiple view geometry in computer vision. 2 edition, Cambridge University Press, New York, NY, USA. External Links: ISBN 0521540518 Cited by: §I.
  • [12] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask R-CNN. In IEEE International Conference on Computer Vision (ICCV), Cited by: §V-B2, TABLE I, TABLE II, TABLE III.
  • [13] M. Kaneko, K. Iwami, T. Ogawa, T. Yamasaki, and K. Aizawa (2018) Mask-SLAM: Robust Feature-Based Monocular SLAM by Masking Using Semantic Segmentation. In

    IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

    Cited by: §I, §II-B, §II.
  • [14] G. Klein and D. Murray (2007) Parallel Tracking and Mapping for Small AR Workspaces. In IEEE and ACM Int. Symposium on Mixed and Augmented Reality, Cited by: §I.
  • [15] P. Li, T. Qin, and S. Shen (2018) Stereo Vision-Based Semantic 3D Object and Ego-Motion Tracking for Autonomous Driving. In Computer Vision – ECCV 2018, Lecture Notes in Computer Science (en). External Links: ISBN 978-3-030-01216-8 Cited by: §II-C.
  • [16] S. Li and D. Lee (2017) RGB-D SLAM in Dynamic Environments Using Static Point Weighting. IEEE Robotics and Automation Letters 2 (4). Cited by: §II-A.
  • [17] X. Lu, W. Wang, C. Ma, J. Shen, L. Shao, and F. Porikli (2019) See More, Know More: Unsupervised Video Object Segmentation With Co-Attention Siamese Networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §III-C1, §V-B2, TABLE I, TABLE II, TABLE III.
  • [18] E. Mouragnon, M. Lhuillier, M. Dhome, F. Dekeyser, and P. Sayd (2006) Real time localization and 3d reconstruction. In Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 363–370. Cited by: §I.
  • [19] R. Mur-Artal and J. D. Tardós (2017)

    ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras

    IEEE Transactions on Robotics 33 (5). Cited by: §I, §III-A, §V-A1, TABLE I, TABLE II, TABLE III.
  • [20] A. Rosinol, H. Rebecq, T. Horstschaefer, and D. Scaramuzza (2018) Ultimate SLAM? Combining Events, Images, and IMU for Robust Visual SLAM in HDR and High Speed Scenarios. IEEE Robotics and Automation Letters. Cited by: §I.
  • [21] M. R. U. Saputra, A. Markham, and N. Trigoni (2018) Visual SLAM and Structure from Motion in Dynamic Environments: A Survey. ACM Comp. Surv. 51. Cited by: §II-A.
  • [22] M. Schorghuber, D. Steininger, Y. Cabon, M. Humenberger, and M. Gelautz (2019) SLAMANTIC - leveraging semantics to improve vslam in dynamic environments. In The IEEE International Conference on Computer Vision (ICCV) Workshops, Cited by: §I, §II-C, §V-B1, TABLE I, TABLE II, TABLE III.
  • [23] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers (2012) A benchmark for the evaluation of RGB-D SLAM systems. In IEEE/RSJ International Conference on Intelligent Robots and Systems, Cited by: §V-A1.
  • [24] D. Sun, X. Yang, M. Liu, and J. Kautz (2018) PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §V-B2, TABLE I, TABLE II, TABLE III.
  • [25] Y. Sun, M. Liu, and M. Q. -H. Meng (2017) Improving RGB-D SLAM in dynamic environments: A motion removal approach. Robotics and Autonomous Systems. Cited by: §II-A.
  • [26] Y. Sun, M. Liu, and M. Q. -H. Meng (2018) Motion removal for reliable RGB-D SLAM in dynamic environments. Robotics and Autonomous Systems 108. Cited by: §II-A.
  • [27] M. Tamaazousti, S. Naudet-Collette, V. Gay-Bellile, S. Bourgeois, B. Besbes, and M. Dhome (2016) The constrained slam framework for non-instrumented augmented reality. Multimedia Tools and Applications 75 (16), pp. 9511–9547. Cited by: §I.
  • [28] C. Ventura, M. Bellver, A. Girbau, A. Salvador, F. Marques, and X. Giro-i-Nieto (2019-06) RVOS: end-to-end recurrent network for video object segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §V-B2, TABLE I, TABLE II, TABLE III.
  • [29] K. Wang, Y. Lin, L. Wang, L. Han, M. Hua, X. Wang, S. Lian, and B. Huang (2019) A Unified Framework for Mutual Improvement of SLAM and Semantic Segmentation. In Int. Conf. on Robotics and Automation (ICRA), Cited by: §II-B, TABLE I, TABLE II.
  • [30] Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. H. S. Torr (2019) Fast Online Object Tracking and Segmentation: A Unifying Approach. In Proc. of the IEEE conference on computer vision and pattern recognition, Cited by: §III-C1.
  • [31] S. Wang, R. Clark, H. Wen, and N. Trigoni (2018)

    End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks

    The International Journal of Robotics Research 37 (4-5) (en). Cited by: §II.
  • [32] C. Yu, Z. Liu, X. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei (2018) DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Cited by: §II-B, §II-C, TABLE I, TABLE II.