ClusterVO: Clustering Moving Instances and Estimating Visual Odometry for Self and Surroundings

03/29/2020 ∙ by Jiahui Huang, et al. ∙ Tsinghua University 0

We present ClusterVO, a stereo Visual Odometry which simultaneously clusters and estimates the motion of both ego and surrounding rigid clusters/objects. Unlike previous solutions relying on batch input or imposing priors on scene structure or dynamic object models, ClusterVO is online, general and thus can be used in various scenarios including indoor scene understanding and autonomous driving. At the core of our system lies a multi-level probabilistic association mechanism and a heterogeneous Conditional Random Field (CRF) clustering approach combining semantic, spatial and motion information to jointly infer cluster segmentations online for every frame. The poses of camera and dynamic objects are instantly solved through a sliding-window optimization. Our system is evaluated on Oxford Multimotion and KITTI dataset both quantitatively and qualitatively, reaching comparable results to state-of-the-art solutions on both odometry and dynamic trajectory recovery.



There are no comments yet.


page 3

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Understanding surrounding dynamic objects is an important step beyond ego-motion estimation in the current visual Simultaneous Localization and Mapping (SLAM) community for the frontier requirements of advanced Augmented Reality (AR) or autonomous things navigation: In typical use cases of Dynamic AR, these dynamics need to be explicitly tracked to enable interactions of virtual object with moving instances in the real world. In outdoor autonomous driving scenes, a car should not only accurately localize itself but also reliably sense other moving cars to avoid possible collisions.

Despite the above need from emerging applications to perceive scene motions, most classical SLAM systems [4, 19, 20, 28]

merely regard dynamics as outliers during pose estimation. Recently, advances in vision and robotics have demonstrated us with new possibilities of developing motion-aware Dynamic SLAM systems by coupling various different vision techniques like detection and tracking 

[5, 32, 36]. Nevertheless, currently these systems are often tailored for special use cases: For indoor scenes where dense RGB-D data are available, geometric features including convexity or structure regularities are used to assist segmentation [6, 34, 35, 38, 46]. For outdoor scenes, object priors like car sizes or road planar structure are exploited to constrain the solution spaces [2, 24, 26, 48]. These different assumptions render existing algorithms hardly applicable to general dynamic scenarios. Contrarily, ClusterSLAM [15] incorporates no scene prior, but it acts as a backend instead of a full system whose performance relies heavily on the landmark tracking and association quality.

Figure 1: Our proposed system ClusterVO can simultaneously recover the camera ego-motion as well as cluster trajectories.

To bridge the above gap in current Dynamic SLAM solutions in the literature, we propose ClusterVO, a stereo visual odometry system for dynamic scenes, which simultaneously optimizes the poses of camera and multiple moving objects, regarded as clusters of point landmarks, in a unified manner, achieving a competitive frame-rate with promising tracking and segmentation ability as listed in Table 1. Because no geometric or shape priors on the scene or dynamic objects are imposed, our proposed system is general and adapts to many various applications ranging from autonomous driving, indoor scene perception to augmented reality development. Our novel strategy is solely based on sparse landmarks and 2D detections [32]; to make use of such a lightweight representation, we propose a robust multi-level probabilistic association technique to efficiently track both low-level features and high-level detections over time in the 3D space. Then a highly-efficient heterogeneous CRF jointly considering semantic bounding boxes, spatial affinity and motion consistency is applied to discover new clusters, cluster novel landmarks and refine existing clusterings. Finally, Both static and dynamic parts of the scene are solved in a sliding-window optimization fashion.

2 Related Works

ORB-SLAM2 [28] Multiple 10
DynamicFusion [30] RGB-D NR -
MaskFusion [35] RGB-D 30
Li et al. [24] Stereo 5.8
DynSLAM [2] Stereo 2
ClusterSLAM [15] Stereo 7
ClusterVO Stereo 8
Table 1: Comparison with other dynamic SLAM solutions. : Sensor(s) used. : Applicable in indoor scene? : Applicable in outdoor driving scenarios? : Recover poses of moving rigid bodies? : Is online? ‘NR’ represents single Non-Rigid body.

Dynamic SLAM / Visual Odometry. Traditional SLAM or VO systems are based on static scene assumption and dynamic contents need to be carefully handled which would otherwise lead to severe pose drift. To this end, some systems explicitly detect motions and filter them either with motion consistency [8, 19, 20] or object detection modules [4, 49, 50]. The idea of simultaneously estimating ego motion and multiple moving rigid objects, same as our formulation, originated from the seminal SLAMMOT [44] project. Follow-ups like [6, 34, 35, 38, 46] use RGB-D as input and reconstruct dense models for the indoor scene along with moving objects. For better segmentation of object identities, [35, 46] combine heavy instance segmentation module and geometric features. [22, 40, 43] can track and reconstruct rigid object parts on a predefined articulation template (e.g. human hands or kinematic structures). [9, 31] couple existing visual-inertial system with moving objects tracked using markers. Many other methods are specially designed for road scenes by exploiting modern vision modules [3, 24, 26, 27, 29, 45]. Among them, [24] proposes a batch optimization to accurately track the motions of moving vehicles but a real-time solution is not presented.

Different from ClusterSLAM [15]

, which is based on motion affinity matrices for hierarchical clustering and SLAM, this work focuses on developing a relatively light-weight visual odometry, and faces challenges from real-time clustering and state estimation.

Object Detection and Pose Estimation.

With the recent advances in deep learning technologies, the performance of 2D object detection and tracking have been boosted 

[5, 12, 14, 25, 32, 36]

. Detection and tracking in 3D space from video sequences is a relatively unexplored area due to the difficulty in the 6-DoF (six degrees of freedom) pose estimation. In order to accurately estimate 3D positions and poses, many methods 

[13, 23] leverages a predefined object template or priors to jointly infer object depth and rotations. In ClusterVO, the combination of low-level geometric feature descriptors and semantic detections inferred simultaneously in the localization and mapping process can provide additional cues for efficient tracking and accurate object pose estimation.

3 ClusterVO

Figure 2: Pipeline of ClusterVO. ① For each incoming stereo frame ORB features and semantic bounding boxes are extracted. ② We apply multi-level probabilistic association to associate features with landmarks and bounding boxes with existing clusters. ③ Then we cluster the landmarks observed in the current frame into different rigid bodies using the Heterogeneous CRF module. ④ The state-estimation is performed in a sliding window manner with specially designed keyframe mechanism. Optimized states are used to update the static maps and clusters.

ClusterVO takes synchronized and calibrated stereo images as input, and outputs camera and object pose for each frame. For each incoming frame, semantic bounding boxes are detected using YOLO object detection network [32], and ORB features [33] are extracted and matched across stereo images. We first associate detected bounding boxes and extracted features to previously found clusters and landmarks, respectively, through a multi-level probabilistic association formulation (Sec. 3.1). Then, we perform heterogeneous conditional random field (CRF) over all features with associated map landmarks to determine the cluster segmentation for current frame (Sec. 3.2). Finally, the state estimation step optimizes all the states over a sliding window with marginalization and a smooth motion prior (Sec. 3.3). The pipeline is illustrated in Figure 2.

Notations. At frame , ClusterVO outputs: the pose of the camera in the global reference frame, the state of all clusters (rigid bodies) , and the state of all landmarks . The -th cluster state contains current 6-DoF pose and current linear speed in 3D space . Specially we use to denote the static scene for convenience. Hence . As a short hand, we denote the transformation from coordinate frame to frame as . For the landmark state , each landmark has the property of its global position , the cluster assignment and its confidence defining the cluster assignment confidence. For observations, we denote the location of the

-th low-level ORB stereo feature extracted at frame

as , and the -th high-level semantic bounding box detected at frame as . Assuming the feature observation is subject to a Gaussian noise with covariance , the noise of the triangulated points in camera space can be calculated as , where is the stereo projection function, is the corresponding back-projection function and is the Jacobian matrix of function .

For generality, we do not introduce a category-specific canonical frame for each cluster. Instead, we initialize the cluster pose with the center and the three principal orthogonal directions of the landmark point clouds belonging to the cluster as the translational and rotational part respectively and track the relative pose ever since.

3.1 Multi-level Probabilistic Association

For the landmarks on static map (i.e. ), the features can be robustly associated by nearest neighbour search and descriptor matching [28]. However, tracking dynamic landmarks which move fast on the image space is not a trivial task. Moreover, we need to associate each detected bounding box to an existing map cluster if possible, which is required in the succeeding Heterogeneous CRF module.

To this end, we propose a multi-level probabilistic association scheme for dynamic landmarks (i.e. ), assigning low-level feature observation to its source landmark id and high-level bounding box to a cluster

. The essence of the probabilistic approach is to model the position of a landmark by a Gaussian distribution with mean

and covariance and consider the uncertainty throughout the matching.

Ideally, should be extracted from the system information matrix from the last state estimation step, but the computation burden is heavy. We hence approximate as transformed with the smallest determinant, i.e.:


which can be incrementally updated. is the rotational part of .

For each new frame, we perform motion prediction for each cluster using . The predicted 3D landmark positions as well as its noise covariance matrix are re-projected back into the current frame using

. The probability score of assigning the

-th observation to landmark becomes:


where is an indicator function, is the descriptor similarity between landmark and observation and in our experiments. For each observation , we choose its corresponding landmark with the highest assignment probability score: if possible. In practice, Eq. 2 is only evaluated on a small neighbourhood of .

We further measure the uncertainty of the association by calculating the Shannon cross-entropy as:


where is the probability of assigning the -th bounding box to cluster . If is smaller than 1.0, we consider this as a successful high-level association, in which case we perform additional brute force low-level feature descriptor matching within the bounding box to find more feature correspondences.

3.2 Heterogeneous CRF for Cluster Assignment

In this step, we determine the cluster assignment of each landmark observed in the current frame. A conditional random field model combining semantic, spatial and motion information, which we call ‘heterogeneous CRF’, is applied, minimizing the following energy:


which is a weighted sum ( being the balance factor) of unary energy and pairwise energy on a complete graph of all the observed landmarks. The total number of classes for CRF is set to , where is the number of live clusters, is the number of unassociated bounding boxes in this frame and the trailing allows for an outlier class. A cluster is considered live if at least one of its landmarks is observed during the past frames.

Unary Energy. The unary energy decides the probability of the observed landmark belonging to a specific cluster and contains three sources of information:


The first multiplier incorporates information from the detected semantic bounding boxes. The probability should be large if the landmark lies within a bounding box. Let be the set of cluster indices corresponding to the bounding boxes where the observation of landmark resides and be a constant for the detection confidence, then:


The second multiplier emphasizes the spatial affinity by assigning a high probability to the landmarks near the center of a cluster:


where and are the cluster center and dimension, respectively, determined by the center and the / percentiles (found empirically) of the cluster landmark point cloud.

The third multiplier defines how the trajectories of cluster over a set of timesteps can explain the observation:


which is a simple reprojection error w.r.t. the observations. In our implementation we set . For the first 5 frames this term is not included in Eq. 5.

The single 2D term only considers the 2D semantic detection, which possibly contains many outliers around the edge of the bounding box. By adding the 3D term, landmarks belonging to faraway background get pruned. However, features close to the 3D boundary, e.g., on the ground nearby a moving vehicle, still have a high probability belonging to the cluster, whose confidence is further refined by the motion term. Please refer to Sec. 4.4 for evaluations and visual comparisons on these three terms.

Pairwise Energy. The pairwise energy is defined as:


where the term inside the exponential operator is the distance between two landmarks in 3D space. The pairwise energy can be viewed as a noise-aware Gaussian smoothing kernel to encourage spatial labeling continuity.

We use an efficient dense CRF inference method [21] to solve for the energy minimization problem. After successful inference, we perform Kuhn-Munkres algorithm to match current CRF clustering results with previous cluster assignments. New clusters are created if no proper cluster assignment is found for an inferred label. We then update the weight for each landmark according to a strategy introduced in [41] and change its cluster assignment if necessary: When the newly assigned cluster is the same as the landmark’s previous cluster, we increase the weight by 1, otherwise the weight is decreased by 1. When is decreased to 0, a change in cluster assignment is triggered to accept the currently assigned cluster.

3.3 Sliding-Window State Estimation

Double-Track Frame Management. Keyframe-based SLAM systems like ORB-SLAM2 [28] select keyframes by the spatial distance between frames and the number of commonly visible features among frames. For ClusterVO where the trajectory of each cluster is incorporated into the state estimation process, the aforementioned strategy for keyframe selection is not enough to capture the relatively fast-moving clusters.

Instead of the chunk strategy proposed in ClusterSLAM [15], we employ a sliding window optimization scheme in accordance with a novel double-track frame management design (Figure 3). The frames maintained and optimized by the system are divided into two sequential tracks: a temporal track and a spatial track . contains the most recent input frames. Whenever a new frame comes, the oldest frame in will be moved out. If this frame is spatially far away enough from the first frame in or the number of commonly visible landmarks is sufficiently small, this frame will be appended to the tail of , otherwise it will be discarded. This design has several advantages. First, frames in the temporal track record all recent observations and hence allow for enough observations to track a fast-moving cluster. Second, previous wrongly clustered landmarks can be later corrected and re-optimization based on new assignments is made possible. Third, features detected in the spatial track help create enough parallax for accurate landmark triangulation and state estimation.

Figure 3: Frame Management in ClusterVO. Frames maintained by the system consist of spatial track (red) and temporal track (green). When a new frame comes, the oldest frame in the temporal track (Temporal Tail) will either be discarded or promoted into the spatial track. The last spatial frame is to be marginalized if the total number of spatial frames exceeds a given threshold.

For static scene and camera pose, the energy function for optimization is a standard Bundle Adjustment  [42] augmented with an additional marginalization term:


where , indicates all landmarks belonging to cluster and is robust Huber M-estimator. As the static scene involves a large number of variables and simply dropping these variables out of the sliding window will cause information loss, leading to possible drifts, we marginalize some variables which would otherwise be removed and summarize the influence to the system with the marginalization term in Eq. 10. Marginalization is only performed when a frame is discarded from the spatial track . To restrict dense fill-in of landmark blocks in the information matrix, the observations from the frame to be removed will be either deleted if the corresponding landmark is observed by the newest frame or marginalized otherwise. This marginalization strategy only adds dense Hessian block onto the frames instead of landmarks, making the system still solvable in real-time.

More specifically, in the marginalization term, is the state change relative to the critical state captured when marginalization happens. For the computation of and , we employ the standard Schur Complement: , where and are components of the system information matrix

and information vector

extracted by linearizing around :


For dynamic clusters

, the motions are modeled using a white-noise-on-acceleration prior 

[1], which can be written in the following form in continuous time :


where is the translational part of the continuous cluster pose (hence is the cluster acceleration), stands for the Gaussian Process, and denotes its power spectral matrix. We define the energy function for optimizing the -th cluster trajectories and its corresponding landmark positions as follows:


in which


where is the Kronecker product and , being the next adjacent timestamp of frame . Eq. 13 is the sum of motion prior term and reprojection term. The motion prior term is obtained by querying the random process model of Eq. 12, which intuitively penalizes the change in velocity over time and smooths cluster motion trajectory which would otherwise be noisy due to fewer features on clusters than static scenes. Note that different from the energy term for the static scene which optimizes over both and , for dynamic clusters only is considered.

During the optimization of cluster state, the camera state stays unchanged. The optimization process for each cluster can be easily paralleled because their states are mutually independent (in practice the system speed is 8.5Hz & 7.8Hz for 2 & 6 clusters, resp.).

4 Experiments

4.1 Datasets and Parameter Setup

The effectiveness and general applicability of ClusterVO system is mainly demonstrated in two scenarios: indoor scenes with moving objects and autonomous driving with moving vehicles.

For indoor scenes, we employ the stereo Oxford Multimotion dataset (OMD) [18] for evaluation. This dataset is specially designed for indoor simultaneous camera localization and rigid body motion estimation, with the ground-truth trajectories recovered using a motion capture system. Evaluations and comparisons are performed on two sequences: swinging_4_unconstrained (S4, 500 frames, with four moving bodies: S4-C1, S4-C2, S4-C3, S4-C4) and occlusion_2_unconstrained (O2, 300 frames, with two moving bodies: O2-Tower and O2-Block), because these are the only sequences with baseline results reported in sequential works from Judd et al[16, 17] named ‘MVO’.

For autonomous driving cases, we employ the challenging KITTI dataset [10] for demonstration. As most of the sequences in the odometry benchmark have low dynamics and comparisons on these data can hardly lead to sensible improvements over other SLAM solutions (e.g. ORB-SLAM), similar to Li et al[24], we demonstrate the strength of our method in selected sequences from the raw dataset as well as the full 21 tracking training sequences with many moving cars. The ground-truth camera ego-motion is obtained from the OxTS packets (combining GNSS and inertial navigation) provided by the dataset.

The CRF weight is set to and the 2D unary energy constant . The power spectral matrix for the motion prior. The maximum sizes of the double-track are set to and . The threshold for determining whether the cluster is still live is set to . All of the experiments are conducted on an Intel Core i7-8700K, 32GB RAM desktop computer with an Nvidia GTX 1080 GPU.

4.2 Indoor Scene Evaluations

We follow the same evaluation protocol as in [16], by computing the maximum drift (deviation from ground-truth pose) across the whole sequence in translation and rotation (represented in three Euler angles, namely roll, yaw and pitch) for camera ego-motion as well as for all moving cluster trajectories. As our method does not define a canonical frame for detected clusters, we need to register the pose recovered by our method with the ground-truth trajectory. To this end, we multiply our recovered pose with a rigid transformation which minimizes the sum of the difference between and the ground-truth pose for all . This is based on the assumption that the local coordinates of the recovered landmarks can be registered with the positions of ground-truth landmarks using this rigid transformation.

For the semantic bounding box extraction, the YOLOv3 network [32] is re-trained to detect an additional class named ‘block’ representing the swinging or rotating blocks in the dataset. The detections used for training are labeled using a combined approach with human annotations and a median flow tracker on the rest frames from S4 and O2.

Figure 4: Performance comparison with MVO on S4 and O2 sequence in Oxford Multimotion [18] dataset. The numbers in the heatmap show the ratio of decrease in error using ClusterVO for different trajectories and measurements.
Figure 5: Qualitative results in OMD Sequence O2. The three subfigures demonstrate an occlusion handling process by ClusterVO.
Figure 6: Other indoor qualitative results. (a) OMD Sequence S4; (b) A laboratory scene where two bottles are reordered.
Sequence ORB-SLAM2 [28] DynSLAM [2] Li et al. [24] ClusterSLAM [15] ClusterVO
0926-0009 0.91 0.01 1.89 7.51 0.06 2.17 1.14 0.92 0.03 2.34 0.79 0.03 2.98
0926-0013 0.30 0.01 0.94 1.97 0.04 1.41 0.35 2.12 0.07 5.50 0.26 0.01 1.16
0926-0014 0.56 0.01 1.15 5.98 0.09 2.73 0.51 0.81 0.03 2.24 0.48 0.01 1.04
0926-0051 0.37 0.00 1.10 10.95 0.10 1.65 0.76 1.19 0.03 1.44 0.81 0.02 2.74
0926-0101 3.42 0.03 14.27 10.24 0.13 12.29 5.30 4.02 0.02 12.43 3.18 0.02 12.78
0929-0004 0.44 0.01 1.22 2.59 0.02 2.03 0.40 1.12 0.02 2.78 0.40 0.02 1.77
1003-0047 18.87 0.05 28.32 9.31 0.05 6.58 1.03 10.21 0.06 8.94 4.79 0.05 6.54
Table 2: Camera ego-motion comparison with state-of-the-art systems on KITTI raw dataset. The unit of ATE and T.RPE is meters and the unit for R.RPE is radians.

Figure 4 shows the ratio of decrease in the drift compared with the baseline MVO [16, 17]. More than half of the trajectory estimation results improve by over 25%, leading to accurate camera ego-motion and cluster motion recoveries. Two main advantages of ClusterVO over MVO have made the improvement possible: First, the pipeline in MVO requires a stable tracking of features in each input batch of 50 frames and this keeps only a small subset of landmarks where the influence of noise becomes more dominating, while ClusterVO maintains consistent landmarks for each individual cluster and associates both low-level and high-level information to maximize the utility of historical information. Second, if the motion in a local window is small, the geometric-based method will tend to misclassify dynamic landmarks and degrade the recovered pose results; ClusterVO, however, leverages additional semantic and spatial information to achieve more accurate and meaningful classification and estimation.

Meanwhile, the robust association strategy and double-track frame management design allow ClusterVO to continuously track cluster motion even it is temporarily occluded. This feature is demonstrated in figure 5 on the O2 sequence where the block is occluded by the tower for 10 frames. The cluster’s motion is predicted during the occlusion and finally the prediction is probabilistically associated with the re-detected semantic bounding box of the block. The state estimation module is then relaunched to recover the motion using the information both before and after the occlusion.

Figure 6(a) shows qualitative results on the S4 sequence and in Figure 6(b) another result from a practical indoor laboratorial scene with two moving bottles recorded using a Mynteye stereo camera is shown.

4.3 KITTI Driving Evaluations

Similar to Li et al[24], we divide the quantitative evaluation into ego-motion comparisons and 3D object detection comparisons. Our results are compared to state-of-the-art systems including ORB-SLAM2 [28], DynSLAM [2], Li et al[24] and ClusterSLAM [15] using the TUM metrics [39]. These metrics evaluate ATE, R.RPE and T.RPE, which are short for the Root Mean Square Error (RMSE) of the Absolute Trajectory Error, the Rotational and Translational Relative Pose Error, respectively.

As shown in Table 2, for most of the sequences we achieve the best results in terms of ATE, meaning that our method can maintain globally correct camera trajectories in challenging scenes (e.g. 1003-0047) where even ORB-SLAM2 fails due to its static scene assumption. Although DynSLAM maintains a dense mapping of both the static scenes and dynamic objects, the underlying sparse scene flow estimation is based on a frame-to-frame visual odometry libviso [11], which will inherently lead to remarkable drift over long travel distances. The batch Multibody SfM formulation of Li et al. results in a highly nonlinear factor graph optimization problem whose solution is not trivial. ClusterSLAM [15] requires associated landmarks and the inaccurate feature tracking frontend affects the localization performance even if the states are solved via full optimization. In contrast, our ClusterVO achieves comparable or even better results than all previous methods due to the fusing of multiple sources of information and the robust sliding-window optimization.

The cluster trajectories are evaluated in 3D object detection benchmark in KITTI tracking dataset. We compute the Average Precision (AP) of the ‘car’ class in both bird view () and 3D view (). Our detected 3D box center is (in Eq. 7) and the dimension is taken as the average car size. The box orientation is initialized to be vertical to the camera and tracked over time later on. The detection is counted as a true positive if the Intersection over Union (IoU) score with an associated ground-truth detection is larger than 0.25. All ground-truth 3D detections are divided into three categories (Easy, Moderate and Hard) based on the height of 2D reprojected bounding box and the occlusion/truncation level.

We compare the performance of our method with the state-of-the-art 3D object detection solution from Chen et al[7] and DynSLAM [2]. The evaluation is performed in camera coordinate system so the inaccuracies in ego-motion estimations are eliminated.

Time (ms)
Easy Moderate Hard Easy Moderate Hard
Chen et al. [7] 81.34 70.70 66.32 80.62 70.01 65.76 1200
DynSLAM [2] 71.83 47.16 40.30 64.51 43.70 37.66 500
ClusterVO 74.65 49.65 42.65 55.85 38.93 33.55 125
Table 3: 3D object detection comparison on KITTI dataset.

The methods of Chen et al. and DynSLAM are similar in that they both perform a dense stereo matching (e.g. [47]) to precompute the 3D structure. While DynSLAM crops the depth map using 2D detections to generate spatial detections, Chen et al. generates and scores object proposals directly in 3D space incorporating many scene priors in autonomous driving scenarios including the ground plane and car dimension prior. These priors are justified to be critical comparing the results in Table 3: DynSLAM witnesses a sharp decrease in both Moderate and Hard categories which contain faraway cars and small 2D detection bounding boxes.

In the case of ClusterVO, which is designed to be general-purpose, the natural uncertainty of stereo triangulation becomes larger when the landmark becomes distant from the camera without object size priors. Also, we do not detect the canonical direction (i.e., the front of the car) of the cluster if its motion is small, so the orientation can be imprecise as well. This explains the gap in detecting hard examples between ours and a specialized system like [7]. Compared to DynSLAM, the average precision improves because ClusterVO is able to track the moving object over time consistently and predicts their motions even if the 2D detection network misses some targets. Additionally, we emphasize the high efficiency of ClusterVO system by comparing the time cost in Table 1 while the work of Chen et al. requires 1.2 seconds for each stereo input pair. Some qualitative results of KITTI raw dataset are shown in Figure 7.

Figure 7: Qualitative results on KITTI raw dataset. The image below each sequence shows the input image and detections of the most recent frame.

4.4 Ablation study

We test the importance of each probabilistic term in our Heterogeneous CRF formulation (Eq. 5) using synthesized motion dataset rendered from SUNCG [37]. Following the same stereo camera parameter as in [15], we generate 4 indoor sequences with moving chairs and balls, and compare the accuracies of ego motion and cluster motions in Table 4.

Ego Motion Cluster Motion
ORB-SLAM2 [28] 0.35 0.14/0.59 - -
DynSLAM [2] 54.07 11.07/49.24 0.26 1.23/0.59
ClusterSLAM [15] 1.34 0.41/1.89 0.17 0.34/0.30
ClusterVO 2D 0.62 0.19/0.95 0.24 0.31/0.53
ClusterVO 2D+3D 0.52 0.11/0.87 0.15 0.50/0.53
ClusterVO Full 0.61 0.19/0.91 0.13 0.37/0.36

Table 4: Ablation comparisons on SUNCG dataset in terms of ego-motion and cluster trajectories.
Figure 8: Unary term visualizations on one indoor sequence from SUNCG dataset. (a) ClusterVO 2D; (b) ClusterVO 2D+3D; (c) ClusterVO Full.

By gradually adding different terms of Eq. 5 into the system, our performance on estimating cluster motions improves especially in terms of absolute trajectory error (decreases by 45.8% compared to 2D only CRF) while the accuracy of ego motion is not affected. This is due to the more accurate moving object clustering combining both geometric and semantic cues. It should be noted that our results are even comparable to the most recent ClusterSLAM [15], a backend method with full batched Bundle Adjustment optimization: This shows that incorporating semantic information into the motion detection problem helps effectively regularize the solution and achieves more consistent trajectory estimation. Figure 8 visualizes this effect further by computing the classification result based only on the unary term

. Some mis-classified landmarks are successfully filtered out by incorporating more information.

5 Conclusion

In this paper we present ClusterVO, a general-purpose fast stereo visual odometry for simultaneous moving rigid body clustering and motion estimation. Comparable results to state-of-the-art solutions on both camera ego-motion and dynamic objects pose estimation demonstrate the effectiveness of our system. In the future, one direction would be to incorporate specific scene priors as pluggable components to improve ClusterVO performance on specialized applications (e.g. autonomous driving); another direction is to fuse information from multiple sensors to further improve localization accuracy.

Acknowledgements. We thank anonymous reviewers for the valuable discussions. This work was supported by the Natural Science Foundation of China (Project Number 61521002, 61902210), the Joint NSFC-DFG Research Program (Project Number 61761136018) and Research Grant of Tsinghua-Tencent Joint Laboratory for Internet Innovation Technology.


  • [1] T. D. Barfoot (2017) State estimation for robotics. Cambridge University Press. Cited by: §3.3.
  • [2] I. A. Bârsan, P. Liu, M. Pollefeys, and A. Geiger (2018) Robust dense mapping for large-scale dynamic environments. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 7510–7517. Cited by: §1, Table 1, §4.3, §4.3, Table 2, Table 3, Table 4.
  • [3] A. Behl, O. Hosseini Jafari, S. Karthik Mustikovela, H. Abu Alhaija, C. Rother, and A. Geiger (2017) Bounding boxes, segmentations and object coordinates: how important is recognition for 3d scene flow estimation in autonomous driving scenarios?. In

    Proceedings of the IEEE International Conference on Computer Vision (ICCV)

    pp. 2574–2583. Cited by: §2.
  • [4] B. Bescos, J. M. Fácil, J. Civera, and J. Neira (2018) DynaSLAM: tracking, mapping, and inpainting in dynamic scenes. IEEE Robotics and Automation Letters 3 (4), pp. 4076–4083. Cited by: §1, §2.
  • [5] G. Bhat, J. Johnander, M. Danelljan, F. Shahbaz Khan, and M. Felsberg (2018) Unveiling the power of deep tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 483–498. Cited by: §1, §2.
  • [6] S. Caccamo, E. Ataer-Cansizoglu, and Y. Taguchi (2017) Joint 3d reconstruction of a static scene and moving objects. In Proceedings of the International Conference on 3D Vision, pp. 677–685. Cited by: §1, §2.
  • [7] X. Chen, K. Kundu, Y. Zhu, H. Ma, S. Fidler, and R. Urtasun (2017) 3d object proposals using stereo imagery for accurate object class detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 40 (5), pp. 1259–1272. Cited by: §4.3, §4.3, Table 3.
  • [8] W. Dai, Y. Zhang, P. Li, and Z. Fang (2018) Rgb-d slam in dynamic environments using points correlations. arXiv preprint arXiv:1811.03217. Cited by: §2.
  • [9] K. Eckenhoff, Y. Yang, P. Geneva, and G. Huang (2019) Tightly-coupled visual-inertial localization and 3-d rigid-body target tracking. IEEE Robotics and Automation Letters 4 (2), pp. 1541–1548. Cited by: §2.
  • [10] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 3354–3361. Cited by: §4.1.
  • [11] A. Geiger, J. Ziegler, and C. Stiller (2011) Stereoscan: dense 3d reconstruction in real-time. In Proceedings of the IEEE Intelligent Vehicles Symposium (IV), pp. 963–968. Cited by: §4.3.
  • [12] R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448. Cited by: §2.
  • [13] A. Grabner, P. M. Roth, and V. Lepetit (2019) GP2C: geometric projection parameter consensus for joint 3d pose and focal length estimation in the wild. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2222–2231. Cited by: §2.
  • [14] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2961–2969. Cited by: §2.
  • [15] J. Huang, S. Yang, Z. Zhao, Y. Lai, and S. Hu (2019) ClusterSLAM: a slam backend for simultaneous rigid body clustering and motion estimation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5875–5884. Cited by: §1, Table 1, §2, §3.3, §4.3, §4.3, §4.4, §4.4, Table 2, Table 4.
  • [16] K. M. Judd, J. D. Gammell, and P. Newman (2018) Multimotion visual odometry (mvo): simultaneous estimation of camera and third-party motions. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3949–3956. Cited by: §4.1, §4.2, §4.2.
  • [17] K. M. Judd and J. D. Gammell (2019) Occlusion-robust mvo: multimotion estimation through occlusion via motion closure. arXiv preprint arXiv:1905.05121. Cited by: §4.1, §4.2.
  • [18] K. M. Judd and J. D. Gammell (2019) The oxford multimotion dataset: multiple se3 motions with ground truth. IEEE Robotics and Automation Letters 4 (2), pp. 800–807. Cited by: Figure 4, §4.1.
  • [19] M. Keller, D. Lefloch, M. Lambers, S. Izadi, T. Weyrich, and A. Kolb (2013) Real-time 3d reconstruction in dynamic scenes using point-based fusion. In Proceedings of the International Conference on 3D Vision, pp. 1–8. Cited by: §1, §2.
  • [20] D. Kim and J. Kim (2016) Effective background model-based rgb-d dense visual odometry in a dynamic environment. IEEE Transactions on Robotics 32 (6), pp. 1565–1573. Cited by: §1, §2.
  • [21] P. Krähenbühl and V. Koltun (2011) Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in Neural Information Processing Systems, pp. 109–117. Cited by: §3.2.
  • [22] S. Kumar, V. Dhiman, M. R. Ganesh, and J. J. Corso (2016) Spatiotemporal articulated models for dynamic slam. arXiv preprint arXiv:1604.03526. Cited by: §2.
  • [23] P. Li, X. Chen, and S. Shen (2019) Stereo r-cnn based 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7644–7652. Cited by: §2.
  • [24] P. Li, T. Qin, et al. (2018) Stereo vision-based semantic 3d object and ego-motion tracking for autonomous driving. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 646–661. Cited by: §1, Table 1, §2, §4.1, §4.3, Table 2.
  • [25] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 21–37. Cited by: §2.
  • [26] J. Luiten, T. Fischer, and B. Leibe (2020) Track to reconstruct and reconstruct to track. IEEE Robotics and Automation Letters 5 (2), pp. 1803–1810. Cited by: §1, §2.
  • [27] M. Menze and A. Geiger (2015) Object scene flow for autonomous vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3061–3070. Cited by: §2.
  • [28] R. Mur-Artal and J. D. Tardós (2017)

    ORB-SLAM2: an open-source slam system for monocular, stereo, and rgb-d cameras

    IEEE Transactions on Robotics 33 (5), pp. 1255–1262. Cited by: §1, Table 1, §3.1, §3.3, §4.3, Table 2, Table 4.
  • [29] G. B. Nair, S. Daga, R. Sajnani, A. Ramesh, J. A. Ansari, and K. M. Krishna (2020) Multi-object monocular slam for dynamic environments. arXiv preprint arXiv:2002.03528. Cited by: §2.
  • [30] R. A. Newcombe, D. Fox, and S. M. Seitz (2015) Dynamicfusion: reconstruction and tracking of non-rigid scenes in real-time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 343–352. Cited by: Table 1.
  • [31] K. Qiu, T. Qin, W. Gao, and S. Shen (2019) Tracking 3-d motion of dynamic objects using monocular visual-inertial sensing. IEEE Transactions on Robotics 35 (4), pp. 799–816. Cited by: §2.
  • [32] J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7263–7271. Cited by: §1, §1, §2, §3, §4.2.
  • [33] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski (2011) ORB: an efficient alternative to SIFT or SURF. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2564–2571. Cited by: §3.
  • [34] M. Rünz and L. Agapito (2017-05) Co-fusion: real-time segmentation, tracking and fusion of multiple objects. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 4471–4478. Cited by: §1, §2.
  • [35] M. Runz, M. Buffier, and L. Agapito (2018) Maskfusion: real-time recognition, tracking and reconstruction of multiple moving objects. In Proceedings of the IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp. 10–20. Cited by: §1, Table 1, §2.
  • [36] S. Sharma, J. A. Ansari, J. K. Murthy, and K. M. Krishna (2018) Beyond pixels: leveraging geometry and shape cues for online multi-object tracking. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 3508–3515. Cited by: §1, §2.
  • [37] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser (2017) Semantic scene completion from a single depth image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1746–1754. Cited by: §4.4.
  • [38] M. Strecke and J. Stuckler (2019) EM-fusion: dynamic object-level slam with probabilistic data association. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5865–5874. Cited by: §1, §2.
  • [39] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers (2012) A benchmark for the evaluation of rgb-d slam systems. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 573–580. Cited by: §4.3.
  • [40] A. Tagliasacchi, M. Schröder, A. Tkach, S. Bouaziz, M. Botsch, and M. Pauly (2015) Robust articulated-icp for real-time hand tracking. In Computer Graphics Forum, Vol. 34, pp. 101–114. Cited by: §2.
  • [41] K. Tateno, F. Tombari, and N. Navab (2015) Real-time and scalable incremental segmentation on dense slam. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4465–4472. Cited by: §3.2.
  • [42] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon (1999) Bundle adjustment—a modern synthesis. In International workshop on vision algorithms, pp. 298–372. Cited by: §3.3.
  • [43] D. Tzionas and J. Gall (2016) Reconstructing articulated rigged models from rgb-d videos. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 620–633. Cited by: §2.
  • [44] C. Wang, C. Thorpe, S. Thrun, M. Hebert, and H. Durrant-Whyte (2007) Simultaneous localization, mapping and moving object tracking. The International Journal of Robotics Research 26 (9), pp. 889–916. Cited by: §2.
  • [45] L. Xiang, Z. Ren, M. Ni, and O. C. Jenkins (2015) Robust graph slam in dynamic environments with moving landmarks. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2543–2549. Cited by: §2.
  • [46] B. Xu, W. Li, D. Tzoumanikas, M. Bloesch, A. Davison, and S. Leutenegger (2019) Mid-fusion: octree-based object-level multi-instance dynamic slam. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 5231–5237. Cited by: §1, §2.
  • [47] K. Yamaguchi, D. McAllester, and R. Urtasun (2014) Efficient joint segmentation, occlusion labeling, stereo and flow estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 756–771. Cited by: §4.3.
  • [48] S. Yang and S. Scherer (2019) Cubeslam: monocular 3-d object slam. IEEE Transactions on Robotics 35 (4), pp. 925–938. Cited by: §1.
  • [49] C. Yu, Z. Liu, X. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei (2018) DS-slam: a semantic visual slam towards dynamic environments. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1168–1174. Cited by: §2.
  • [50] F. Zhong, S. Wang, Z. Zhang, and Y. Wang (2018) Detect-slam: making object detection and slam mutually beneficial. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1001–1010. Cited by: §2.