Consensus-Informed Optimization Over Mixtures for Ambiguity-Aware Object SLAM

by   Ziqi Lu, et al.

Objects could often have multiple probable poses in single-shot measurements due to symmetry, occlusion or perceptual failures. A robust object-level simultaneous localization and mapping (object SLAM) algorithm needs to be aware of the pose ambiguity. We propose to maintain and subsequently dis-ambiguate the multiple pose interpretations to gradually recover a globally consistent world representation. The max-mixtures model is applied to implicitly and efficiently track all pose hypotheses. The temporally consistent hypotheses are extracted to guide the optimization solution into the global optimum. This consensus-informed inference method is implemented on top of the incremental SLAM framework iSAM2, via landmark variable re-initialization.


A Multi-Hypothesis Approach to Pose Ambiguity in Object-Based SLAM

In object-based Simultaneous Localization and Mapping (SLAM), 6D object ...

Accurate Object Association and Pose Updating for Semantic SLAM

Nowadays in the field of semantic SLAM, how to correctly use semantic in...

Partial Hierarchical Pose Graph Optimization for SLAM

In this paper we consider a hierarchical pose graph optimization (HPGO) ...

Multiple Hypothesis Semantic Mapping for Robust Data Association

In this paper, we present a semantic mapping approach with multiple hypo...

Making Parameterization and Constrains of Object Landmark Globally Consistent via SPD(3) Manifold and Improved Cost Functions

Object-level SLAM introduces semantic meaningful and compact object land...

Superquadric Object Representation for Optimization-based Semantic SLAM

Introducing semantically meaningful objects to visual Simultaneous Local...

GeoD: Consensus-based Geodesic Distributed Pose Graph Optimization

We present a consensus-based distributed pose graph optimization algorit...

I Introduction

Robot-environment interactions rely heavily on real-time, robust and high-level spatial understanding. In mobile manipulation, for instance, the robot is required to reliably self-localize and pinpoint the target object for safe navigation and interaction. Building a consistent object-level world representation would greatly facilitate the process. Without external assistance and abundant visual features, the robot is expected to map the environment using an object-based SLAM system.

Recent progress in machine learning based object recognition and pose estimation techniques has spurred the development of object-based SLAM. Using the learned perception models, the robot gains higher-level environment understanding leveraging semantic and object pose information. However, ambiguities in object poses, induced by symmetry or occlusions, can cause unexpected uncertainties in pose estimates. The computer vision community has developed various ambiguity-resolving pose inference algorithms

[17, 29, 20, 4, 10, 19, 6, 32] to address the difficulty. A SLAM optimization framework is required to tolerate and synthesize the potentially ambiguous local pose measurements and efficiently recover a robust global representation.

For instance, if viewing a coffee mug from a vantage point where the handle is not visible, the mug pose can hardly get uniquely constrained (Fig. 1). An ambiguous observation of a coffee mug would therefore result in multiple, and even infinitely many, possible interpretations of the robot and mug poses. The ambiguity-oblivious single-hypothesis SLAM inference may fail quietly in this circumstance.

Fig. 1: Coffee mug pose ambiguity and disambiguation: A single-shot observation from a handle-occluded viewpoint fails to yield unique robot and mug pose estimates. Disambiguation and robust mug grasping need a handle-visible angle.

In this work, we use discrete, multi-hypothesis pose representation to quantify the uncertainties in ambiguous objects’ 6D pose measurements. We keep track of all the pose hypotheses over time and gradually recover the robot and object poses. The challenge is that the number of total hypotheses grows exponentially with increasing ambiguous measurements. An exhaustive global search over the hypothesis space entails exorbitant time and memory.

To alleviate the complexity, we represent all hypotheses as Gaussian max-mixtures, and cast the problem as a continuous, albeit multi-modal, factor graph optimization [21]. High computational efficiency is achieved with Gaussian-preserving max-mixtures amenable to nonlinear least square optimization. However, the initialization-sensitive gradient-based optimizers can be easily trapped in local optima without global awareness of modes in the posterior.

We propose a consensus-based heuristic to augment local optimization (iSAM2

[13]) with global understanding of dominant pose hypothesis. The most time-consistent set of object pose estimates are extracted from measurements to inform optimization of the dominant mode up to now. Once the current mode loses superiority, we perform low-overhead factor graph surgery to change landmark initialization and restart the optimization inside a new dominant mode. This “dynamic re-initialization (re-init)” procedure robustifies the efficient optimization over mixtures with the flexibility of adjusting mode selections.

The proposed algorithm provides a robust, real-time solution for robot localization and mapping in feature-sparse, ambiguity-rich environments. Two SLAM experiments with ambiguous objects are conducted to demonstrate its improved estimation performance.

Ii Related Work

Ii-a Ambiguity-Aware Pose Estimation

Object pose estimation considering symmetry or ambiguity is becoming increasingly popular in computer vision and robotics. A natural and ordinary idea is to represent symmetry in an unambiguous manner (e.g.

[23, 14]). Rad and Lepetit [23] restricts the pose of a rotationally symmetric object within a unambiguous range. Any pose measurement beyond the range is returned as its counterpart.

Another popular class of methods make multi-hypothesis pose predictions via view matching. The visual appearances of an object from arbitrary viewpoints are compared against the captured image. Multiple poses are returned when there exist one-to-many view-pose correspondences. Manhardt et al. [17] retrieve 6D poses for ambiguous objects based on learned multi-hypothesis view-pose mapping. Sundermeyer et al. [29] encode visual appearances using an auto-encoder and perform similarity checks in the feature space.

Other ambiguity resolving strategies incorporate regressing pose distributions [20] and training symmetry predictors [4]. Beyond that, multi-shot pose estimators are developed to gradually recover ambiguous object poses. The predicted poses are tracked and updated from successive observations by non-Gaussian filters [10, 19, 6] or a learned tracking network [32].

Our primary interest is to tolerate and utilize the local, potentially ambiguous, object pose measurements to build a globally consistent world representation.

Ii-B Multi-Modal SLAM Inference

We review the three most relevant methods for multi-modal inference under measurement ambiguity:

A natural solution would be to select the most likely hypothesis for each ambiguous measurement and solve an uni-modal inference problem, which we refer to as single-hypothesis method (e.g. nearest neighbor for data association [1]

). With the loss of information in discarded hypotheses, it can fail quietly with noise- or outlier-corrupted measurements.

Multi-hypothesis tracking (MHT) based methods[15, 12, 11] maintain most probable hypotheses and solve the corresponding set of uni-modal inference problems. B-iMAP [12] is the first smoothing based multi-hypothesis SLAM pipeline, where probable hypotheses are selected analytically. MH-iSAM2 [11] utilizes the Bayes tree and Hypo-tree to perform efficient incremental multi-modal inference with rule-based hypothesis pruning. MHT-based methods provide multiple state estimations including the temporary optimum. The global optimum is achieved as long as enough hypotheses are tracked. However, the rapid hypothesis accumulation necessitates hypothesis pruning. Designing an optimality-guaranteed pruning strategy still remains challenging.

Max-mixtures model [21] provides an efficient solution to point estimates in multi-modal inference. However, optimization over max-mixtures can be trapped in a local minimum with a poor initialization. Wang and Olson [31] propose to bootstrap the max-mixtures model with stochastic gradient decent (SGD) and obtain increased robustness to poor initializations on pose graph optimization. However, SGD is sensitive to learning rate and lacks global understanding of dominant modes. Instead of local search, we use the consensus of pose hypotheses to guide optimization towards the global optimum, to obtain a robust and efficient solution to pose ambiguity corrupted SLAM.

Ii-C Object SLAM

SLAM systems utilizing object 6D pose representations (e.g. [3, 9, 25, 28]) have great promise for robot-object interactions. But the initializations of landmark poses, especially for ambiguous objects, has been a challenging but under-examined problem. One way of handling ambiguities in single-shot observations is to delay object registrations until robust 6D pose estimations are available (e.g. [3, 9]). Another is to robustify the one-shot pose estimations. NodeSLAM [28], for instance, trains a coffee mug rotation estimator for reliable mug orientation initializations.

Instead of relying on the first measurement to provide satisfactory initialization or delaying object registrations, we enable the flexibility to adjust the landmark initializations during the incremental multi-modal inference process.

Iii Object-Based SLAM with Pose Ambiguities

We define the object-based SLAM problem as joint inference of 6D robot poses and object landmark poses from a series of measurements , consisting of odometry and landmark measurements , where denotes the relative pose measurement from to . Each object in the world is assumed to be static. Without measurement ambiguity, the inference can be modeled as a maximum a posteriori (MAP) estimation problem:


In the real world, however, a robot could inevitably fail to uniquely perceive the 6D pose of an object. Ambiguous objects such as a coffee mug with handle occluded (Fig. 1), or a centrally symmetric playing card don’t have unique pose representations. Furthermore, the perception models can also output highly-uncertain or redundant pose predictions due to algorithmic failures or challenging scenes.

Instead of assuming perfect measurements, we use multiple discrete pose hypotheses to represent all possibilities. A symmetric playing card, for example, possesses hypotheses in a single-shot observation. It is not an overkill to fully represent the two visually indistinguishable poses. Oblivion of the two pose hypotheses can confuse robot localization, as opposite viewpoints yield identical card appearance.

Therefore, we also have to infer the true hypothesis in each ambiguous measurement. A discrete “hypothesis decision” variable is introduced to the optimization. indicates that the -th hypothesis in measurement is the true pose for landmark . Accordingly, the full MAP estimation can be formulated as:


Unfortunately, the combinations of possible hypothesis assignments grow exponentially with the number of ambiguous measurements during navigation (Fig. 2). An exhaustive search over the exploding hypothesis space imposes great computational cost. We propose to implicitly model the multi-modal uncertainty as Gaussian max-mixtures.

Fig. 2: Hypothesis explosion. The symmetric playing card has 2 pose hypotheses in each observation. observations leads to hypotheses in total. We use max-mixture factors to implicitly represent the multiple hypotheses in a measurement as Gaussian mixtures.

Iv Max-Mixtures Method with Dynamic Re-Init

Iv-a Max-Mixtures Model

While max-mixtures was initially introduced as an approximation of sum-mixtures, we show that it can be independently derived by variable elimination of the MAP. With the robot and object landmark poses, , being of key concern in SLAM, we can marginalize out the hypothesis selections from (2) with the max-product algorithm:


Under the Bayes rule, the joint posterior in (3) can be factored as:


with independent joint priors and the irrelevant priors taken out of the operator.

Assuming the measurements are independent, the joint probabilities can be decomposed as the product of factors (prior factors on omitted):


where represents odometry factors that are irrelevant to pose hypotheses , and corresponds to landmark pose measurements, which following the joint probabilities in (4) takes the form of:


Swapping the and product operator in front of in (5) yields the final form of the landmark measurement factor:


where under reduces to for each measurement.

Assuming the measurement model for each pose hypothesis is Gaussian, we obtain a max-mixture type factor, in which represents the Gaussian observation likelihood for and denotes the discrete hypothesis weights. The -operator selects the best hypothesis given the latent variables .

Finally, the formulation for the max-mixtures model can be expressed as:


With Gaussianity locally preserved by the selected mixture components in , the max-mixtures model is amenable to efficient solution by nonlinear least squares optimization. In this study, we use the incremental SLAM framework iSAM2 [13] to solve (8) for better scalability.

Meanwhile, after hypothesis decisions, other subdominant modes are still kept alive in the latent space. The best components will be re-evaluated at each optimization step [21]. Therefore, all the pose hypotheses are implicitly tracked forward and the exhaustive search over hypothesis space is replaced by the local hypothesis selections.

However, efficiency of the max-mixtures model is achieved at the expense of potential local optimality. As the number of ambiguous observations increases, modes accumulate rapidly in the posterior distribution. Without a good initial value, the solution can easily get trapped in an incorrect mode. Therefore, we propose to use consensus over pose hypotheses to directly guide optimization over max-mixtures into the global optimum.

Iv-B Dynamic Re-initialization

A good initial value is typically unachievable for an ambiguous landmark variable. As the first one-shot object observation has multiple hypotheses, it is usually hard to determine which one represents the real pose. Even worse, a sub-dominant pose hypothesis could become dominant later. An arbitrary or a temporarily optimal initialization may trap the solution in a local minimum. In general, it is difficult for the solution to escape the incorrect mode and converge to the global optimum.

Therefore, we perform dynamic landmark re-initialization, forcing some variables in the least-square optimization to “restart” inside the dominant mode. The procedure is summarized in Alg. 1

Input: new multi-hypothesis relative pose measurements from robot to ambiguous landmark , corresponding max-mixture factor
if  doesn’t exist in factor graph then
     1. Initialize with and add into factor graph.
     2. Create pose cache for .
else if  exists in factor graph then
     1. Identify the most consistent set of poses in and compute their average with Alg. 2.
     2. Landmark re-init.
     if dist() threshold  then
         Re-initialize with using Alg. 3.
     end if
     3. Add into factor graph.
     4. Append with .
end if
Algorithm 1 Dynamic Re-Init

Iv-B1 Dominant Hypothesis Inference

Input: pose cache for landmark
Output: average of inlier poses
1. Randomly select a subset of poses out of .
2. Average the selected poses.
3. Find the set of poses within distance threshold to the average pose.
4. Consensus check
if  threshold  then
     Accept as inliers and return their average .
     Repeat from Step 1 until maximum number of iterations is reached.
end if
Algorithm 2 Outlier Robust Pose Averaging w/ RANSAC

We infer the dominant pose hypothesis for each landmark to inform the optimization of the globally consistent mode.

Since the absolute (world frame) poses for static landmarks are time-invariant, only true pose hypotheses appear consistently in measurements. False and outlier hypotheses typically exist unsteadily over time. Hence, the true hypotheses will gradually build up a leading cluster in our cache of poses . We extract this consistent set of pose hypotheses and average them to obtain the initial value. An outlier-robust pose averaging (Fig. 3, Alg. 2) method is employed to separate the dominant from sub-dominant hypotheses. If there’s a significant change in the average pose, landmark re-init (Fig. 4, Alg. 3) is triggered to guide the optimization into the new dominant mode.

To minimize the parameter tuning efforts, we determine the re-init and RANSAC distance thresholds (, in Algs. 1-2) according to the mutual distances between the hypotheses in multi-hypothesis measurements , which indicates the true-to-false hypothesis distances. The minimum mutual distance is used to compute and update these thresholds, such that .

Fig. 3: Outlier-robust pose averaging to infer the dominant hypothesis of an ambiguous landmark

Iv-B2 Landmark Re-initialization

As mentioned above, the SLAM variables are estimated using iSAM2 algorithm [13]. In iSAM2, a landmark variable is initialized at its first observation. Unfortunately, a ready-made one-stop option to adjust the initialization is not available. A new function is required to fix incorrect mode selections for already initialized ambiguous landmarks.

Fig. 4 visualizes an example for landmark re-init, realized by performing local “surgery” on the factor graph. As illustrated, the new measurement disambiguates the landmark , after which the right mode outweighs the left one. To guide the solution from the left into the right mode, the landmark is re-initialized by removing and re-adding the landmark variable with neighboring factors. As a result, the optimization is restarted from a new linearization point within the right mode. Any temporarily optimal mode selections in the old max-mixture factors are corrected. The solution for converges to the global minimum.

Under the hood, iSAM2 performs incremental local updates for only variables affected by new measurements [13]. Provided that the factor graph is not densely connected, the influence of the removing and re-adding steps is rather local. Therefore, in most cases, the re-init step doesn’t introduce much overhead, preserving the real-time performance. This is also validated by a quantitative case study in Appendix A.

Input: factors connected to landmark , new initial value
1. Remove variable and from factor graph.
2. Re-add and to factor graph and initialize with .
Algorithm 3 Landmark Re-Init
Fig. 4: Landmark re-init by local surgery in a factor graph. At top left is a local part of a factor graph. is an ambiguous landmark pose variable and are robot poses. Max-mixture factors (green) correspond to the landmark pose measurements , each possessing two pose hypotheses. and are previous and new initial values for . The two distributions at bottom are marginal posteriors over before and after the new measurement . As disambiguates , we remove and re-add and its neighboring max-mixture factors and re-init it with .

V Experimental Results

We design two object SLAM experiments in ambiguity-rich and feature-sparse environments to test our algorithm. Playing cards (symmetry-induced ambiguity) and coffee mugs (occlusion-induced ambiguity) are used as test objects. Our algorithm is developed in C++ using the iSAM2 implementation [13] in GTSAM library [5]. We use ROS [22] for data collection and post-processing. We run all the tests on a 2.60GHz Intel i7 CPU.

V-a Playing Cards SLAM Experiment

An object SLAM experiment, using playing cards as landmarks, is conducted with the SwarmRobot [18]. The robot is equipped with a forward-pointing ZED camera for visual odometry [27], and a downward-looking (30 to the ground) Blackfly [8] camera for image taking. As illustrated in Fig. 5a, 40 playing cards are placed on the ground with a 58 configuration. They are drawn from 22 classes (number+suit defines a class), where 4 cards are unique and the rest appear in pairs. The robot is remotely controlled to follow a lawnmower-patterned round-trip path. Each card is first observed from one view angle and revisited later from the opposite. Identical card appearances in the two encounters create ambiguities in pose estimations.

Fig. 5: Test field and robot path for the playing cards SLAM experiment. A playing card’s two pose hypotheses are estimated from the Blackfly camera view.
Fig. 6: Frames with card key-pose estimations (high-confidence). The sparse structure of this detection matrix reflects the scarcity of environmental features.
Fig. 7: Playing cards SLAM results. Lines are robot trajectories. Black and cyan texts are card pose references and estimations.
Fig. 8: Time evolution of average trajectory errors.

In the roughly 10min long test, 8522 odometric measurements (ZED odometry) and 17369 images are collected. The images with cards detected are illustrated in Fig. 5c. We utilize the odometry and estimated relative card poses to gradually recover a world map consisting of robot trajectory and 6D card poses. The Vicon MoCap system [30] is employed to obtain the ground truth data.

A SIFT feature [16] based algorithm is developed using OpenCV for cards identification and pose estimation [24]. It is able to return the two centrally symmetric card pose hypotheses (Fig. 5). We design a number of “key-pose” criteria to filter out spurious pose measurements and use only key-poses in optimization. The measurement-card correspondences (data association) are inferred with the classic nearest neighbor approach. We adopt both the proposed max-mixture algorithm and a single-hypothesis system to solve the optimization. The latter incorporates a single-hypothesis estimator, that predicts the most likely card pose, and a uni-modal back-end optimizer.

As shown in Fig. 7, the single-hypothesis method fails catastrophically due to oblivion of the pose ambiguity. At the first card re-observation, i.e. JC at bottom left of Fig. 7, the estimator returns near-identical relative orientation compared to the card’s first encounter. However, the true relative orientation is reversed with robot viewing it from an opposite viewpoint. As a consequence, the inconsistency in JC pose understanding leads to a large deviation in robot localization, which corresponds to the separation point in Fig. 8. Even worse, the localization failure also induces disastrous performance in data associations. Several cards are registered more than twice in the world map.

On the contrary, the proposed algorithm maintains every pose hypothesis and guarantees the pose consistency. It correctly handles the decision making in data associations and true pose hypotheses. Loop closures at the JC re-observation and subsequent card revisits constantly adjust the trajectory and minimize estimation errors (Fig. 8). In the feature-scarce and ambiguity-rich environment, as Fig. 7 demonstrates, the max-mixture method still attains good localization and mapping performance.

V-B Simulation Experiment with Coffee mugs

We conduct a simulation to compare different methods in a feature-sparse, ambiguity-rich scenario. A virtual environment is created using the Unreal Engine in which ten mugs are placed as landmarks[7]. Some obstacles are configured around the mugs so the mugs are frequently occluded. The mug model is adapted from the YCB data set and has been scaled up by 50 folds to match the environment [2]. A mobile robot equipped with a monocular camera is integrated in the environment via the AirSim car simulator[26]. The robot odometry, relative poses to the mugs, and camera images are collected via AirSim to estimate the robot trajectory and mug poses. Fig. 9(a) presents the simulation environment. A synthetic pose estimator is created to detect ambiguous poses due to occlusion. As the robot approaches a mug, the handle of the mug may not be visible to the camera. In the handle-occluded case, as shown in Fig. 9(b), a relative pose measurement with three hypotheses will be added to the data. Two spurious poses of them are derived from ground truth and corrupted by pre-determined rotations (30 degrees around the vertical axis). If the handle is visible to the camera, the relative pose measurement will comprise a single hypothesis synthesizing the ground truth pose with noise. In total, there are 857 time steps along the robot trajectory. 148 of them are associated with handle-visible mug detections while 119 of them have handle-occluded mug detections (see Fig. 9(c)). With the ambiguity-aware pose estimator, we are able to create an feature-sparse, ambiguity-rich scenario to examine different algorithms.

Fig. 9: (a) Simulation environment, (b) synthetic poses, and (c) the number of hypotheses w.r.t time step. Mugs are scaled for better rendering. The coordinates denote the starting point of the robot. Arrows in (b) represent the directions of mug handle in different hypotheses. In (c), a handle-visible detection generates one hypothesis of pose while a handle-occluded detection leads to three hypotheses.
Fig. 10: Estimates by different methods: (a) single hypothesis, (b) MH-iSAM2, (c) max-mixtures, and (d) max-mixtures with dynamic re-initialization. Dots in the figure represent mugs. Ground truth robot trajectory and mug positions are in black.

Fig. 10 shows the estimates at the final time step by four methods: single-hypothesis, max-mixtures, MH-iSAM2111We use the MH-iSAM2 implementation at The default threshold number of tracked hypotheses is set to 20. The solution from the most probable hypothesis is adopted., and max-mixtures with dynamic re-initialization. Single-hypothesis refers to randomly admitting one of the hypotheses from each relative pose measurement, which is similar to the scenario with no awareness of ambiguity. It is not surprising that single-hypothesis causes large deviations in the robot trajectory and mug positions since some factors in the MAP formulation purely consist of spurious hypotheses. Even though both max-mixtures and MH-iSAM2 formulate all the hypotheses in the MAP problem, the estimated trajectories by them are still considerably different from the ground truth (see Fig. 10). The posterior distribution can be highly multi-modal in the case considering all the hypotheses so the estimate of max-mixtures can be trapped in local optima due to bad initial values. MH-iSAM2 is relatively more robust to initial values as it tracks multiple solutions; however, it is possible that the hypothesis set of optimal estimate is pruned even at the very early stage of the trajectory, resulting in inaccurate estimates eventually.

Intermediate estimates as the robot proceeds are compared to reveal the evolution of estimate errors. Fig. 11 shows that max-mixtures with re-initializing maintains relatively low estimate errors during the whole task. The re-initialization operations can be reflected by the spikes in the average rotation error (see Fig. 11(d)). If the first glance of a mug is handle-occluded, then the initial value of the mug pose can be assigned by a spurious hypothesis, which increases the rotation error sharply. As the robot moves to where the handle is visible shortly, our dynamic re-initialization algorithm (Alg.1) can utilize the ambiguity-free measurements there to re-initialize the pose variable of that mug to a good value. Hence, the estimate error decreases immediately after the rise, resulting spikes in the error plot. Table I shows the run time by different methods for processing the simulation data. Since MH-iSAM2 essentially solves a number of point-estimate problems, it is expected to be slower than others. It is a bit surprising that max-mixture with dynamic re-initialization is faster than others as re-initialization is an extra step; however, its improved efficiency still makes sense considering better initial values can reduce the search time in optimization.

Single-hypo. Max-mix. MH-iSAM2 Max-mix(re-init)
5.64 sec 5.24 sec 15.78 sec 5.10 sec
TABLE I: Run time for solving the simulation sequence.
Fig. 11: Evolution of estimate errors: (a), (b) robot poses; (c), (d) landmark poses.

Vi Conclusions

We have presented an ambiguity-tolerant object SLAM inference method that can attain improved localization and mapping performance in feature-sparse, ambiguity-rich environments. Our experiment shows that considerable error can arise in the point estimate if pose ambiguities are not considered in the MAP estimation. The proposed max-mixture method succeeds under unknown data associations and object pose ambiguities in information-scarce scenes. We have also shown that multi-modality-induced local convergence can be mitigated by dynamic re-initialization. Our experiment demonstrates that the heuristic augmented max-mixture method outperforms state-of-the-art ambiguity-resolving SLAM algorithms.

Our future work involves generalizing pose representations for objects possessing continuum pose hypotheses (e.g. mug). This requires combining pose distribution inference (e.g. [20]) and more general uncertainty quantifications beyond multi-modality. Further, the assumed prior knowledge about object pose hypotheses can be inferred, via a ambiguity-detectable multi-hypothesis pose estimator.


Vi-a Runtime Performance Case Study of Landmark Re-Init

Is landmark re-init an over-treatment while a batch optimization step naturally supports full re-initialization? To address this question, we case study the runtime performance of landmark re-init. We build a factor graph with 1000 robot pose variables connected in series. A number of landmark pose variables are connected with them through randomly generated edges. The factor generation and variable initialization are randomized. We vary the numbers of landmarks and edges to carry out a series of runtime tests.

We solve the optimizations with iSAM2 [13] and a batch optimizer (Gauss-Newton). At the last step of iSAM2 optimization, a random landmark variable is re-initialized. We time both the normal iSAM2 steps and the last re-init iSAM2 step for evaluation. The last batch optimization step is also timed for a fair comparison with the re-init iSAM2 step. We repeat the optimizations 10 times for each parameter combination, and average the time costs.

In summary, the steps in Fig. 12 respectively incorporate:

  • iSAM2 step: factor graph update + estimate computation

  • iSAM2 step w/ re-init: landmark re-init + iSAM2 step

  • batch step: factor graph re-construction + estimate computation

Fig. 12: Runtime performance of landmark re-init evaluated via a factor graph optimization case study. The numbers of landmarks and edges are varied to showcase how the step-wise time cost scales.

We learn from the results that an occasional re-init iSAM2 step does not bring about too much overhead as long as the factor graph is not densely connected. In the object-based SLAM context we seldom encounter the scenarios where a large number of objects are concurrently observed at all times. That said, we can conclude that in most cases iSAM2 with re-init is much more efficient than a batch update.


  • [1] Y. Bar-Shalom, T. E. Fortmann, and P. G. Cable (1990) Tracking and data association. Acoustical Society of America. Cited by: §II-B.
  • [2] B. Calli, A. Singh, J. Bruce, A. Walsman, K. Konolige, S. Srinivasa, P. Abbeel, and A. M. Dollar (2017) Yale-cmu-berkeley dataset for robotic manipulation research. The International Journal of Robotics Research 36 (3), pp. 261–268. Cited by: §V-B.
  • [3] J. Civera, D. Gálvez-López, L. Riazuelo, J. D. Tardós, and J. M. M. Montiel (2011) Towards semantic slam using a monocular camera. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1277–1284. Cited by: §II-C.
  • [4] E. Corona, K. Kundu, and S. Fidler (2018) Pose estimation for objects with rotational symmetry. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7215–7222. Cited by: §I, §II-A.
  • [5] F. Dellaert (2012) Factor graphs and gtsam: a hands-on introduction. Technical report Georgia Institute of Technology. Cited by: §V.
  • [6] X. Deng, A. Mousavian, Y. Xiang, F. Xia, T. Bretl, and D. Fox (2019) Poserbpf: a rao-blackwellized particle filter for 6d object pose tracking. arXiv preprint arXiv:1905.09304. Cited by: §I, §II-A.
  • [7] Unreal engine External Links: Link Cited by: §V-B.
  • [8] FLIR Blackfly camera. External Links: Link Cited by: §V-A.
  • [9] D. Gálvez-López, M. Salas, J. D. Tardós, and J. Montiel (2016) Real-time monocular object slam. Robotics and Autonomous Systems 75, pp. 435–449. Cited by: §II-C.
  • [10] B. Grossmann and V. Krüger (2017) Fast view-based pose estimation of industrial objects in point clouds using a particle filter with an icp-based motion model. In 2017 IEEE 15th International Conference on Industrial Informatics (INDIN), pp. 331–338. Cited by: §I, §II-A.
  • [11] M. Hsiao and M. Kaess (2019) MH-isam2: multi-hypothesis isam using bayes tree and hypo-tree. In 2019 International Conference on Robotics and Automation (ICRA), pp. 1274–1280. Cited by: §II-B.
  • [12] G. Huang, M. Kaess, J. J. Leonard, and S. I. Roumeliotis (2013) Analytically-selected multi-hypothesis incremental map estimation. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6481–6485. Cited by: §II-B.
  • [13] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. J. Leonard, and F. Dellaert (2012) ISAM2: incremental smoothing and mapping using the bayes tree. The International Journal of Robotics Research 31 (2), pp. 216–235. Cited by: §I, §IV-A, §IV-B2, §IV-B2, §V, §VI-A.
  • [14] W. Kehl, F. Manhardt, F. Tombari, S. Ilic, and N. Navab (2017) Ssd-6d: making rgb-based 3d detection and 6d pose estimation great again. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1521–1529. Cited by: §II-A.
  • [15] T. Kronhamn (1998)

    Bearings-only target motion analysis based on a multihypothesis kalman filter and adaptive ownship motion control

    IEE Proceedings-Radar, Sonar and Navigation 145 (4), pp. 247–252. Cited by: §II-B.
  • [16] D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §V-A.
  • [17] F. Manhardt, D. M. Arroyo, C. Rupprecht, B. Busam, T. Birdal, N. Navab, and F. Tombari (2019) Explaining the ambiguity of object detection and 6d pose from visual data. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6841–6850. Cited by: §I, §II-A.
  • [18] MarineRoboticsGroup MarineRoboticsGroup/swarmrobot. External Links: Link Cited by: §V-A.
  • [19] Z. C. Márton, S. Türker, C. Rink, M. Brucker, S. Kriegel, T. Bodenmüller, and S. Riedel (2018) Improving object orientation estimates by considering multiple viewpoints. Autonomous Robots 42 (2), pp. 423–442. Cited by: §I, §II-A.
  • [20] B. Okorn, M. Xu, M. Hebert, and D. Held (2020) Learning orientation distributions for object pose estimation. arXiv preprint arXiv:2007.01418. Cited by: §I, §II-A, §VI.
  • [21] E. Olson and P. Agarwal (2013) Inference on networks of mixtures for robust robot mapping. The International Journal of Robotics Research 32 (7), pp. 826–840. Cited by: §I, §II-B, §IV-A.
  • [22] M. Quigley, J. Faust, T. Foote, and J. Leibs

    ROS: an open-source robot operating system

    Cited by: §V.
  • [23] M. Rad and V. Lepetit (2017) BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3828–3836. Cited by: §II-A.
  • [24] Real time pose estimation of a textured object. External Links: Link Cited by: §V-A.
  • [25] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J. Davison (2013) Slam++: simultaneous localisation and mapping at the level of objects. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 1352–1359. Cited by: §II-C.
  • [26] S. Shah, D. Dey, C. Lovett, and A. Kapoor (2018) Airsim: high-fidelity visual and physical simulation for autonomous vehicles. In Field and service robotics, pp. 621–635. Cited by: §V-B.
  • [27] StereoLabs zed camera External Links: Link Cited by: §V-A.
  • [28] E. Sucar, K. Wada, and A. Davison (2020) NodeSLAM: neural object descriptors for multi-view shape reconstruction. In 2020 International Conference on 3D Vision (3DV), pp. 949–958. Cited by: §II-C.
  • [29] M. Sundermeyer, Z. Marton, M. Durner, and R. Triebel (2020)

    Augmented autoencoders: implicit 3d orientation learning for 6d object detection

    International Journal of Computer Vision 128 (3), pp. 714–729. Cited by: §I, §II-A.
  • [30] Vicon motion capture system. External Links: Link Cited by: §V-A.
  • [31] J. Wang and E. Olson (2014)

    Robust pose graph optimization using stochastic gradient descent

    In 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 4284–4289. Cited by: §II-B.
  • [32] B. Wen, C. Mitash, B. Ren, and K. E. Bekris (2020) Se (3)-tracknet: data-driven 6d pose tracking by calibrating image residuals in synthetic domains. arXiv preprint arXiv:2007.13866. Cited by: §I, §II-A.