I Introduction
Robotenvironment interactions rely heavily on realtime, robust and highlevel spatial understanding. In mobile manipulation, for instance, the robot is required to reliably selflocalize and pinpoint the target object for safe navigation and interaction. Building a consistent objectlevel world representation would greatly facilitate the process. Without external assistance and abundant visual features, the robot is expected to map the environment using an objectbased SLAM system.
Recent progress in machine learning based object recognition and pose estimation techniques has spurred the development of objectbased SLAM. Using the learned perception models, the robot gains higherlevel environment understanding leveraging semantic and object pose information. However, ambiguities in object poses, induced by symmetry or occlusions, can cause unexpected uncertainties in pose estimates. The computer vision community has developed various ambiguityresolving pose inference algorithms
[17, 29, 20, 4, 10, 19, 6, 32] to address the difficulty. A SLAM optimization framework is required to tolerate and synthesize the potentially ambiguous local pose measurements and efficiently recover a robust global representation.For instance, if viewing a coffee mug from a vantage point where the handle is not visible, the mug pose can hardly get uniquely constrained (Fig. 1). An ambiguous observation of a coffee mug would therefore result in multiple, and even infinitely many, possible interpretations of the robot and mug poses. The ambiguityoblivious singlehypothesis SLAM inference may fail quietly in this circumstance.
In this work, we use discrete, multihypothesis pose representation to quantify the uncertainties in ambiguous objects’ 6D pose measurements. We keep track of all the pose hypotheses over time and gradually recover the robot and object poses. The challenge is that the number of total hypotheses grows exponentially with increasing ambiguous measurements. An exhaustive global search over the hypothesis space entails exorbitant time and memory.
To alleviate the complexity, we represent all hypotheses as Gaussian maxmixtures, and cast the problem as a continuous, albeit multimodal, factor graph optimization [21]. High computational efficiency is achieved with Gaussianpreserving maxmixtures amenable to nonlinear least square optimization. However, the initializationsensitive gradientbased optimizers can be easily trapped in local optima without global awareness of modes in the posterior.
We propose a consensusbased heuristic to augment local optimization (iSAM2
[13]) with global understanding of dominant pose hypothesis. The most timeconsistent set of object pose estimates are extracted from measurements to inform optimization of the dominant mode up to now. Once the current mode loses superiority, we perform lowoverhead factor graph surgery to change landmark initialization and restart the optimization inside a new dominant mode. This “dynamic reinitialization (reinit)” procedure robustifies the efficient optimization over mixtures with the flexibility of adjusting mode selections.The proposed algorithm provides a robust, realtime solution for robot localization and mapping in featuresparse, ambiguityrich environments. Two SLAM experiments with ambiguous objects are conducted to demonstrate its improved estimation performance.
Ii Related Work
Iia AmbiguityAware Pose Estimation
Object pose estimation considering symmetry or ambiguity is becoming increasingly popular in computer vision and robotics. A natural and ordinary idea is to represent symmetry in an unambiguous manner (e.g.
[23, 14]). Rad and Lepetit [23] restricts the pose of a rotationally symmetric object within a unambiguous range. Any pose measurement beyond the range is returned as its counterpart.Another popular class of methods make multihypothesis pose predictions via view matching. The visual appearances of an object from arbitrary viewpoints are compared against the captured image. Multiple poses are returned when there exist onetomany viewpose correspondences. Manhardt et al. [17] retrieve 6D poses for ambiguous objects based on learned multihypothesis viewpose mapping. Sundermeyer et al. [29] encode visual appearances using an autoencoder and perform similarity checks in the feature space.
Other ambiguity resolving strategies incorporate regressing pose distributions [20] and training symmetry predictors [4]. Beyond that, multishot pose estimators are developed to gradually recover ambiguous object poses. The predicted poses are tracked and updated from successive observations by nonGaussian filters [10, 19, 6] or a learned tracking network [32].
Our primary interest is to tolerate and utilize the local, potentially ambiguous, object pose measurements to build a globally consistent world representation.
IiB MultiModal SLAM Inference
We review the three most relevant methods for multimodal inference under measurement ambiguity:
A natural solution would be to select the most likely hypothesis for each ambiguous measurement and solve an unimodal inference problem, which we refer to as singlehypothesis method (e.g. nearest neighbor for data association [1]
). With the loss of information in discarded hypotheses, it can fail quietly with noise or outliercorrupted measurements.
Multihypothesis tracking (MHT) based methods[15, 12, 11] maintain most probable hypotheses and solve the corresponding set of unimodal inference problems. BiMAP [12] is the first smoothing based multihypothesis SLAM pipeline, where probable hypotheses are selected analytically. MHiSAM2 [11] utilizes the Bayes tree and Hypotree to perform efficient incremental multimodal inference with rulebased hypothesis pruning. MHTbased methods provide multiple state estimations including the temporary optimum. The global optimum is achieved as long as enough hypotheses are tracked. However, the rapid hypothesis accumulation necessitates hypothesis pruning. Designing an optimalityguaranteed pruning strategy still remains challenging.
Maxmixtures model [21] provides an efficient solution to point estimates in multimodal inference. However, optimization over maxmixtures can be trapped in a local minimum with a poor initialization. Wang and Olson [31] propose to bootstrap the maxmixtures model with stochastic gradient decent (SGD) and obtain increased robustness to poor initializations on pose graph optimization. However, SGD is sensitive to learning rate and lacks global understanding of dominant modes. Instead of local search, we use the consensus of pose hypotheses to guide optimization towards the global optimum, to obtain a robust and efficient solution to pose ambiguity corrupted SLAM.
IiC Object SLAM
SLAM systems utilizing object 6D pose representations (e.g. [3, 9, 25, 28]) have great promise for robotobject interactions. But the initializations of landmark poses, especially for ambiguous objects, has been a challenging but underexamined problem. One way of handling ambiguities in singleshot observations is to delay object registrations until robust 6D pose estimations are available (e.g. [3, 9]). Another is to robustify the oneshot pose estimations. NodeSLAM [28], for instance, trains a coffee mug rotation estimator for reliable mug orientation initializations.
Instead of relying on the first measurement to provide satisfactory initialization or delaying object registrations, we enable the flexibility to adjust the landmark initializations during the incremental multimodal inference process.
Iii ObjectBased SLAM with Pose Ambiguities
We define the objectbased SLAM problem as joint inference of 6D robot poses and object landmark poses from a series of measurements , consisting of odometry and landmark measurements , where denotes the relative pose measurement from to . Each object in the world is assumed to be static. Without measurement ambiguity, the inference can be modeled as a maximum a posteriori (MAP) estimation problem:
(1) 
In the real world, however, a robot could inevitably fail to uniquely perceive the 6D pose of an object. Ambiguous objects such as a coffee mug with handle occluded (Fig. 1), or a centrally symmetric playing card don’t have unique pose representations. Furthermore, the perception models can also output highlyuncertain or redundant pose predictions due to algorithmic failures or challenging scenes.
Instead of assuming perfect measurements, we use multiple discrete pose hypotheses to represent all possibilities. A symmetric playing card, for example, possesses hypotheses in a singleshot observation. It is not an overkill to fully represent the two visually indistinguishable poses. Oblivion of the two pose hypotheses can confuse robot localization, as opposite viewpoints yield identical card appearance.
Therefore, we also have to infer the true hypothesis in each ambiguous measurement. A discrete “hypothesis decision” variable is introduced to the optimization. indicates that the th hypothesis in measurement is the true pose for landmark . Accordingly, the full MAP estimation can be formulated as:
(2) 
Unfortunately, the combinations of possible hypothesis assignments grow exponentially with the number of ambiguous measurements during navigation (Fig. 2). An exhaustive search over the exploding hypothesis space imposes great computational cost. We propose to implicitly model the multimodal uncertainty as Gaussian maxmixtures.
Iv MaxMixtures Method with Dynamic ReInit
Iva MaxMixtures Model
While maxmixtures was initially introduced as an approximation of summixtures, we show that it can be independently derived by variable elimination of the MAP. With the robot and object landmark poses, , being of key concern in SLAM, we can marginalize out the hypothesis selections from (2) with the maxproduct algorithm:
(3) 
Under the Bayes rule, the joint posterior in (3) can be factored as:
(4) 
with independent joint priors and the irrelevant priors taken out of the operator.
Assuming the measurements are independent, the joint probabilities can be decomposed as the product of factors (prior factors on omitted):
(5) 
where represents odometry factors that are irrelevant to pose hypotheses , and corresponds to landmark pose measurements, which following the joint probabilities in (4) takes the form of:
(6) 
Swapping the and product operator in front of in (5) yields the final form of the landmark measurement factor:
(7) 
where under reduces to for each measurement.
Assuming the measurement model for each pose hypothesis is Gaussian, we obtain a maxmixture type factor, in which represents the Gaussian observation likelihood for and denotes the discrete hypothesis weights. The operator selects the best hypothesis given the latent variables .
Finally, the formulation for the maxmixtures model can be expressed as:
(8) 
With Gaussianity locally preserved by the selected mixture components in , the maxmixtures model is amenable to efficient solution by nonlinear least squares optimization. In this study, we use the incremental SLAM framework iSAM2 [13] to solve (8) for better scalability.
Meanwhile, after hypothesis decisions, other subdominant modes are still kept alive in the latent space. The best components will be reevaluated at each optimization step [21]. Therefore, all the pose hypotheses are implicitly tracked forward and the exhaustive search over hypothesis space is replaced by the local hypothesis selections.
However, efficiency of the maxmixtures model is achieved at the expense of potential local optimality. As the number of ambiguous observations increases, modes accumulate rapidly in the posterior distribution. Without a good initial value, the solution can easily get trapped in an incorrect mode. Therefore, we propose to use consensus over pose hypotheses to directly guide optimization over maxmixtures into the global optimum.
IvB Dynamic Reinitialization
A good initial value is typically unachievable for an ambiguous landmark variable. As the first oneshot object observation has multiple hypotheses, it is usually hard to determine which one represents the real pose. Even worse, a subdominant pose hypothesis could become dominant later. An arbitrary or a temporarily optimal initialization may trap the solution in a local minimum. In general, it is difficult for the solution to escape the incorrect mode and converge to the global optimum.
Therefore, we perform dynamic landmark reinitialization, forcing some variables in the leastsquare optimization to “restart” inside the dominant mode. The procedure is summarized in Alg. 1
IvB1 Dominant Hypothesis Inference
We infer the dominant pose hypothesis for each landmark to inform the optimization of the globally consistent mode.
Since the absolute (world frame) poses for static landmarks are timeinvariant, only true pose hypotheses appear consistently in measurements. False and outlier hypotheses typically exist unsteadily over time. Hence, the true hypotheses will gradually build up a leading cluster in our cache of poses . We extract this consistent set of pose hypotheses and average them to obtain the initial value. An outlierrobust pose averaging (Fig. 3, Alg. 2) method is employed to separate the dominant from subdominant hypotheses. If there’s a significant change in the average pose, landmark reinit (Fig. 4, Alg. 3) is triggered to guide the optimization into the new dominant mode.
To minimize the parameter tuning efforts, we determine the reinit and RANSAC distance thresholds (, in Algs. 12) according to the mutual distances between the hypotheses in multihypothesis measurements , which indicates the truetofalse hypothesis distances. The minimum mutual distance is used to compute and update these thresholds, such that .
IvB2 Landmark Reinitialization
As mentioned above, the SLAM variables are estimated using iSAM2 algorithm [13]. In iSAM2, a landmark variable is initialized at its first observation. Unfortunately, a readymade onestop option to adjust the initialization is not available. A new function is required to fix incorrect mode selections for already initialized ambiguous landmarks.
Fig. 4 visualizes an example for landmark reinit, realized by performing local “surgery” on the factor graph. As illustrated, the new measurement disambiguates the landmark , after which the right mode outweighs the left one. To guide the solution from the left into the right mode, the landmark is reinitialized by removing and readding the landmark variable with neighboring factors. As a result, the optimization is restarted from a new linearization point within the right mode. Any temporarily optimal mode selections in the old maxmixture factors are corrected. The solution for converges to the global minimum.
Under the hood, iSAM2 performs incremental local updates for only variables affected by new measurements [13]. Provided that the factor graph is not densely connected, the influence of the removing and readding steps is rather local. Therefore, in most cases, the reinit step doesn’t introduce much overhead, preserving the realtime performance. This is also validated by a quantitative case study in Appendix A.
V Experimental Results
We design two object SLAM experiments in ambiguityrich and featuresparse environments to test our algorithm. Playing cards (symmetryinduced ambiguity) and coffee mugs (occlusioninduced ambiguity) are used as test objects. Our algorithm is developed in C++ using the iSAM2 implementation [13] in GTSAM library [5]. We use ROS [22] for data collection and postprocessing. We run all the tests on a 2.60GHz Intel i7 CPU.
Va Playing Cards SLAM Experiment
An object SLAM experiment, using playing cards as landmarks, is conducted with the SwarmRobot [18]. The robot is equipped with a forwardpointing ZED camera for visual odometry [27], and a downwardlooking (30 to the ground) Blackfly [8] camera for image taking. As illustrated in Fig. 5a, 40 playing cards are placed on the ground with a 58 configuration. They are drawn from 22 classes (number+suit defines a class), where 4 cards are unique and the rest appear in pairs. The robot is remotely controlled to follow a lawnmowerpatterned roundtrip path. Each card is first observed from one view angle and revisited later from the opposite. Identical card appearances in the two encounters create ambiguities in pose estimations.
In the roughly 10min long test, 8522 odometric measurements (ZED odometry) and 17369 images are collected. The images with cards detected are illustrated in Fig. 5c. We utilize the odometry and estimated relative card poses to gradually recover a world map consisting of robot trajectory and 6D card poses. The Vicon MoCap system [30] is employed to obtain the ground truth data.
A SIFT feature [16] based algorithm is developed using OpenCV for cards identification and pose estimation [24]. It is able to return the two centrally symmetric card pose hypotheses (Fig. 5). We design a number of “keypose” criteria to filter out spurious pose measurements and use only keyposes in optimization. The measurementcard correspondences (data association) are inferred with the classic nearest neighbor approach. We adopt both the proposed maxmixture algorithm and a singlehypothesis system to solve the optimization. The latter incorporates a singlehypothesis estimator, that predicts the most likely card pose, and a unimodal backend optimizer.
As shown in Fig. 7, the singlehypothesis method fails catastrophically due to oblivion of the pose ambiguity. At the first card reobservation, i.e. JC at bottom left of Fig. 7, the estimator returns nearidentical relative orientation compared to the card’s first encounter. However, the true relative orientation is reversed with robot viewing it from an opposite viewpoint. As a consequence, the inconsistency in JC pose understanding leads to a large deviation in robot localization, which corresponds to the separation point in Fig. 8. Even worse, the localization failure also induces disastrous performance in data associations. Several cards are registered more than twice in the world map.
On the contrary, the proposed algorithm maintains every pose hypothesis and guarantees the pose consistency. It correctly handles the decision making in data associations and true pose hypotheses. Loop closures at the JC reobservation and subsequent card revisits constantly adjust the trajectory and minimize estimation errors (Fig. 8). In the featurescarce and ambiguityrich environment, as Fig. 7 demonstrates, the maxmixture method still attains good localization and mapping performance.
VB Simulation Experiment with Coffee mugs
We conduct a simulation to compare different methods in a featuresparse, ambiguityrich scenario. A virtual environment is created using the Unreal Engine in which ten mugs are placed as landmarks[7]. Some obstacles are configured around the mugs so the mugs are frequently occluded. The mug model is adapted from the YCB data set and has been scaled up by 50 folds to match the environment [2]. A mobile robot equipped with a monocular camera is integrated in the environment via the AirSim car simulator[26]. The robot odometry, relative poses to the mugs, and camera images are collected via AirSim to estimate the robot trajectory and mug poses. Fig. 9(a) presents the simulation environment. A synthetic pose estimator is created to detect ambiguous poses due to occlusion. As the robot approaches a mug, the handle of the mug may not be visible to the camera. In the handleoccluded case, as shown in Fig. 9(b), a relative pose measurement with three hypotheses will be added to the data. Two spurious poses of them are derived from ground truth and corrupted by predetermined rotations (30 degrees around the vertical axis). If the handle is visible to the camera, the relative pose measurement will comprise a single hypothesis synthesizing the ground truth pose with noise. In total, there are 857 time steps along the robot trajectory. 148 of them are associated with handlevisible mug detections while 119 of them have handleoccluded mug detections (see Fig. 9(c)). With the ambiguityaware pose estimator, we are able to create an featuresparse, ambiguityrich scenario to examine different algorithms.
Fig. 10 shows the estimates at the final time step by four methods: singlehypothesis, maxmixtures, MHiSAM2^{1}^{1}1We use the MHiSAM2 implementation at https://bitbucket.org/rpl_cmu/mhisam2_lib. The default threshold number of tracked hypotheses is set to 20. The solution from the most probable hypothesis is adopted., and maxmixtures with dynamic reinitialization. Singlehypothesis refers to randomly admitting one of the hypotheses from each relative pose measurement, which is similar to the scenario with no awareness of ambiguity. It is not surprising that singlehypothesis causes large deviations in the robot trajectory and mug positions since some factors in the MAP formulation purely consist of spurious hypotheses. Even though both maxmixtures and MHiSAM2 formulate all the hypotheses in the MAP problem, the estimated trajectories by them are still considerably different from the ground truth (see Fig. 10). The posterior distribution can be highly multimodal in the case considering all the hypotheses so the estimate of maxmixtures can be trapped in local optima due to bad initial values. MHiSAM2 is relatively more robust to initial values as it tracks multiple solutions; however, it is possible that the hypothesis set of optimal estimate is pruned even at the very early stage of the trajectory, resulting in inaccurate estimates eventually.
Intermediate estimates as the robot proceeds are compared to reveal the evolution of estimate errors. Fig. 11 shows that maxmixtures with reinitializing maintains relatively low estimate errors during the whole task. The reinitialization operations can be reflected by the spikes in the average rotation error (see Fig. 11(d)). If the first glance of a mug is handleoccluded, then the initial value of the mug pose can be assigned by a spurious hypothesis, which increases the rotation error sharply. As the robot moves to where the handle is visible shortly, our dynamic reinitialization algorithm (Alg.1) can utilize the ambiguityfree measurements there to reinitialize the pose variable of that mug to a good value. Hence, the estimate error decreases immediately after the rise, resulting spikes in the error plot. Table I shows the run time by different methods for processing the simulation data. Since MHiSAM2 essentially solves a number of pointestimate problems, it is expected to be slower than others. It is a bit surprising that maxmixture with dynamic reinitialization is faster than others as reinitialization is an extra step; however, its improved efficiency still makes sense considering better initial values can reduce the search time in optimization.
Singlehypo.  Maxmix.  MHiSAM2  Maxmix(reinit) 
5.64 sec  5.24 sec  15.78 sec  5.10 sec 
Vi Conclusions
We have presented an ambiguitytolerant object SLAM inference method that can attain improved localization and mapping performance in featuresparse, ambiguityrich environments. Our experiment shows that considerable error can arise in the point estimate if pose ambiguities are not considered in the MAP estimation. The proposed maxmixture method succeeds under unknown data associations and object pose ambiguities in informationscarce scenes. We have also shown that multimodalityinduced local convergence can be mitigated by dynamic reinitialization. Our experiment demonstrates that the heuristic augmented maxmixture method outperforms stateoftheart ambiguityresolving SLAM algorithms.
Our future work involves generalizing pose representations for objects possessing continuum pose hypotheses (e.g. mug). This requires combining pose distribution inference (e.g. [20]) and more general uncertainty quantifications beyond multimodality. Further, the assumed prior knowledge about object pose hypotheses can be inferred, via a ambiguitydetectable multihypothesis pose estimator.
Appendix
Via Runtime Performance Case Study of Landmark ReInit
Is landmark reinit an overtreatment while a batch optimization step naturally supports full reinitialization? To address this question, we case study the runtime performance of landmark reinit. We build a factor graph with 1000 robot pose variables connected in series. A number of landmark pose variables are connected with them through randomly generated edges. The factor generation and variable initialization are randomized. We vary the numbers of landmarks and edges to carry out a series of runtime tests.
We solve the optimizations with iSAM2 [13] and a batch optimizer (GaussNewton). At the last step of iSAM2 optimization, a random landmark variable is reinitialized. We time both the normal iSAM2 steps and the last reinit iSAM2 step for evaluation. The last batch optimization step is also timed for a fair comparison with the reinit iSAM2 step. We repeat the optimizations 10 times for each parameter combination, and average the time costs.
In summary, the steps in Fig. 12 respectively incorporate:

iSAM2 step: factor graph update + estimate computation

iSAM2 step w/ reinit: landmark reinit + iSAM2 step

batch step: factor graph reconstruction + estimate computation
We learn from the results that an occasional reinit iSAM2 step does not bring about too much overhead as long as the factor graph is not densely connected. In the objectbased SLAM context we seldom encounter the scenarios where a large number of objects are concurrently observed at all times. That said, we can conclude that in most cases iSAM2 with reinit is much more efficient than a batch update.
References
 [1] (1990) Tracking and data association. Acoustical Society of America. Cited by: §IIB.
 [2] (2017) Yalecmuberkeley dataset for robotic manipulation research. The International Journal of Robotics Research 36 (3), pp. 261–268. Cited by: §VB.
 [3] (2011) Towards semantic slam using a monocular camera. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1277–1284. Cited by: §IIC.
 [4] (2018) Pose estimation for objects with rotational symmetry. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7215–7222. Cited by: §I, §IIA.
 [5] (2012) Factor graphs and gtsam: a handson introduction. Technical report Georgia Institute of Technology. Cited by: §V.
 [6] (2019) Poserbpf: a raoblackwellized particle filter for 6d object pose tracking. arXiv preprint arXiv:1905.09304. Cited by: §I, §IIA.
 [7] Unreal engine External Links: Link Cited by: §VB.
 [8] FLIR Blackfly camera. External Links: Link Cited by: §VA.
 [9] (2016) Realtime monocular object slam. Robotics and Autonomous Systems 75, pp. 435–449. Cited by: §IIC.
 [10] (2017) Fast viewbased pose estimation of industrial objects in point clouds using a particle filter with an icpbased motion model. In 2017 IEEE 15th International Conference on Industrial Informatics (INDIN), pp. 331–338. Cited by: §I, §IIA.
 [11] (2019) MHisam2: multihypothesis isam using bayes tree and hypotree. In 2019 International Conference on Robotics and Automation (ICRA), pp. 1274–1280. Cited by: §IIB.
 [12] (2013) Analyticallyselected multihypothesis incremental map estimation. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6481–6485. Cited by: §IIB.
 [13] (2012) ISAM2: incremental smoothing and mapping using the bayes tree. The International Journal of Robotics Research 31 (2), pp. 216–235. Cited by: §I, §IVA, §IVB2, §IVB2, §V, §VIA.
 [14] (2017) Ssd6d: making rgbbased 3d detection and 6d pose estimation great again. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1521–1529. Cited by: §IIA.

[15]
(1998)
Bearingsonly target motion analysis based on a multihypothesis kalman filter and adaptive ownship motion control
. IEE ProceedingsRadar, Sonar and Navigation 145 (4), pp. 247–252. Cited by: §IIB.  [16] (2004) Distinctive image features from scaleinvariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §VA.
 [17] (2019) Explaining the ambiguity of object detection and 6d pose from visual data. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6841–6850. Cited by: §I, §IIA.
 [18] MarineRoboticsGroup/swarmrobot. External Links: Link Cited by: §VA.
 [19] (2018) Improving object orientation estimates by considering multiple viewpoints. Autonomous Robots 42 (2), pp. 423–442. Cited by: §I, §IIA.
 [20] (2020) Learning orientation distributions for object pose estimation. arXiv preprint arXiv:2007.01418. Cited by: §I, §IIA, §VI.
 [21] (2013) Inference on networks of mixtures for robust robot mapping. The International Journal of Robotics Research 32 (7), pp. 826–840. Cited by: §I, §IIB, §IVA.

[22]
ROS: an opensource robot operating system
. Cited by: §V.  [23] (2017) BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3828–3836. Cited by: §IIA.
 [24] Real time pose estimation of a textured object. External Links: Link Cited by: §VA.

[25]
(2013)
Slam++: simultaneous localisation and mapping at the level of objects.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 1352–1359. Cited by: §IIC.  [26] (2018) Airsim: highfidelity visual and physical simulation for autonomous vehicles. In Field and service robotics, pp. 621–635. Cited by: §VB.
 [27] StereoLabs zed camera External Links: Link Cited by: §VA.
 [28] (2020) NodeSLAM: neural object descriptors for multiview shape reconstruction. In 2020 International Conference on 3D Vision (3DV), pp. 949–958. Cited by: §IIC.

[29]
(2020)
Augmented autoencoders: implicit 3d orientation learning for 6d object detection
. International Journal of Computer Vision 128 (3), pp. 714–729. Cited by: §I, §IIA.  [30] Vicon motion capture system. External Links: Link Cited by: §VA.

[31]
(2014)
Robust pose graph optimization using stochastic gradient descent
. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 4284–4289. Cited by: §IIB.  [32] (2020) Se (3)tracknet: datadriven 6d pose tracking by calibrating image residuals in synthetic domains. arXiv preprint arXiv:2007.13866. Cited by: §I, §IIA.