Log In Sign Up

SLAM-Supported Self-Training for 6D Object Pose Estimation

by   Ziqi Lu, et al.

Recent progress in learning-based object pose estimation paves the way for developing richer object-level world representations. However, the estimators, often trained with out-of-domain data, can suffer performance degradation as deployed in novel environments. To address the problem, we present a SLAM-supported self-training procedure to autonomously improve robot object pose estimation ability during navigation. Combining the network predictions with robot odometry, we can build a consistent object-level environment map via pose graph optimization (PGO). Exploiting the state estimates from PGO, we pseudo-label robot-collected RGB images to fine-tune the pose estimators. Unfortunately, it is difficult to model the uncertainty of the estimator predictions. The unmodeled uncertainty in the data used for PGO can result in low-quality object pose estimates. An automatic covariance tuning method is developed for robust PGO by allowing the measurement uncertainty models to change as part of the optimization process. The formulation permits a straightforward alternating minimization procedure that re-scales covariances analytically and component-wise, enabling more flexible noise modeling for learning-based measurements. We test our method with the deep object pose estimator (DOPE) on the YCB video dataset and in real-world robot experiments. The method can achieve significant performance gain in pose estimation, and in return facilitates the success of object SLAM.


page 1

page 6


GODSAC*: Graph Optimized DSAC* for Robot Relocalization

Deep learning based camera pose estimation from monocular camera images ...

A Multi-Hypothesis Approach to Pose Ambiguity in Object-Based SLAM

In object-based Simultaneous Localization and Mapping (SLAM), 6D object ...

Symmetry and Uncertainty-Aware Object SLAM for 6DoF Object Pose Estimation

We propose a keypoint-based object-level SLAM framework that can provide...

Global Pose Estimation with an Attention-based Recurrent Network

The ability for an agent to localize itself within an environment is cru...

Fast Uncertainty Quantification for Deep Object Pose Estimation

Deep learning-based object pose estimators are often unreliable and over...

SIM2REALVIZ: Visualizing the Sim2Real Gap in Robot Ego-Pose Estimation

The Robotics community has started to heavily rely on increasingly reali...

Sim2Real Object-Centric Keypoint Detection and Description

Keypoint detection and description play a central role in computer visio...

Code Repositories


SLAM-Supported Semi-Supervised Learning for 6D Object Pose Estimation

view repo

I Introduction

State-of-the-art object 6D pose estimators can capture object-level geometric and semantic information in challenging scenes [20]. During robot navigation, an object-based simultaneous localization and mapping (object SLAM) system can use object pose predictions, with robot odometry, to build a consistent object-level map (e.g. [21]). However, the estimators, mostly trained with out-of-domain data, often show degraded performance as deployed in novel environments. Annotating target-domain real images for training is tedious, and limits the potential for autonomous operations. We propose to collect images during robot navigation, exploit the object SLAM estimates to pseudo-label the data, and fine-tune the pose estimator.

As depicted in Fig. 1, we develop a SLAM-supported self-training procedure for RGB-image-based object pose estimators. During navigation, the robot collects images and deploys a pre-trained model to infer object poses in the scene. Together with noisy robot state measurements from on-board odometric sensors (camera, IMU, lidar, etc.), a pose graph optimization (PGO) problem is formulated to optimize the camera (i.e. robot) and object poses. We leverage the state estimates to pseudo-label the images, generating new training data to fine-tune the initial model, and autonomously improve the robot’s object pose estimation ability.

A major challenge in this procedure is the difficulty of modeling the uncertainty of the learning-based pose measurements. In particular, it is difficult to specify a priori an appropriate (Gaussian) noise model for them as typically required in PGO. For this reason, rather than fixing a potentially poor choice of covariance model, we allow the uncertainty model to change as part of the optimization process. Similar approaches have been explored previously in the context of robust SLAM (e.g. [1, 17]). However, our joint optimization formulation permits a straightforward alternating minimization procedure, where the optimal covariance models are determined analytically. Moreover, our method realizes component-wise covariance re-scaling, allowing us to fit a richer class of noise models.

We tested our system with the deep object pose estimator (DOPE) [24] on the YCB video dataset [28]

and with real-world robot experiments. The system is demonstrated able to generate high-quality pseudo labels. The fine-tuned networks show enhanced accuracy and significantly reduced outlier rates, which in return facilitates the success of SLAM algorithms.

In summary, our work makes the following contributions:

  • A SLAM-aided self-training procedure to pseudo-label robot-collected RGB images and boost object pose estimation performance during object-based navigation.

  • A automatic covariance tuning (ACT) method that is critical for the success and autonomous operation of the above procedure.

  • Experimental evaluation on the YCB-v dataset and a new robot-collected dataset.

Fig. 1: SLAM-supported self-training for object 6D pose estimators: (a) Train an initial pose estimator on (typically synthetic) labeled image data; (b) Infer object poses on the robot-collected RGB images; (c) Establish a pose graph with object pose predictions and odometric measurements; (d) Solve the PGO with the proposed automatic covariance tuning method (Sec. III.C); (e) Compare and combine the optimal object poses and inlier pose predictions to obtain pseudo ground truth poses ; (f) Evaluate the poses to test their visual consistency with the RGB images, as proposed in [5]; (g) Fine-tune the initial estimator model with the pseudo-labeled data and the initial labeled data.

Ii Related Work

Ii-a Semi- and self-supervised 6D object pose estimation

Learning-based object pose estimators (e.g.[28, 24, 9]) typically require a large of amount of training data to succeed, but it can be expensive to acquire high-quality pose-annotated images from real environments. One solution is to generate photo-realistic and domain-randomized synthetic image data for training (e.g.[14]). But bridging the sim-to-real gap is difficult.

Semi- and self-supervised methods are developed to also exploit unlabeled real images to mitigate the lack of data. These methods typically train the model on synthetic data in a supervised manner and improve its performance on real images by semi- or self-supervised learning

[30, 10, 27, 12, 22, 29, 32]. Many recent methods leverage differentiable rendering to develop end-to-end self-supervised pose estimation models, by encouraging similarity between real images and images rendered with estimated poses [27, 12, 22, 29]. Despite their success, the methods are not designed to leverage the video streams collected during robot tasks. Deng et al. [5], instead, collect RGB-D image streams with a robot arm during object manipulation for self-training. They use a pose initialization module to obtain accurate pose estimates in the first frame, and use motion priors from forward kinematics and a particle filter [4] to propagate the pose estimates forward in time.

We propose to collect and pseudo-label RGB images during robot navigation. Instead of frame-to-frame pose tracking, we directly estimate the 3D scene geometry which requires no accurate first-frame object poses or high-precision motion priors. We use only RGB cameras and optionally other odometric sensors, which are mostly available on a robot platform, for data collection and self-training.

Ii-B Multi-view object-based perception

Integrating information from multiple viewpoints can help correct the inaccurate and missed predictions made by single-frame object perception models. This idea is widely applied in image classification (e.g. [31]), object detection (e.g. [18]) and pose estimation (e.g. [4, 9]). Pillai et al. [18] develop a SLAM-aware object localization system. They leverage visual SLAM estimates, consisting of camera poses and feature locations, to support the object detection task, leading to strong improvements over single-frame methods.

Beyond that, multi-view object perception models can provide noisy supervision to train single-frame methods. Mitash et al. [13] conduct multi-view pose estimation to pseudo-label real images for self-training of an object detector. Zeng et al. [30] perform multi-view object segmentation to generate training data for object pose estimation. Nava et al. [15] exploit noisy state estimates to assist self-supervised learning of spatial perception models.

Inspired by these works, we propose to use SLAM-aided object pose estimation to generate training data for semi-supervised learning of object pose estimators.

Ii-C Automatic covariance tuning for robust SLAM

Obtaining robust SLAM estimates is critical for the success of our method. In PGO, the selection of measurement noise covariances relies mostly on empirical estimates of noise statistics. In the presence of outlier-prone measurements (e.g. learning-based) whose uncertainty is hard to quantify, it is infeasible to fix a (Gaussian) noise model for PGO. Adaptive covariance tuning methods (e.g. [19, 25]) have been developed to concurrently estimate the SLAM variables and noise statistics. These methods show generally better performance than using empirical noise models.

The robust M-estimators are widely applied to deal with outlier-corrupted measurements [11]. Minimizing their robust cost functions with the iteratively re-weighted least squares (IRLS) method, the measurement contributions are re-weighted at each step based on their Mahalanobis distances. This down-weights the influence of outliers and is de facto re-scaling the covariance uniformly [1]. Pfeifer et al. [17] propose to jointly optimize state variables and covariances, with the covariance components log-regularized. This method leads to a robust M-estimator, closed-form dynamic covariance estimation (cDCE), which can be solved with IRLS. It is more flexible than existing M-estimators since it allows the noise components to be re-weighted differently.

Our joint optimization formulation is similar, with covariances L1 regularized, and also permits closed-form component-wise covariance update. Instead of IRLS, we follow an alternating minimization procedure, and eliminate the influence of outliers based on the test.

Iii Methodology

Iii-a Object SLAM via pose graph optimization

In our method, the object pose estimator is initially trained on (typically synthetic) labeled image data , where and denote RGB images and the ground truth 6DoF object poses (w.r.t camera), and represents the ground truth labels. We can map to via an estimator-dependent function . In this paper, we work with the DOPE [24] estimator222Our method is not restricted to using a particular pose estimator., for which is the pixel locations for the projected object 3D bounding box, and is the perspective projection function. We apply the initial estimator to make object pose predictions as the robot explores its environment.

During navigation, the robot collects a number of unlabeled RGB images . The estimator makes noisy object pose measurements (i.e. network predictions) on . Combining and camera odometric measurements obtained with on-board sensors, a pose graph can be established as (see Fig. 1(c)). denotes the latent variables incorporating camera poses and object landmark poses (w.r.t world) . Assuming all the measurements are independent, we can decompose the joint likelihood into factors , consisting of odometry factors and object pose measurement factors . Each factor describes the likelihood of a measurement given certain variable assignments. Edges represent the factorization structure of the likelihood distribution:


We assume the measurement noises are zero-mean, normally-distributed

, where is the noise covariance matrix for measurement . The optimal assignments of the variables can be obtained by solving the maximum likelihood estimation problem:


where Log is the logarithmic map on the SE(3) group, and is the Mahalanobis distance. We solve (2) with automatic covariance tuning (see Sec. III.C) to obtain robust SLAM estimates .

Iii-B Hybrid pseudo-labeling

Based on the optimal states, we can: (1) identify inliers from object pose measurements ; (2) compute optimized object-to-camera poses . We compare and combine and to obtain the pseudo ground truth poses for images in (see Fig. 1(e)). We refer to the pseudo-labeling method as Hybrid labeling, with pseudo labels derived from the mixed sources.

First , we identify inlier pose measurements based on the test (Alg. 1 line 1):


where is the critical value for the

distribution with 6 degrees of freedom at confidence level 0.95.

Second, we directly compute pseudo ground truth poses from the optimal states. For the image collected at time , the optimized object-to-camera pose for object number is computed as . However, the inlier predictions or optimized poses can be noisy and may deviate visually from the target objects on the images and hurt training.

Therefore, we employ a pose evaluation module, as proposed in [5], to quantify the visual coherence between pseudo labels and images (see Fig.1(f)). Each RGB image is compared with its corresponding rendered image based on the pseudo ground truth pose . The regions of interest (RoIs) on the two images are passed into a pre-trained auto-encoder from PoseRBPF [4]

, and the cosine similarity of their feature embeddings is computed. We abstract the process as a function

, where the dependence on object, camera and image information is made implicit.

We contrast the PGO-computed pose’s score with the inlier score for every target object on unlabeled images, and pick the pose with a higher score as the pseudo ground truth pose (see Sec. IV.A for details). We assemble the pseudo-labeled data as , where again . And we fine-tune the object pose estimator with (see Fig.1(g)).

Our method naturally generates training samples on which the initial estimator fails to make reasonable predictions, i.e. hard examples. In particular, when the initial estimator makes an outlier prediction or misses a prediction, PGO can potentially recover an object pose that is visually consistent with the RGB image. The challenging examples, as we demonstrate in Sec. IV, are the key to achieving significant performance gain during self-training.

Iii-C Automatic covariance tuning by alternating minimization

Since the object pose measurements are derived from learning-based models, it is difficult to a priori fix a (Gaussian) uncertainty model for them as typically required in PGO. We propose to automate the covariance tuning process by incorporating the noise covariances as variables into a joint optimization procedure. Our formulation, as we show, leads to a simple and robust PGO method, that allows us to alternate between PGO and determining in closed form the optimal noise models.

For simplicity, we rewrite the PGO cost function in (2) as:


where and are the unwhitened errors.

Given that the odometry measurements are typically more reliable, we only update the object pose noise covariances at the covariance optimization step. Each time after optimizing SLAM variables, we solve for the per-measurement noise models that can further minimize the optimization loss, and re-solve the SLAM variables with them. To avoid the trivial solution of

, we apply L1 regularization on the covariance matrices. The loss function for the joint optimization is:


For simplicity, we assume the covariance matrices are diagonal, i.e. , and the joint loss reduces to:


where is the th entry of the unwhitened error.

For fixed and , optimizing (6) w.r.t amounts to minimizing a convex objective over the non-negative reals. Moreover, we know is never a minimizer for . Thus, a global optimal solution is obtained when:


evaluating which we obtain the extremum on :


With being valid for all , we can confirm that this extremum is a global minimum.

Therefore we can express the covariance update rule at iteration as:


where we factor out in (8) for the convenience of tuning, i.e. .

Since the covariance optimization step admits a closed form solution, the joint optimization reduces to “iteratively solving PGO with re-computed covariances”. The selection of all the noise models reduces to tuning . According to our experiments, it’s feasible to set as a constant for consistent performance in different cases333 (i.e. ) for all the tests..

The algorithm is summarized in Alg. 1. We use the Levenberg-Marquardt (LM) algorithm to solve the Gaussian PGO (line 1). For better performance, we also identify outliers at each iteration using the test (3) and set their noises to a very large value (line 1) to rule out their influences. The optimization is terminated as the relative decrease in is sufficiently small or the maximum number of iterations is reached. In Appendix A, we show the algorithm can monotonically improve .

Input: Graph with obj. pose
meas. factors and odom. factors ;
Initial values: , , ;
Regularization coefficient:
Output: Optimized variables
1 for  to  do
2       = LM , , , ;
3       for  to  do
4             evaluateError ;
5             if chiSqTest  then
6                   ;
8            else
9                   ;
11             end if
13       end for
15 end for
Algorithm 1 ACT by alternating minimization

The update rule (9

) reveals our implicit assumption that a high-residual measurement component (i.e. xyzrpy) is likely from a high-variance normal distribution. The recomputed covariances down-weight high-residual measurements, which is in spirit similar to robust M-estimation. We show in Appendix

B that, with isotropic noise models, our method reduces to using the L1 robust kernel.

However, the robust M-estimation is typically solved with the IRLS method, where the measurement losses are re-weighted based on the Mahalanobis distance (see (17)). In comparison, our method re-weights the per-component losses differently. This enables us to fit a richer class of noise models for 6DoF PGO, in that different components often follow different noise characteristics.

Iv Experiments

Our method is tested with (1) the YCB video (YCB-v) dataset and (2) a new ground robot experiment. On the YCB-v dataset (Sec. IV.A), we leverage the image streams in the training sets to build per-video object-level maps for self-training. The Hybrid labeling method is compared with another two baseline methods to verify the effectiveness of different modules. In the ground robot experiment (Sec. IV. B), we apply the method on longer sequences, circumnavigating selected objects, and demonstrate its potential for use in object-based navigation.

Our method is implemented in Python. We use the NVISII toolkit [14] to generate synthetic image data. The training and pose inference programs are adapted from the code available in the DOPE GitHub [24]

. Every network is initially trained for 60 epochs and fine-tuned for 20 epochs, with a batchsize of 64 and learning rate of 0.0001. We solve PGO problems based on the GTSAM library

[3]. We run the data generation and training programs with 2 NVIDIA Volta V100 GPUs and other code on an Intel Core i7-9850H@2.6GHz CPU and an NVIDIA Quadro RTX 3000 GPU.

Iv-a YCB video experiment

We test our method with DOPE estimators [24] for three YCB objects: 003_cracker_box, 004_sugar_box and 010_potted_meat_can. Each object appears in 20 YCB videos (training + testing). They have respectively 26689, 22528 and 27050 training images (dashed lines in Fig. 2(a)) from 17, 15, and 17 training videos.

We use 60k NVISII-generated synthetic image data to train initial DOPE estimators for the 3 objects444The 010_potted_meat_can data are the same as that used in [14].. The models are applied to infer object poses on the YCB-v images. We employ the ORB-SLAM3 [2] RGB-D module (w/o loop closing) to obtain camera odometry on these videos.

003_cracker_box 0001 0004 0007 0016 0017 0019 0025 #best
LM 62.3 58.7 13.2 69.4 37.6 110.1 101.6 0
Cauchy 12.4 10.8 10.2 13.8 29.5 94.4 171.4 4
Huber 31.4 25.4 10.2 34.2 21.6 52.5 57.0 1
G-M 11.5 168.4 10.2 115.0 48.4 94.4 171.4 2
cDCE[17] 28.7 25.4 10.5 32.5 21.1 45.2 58.9 4
ACT(Ours) 15.7 12.0 9.4 12.6 20.3 52.0 15.4 9
004_sugar_box 0001 0014 0015 0020 0025 0029 0033 #best
LM 22.9 27.1 100.9 21.1 57.3 78.7 7.1 0
Cauchy 8.3 13.4 30.4 21.9 22.3 104.4 6.4 1
Huber 11.7 12.8 35.5 15.8 23.4 71.1 6.6 3
G-M 9.4 11.4 29.1 14.3 19.6 104.4 6.4 6
cDCE[17] 12.3 12.0 31.6 14.9 20.0 72.7 6.5 0
ACT(Ours) 8.2 15.9 34.2 15.1 18.0 100.5 6.1 10
010_potted_meat_can 0002 0005 0008 0014 0017 0023 0026 #best
LM 35.2 38.1 61.4 59.2 31.1 32.8 17.5 0
Cauchy 10.8 14.9 10.7 12.2 14.2 11.7 11.4 6
Huber 11.1 17.1 14.6 16.6 18.6 15.5 11.8 1
G-M 10.4 15.3 11.9 13.2 18.9 15.2 9.1 5
cDCE[17] 11.5 16.2 16.1 20.5 17.8 15.3 11.2 1
ACT(Ours) 13.3 14.5 12.8 11.3 19.1 14.0 10.4 7
TABLE I: Comparison of robust PGO methods via pseudo label accuracy on YCB-v sequences. Column 1-7 are median (pixel) errors in PGO-generated pseudo labels for the first 7 (out of 20) videos. Column 8 (#best) is the number of sequences on which a method achieves the lowest error.

Combining the measurements, we solve per-video PGO problems to build object-level maps for pseudo-labeling. We initialize the camera poses with the odometry chain, object poses with average pose predictions, and covariances with . We apply different robust optimization methods to solve all 60 PGO problems (see Tab. I555 Due to space limitations, we report results for the first 7 out of 20 YCB-v sequences. Please check out our GitHub repo for complete statistics. 666 We use the default parameters and the same noise covariances for the robust M-estimators. 777 We implemented cDCE via replacing (8) with equation (16) in [17]. ). For comparison purposes only, we compute pseudo-labels for all the YCB-v images directly from the optimal states, i.e. , and compare the methods via label errors, i.e. how much the projected object bounding boxes deviate from the ground truth. With the benefit of the component-wise covariance re-scaling and mitigation of the effects of outliers, our ACT method achieves the lowest errors in much more videos (see Tab. I Col. 8). It performs stably across sequences based on a fixed initial guess and a constant regularization coefficient . Thus, we pseudo-label the YCB-v training images based on the results by our method.

To evaluate the efficacy of different modules, we compare our Hybrid method with another two baseline labeling methods: Inlier and PoseEval. Inlier uses the PGO results only as an inlier measurement filter. The raw pose predictions that are geometrically congruent with other measurements are selected for labeling, i.e. . PoseEval extracts visually coherent pose predictions by thresholding the similarity scores from the pose evaluation module, i.e. 888 as in [5]..

The Hybrid method ensures spatial and visual consistency of the pseudo-labeled data. For a certain target object in an image, we compare the pose evaluation score for the inlier prediction (if available) with that of the optimized object pose . The higher scorer, if beyond a threshold, is picked for labeling. Thus, the Hybrid pseudo ground truth poses can be expressed as: , where the thresholds satisfy because the PGO-generated labels, not directly derived from RGB images, are prone to misalignment999 and for the 3 YCB objects. . Thus, the Hybrid pseudo-labeled data consists of high-score inliers, PGO-generated easy examples, and hard examples (on which the initial estimator fails). The 3 components are colored differently on Hybrid bars in Fig. 2(a). For the Hybrid and Inlier modes, we also exclude YCB-v sequences with measurement outlier rates higher than 20% from data generation.

(a) Data sizes
(b) Label errors in pixels
Fig. 2: Statistics of the pseudo-labeled data generated on the YCB-v dataset. Our Hybrid pseudo-labeled data are derived from 2 sources: inlier pose predictions and PGO-estimated poses, and the latter consists of easy and hard examples. Inlier and PoseEval are baseline labeling methods. (a) Data sizes; (b) Average errors in the pseudo labels (The image dimensions are ). The black bars represent

one standard deviation.

The statistics of the pseudo-labeled data are reported in Fig. 2. The Hybrid and Inlier data are in general very accurate (average label errors of image width), with the test being an effective inlier filter. But the PoseEval data, whatever their sizes, are more noisy and outlier-corrupted, so we cannot rely only on the pose evaluation test to generate outlier-free labels.

Further, we evaluate the DOPE estimator, fine-tuned with , on the YCB-v testsets. We adopt the average distance (ADD) metric [7], and present the accuracy-threshold curves in Fig. 3. All the methods achieve considerable improvements over the initial model (Synthetic), indicating the significance of in-domain data, although noisy and outlier-corrupted, for training pose estimators, which is also reported in [27, 22, 12, 32, 29]. But they still have large performance gaps from the model trained with YCB-v ground truth data, due to noises and the limited data size. Our Hybrid method consistently outperforms other baselines, even though its data have similar statistics with the Inlier data. We thus believe the performance gain mainly comes from the presence of hard examples.

Fig. 3: Evaluation of the initial and fine-tuned DOPE estimators on the YCB-v testsets: ADD accuracy-threshold curves. The area under curve (AUC) values (%) are labeled in the legend.

Beyond the first iteration of self-training, we attempt to pseudo-label the same data with the improved pose estimator and further fine-tune the model. But the twice-fine-tuned estimators fail to show significant performance enhancement and can even worsen101010 The ADD AUC values for the twice-fine-tuned estimators (by Hybrid labeling) are 47.0, 76.0, and 53.9 respectively. . We believe this is the result of model over-fitting111111 Over-fitting is observed after 20 epochs during the 1st fine-tuning. and the accumulation of label noises.

In return, we examine the effect of fine-tuned estimator models on object SLAM performance. The YCB-v test sequences 0049, 0059 are selected for evaluation, since they have 2 (out of the 3) selected YCB objects. We solve the PGOs with our ACT method and the LM algorithm. The results are reported in Fig. 4 and Tab. II.

The outlier measurement rates are greatly reduced on both sequences, rendering the LM algorithm easier to succeed. But for our ACT method, the SLAM accuracy is improved on 0059 but slightly degraded on 0049. That’s because our method is outlier-robust, and thus decreasing outliers doesn’t always lead to improved SLAM performance, especially for low-outlier regimes.

(a) 0049 before (outliers=15%)
(b) 0049 after (outliers=2%)
(c) 0059 before (outliers=30%)
(d) 0059 after (outliers=8%)
Fig. 4: Object SLAM results by automatic covariance tuning on the YCB-v (unseen) test sequences 0049 and 0059 before and after fine-tuning. The outlier rates of object pose predictions are labeled. The solid and dashed axes represent respectively the optimal and ground truth object poses.
0049 ATE (m) 004_sugar_box 010_potted_meat_can
LM Tran. (m) Ori (rad) Tran. (m) Ori (rad)
before 0.199 0.054 0.023 0.091 0.154
after 0.081 0.039 0.083 0.037 0.089
ACT Tran. (m) Ori (rad) Tran. (m) Ori (rad)
before 0.030 0.057 0.018 0.020 0.064
after 0.046 0.037 0.046 0.035 0.065
0059 ATE (m) 003_cracker_box 010_potted_meat_can
LM Tran. (m) Ori (rad) Tran. (m) Ori (rad)
before 0.618 0.234 0.402 0.518 0.214
after 0.105 0.057 0.027 0.032 0.147
ACT Tran. (m) Ori. (rad) Tran. (m) Ori. (rad)
before 0.036 0.043 0.065 0.049 0.103
after 0.025 0.019 0.027 0.020 0.108
TABLE II: Object SLAM accuracy on the YCB-v test sequences 0049 and 0059 before and after fine-tuning: absolute trajectory errors (ATE) and object pose errors (w.r.t world). The results by automatic covariance tuning (ACT) are visualized in Fig. 4.

Iv-B Ground robot experiment

As illustrated in Fig. 5, we control a Jackal robot [8] to circle around two target objects: 003_cracker_box and 010_potted_meat_can, and collect stereo RGB images from a ZED2 camera [23]. For each object, two 4 min. long sequences are recorded, one for self-training and the other for testing. We obtain the ground truth camera trajectory from a Vicon MoCap system [26] and the ground truth object poses from AprilTag detections [16]. The camera odometry is computed with the SVO2 stereo module [6]. We infer the object poses from the left camera RGB images using the same initial estimator as in Sec. IV.A.

Fig. 5: The ground robot experiment

Similar to the YCB-v experiment, we build an object-level map from the measurements and pseudo-label the left camera images with the Hybrid

method. For the two objects, 648 (out of 3120) and 950 (out of 2657) images are pseudo-labeled, in which 1.5% and 2.8% are hard examples. After fine-tuning, we evaluate the estimator model on both our experiment’s test sequences and the YCB-v testsets. On our own test sequences, we adopt the reprojection error for the object 3D bounding box as the evaluation metric. The accuracy-threshold curves are presented in Fig 

6. The AUCs for the curves in both tests are reported in Tab. III.

The estimators fine-tuned with the robot-collected data show elevated performance in a similar environment, and achieve slight enhancements in the YCB-v test scenes. This indicates that our method generalizes well to robot navigation scenarios, and also reiterates that real annotated images are precious and effective for mitigating the domain gap.

Fig. 6: Evaluation of the initial and self-trained pose estimators on the ground robot experiment test sequences: Reprojection error accuracy-threshold curves. (The image dimension is 376672.)
Our test sequences
Reproj error AUC 003_cracker_box 010_potted_meat_can
Before 58.0 40.6
After 63.9 70.2
YCB-v testsets
ADD AUC 003_cracker_box 010_potted_meat_can
Before 8.0 42.3
After 15.2 44.7
TABLE III: Evaluation of the initial and self-trained DOPE estimators on our experiment test sequences and the YCB-v testsets: area under curve (AUC) values (%) for the accuracy-threshold curves.

To study the value of enhanced pose estimators on object SLAM, we solve the PGOs on our test sequences and the YCB-v test sequence 0059, with initial and fine-tuned models. The SLAM estimates solved by our ACT method and the LM algorithm are presented in Fig. 7, and the trajectory errors are reported in Tab. IV.

Similarly, since our method (ACT) is outlier-robust, the reduced outlier rates do not always bring about improved SLAM accuracy. However, in all cases, it facilitates the success of the non-robust LM algorithm.

ACT 003_cracker_box test 010_potted_meat_can test YCB-v 0059
Before 0.030 0.028 0.039
After 0.029 0.028 0.031
LM 003_cracker_box test 010_potted_meat_can test YCB-v 0059
Before 0.062 0.253 0.618
After 0.046 0.072 0.576
TABLE IV: Object SLAM accuracy before and after fine-tuning: absolute trajectory errors (ATE). We solve the PGOs using our ACT method and the LM algorithm.
(a) 003_cracker_box by ACT
(b) 003_cracker_box by LM
(c) 010_potted_meat_can by ACT
(d) 010_potted_meat_can by LM
Fig. 7: SLAM results on the test sequences of our robot experiment before and after estimator self-training. We solve the PGOs using our ACT method and the LM algorithm. The blue, red and dashed axes are respectively optimal (before fine-tuning), optimal (after fine-tuning) and ground truth object poses. The circular trajectories are 15m long.

V Conclusions

A SLAM-aided semi-supervised learning method for object pose estimation is developed, to mutually boost the performance of object pose inference and object SLAM during robot navigation. For SLAM optimization, We propose to automate the tuning of the noise covariances by joint optimization of the SLAM variables and uncertainty models, leading to a flexible and easy-to-implement robust PGO method. We demonstrate the effectiveness of our method on the YCB-v dataset and with ground robot experiments. Our method can mine high-quality pseudo-labeled data from noisy and outlier-corrupted measurements. The SLAM-supported self-training, even with noisy supervisory signals, considerably enhances the performance of pose estimators trained with synthetic data. The fine-tuned estimator models, with reduced outlier rates, in return make object SLAM more effective.


V-a Proof for monotonic improvements in the joint loss

We prove that our automatic covariance tuning method can monotonically improve the joint loss as defined in (5). At iteration , we have the variable assignments and the noise covariances from iteration . Solving the PGO, the LM algorithm ensures . Since the extra regularization term in is not a function of , we can have:


We have also showed after (8) that is a global minimizer for . Thus, we have:


Combining the two inequalities, we obtain:


With this being valid for all iterations, we can have the chain of inequalities:


which completes the proof.

In our implementation, we eliminate outliers’ influence by setting their to a large value (Alg. 1 line 1), which may violate this monotonic improvement property. We choose to freeze the losses of outliers in till the end of optimization to resolve the problem.

V-B Reduction to L1 robust M-estimator

We show that our automatic covariance tuning method, as the noise models are assumed to be isotropic, is equivalent to using the L1 robust M-estimator. With the isotropic noises, i.e. , the joint loss in (5) reduces to:


Evaluating yields the new update rule:


On the other hand, we can apply the L1 robust loss for the object pose measurement factors to minimize the PGO loss (4). The robust PGO cost can be expressed as:


where is the L1 robust kernel. The robust cost is typically minimized with the IRLS method, by matching the gradients of (16) locally with a sequence of weighted least squares problems. The local least squares formulation at iteration can be expressed as:


where is the weight function. Under the isotropic noise assumption, i.e. , we can absorb the weight function into the covariance matrix and rewrite (17) as:


where the covariance matrix is de facto re-scaled iteratively by:


Matching (15) with (19), we can see as , the two methods are in theory equivalent121212 = const. in the context of robust M-estimation..


The authors acknowledge Jonathan Tremblay and other NVIDIA developers for providing consultation on training DOPE networks and generating synthetic data. The authors also acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing Center for HPC resources that have contributed to the results reported within this paper.


  • [1] P. Agarwal, G. D. Tipaldi, L. Spinello, C. Stachniss, and W. Burgard (2013) Robust map optimization using dynamic covariance scaling. In 2013 IEEE International Conference on Robotics and Automation, pp. 62–69. Cited by: §I, §II-C.
  • [2] C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. Montiel, and J. D. Tardós (2021)

    Orb-slam3: an accurate open-source library for visual, visual–inertial, and multimap slam

    IEEE Transactions on Robotics 37 (6), pp. 1874–1890. Cited by: §IV-A.
  • [3] F. Dellaert (2012) Factor graphs and gtsam: a hands-on introduction. Technical report Georgia Institute of Technology. Cited by: §IV.
  • [4] X. Deng, A. Mousavian, Y. Xiang, F. Xia, T. Bretl, and D. Fox (2019) PoseRBPF: a rao-blackwellized particle filter for 6d object pose estimation. In Robotics: Science and Systems (RSS), Cited by: §II-A, §II-B, §III-B.
  • [5] X. Deng, Y. Xiang, A. Mousavian, C. Eppner, T. Bretl, and D. Fox (2020) Self-supervised 6d object pose estimation for robot manipulation. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 3665–3671. Cited by: Fig. 1, §II-A, §III-B, footnote 8.
  • [6] C. Forster, M. Pizzoli, and D. Scaramuzza (2014) SVO: fast semi-direct monocular visual odometry. In 2014 IEEE international conference on robotics and automation (ICRA), pp. 15–22. Cited by: §IV-B.
  • [7] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab (2012) Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In

    Asian conference on computer vision

    pp. 548–562. Cited by: §IV-A.
  • [8] Jackal ugv. External Links: Link Cited by: §IV-B.
  • [9] Y. Labbé, J. Carpentier, M. Aubry, and J. Sivic (2020) Cosypose: consistent multi-view multi-object 6d pose estimation. In European Conference on Computer Vision, pp. 574–591. Cited by: §II-A, §II-B.
  • [10] Z. Li, Y. Hu, M. Salzmann, and X. Ji (2020) Robust rgb-based 6-dof pose estimation without real pose annotations. arXiv preprint arXiv:2008.08391. Cited by: §II-A.
  • [11] K. MacTavish and T. D. Barfoot (2015) At all costs: a comparison of robust cost functions for camera correspondence outliers. In 2015 12th conference on computer and robot vision, pp. 62–69. Cited by: §II-C.
  • [12] F. Manhardt, G. Wang, B. Busam, M. Nickel, S. Meier, L. Minciullo, X. Ji, and N. Navab (2020) CPS++: improving class-level 6d pose and shape estimation from monocular images with self-supervised learning. arXiv preprint arXiv:2003.05848. Cited by: §II-A, §IV-A.
  • [13] C. Mitash, K. E. Bekris, and A. Boularias (2017) A self-supervised learning system for object detection using physics simulation and multi-view pose estimation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 545–551. Cited by: §II-B.
  • [14] N. Morrical, J. Tremblay, S. Birchfield, and I. Wald (2020) NVISII: nvidia scene imaging interface. Note: Cited by: §II-A, §IV, footnote 4.
  • [15] M. Nava, A. Paolillo, J. Guzzi, L. M. Gambardella, and A. Giusti (2021) Uncertainty-aware self-supervised learning of spatial perception tasks. IEEE Robotics and Automation Letters 6 (4), pp. 6693–6700. Cited by: §II-B.
  • [16] E. Olson (2011) AprilTag: a robust and flexible visual fiducial system. In 2011 IEEE International Conference on Robotics and Automation, Vol. , pp. 3400–3407. External Links: Document Cited by: §IV-B.
  • [17] T. Pfeifer, S. Lange, and P. Protzel (2017) Dynamic covariance estimation—a parameter free approach to robust sensor fusion. In 2017 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), pp. 359–365. Cited by: §I, §II-C, TABLE I, footnote 7.
  • [18] S. Pillai and J. Leonard (2015) Monocular slam supported object recognition. arXiv preprint arXiv:1506.01732. Cited by: §II-B.
  • [19] A. Rao, J. Mullane, W. Wijesoma, and N. M. Patrikalakis (2011) Slam with adaptive noise tuning for the marine environment. In OCEANS 2011 IEEE-Spain, pp. 1–6. Cited by: §II-C.
  • [20] C. Sahin, G. Garcia-Hernando, J. Sock, and T. Kim (2020) A review on object pose recovery: from 3d bounding box detectors to full 6d pose estimators. Image and Vision Computing 96, pp. 103898. Cited by: §I.
  • [21] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J. Davison (2013) Slam++: simultaneous localisation and mapping at the level of objects. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 1352–1359. Cited by: §I.
  • [22] J. Sock, G. Garcia-Hernando, A. Armagan, and T. Kim (2020) Introducing pose consistency and warp-alignment for self-supervised 6d object pose estimation in color images. In 2020 International Conference on 3D Vision (3DV), pp. 291–300. Cited by: §II-A, §IV-A.
  • [23] StereoLabs ZED 2 stereo camera. External Links: Link Cited by: §IV-B.
  • [24] J. Tremblay, T. To, B. Sundaralingam, Y. Xiang, D. Fox, and S. Birchfield (2018) Deep object pose estimation for semantic robotic grasping of household objects. arXiv preprint arXiv:1809.10790. Cited by: §I, §II-A, §III-A, §IV-A, §IV.
  • [25] W. Vega-Brown, A. Bachrach, A. Bry, J. Kelly, and N. Roy (2013) CELLO: a fast algorithm for covariance estimation. In 2013 IEEE International Conference on Robotics and Automation, pp. 3160–3167. Cited by: §II-C.
  • [26] Vicon motion capture system. External Links: Link Cited by: §IV-B.
  • [27] G. Wang, F. Manhardt, J. Shao, X. Ji, N. Navab, and F. Tombari (2020) Self6d: self-supervised monocular 6d object pose estimation. In European Conference on Computer Vision, pp. 108–125. Cited by: §II-A, §IV-A.
  • [28] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox (2017)

    Posecnn: a convolutional neural network for 6d object pose estimation in cluttered scenes

    arXiv preprint arXiv:1711.00199. Cited by: §I, §II-A.
  • [29] Z. Yang, X. Yu, and Y. Yang (2021) DSC-posenet: learning 6dof object pose estimation via dual-scale consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3907–3916. Cited by: §II-A, §IV-A.
  • [30] A. Zeng, K. Yu, S. Song, D. Suo, E. Walker, A. Rodriguez, and J. Xiao (2017)

    Multi-view self-supervised deep learning for 6d pose estimation in the amazon picking challenge

    In 2017 IEEE international conference on robotics and automation (ICRA), pp. 1386–1383. Cited by: §II-A, §II-B.
  • [31] C. Zhang, J. Cheng, and Q. Tian (2019) Multi-view image classification with visual, semantic and view consistency. IEEE Transactions on Image Processing 29, pp. 617–627. Cited by: §II-B.
  • [32] G. Zhou, D. Wang, Y. Yan, H. Chen, and Q. Chen (2021) Semi-supervised 6d object pose estimation without using real annotations. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §II-A, §IV-A.