SLAM-Supported Semi-Supervised Learning for 6D Object Pose Estimation
Recent progress in learning-based object pose estimation paves the way for developing richer object-level world representations. However, the estimators, often trained with out-of-domain data, can suffer performance degradation as deployed in novel environments. To address the problem, we present a SLAM-supported self-training procedure to autonomously improve robot object pose estimation ability during navigation. Combining the network predictions with robot odometry, we can build a consistent object-level environment map via pose graph optimization (PGO). Exploiting the state estimates from PGO, we pseudo-label robot-collected RGB images to fine-tune the pose estimators. Unfortunately, it is difficult to model the uncertainty of the estimator predictions. The unmodeled uncertainty in the data used for PGO can result in low-quality object pose estimates. An automatic covariance tuning method is developed for robust PGO by allowing the measurement uncertainty models to change as part of the optimization process. The formulation permits a straightforward alternating minimization procedure that re-scales covariances analytically and component-wise, enabling more flexible noise modeling for learning-based measurements. We test our method with the deep object pose estimator (DOPE) on the YCB video dataset and in real-world robot experiments. The method can achieve significant performance gain in pose estimation, and in return facilitates the success of object SLAM.READ FULL TEXT VIEW PDF
SLAM-Supported Semi-Supervised Learning for 6D Object Pose Estimation
State-of-the-art object 6D pose estimators can capture object-level geometric and semantic information in challenging scenes . During robot navigation, an object-based simultaneous localization and mapping (object SLAM) system can use object pose predictions, with robot odometry, to build a consistent object-level map (e.g. ). However, the estimators, mostly trained with out-of-domain data, often show degraded performance as deployed in novel environments. Annotating target-domain real images for training is tedious, and limits the potential for autonomous operations. We propose to collect images during robot navigation, exploit the object SLAM estimates to pseudo-label the data, and fine-tune the pose estimator.
As depicted in Fig. 1, we develop a SLAM-supported self-training procedure for RGB-image-based object pose estimators. During navigation, the robot collects images and deploys a pre-trained model to infer object poses in the scene. Together with noisy robot state measurements from on-board odometric sensors (camera, IMU, lidar, etc.), a pose graph optimization (PGO) problem is formulated to optimize the camera (i.e. robot) and object poses. We leverage the state estimates to pseudo-label the images, generating new training data to fine-tune the initial model, and autonomously improve the robot’s object pose estimation ability.
A major challenge in this procedure is the difficulty of modeling the uncertainty of the learning-based pose measurements. In particular, it is difficult to specify a priori an appropriate (Gaussian) noise model for them as typically required in PGO. For this reason, rather than fixing a potentially poor choice of covariance model, we allow the uncertainty model to change as part of the optimization process. Similar approaches have been explored previously in the context of robust SLAM (e.g. [1, 17]). However, our joint optimization formulation permits a straightforward alternating minimization procedure, where the optimal covariance models are determined analytically. Moreover, our method realizes component-wise covariance re-scaling, allowing us to fit a richer class of noise models.
and with real-world robot experiments. The system is demonstrated able to generate high-quality pseudo labels. The fine-tuned networks show enhanced accuracy and significantly reduced outlier rates, which in return facilitates the success of SLAM algorithms.
In summary, our work makes the following contributions:
A SLAM-aided self-training procedure to pseudo-label robot-collected RGB images and boost object pose estimation performance during object-based navigation.
A automatic covariance tuning (ACT) method that is critical for the success and autonomous operation of the above procedure.
Experimental evaluation on the YCB-v dataset and a new robot-collected dataset.
Learning-based object pose estimators (e.g.[28, 24, 9]) typically require a large of amount of training data to succeed, but it can be expensive to acquire high-quality pose-annotated images from real environments. One solution is to generate photo-realistic and domain-randomized synthetic image data for training (e.g.). But bridging the sim-to-real gap is difficult.
Semi- and self-supervised methods are developed to also exploit unlabeled real images to mitigate the lack of data. These methods typically train the model on synthetic data in a supervised manner and improve its performance on real images by semi- or self-supervised learning[30, 10, 27, 12, 22, 29, 32]. Many recent methods leverage differentiable rendering to develop end-to-end self-supervised pose estimation models, by encouraging similarity between real images and images rendered with estimated poses [27, 12, 22, 29]. Despite their success, the methods are not designed to leverage the video streams collected during robot tasks. Deng et al. , instead, collect RGB-D image streams with a robot arm during object manipulation for self-training. They use a pose initialization module to obtain accurate pose estimates in the first frame, and use motion priors from forward kinematics and a particle filter  to propagate the pose estimates forward in time.
We propose to collect and pseudo-label RGB images during robot navigation. Instead of frame-to-frame pose tracking, we directly estimate the 3D scene geometry which requires no accurate first-frame object poses or high-precision motion priors. We use only RGB cameras and optionally other odometric sensors, which are mostly available on a robot platform, for data collection and self-training.
Integrating information from multiple viewpoints can help correct the inaccurate and missed predictions made by single-frame object perception models. This idea is widely applied in image classification (e.g. ), object detection (e.g. ) and pose estimation (e.g. [4, 9]). Pillai et al.  develop a SLAM-aware object localization system. They leverage visual SLAM estimates, consisting of camera poses and feature locations, to support the object detection task, leading to strong improvements over single-frame methods.
Beyond that, multi-view object perception models can provide noisy supervision to train single-frame methods. Mitash et al.  conduct multi-view pose estimation to pseudo-label real images for self-training of an object detector. Zeng et al.  perform multi-view object segmentation to generate training data for object pose estimation. Nava et al.  exploit noisy state estimates to assist self-supervised learning of spatial perception models.
Inspired by these works, we propose to use SLAM-aided object pose estimation to generate training data for semi-supervised learning of object pose estimators.
Obtaining robust SLAM estimates is critical for the success of our method. In PGO, the selection of measurement noise covariances relies mostly on empirical estimates of noise statistics. In the presence of outlier-prone measurements (e.g. learning-based) whose uncertainty is hard to quantify, it is infeasible to fix a (Gaussian) noise model for PGO. Adaptive covariance tuning methods (e.g. [19, 25]) have been developed to concurrently estimate the SLAM variables and noise statistics. These methods show generally better performance than using empirical noise models.
The robust M-estimators are widely applied to deal with outlier-corrupted measurements . Minimizing their robust cost functions with the iteratively re-weighted least squares (IRLS) method, the measurement contributions are re-weighted at each step based on their Mahalanobis distances. This down-weights the influence of outliers and is de facto re-scaling the covariance uniformly . Pfeifer et al.  propose to jointly optimize state variables and covariances, with the covariance components log-regularized. This method leads to a robust M-estimator, closed-form dynamic covariance estimation (cDCE), which can be solved with IRLS. It is more flexible than existing M-estimators since it allows the noise components to be re-weighted differently.
Our joint optimization formulation is similar, with covariances L1 regularized, and also permits closed-form component-wise covariance update. Instead of IRLS, we follow an alternating minimization procedure, and eliminate the influence of outliers based on the test.
In our method, the object pose estimator is initially trained on (typically synthetic) labeled image data , where and denote RGB images and the ground truth 6DoF object poses (w.r.t camera), and represents the ground truth labels. We can map to via an estimator-dependent function . In this paper, we work with the DOPE  estimator222Our method is not restricted to using a particular pose estimator., for which is the pixel locations for the projected object 3D bounding box, and is the perspective projection function. We apply the initial estimator to make object pose predictions as the robot explores its environment.
During navigation, the robot collects a number of unlabeled RGB images . The estimator makes noisy object pose measurements (i.e. network predictions) on . Combining and camera odometric measurements obtained with on-board sensors, a pose graph can be established as (see Fig. 1(c)). denotes the latent variables incorporating camera poses and object landmark poses (w.r.t world) . Assuming all the measurements are independent, we can decompose the joint likelihood into factors , consisting of odometry factors and object pose measurement factors . Each factor describes the likelihood of a measurement given certain variable assignments. Edges represent the factorization structure of the likelihood distribution:
We assume the measurement noises are zero-mean, normally-distributed, where is the noise covariance matrix for measurement . The optimal assignments of the variables can be obtained by solving the maximum likelihood estimation problem:
where Log is the logarithmic map on the SE(3) group, and is the Mahalanobis distance. We solve (2) with automatic covariance tuning (see Sec. III.C) to obtain robust SLAM estimates .
Based on the optimal states, we can: (1) identify inliers from object pose measurements ; (2) compute optimized object-to-camera poses . We compare and combine and to obtain the pseudo ground truth poses for images in (see Fig. 1(e)). We refer to the pseudo-labeling method as Hybrid labeling, with pseudo labels derived from the mixed sources.
where is the critical value for the
distribution with 6 degrees of freedom at confidence level 0.95.
Second, we directly compute pseudo ground truth poses from the optimal states. For the image collected at time , the optimized object-to-camera pose for object number is computed as . However, the inlier predictions or optimized poses can be noisy and may deviate visually from the target objects on the images and hurt training.
Therefore, we employ a pose evaluation module, as proposed in , to quantify the visual coherence between pseudo labels and images (see Fig.1(f)). Each RGB image is compared with its corresponding rendered image based on the pseudo ground truth pose . The regions of interest (RoIs) on the two images are passed into a pre-trained auto-encoder from PoseRBPF 
, and the cosine similarity of their feature embeddings is computed. We abstract the process as a function, where the dependence on object, camera and image information is made implicit.
We contrast the PGO-computed pose’s score with the inlier score for every target object on unlabeled images, and pick the pose with a higher score as the pseudo ground truth pose (see Sec. IV.A for details). We assemble the pseudo-labeled data as , where again . And we fine-tune the object pose estimator with (see Fig.1(g)).
Our method naturally generates training samples on which the initial estimator fails to make reasonable predictions, i.e. hard examples. In particular, when the initial estimator makes an outlier prediction or misses a prediction, PGO can potentially recover an object pose that is visually consistent with the RGB image. The challenging examples, as we demonstrate in Sec. IV, are the key to achieving significant performance gain during self-training.
Since the object pose measurements are derived from learning-based models, it is difficult to a priori fix a (Gaussian) uncertainty model for them as typically required in PGO. We propose to automate the covariance tuning process by incorporating the noise covariances as variables into a joint optimization procedure. Our formulation, as we show, leads to a simple and robust PGO method, that allows us to alternate between PGO and determining in closed form the optimal noise models.
For simplicity, we rewrite the PGO cost function in (2) as:
where and are the unwhitened errors.
Given that the odometry measurements are typically more reliable, we only update the object pose noise covariances at the covariance optimization step. Each time after optimizing SLAM variables, we solve for the per-measurement noise models that can further minimize the optimization loss, and re-solve the SLAM variables with them. To avoid the trivial solution of
, we apply L1 regularization on the covariance matrices. The loss function for the joint optimization is:
For simplicity, we assume the covariance matrices are diagonal, i.e. , and the joint loss reduces to:
where is the th entry of the unwhitened error.
For fixed and , optimizing (6) w.r.t amounts to minimizing a convex objective over the non-negative reals. Moreover, we know is never a minimizer for . Thus, a global optimal solution is obtained when:
evaluating which we obtain the extremum on :
With being valid for all , we can confirm that this extremum is a global minimum.
Therefore we can express the covariance update rule at iteration as:
where we factor out in (8) for the convenience of tuning, i.e. .
Since the covariance optimization step admits a closed form solution, the joint optimization reduces to “iteratively solving PGO with re-computed covariances”. The selection of all the noise models reduces to tuning . According to our experiments, it’s feasible to set as a constant for consistent performance in different cases333 (i.e. ) for all the tests..
The algorithm is summarized in Alg. 1. We use the Levenberg-Marquardt (LM) algorithm to solve the Gaussian PGO (line 1). For better performance, we also identify outliers at each iteration using the test (3) and set their noises to a very large value (line 1) to rule out their influences. The optimization is terminated as the relative decrease in is sufficiently small or the maximum number of iterations is reached. In Appendix A, we show the algorithm can monotonically improve .
The update rule (9
) reveals our implicit assumption that a high-residual measurement component (i.e. xyzrpy) is likely from a high-variance normal distribution. The recomputed covariances down-weight high-residual measurements, which is in spirit similar to robust M-estimation. We show in AppendixB that, with isotropic noise models, our method reduces to using the L1 robust kernel.
However, the robust M-estimation is typically solved with the IRLS method, where the measurement losses are re-weighted based on the Mahalanobis distance (see (17)). In comparison, our method re-weights the per-component losses differently. This enables us to fit a richer class of noise models for 6DoF PGO, in that different components often follow different noise characteristics.
Our method is tested with (1) the YCB video (YCB-v) dataset and (2) a new ground robot experiment. On the YCB-v dataset (Sec. IV.A), we leverage the image streams in the training sets to build per-video object-level maps for self-training. The Hybrid labeling method is compared with another two baseline methods to verify the effectiveness of different modules. In the ground robot experiment (Sec. IV. B), we apply the method on longer sequences, circumnavigating selected objects, and demonstrate its potential for use in object-based navigation.
Our method is implemented in Python. We use the NVISII toolkit  to generate synthetic image data. The training and pose inference programs are adapted from the code available in the DOPE GitHub 
. Every network is initially trained for 60 epochs and fine-tuned for 20 epochs, with a batchsize of 64 and learning rate of 0.0001. We solve PGO problems based on the GTSAM library. We run the data generation and training programs with 2 NVIDIA Volta V100 GPUs and other code on an Intel Core i7-9850H@2.6GHz CPU and an NVIDIA Quadro RTX 3000 GPU.
We test our method with DOPE estimators  for three YCB objects: 003_cracker_box, 004_sugar_box and 010_potted_meat_can. Each object appears in 20 YCB videos (training + testing). They have respectively 26689, 22528 and 27050 training images (dashed lines in Fig. 2(a)) from 17, 15, and 17 training videos.
We use 60k NVISII-generated synthetic image data to train initial DOPE estimators for the 3 objects444The 010_potted_meat_can data are the same as that used in .. The models are applied to infer object poses on the YCB-v images. We employ the ORB-SLAM3  RGB-D module (w/o loop closing) to obtain camera odometry on these videos.
Combining the measurements, we solve per-video PGO problems to build object-level maps for pseudo-labeling. We initialize the camera poses with the odometry chain, object poses with average pose predictions, and covariances with . We apply different robust optimization methods to solve all 60 PGO problems (see Tab. I555 Due to space limitations, we report results for the first 7 out of 20 YCB-v sequences. Please check out our GitHub repo for complete statistics. 666 We use the default parameters and the same noise covariances for the robust M-estimators. 777 We implemented cDCE via replacing (8) with equation (16) in . ). For comparison purposes only, we compute pseudo-labels for all the YCB-v images directly from the optimal states, i.e. , and compare the methods via label errors, i.e. how much the projected object bounding boxes deviate from the ground truth. With the benefit of the component-wise covariance re-scaling and mitigation of the effects of outliers, our ACT method achieves the lowest errors in much more videos (see Tab. I Col. 8). It performs stably across sequences based on a fixed initial guess and a constant regularization coefficient . Thus, we pseudo-label the YCB-v training images based on the results by our method.
To evaluate the efficacy of different modules, we compare our Hybrid method with another two baseline labeling methods: Inlier and PoseEval. Inlier uses the PGO results only as an inlier measurement filter. The raw pose predictions that are geometrically congruent with other measurements are selected for labeling, i.e. . PoseEval extracts visually coherent pose predictions by thresholding the similarity scores from the pose evaluation module, i.e. 888 as in ..
The Hybrid method ensures spatial and visual consistency of the pseudo-labeled data. For a certain target object in an image, we compare the pose evaluation score for the inlier prediction (if available) with that of the optimized object pose . The higher scorer, if beyond a threshold, is picked for labeling. Thus, the Hybrid pseudo ground truth poses can be expressed as: , where the thresholds satisfy because the PGO-generated labels, not directly derived from RGB images, are prone to misalignment999 and for the 3 YCB objects. . Thus, the Hybrid pseudo-labeled data consists of high-score inliers, PGO-generated easy examples, and hard examples (on which the initial estimator fails). The 3 components are colored differently on Hybrid bars in Fig. 2(a). For the Hybrid and Inlier modes, we also exclude YCB-v sequences with measurement outlier rates higher than 20% from data generation.
one standard deviation.
The statistics of the pseudo-labeled data are reported in Fig. 2. The Hybrid and Inlier data are in general very accurate (average label errors of image width), with the test being an effective inlier filter. But the PoseEval data, whatever their sizes, are more noisy and outlier-corrupted, so we cannot rely only on the pose evaluation test to generate outlier-free labels.
Further, we evaluate the DOPE estimator, fine-tuned with , on the YCB-v testsets. We adopt the average distance (ADD) metric , and present the accuracy-threshold curves in Fig. 3. All the methods achieve considerable improvements over the initial model (Synthetic), indicating the significance of in-domain data, although noisy and outlier-corrupted, for training pose estimators, which is also reported in [27, 22, 12, 32, 29]. But they still have large performance gaps from the model trained with YCB-v ground truth data, due to noises and the limited data size. Our Hybrid method consistently outperforms other baselines, even though its data have similar statistics with the Inlier data. We thus believe the performance gain mainly comes from the presence of hard examples.
Beyond the first iteration of self-training, we attempt to pseudo-label the same data with the improved pose estimator and further fine-tune the model. But the twice-fine-tuned estimators fail to show significant performance enhancement and can even worsen101010 The ADD AUC values for the twice-fine-tuned estimators (by Hybrid labeling) are 47.0, 76.0, and 53.9 respectively. . We believe this is the result of model over-fitting111111 Over-fitting is observed after 20 epochs during the 1st fine-tuning. and the accumulation of label noises.
In return, we examine the effect of fine-tuned estimator models on object SLAM performance. The YCB-v test sequences 0049, 0059 are selected for evaluation, since they have 2 (out of the 3) selected YCB objects. We solve the PGOs with our ACT method and the LM algorithm. The results are reported in Fig. 4 and Tab. II.
The outlier measurement rates are greatly reduced on both sequences, rendering the LM algorithm easier to succeed. But for our ACT method, the SLAM accuracy is improved on 0059 but slightly degraded on 0049. That’s because our method is outlier-robust, and thus decreasing outliers doesn’t always lead to improved SLAM performance, especially for low-outlier regimes.
|LM||Tran. (m)||Ori (rad)||Tran. (m)||Ori (rad)|
|ACT||Tran. (m)||Ori (rad)||Tran. (m)||Ori (rad)|
|LM||Tran. (m)||Ori (rad)||Tran. (m)||Ori (rad)|
|ACT||Tran. (m)||Ori. (rad)||Tran. (m)||Ori. (rad)|
As illustrated in Fig. 5, we control a Jackal robot  to circle around two target objects: 003_cracker_box and 010_potted_meat_can, and collect stereo RGB images from a ZED2 camera . For each object, two 4 min. long sequences are recorded, one for self-training and the other for testing. We obtain the ground truth camera trajectory from a Vicon MoCap system  and the ground truth object poses from AprilTag detections . The camera odometry is computed with the SVO2 stereo module . We infer the object poses from the left camera RGB images using the same initial estimator as in Sec. IV.A.
Similar to the YCB-v experiment, we build an object-level map from the measurements and pseudo-label the left camera images with the Hybrid
method. For the two objects, 648 (out of 3120) and 950 (out of 2657) images are pseudo-labeled, in which 1.5% and 2.8% are hard examples. After fine-tuning, we evaluate the estimator model on both our experiment’s test sequences and the YCB-v testsets. On our own test sequences, we adopt the reprojection error for the object 3D bounding box as the evaluation metric. The accuracy-threshold curves are presented in Fig6. The AUCs for the curves in both tests are reported in Tab. III.
The estimators fine-tuned with the robot-collected data show elevated performance in a similar environment, and achieve slight enhancements in the YCB-v test scenes. This indicates that our method generalizes well to robot navigation scenarios, and also reiterates that real annotated images are precious and effective for mitigating the domain gap.
|Our test sequences|
|Reproj error AUC||003_cracker_box||010_potted_meat_can|
To study the value of enhanced pose estimators on object SLAM, we solve the PGOs on our test sequences and the YCB-v test sequence 0059, with initial and fine-tuned models. The SLAM estimates solved by our ACT method and the LM algorithm are presented in Fig. 7, and the trajectory errors are reported in Tab. IV.
Similarly, since our method (ACT) is outlier-robust, the reduced outlier rates do not always bring about improved SLAM accuracy. However, in all cases, it facilitates the success of the non-robust LM algorithm.
|ACT||003_cracker_box test||010_potted_meat_can test||YCB-v 0059|
|LM||003_cracker_box test||010_potted_meat_can test||YCB-v 0059|
A SLAM-aided semi-supervised learning method for object pose estimation is developed, to mutually boost the performance of object pose inference and object SLAM during robot navigation. For SLAM optimization, We propose to automate the tuning of the noise covariances by joint optimization of the SLAM variables and uncertainty models, leading to a flexible and easy-to-implement robust PGO method. We demonstrate the effectiveness of our method on the YCB-v dataset and with ground robot experiments. Our method can mine high-quality pseudo-labeled data from noisy and outlier-corrupted measurements. The SLAM-supported self-training, even with noisy supervisory signals, considerably enhances the performance of pose estimators trained with synthetic data. The fine-tuned estimator models, with reduced outlier rates, in return make object SLAM more effective.
We prove that our automatic covariance tuning method can monotonically improve the joint loss as defined in (5). At iteration , we have the variable assignments and the noise covariances from iteration . Solving the PGO, the LM algorithm ensures . Since the extra regularization term in is not a function of , we can have:
We have also showed after (8) that is a global minimizer for . Thus, we have:
Combining the two inequalities, we obtain:
With this being valid for all iterations, we can have the chain of inequalities:
which completes the proof.
We show that our automatic covariance tuning method, as the noise models are assumed to be isotropic, is equivalent to using the L1 robust M-estimator. With the isotropic noises, i.e. , the joint loss in (5) reduces to:
Evaluating yields the new update rule:
On the other hand, we can apply the L1 robust loss for the object pose measurement factors to minimize the PGO loss (4). The robust PGO cost can be expressed as:
where is the L1 robust kernel. The robust cost is typically minimized with the IRLS method, by matching the gradients of (16) locally with a sequence of weighted least squares problems. The local least squares formulation at iteration can be expressed as:
where is the weight function. Under the isotropic noise assumption, i.e. , we can absorb the weight function into the covariance matrix and rewrite (17) as:
where the covariance matrix is de facto re-scaled iteratively by:
The authors acknowledge Jonathan Tremblay and other NVIDIA developers for providing consultation on training DOPE networks and generating synthetic data. The authors also acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing Center for HPC resources that have contributed to the results reported within this paper.
Orb-slam3: an accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics 37 (6), pp. 1874–1890. Cited by: §IV-A.
Asian conference on computer vision, pp. 548–562. Cited by: §IV-A.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1352–1359. Cited by: §I.
Posecnn: a convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199. Cited by: §I, §II-A.
Multi-view self-supervised deep learning for 6d pose estimation in the amazon picking challenge. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 1386–1383. Cited by: §II-A, §II-B.