slamsuper6d
SLAMSupported SemiSupervised Learning for 6D Object Pose Estimation
view repo
Recent progress in learningbased object pose estimation paves the way for developing richer objectlevel world representations. However, the estimators, often trained with outofdomain data, can suffer performance degradation as deployed in novel environments. To address the problem, we present a SLAMsupported selftraining procedure to autonomously improve robot object pose estimation ability during navigation. Combining the network predictions with robot odometry, we can build a consistent objectlevel environment map via pose graph optimization (PGO). Exploiting the state estimates from PGO, we pseudolabel robotcollected RGB images to finetune the pose estimators. Unfortunately, it is difficult to model the uncertainty of the estimator predictions. The unmodeled uncertainty in the data used for PGO can result in lowquality object pose estimates. An automatic covariance tuning method is developed for robust PGO by allowing the measurement uncertainty models to change as part of the optimization process. The formulation permits a straightforward alternating minimization procedure that rescales covariances analytically and componentwise, enabling more flexible noise modeling for learningbased measurements. We test our method with the deep object pose estimator (DOPE) on the YCB video dataset and in realworld robot experiments. The method can achieve significant performance gain in pose estimation, and in return facilitates the success of object SLAM.
READ FULL TEXT VIEW PDFSLAMSupported SemiSupervised Learning for 6D Object Pose Estimation
Stateoftheart object 6D pose estimators can capture objectlevel geometric and semantic information in challenging scenes [20]. During robot navigation, an objectbased simultaneous localization and mapping (object SLAM) system can use object pose predictions, with robot odometry, to build a consistent objectlevel map (e.g. [21]). However, the estimators, mostly trained with outofdomain data, often show degraded performance as deployed in novel environments. Annotating targetdomain real images for training is tedious, and limits the potential for autonomous operations. We propose to collect images during robot navigation, exploit the object SLAM estimates to pseudolabel the data, and finetune the pose estimator.
As depicted in Fig. 1, we develop a SLAMsupported selftraining procedure for RGBimagebased object pose estimators. During navigation, the robot collects images and deploys a pretrained model to infer object poses in the scene. Together with noisy robot state measurements from onboard odometric sensors (camera, IMU, lidar, etc.), a pose graph optimization (PGO) problem is formulated to optimize the camera (i.e. robot) and object poses. We leverage the state estimates to pseudolabel the images, generating new training data to finetune the initial model, and autonomously improve the robot’s object pose estimation ability.
A major challenge in this procedure is the difficulty of modeling the uncertainty of the learningbased pose measurements. In particular, it is difficult to specify a priori an appropriate (Gaussian) noise model for them as typically required in PGO. For this reason, rather than fixing a potentially poor choice of covariance model, we allow the uncertainty model to change as part of the optimization process. Similar approaches have been explored previously in the context of robust SLAM (e.g. [1, 17]). However, our joint optimization formulation permits a straightforward alternating minimization procedure, where the optimal covariance models are determined analytically. Moreover, our method realizes componentwise covariance rescaling, allowing us to fit a richer class of noise models.
We tested our system with the deep object pose estimator (DOPE) [24] on the YCB video dataset [28]
and with realworld robot experiments. The system is demonstrated able to generate highquality pseudo labels. The finetuned networks show enhanced accuracy and significantly reduced outlier rates, which in return facilitates the success of SLAM algorithms.
In summary, our work makes the following contributions:
A SLAMaided selftraining procedure to pseudolabel robotcollected RGB images and boost object pose estimation performance during objectbased navigation.
A automatic covariance tuning (ACT) method that is critical for the success and autonomous operation of the above procedure.
Experimental evaluation on the YCBv dataset and a new robotcollected dataset.
Learningbased object pose estimators (e.g.[28, 24, 9]) typically require a large of amount of training data to succeed, but it can be expensive to acquire highquality poseannotated images from real environments. One solution is to generate photorealistic and domainrandomized synthetic image data for training (e.g.[14]). But bridging the simtoreal gap is difficult.
Semi and selfsupervised methods are developed to also exploit unlabeled real images to mitigate the lack of data. These methods typically train the model on synthetic data in a supervised manner and improve its performance on real images by semi or selfsupervised learning
[30, 10, 27, 12, 22, 29, 32]. Many recent methods leverage differentiable rendering to develop endtoend selfsupervised pose estimation models, by encouraging similarity between real images and images rendered with estimated poses [27, 12, 22, 29]. Despite their success, the methods are not designed to leverage the video streams collected during robot tasks. Deng et al. [5], instead, collect RGBD image streams with a robot arm during object manipulation for selftraining. They use a pose initialization module to obtain accurate pose estimates in the first frame, and use motion priors from forward kinematics and a particle filter [4] to propagate the pose estimates forward in time.We propose to collect and pseudolabel RGB images during robot navigation. Instead of frametoframe pose tracking, we directly estimate the 3D scene geometry which requires no accurate firstframe object poses or highprecision motion priors. We use only RGB cameras and optionally other odometric sensors, which are mostly available on a robot platform, for data collection and selftraining.
Integrating information from multiple viewpoints can help correct the inaccurate and missed predictions made by singleframe object perception models. This idea is widely applied in image classification (e.g. [31]), object detection (e.g. [18]) and pose estimation (e.g. [4, 9]). Pillai et al. [18] develop a SLAMaware object localization system. They leverage visual SLAM estimates, consisting of camera poses and feature locations, to support the object detection task, leading to strong improvements over singleframe methods.
Beyond that, multiview object perception models can provide noisy supervision to train singleframe methods. Mitash et al. [13] conduct multiview pose estimation to pseudolabel real images for selftraining of an object detector. Zeng et al. [30] perform multiview object segmentation to generate training data for object pose estimation. Nava et al. [15] exploit noisy state estimates to assist selfsupervised learning of spatial perception models.
Inspired by these works, we propose to use SLAMaided object pose estimation to generate training data for semisupervised learning of object pose estimators.
Obtaining robust SLAM estimates is critical for the success of our method. In PGO, the selection of measurement noise covariances relies mostly on empirical estimates of noise statistics. In the presence of outlierprone measurements (e.g. learningbased) whose uncertainty is hard to quantify, it is infeasible to fix a (Gaussian) noise model for PGO. Adaptive covariance tuning methods (e.g. [19, 25]) have been developed to concurrently estimate the SLAM variables and noise statistics. These methods show generally better performance than using empirical noise models.
The robust Mestimators are widely applied to deal with outliercorrupted measurements [11]. Minimizing their robust cost functions with the iteratively reweighted least squares (IRLS) method, the measurement contributions are reweighted at each step based on their Mahalanobis distances. This downweights the influence of outliers and is de facto rescaling the covariance uniformly [1]. Pfeifer et al. [17] propose to jointly optimize state variables and covariances, with the covariance components logregularized. This method leads to a robust Mestimator, closedform dynamic covariance estimation (cDCE), which can be solved with IRLS. It is more flexible than existing Mestimators since it allows the noise components to be reweighted differently.
Our joint optimization formulation is similar, with covariances L1 regularized, and also permits closedform componentwise covariance update. Instead of IRLS, we follow an alternating minimization procedure, and eliminate the influence of outliers based on the test.
In our method, the object pose estimator is initially trained on (typically synthetic) labeled image data , where and denote RGB images and the ground truth 6DoF object poses (w.r.t camera), and represents the ground truth labels. We can map to via an estimatordependent function . In this paper, we work with the DOPE [24] estimator^{2}^{2}2Our method is not restricted to using a particular pose estimator., for which is the pixel locations for the projected object 3D bounding box, and is the perspective projection function. We apply the initial estimator to make object pose predictions as the robot explores its environment.
During navigation, the robot collects a number of unlabeled RGB images . The estimator makes noisy object pose measurements (i.e. network predictions) on . Combining and camera odometric measurements obtained with onboard sensors, a pose graph can be established as (see Fig. 1(c)). denotes the latent variables incorporating camera poses and object landmark poses (w.r.t world) . Assuming all the measurements are independent, we can decompose the joint likelihood into factors , consisting of odometry factors and object pose measurement factors . Each factor describes the likelihood of a measurement given certain variable assignments. Edges represent the factorization structure of the likelihood distribution:
(1)  
We assume the measurement noises are zeromean, normallydistributed
, where is the noise covariance matrix for measurement . The optimal assignments of the variables can be obtained by solving the maximum likelihood estimation problem:(2)  
where Log is the logarithmic map on the SE(3) group, and is the Mahalanobis distance. We solve (2) with automatic covariance tuning (see Sec. III.C) to obtain robust SLAM estimates .
Based on the optimal states, we can: (1) identify inliers from object pose measurements ; (2) compute optimized objecttocamera poses . We compare and combine and to obtain the pseudo ground truth poses for images in (see Fig. 1(e)). We refer to the pseudolabeling method as Hybrid labeling, with pseudo labels derived from the mixed sources.
First , we identify inlier pose measurements based on the test (Alg. 1 line 1):
(3) 
where is the critical value for the
distribution with 6 degrees of freedom at confidence level 0.95.
Second, we directly compute pseudo ground truth poses from the optimal states. For the image collected at time , the optimized objecttocamera pose for object number is computed as . However, the inlier predictions or optimized poses can be noisy and may deviate visually from the target objects on the images and hurt training.
Therefore, we employ a pose evaluation module, as proposed in [5], to quantify the visual coherence between pseudo labels and images (see Fig.1(f)). Each RGB image is compared with its corresponding rendered image based on the pseudo ground truth pose . The regions of interest (RoIs) on the two images are passed into a pretrained autoencoder from PoseRBPF [4]
, and the cosine similarity of their feature embeddings is computed. We abstract the process as a function
, where the dependence on object, camera and image information is made implicit.We contrast the PGOcomputed pose’s score with the inlier score for every target object on unlabeled images, and pick the pose with a higher score as the pseudo ground truth pose (see Sec. IV.A for details). We assemble the pseudolabeled data as , where again . And we finetune the object pose estimator with (see Fig.1(g)).
Our method naturally generates training samples on which the initial estimator fails to make reasonable predictions, i.e. hard examples. In particular, when the initial estimator makes an outlier prediction or misses a prediction, PGO can potentially recover an object pose that is visually consistent with the RGB image. The challenging examples, as we demonstrate in Sec. IV, are the key to achieving significant performance gain during selftraining.
Since the object pose measurements are derived from learningbased models, it is difficult to a priori fix a (Gaussian) uncertainty model for them as typically required in PGO. We propose to automate the covariance tuning process by incorporating the noise covariances as variables into a joint optimization procedure. Our formulation, as we show, leads to a simple and robust PGO method, that allows us to alternate between PGO and determining in closed form the optimal noise models.
For simplicity, we rewrite the PGO cost function in (2) as:
(4) 
where and are the unwhitened errors.
Given that the odometry measurements are typically more reliable, we only update the object pose noise covariances at the covariance optimization step. Each time after optimizing SLAM variables, we solve for the permeasurement noise models that can further minimize the optimization loss, and resolve the SLAM variables with them. To avoid the trivial solution of
, we apply L1 regularization on the covariance matrices. The loss function for the joint optimization is:
(5) 
For simplicity, we assume the covariance matrices are diagonal, i.e. , and the joint loss reduces to:
(6) 
where is the th entry of the unwhitened error.
For fixed and , optimizing (6) w.r.t amounts to minimizing a convex objective over the nonnegative reals. Moreover, we know is never a minimizer for . Thus, a global optimal solution is obtained when:
(7) 
evaluating which we obtain the extremum on :
(8) 
With being valid for all , we can confirm that this extremum is a global minimum.
Therefore we can express the covariance update rule at iteration as:
(9) 
where we factor out in (8) for the convenience of tuning, i.e. .
Since the covariance optimization step admits a closed form solution, the joint optimization reduces to “iteratively solving PGO with recomputed covariances”. The selection of all the noise models reduces to tuning . According to our experiments, it’s feasible to set as a constant for consistent performance in different cases^{3}^{3}3 (i.e. ) for all the tests..
The algorithm is summarized in Alg. 1. We use the LevenbergMarquardt (LM) algorithm to solve the Gaussian PGO (line 1). For better performance, we also identify outliers at each iteration using the test (3) and set their noises to a very large value (line 1) to rule out their influences. The optimization is terminated as the relative decrease in is sufficiently small or the maximum number of iterations is reached. In Appendix A, we show the algorithm can monotonically improve .
The update rule (9
) reveals our implicit assumption that a highresidual measurement component (i.e. xyzrpy) is likely from a highvariance normal distribution. The recomputed covariances downweight highresidual measurements, which is in spirit similar to robust Mestimation. We show in Appendix
B that, with isotropic noise models, our method reduces to using the L1 robust kernel.However, the robust Mestimation is typically solved with the IRLS method, where the measurement losses are reweighted based on the Mahalanobis distance (see (17)). In comparison, our method reweights the percomponent losses differently. This enables us to fit a richer class of noise models for 6DoF PGO, in that different components often follow different noise characteristics.
Our method is tested with (1) the YCB video (YCBv) dataset and (2) a new ground robot experiment. On the YCBv dataset (Sec. IV.A), we leverage the image streams in the training sets to build pervideo objectlevel maps for selftraining. The Hybrid labeling method is compared with another two baseline methods to verify the effectiveness of different modules. In the ground robot experiment (Sec. IV. B), we apply the method on longer sequences, circumnavigating selected objects, and demonstrate its potential for use in objectbased navigation.
Our method is implemented in Python. We use the NVISII toolkit [14] to generate synthetic image data. The training and pose inference programs are adapted from the code available in the DOPE GitHub [24]
. Every network is initially trained for 60 epochs and finetuned for 20 epochs, with a batchsize of 64 and learning rate of 0.0001. We solve PGO problems based on the GTSAM library
[3]. We run the data generation and training programs with 2 NVIDIA Volta V100 GPUs and other code on an Intel Core i79850H@2.6GHz CPU and an NVIDIA Quadro RTX 3000 GPU.We test our method with DOPE estimators [24] for three YCB objects: 003_cracker_box, 004_sugar_box and 010_potted_meat_can. Each object appears in 20 YCB videos (training + testing). They have respectively 26689, 22528 and 27050 training images (dashed lines in Fig. 2(a)) from 17, 15, and 17 training videos.
We use 60k NVISIIgenerated synthetic image data to train initial DOPE estimators for the 3 objects^{4}^{4}4The 010_potted_meat_can data are the same as that used in [14].. The models are applied to infer object poses on the YCBv images. We employ the ORBSLAM3 [2] RGBD module (w/o loop closing) to obtain camera odometry on these videos.
003_cracker_box  0001  0004  0007  0016  0017  0019  0025  #best 

LM  62.3  58.7  13.2  69.4  37.6  110.1  101.6  0 
Cauchy  12.4  10.8  10.2  13.8  29.5  94.4  171.4  4 
Huber  31.4  25.4  10.2  34.2  21.6  52.5  57.0  1 
GM  11.5  168.4  10.2  115.0  48.4  94.4  171.4  2 
cDCE[17]  28.7  25.4  10.5  32.5  21.1  45.2  58.9  4 
ACT(Ours)  15.7  12.0  9.4  12.6  20.3  52.0  15.4  9 
004_sugar_box  0001  0014  0015  0020  0025  0029  0033  #best 
LM  22.9  27.1  100.9  21.1  57.3  78.7  7.1  0 
Cauchy  8.3  13.4  30.4  21.9  22.3  104.4  6.4  1 
Huber  11.7  12.8  35.5  15.8  23.4  71.1  6.6  3 
GM  9.4  11.4  29.1  14.3  19.6  104.4  6.4  6 
cDCE[17]  12.3  12.0  31.6  14.9  20.0  72.7  6.5  0 
ACT(Ours)  8.2  15.9  34.2  15.1  18.0  100.5  6.1  10 
010_potted_meat_can  0002  0005  0008  0014  0017  0023  0026  #best 
LM  35.2  38.1  61.4  59.2  31.1  32.8  17.5  0 
Cauchy  10.8  14.9  10.7  12.2  14.2  11.7  11.4  6 
Huber  11.1  17.1  14.6  16.6  18.6  15.5  11.8  1 
GM  10.4  15.3  11.9  13.2  18.9  15.2  9.1  5 
cDCE[17]  11.5  16.2  16.1  20.5  17.8  15.3  11.2  1 
ACT(Ours)  13.3  14.5  12.8  11.3  19.1  14.0  10.4  7 
Combining the measurements, we solve pervideo PGO problems to build objectlevel maps for pseudolabeling. We initialize the camera poses with the odometry chain, object poses with average pose predictions, and covariances with . We apply different robust optimization methods to solve all 60 PGO problems (see Tab. I^{5}^{5}5 Due to space limitations, we report results for the first 7 out of 20 YCBv sequences. Please check out our GitHub repo for complete statistics. ^{6}^{6}6 We use the default parameters and the same noise covariances for the robust Mestimators. ^{7}^{7}7 We implemented cDCE via replacing (8) with equation (16) in [17]. ). For comparison purposes only, we compute pseudolabels for all the YCBv images directly from the optimal states, i.e. , and compare the methods via label errors, i.e. how much the projected object bounding boxes deviate from the ground truth. With the benefit of the componentwise covariance rescaling and mitigation of the effects of outliers, our ACT method achieves the lowest errors in much more videos (see Tab. I Col. 8). It performs stably across sequences based on a fixed initial guess and a constant regularization coefficient . Thus, we pseudolabel the YCBv training images based on the results by our method.
To evaluate the efficacy of different modules, we compare our Hybrid method with another two baseline labeling methods: Inlier and PoseEval. Inlier uses the PGO results only as an inlier measurement filter. The raw pose predictions that are geometrically congruent with other measurements are selected for labeling, i.e. . PoseEval extracts visually coherent pose predictions by thresholding the similarity scores from the pose evaluation module, i.e. ^{8}^{8}8 as in [5]..
The Hybrid method ensures spatial and visual consistency of the pseudolabeled data. For a certain target object in an image, we compare the pose evaluation score for the inlier prediction (if available) with that of the optimized object pose . The higher scorer, if beyond a threshold, is picked for labeling. Thus, the Hybrid pseudo ground truth poses can be expressed as: , where the thresholds satisfy because the PGOgenerated labels, not directly derived from RGB images, are prone to misalignment^{9}^{9}9 and for the 3 YCB objects. . Thus, the Hybrid pseudolabeled data consists of highscore inliers, PGOgenerated easy examples, and hard examples (on which the initial estimator fails). The 3 components are colored differently on Hybrid bars in Fig. 2(a). For the Hybrid and Inlier modes, we also exclude YCBv sequences with measurement outlier rates higher than 20% from data generation.

one standard deviation.
The statistics of the pseudolabeled data are reported in Fig. 2. The Hybrid and Inlier data are in general very accurate (average label errors of image width), with the test being an effective inlier filter. But the PoseEval data, whatever their sizes, are more noisy and outliercorrupted, so we cannot rely only on the pose evaluation test to generate outlierfree labels.
Further, we evaluate the DOPE estimator, finetuned with , on the YCBv testsets. We adopt the average distance (ADD) metric [7], and present the accuracythreshold curves in Fig. 3. All the methods achieve considerable improvements over the initial model (Synthetic), indicating the significance of indomain data, although noisy and outliercorrupted, for training pose estimators, which is also reported in [27, 22, 12, 32, 29]. But they still have large performance gaps from the model trained with YCBv ground truth data, due to noises and the limited data size. Our Hybrid method consistently outperforms other baselines, even though its data have similar statistics with the Inlier data. We thus believe the performance gain mainly comes from the presence of hard examples.
Beyond the first iteration of selftraining, we attempt to pseudolabel the same data with the improved pose estimator and further finetune the model. But the twicefinetuned estimators fail to show significant performance enhancement and can even worsen^{10}^{10}10 The ADD AUC values for the twicefinetuned estimators (by Hybrid labeling) are 47.0, 76.0, and 53.9 respectively. . We believe this is the result of model overfitting^{11}^{11}11 Overfitting is observed after 20 epochs during the 1st finetuning. and the accumulation of label noises.
In return, we examine the effect of finetuned estimator models on object SLAM performance. The YCBv test sequences 0049, 0059 are selected for evaluation, since they have 2 (out of the 3) selected YCB objects. We solve the PGOs with our ACT method and the LM algorithm. The results are reported in Fig. 4 and Tab. II.
The outlier measurement rates are greatly reduced on both sequences, rendering the LM algorithm easier to succeed. But for our ACT method, the SLAM accuracy is improved on 0059 but slightly degraded on 0049. That’s because our method is outlierrobust, and thus decreasing outliers doesn’t always lead to improved SLAM performance, especially for lowoutlier regimes.
0049  ATE (m)  004_sugar_box  010_potted_meat_can  

LM  Tran. (m)  Ori (rad)  Tran. (m)  Ori (rad)  
before  0.199  0.054  0.023  0.091  0.154 
after  0.081  0.039  0.083  0.037  0.089 
ACT  Tran. (m)  Ori (rad)  Tran. (m)  Ori (rad)  
before  0.030  0.057  0.018  0.020  0.064 
after  0.046  0.037  0.046  0.035  0.065 
0059  ATE (m)  003_cracker_box  010_potted_meat_can  
LM  Tran. (m)  Ori (rad)  Tran. (m)  Ori (rad)  
before  0.618  0.234  0.402  0.518  0.214 
after  0.105  0.057  0.027  0.032  0.147 
ACT  Tran. (m)  Ori. (rad)  Tran. (m)  Ori. (rad)  
before  0.036  0.043  0.065  0.049  0.103 
after  0.025  0.019  0.027  0.020  0.108 
As illustrated in Fig. 5, we control a Jackal robot [8] to circle around two target objects: 003_cracker_box and 010_potted_meat_can, and collect stereo RGB images from a ZED2 camera [23]. For each object, two 4 min. long sequences are recorded, one for selftraining and the other for testing. We obtain the ground truth camera trajectory from a Vicon MoCap system [26] and the ground truth object poses from AprilTag detections [16]. The camera odometry is computed with the SVO2 stereo module [6]. We infer the object poses from the left camera RGB images using the same initial estimator as in Sec. IV.A.
Similar to the YCBv experiment, we build an objectlevel map from the measurements and pseudolabel the left camera images with the Hybrid
method. For the two objects, 648 (out of 3120) and 950 (out of 2657) images are pseudolabeled, in which 1.5% and 2.8% are hard examples. After finetuning, we evaluate the estimator model on both our experiment’s test sequences and the YCBv testsets. On our own test sequences, we adopt the reprojection error for the object 3D bounding box as the evaluation metric. The accuracythreshold curves are presented in Fig
6. The AUCs for the curves in both tests are reported in Tab. III.The estimators finetuned with the robotcollected data show elevated performance in a similar environment, and achieve slight enhancements in the YCBv test scenes. This indicates that our method generalizes well to robot navigation scenarios, and also reiterates that real annotated images are precious and effective for mitigating the domain gap.
Our test sequences  

Reproj error AUC  003_cracker_box  010_potted_meat_can 
Before  58.0  40.6 
After  63.9  70.2 
YCBv testsets  
ADD AUC  003_cracker_box  010_potted_meat_can 
Before  8.0  42.3 
After  15.2  44.7 
To study the value of enhanced pose estimators on object SLAM, we solve the PGOs on our test sequences and the YCBv test sequence 0059, with initial and finetuned models. The SLAM estimates solved by our ACT method and the LM algorithm are presented in Fig. 7, and the trajectory errors are reported in Tab. IV.
Similarly, since our method (ACT) is outlierrobust, the reduced outlier rates do not always bring about improved SLAM accuracy. However, in all cases, it facilitates the success of the nonrobust LM algorithm.
ACT  003_cracker_box test  010_potted_meat_can test  YCBv 0059 

Before  0.030  0.028  0.039 
After  0.029  0.028  0.031 
LM  003_cracker_box test  010_potted_meat_can test  YCBv 0059 
Before  0.062  0.253  0.618 
After  0.046  0.072  0.576 
A SLAMaided semisupervised learning method for object pose estimation is developed, to mutually boost the performance of object pose inference and object SLAM during robot navigation. For SLAM optimization, We propose to automate the tuning of the noise covariances by joint optimization of the SLAM variables and uncertainty models, leading to a flexible and easytoimplement robust PGO method. We demonstrate the effectiveness of our method on the YCBv dataset and with ground robot experiments. Our method can mine highquality pseudolabeled data from noisy and outliercorrupted measurements. The SLAMsupported selftraining, even with noisy supervisory signals, considerably enhances the performance of pose estimators trained with synthetic data. The finetuned estimator models, with reduced outlier rates, in return make object SLAM more effective.
We prove that our automatic covariance tuning method can monotonically improve the joint loss as defined in (5). At iteration , we have the variable assignments and the noise covariances from iteration . Solving the PGO, the LM algorithm ensures . Since the extra regularization term in is not a function of , we can have:
(10) 
We have also showed after (8) that is a global minimizer for . Thus, we have:
(11) 
Combining the two inequalities, we obtain:
(12) 
With this being valid for all iterations, we can have the chain of inequalities:
(13) 
which completes the proof.
We show that our automatic covariance tuning method, as the noise models are assumed to be isotropic, is equivalent to using the L1 robust Mestimator. With the isotropic noises, i.e. , the joint loss in (5) reduces to:
(14) 
Evaluating yields the new update rule:
(15) 
On the other hand, we can apply the L1 robust loss for the object pose measurement factors to minimize the PGO loss (4). The robust PGO cost can be expressed as:
(16) 
where is the L1 robust kernel. The robust cost is typically minimized with the IRLS method, by matching the gradients of (16) locally with a sequence of weighted least squares problems. The local least squares formulation at iteration can be expressed as:
(17) 
where is the weight function. Under the isotropic noise assumption, i.e. , we can absorb the weight function into the covariance matrix and rewrite (17) as:
(18) 
where the covariance matrix is de facto rescaled iteratively by:
(19) 
Matching (15) with (19), we can see as , the two methods are in theory equivalent^{12}^{12}12 = const. in the context of robust Mestimation..
The authors acknowledge Jonathan Tremblay and other NVIDIA developers for providing consultation on training DOPE networks and generating synthetic data. The authors also acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing Center for HPC resources that have contributed to the results reported within this paper.
Orbslam3: an accurate opensource library for visual, visual–inertial, and multimap slam
. IEEE Transactions on Robotics 37 (6), pp. 1874–1890. Cited by: §IVA.Asian conference on computer vision
, pp. 548–562. Cited by: §IVA.Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 1352–1359. Cited by: §I.Posecnn: a convolutional neural network for 6d object pose estimation in cluttered scenes
. arXiv preprint arXiv:1711.00199. Cited by: §I, §IIA.Multiview selfsupervised deep learning for 6d pose estimation in the amazon picking challenge
. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 1386–1383. Cited by: §IIA, §IIB.