Code for "Single-view robot pose and joint angle estimation via render & compare", CVPR 2021 (Oral).
We introduce RoboPose, a method to estimate the joint angles and the 6D camera-to-robot pose of a known articulated robot from a single RGB image. This is an important problem to grant mobile and itinerant autonomous systems the ability to interact with other robots using only visual information in non-instrumented environments, especially in the context of collaborative robotics. It is also challenging because robots have many degrees of freedom and an infinite space of possible configurations that often result in self-occlusions and depth ambiguities when imaged by a single camera. The contributions of this work are three-fold. First, we introduce a new render compare approach for estimating the 6D pose and joint angles of an articulated robot that can be trained from synthetic data, generalizes to new unseen robot configurations at test time, and can be applied to a variety of robots. Second, we experimentally demonstrate the importance of the robot parametrization for the iterative pose updates and design a parametrization strategy that is independent of the robot structure. Finally, we show experimental results on existing benchmark datasets for four different robots and demonstrate that our method significantly outperforms the state of the art. Code and pre-trained models are available on the project webpage https://www.di.ens.fr/willow/research/robopose/.READ FULL TEXT VIEW PDF
Code for "Single-view robot pose and joint angle estimation via render & compare", CVPR 2021 (Oral).
The goal of this work is to recover the state of a known articulated robot within a 3D scene using a single RGB image. The robot state is defined by (i) its 6D pose, i.e. a 3D translation and a 3D rotation with respect to the camera frame, and (ii) the joint angle values of the robot’s articulations. The problem set-up is illustrated in Figure 1. This is an important problem to grant mobile and itinerant autonomous systems the ability to interact with other robots using only visual information in non-instrumented environments. For instance, in the context of collaborative tasks between two or more robots, having knowledge of the pose and the joint angle values of all other robots would allow better distribution of the load between robots involved in the task [caccavale2016cooperative].
The problem is, however, very challenging because robots can have many degrees of freedom (DoF) and an infinite space of admissible configurations that often result in self-occlusions and depth ambiguities when imaged by a single camera. The current best performing methods for this problem [lee2020camera, zuo2019craves]
use a deep neural network to localize in the image a fixed number of pre-defined keypoints (typically located at the articulations) and then solve a 2D-to-3D optimization problem to recover the robot 6D pose[lee2020camera] or pose and configuration [zuo2019craves]. For rigid objects, however, methods based on 2D keypoints [Lowe1999-bf, bay2006surf, Collet2010-zj, Collet2011-lj, Rad2017-de, Tremblay2018-bd, Kehl2017-ek, Tekin2017-hp, peng2019pvnet, pavlakos20176, hu2019segmentation] have been recently outperformed by render & compare methods that forgo explicit detection of 2D keypoints but instead use the entire shape of the object by comparing the rendered view of the 3D model to the input image and iteratively refining the object’s 6D pose [Rad2017-de, zakharov2019dpod, li2018deepim, labbe2020cosypose]. Motivated by this success, we investigate how to extend the render & compare
paradigm for articulated objects. This presents significant challenges. First, we need to estimate many more degrees of freedom than the sole 6D pose. Articulated robots we consider in this work can have up to 15 degrees of freedom in addition to their 6D rigid pose in the environment. Second, the space of configurations is continuous and hence there are infinitely many configurations in which the object can appear. As a result, it is not possible to see all configurations during training and the method has to generalize to unseen configurations at test time. Third, the choice of transformation parametrization plays an important role for 6D pose estimation of rigid objects[li2018deepim] and finding a good parametrization of pose updates for articulated objects is a key technical challenge.
To address these challenges, we make the following contributions. First, we introduce a new render & compare approach for estimating the 6D pose and joint angles of an articulated robot that can be trained from synthetic data, generalizes to new unseen robot configurations at test time, and can be applied to a large variety of robots (robotic arms, bi-manual robots, etc.). Second, we experimentally demonstrate the importance of the robot pose parametrization for the iterative pose updates and design an effective parametrization strategy that is independent of the robot. Third, we apply the proposed method in two settings: (i) with known joint angles (e.g. provided by internal measurements from the robot such as joint encoders), only predicting the camera-to-robot 6D pose, and (ii) with unknown joint angles, predicting both the joint angles and the camera-to-robot 6D pose. We show experimental results on existing benchmark datasets for both settings that include a total of four different robots and demonstrate significant improvements compared to the state of the art.
from RGB images [Roberts1963-ck, Lowe1987-yf, Lowe1999-bf]
is one of the oldest problems in computer vision. It has been successfully approached by estimating the pose from 2D-3D correspondences obtained via local invariant features[Lowe1999-bf, bay2006surf, Collet2010-zj, Collet2011-lj], or by template-matching [Hinterstoisser2011-es]
. Both these strategies have been revisited using convolutional neural networks (CNNs). A set of sparse[Rad2017-de, Tremblay2018-bd, Kehl2017-ek, Tekin2017-hp, peng2019pvnet, pavlakos20176, hu2019segmentation] or dense [Xiang2018-dv, Park2019-od, song2020hybridpose, zakharov2019dpod] features is detected on the object in the image using a CNN and the resulting 2D-to-3D correspondences are used to recover the camera pose using PnP [lepetit2009epnp]. The best performing methods for 6D pose estimation from RGB images are now based on variants of the render & compare strategy [oberweger2015training, Rad2017-de, li2018deepim, manhardt2018deep, oberweger2019generalized, labbe2020cosypose, zakharov2019dpod] and are approaching the accuracy of methods using depth as input [hodan2020bop, Hodan_undated-sl, li2018deepim, labbe2020cosypose].
(HEC) [Horaud1995-fl, Heller2011-so] methods recover the 6D pose of the camera with respect to a robot. The most common approach is to detect in the image fiducial markers [garrido2014automatic, fiala2005artag, olson2011apriltag] placed on the robot at known positions. The resulting 3D-to-2D correspondences are then used to recover the camera-to-robot pose using known joint angles and the kinematic description of the robot by solving an optimization problem [park1994robot, ilonen2011robust, yang1994calibrating]. Recent works have explored using CNNs [lambrecht2019towards, lee2020camera] to perform this task by recognizing 2D keypoints at specific robot parts and using the resulting 3D-to-2D correspondences to recover the hand-eye calibration via PnP. The most recent work in this direction [lee2020camera] demonstrated that such learning-based approach could replace more standard hand-eye calibration methods [tsai1989new] to perform online calibration and object manipulation [tremblay2020indirect]. Our render & compare method significantly outperforms [lee2020camera] and we also demonstrate that our method can achieve a competitive accuracy without requiring known joint angles at test time.
Previous work on this problem can be split into three classes. The first class of methods aims at discovering properties of the kinematic chain through active manipulation [katz2008manipulating, katz2013interactive, hausman2015active, martin2016integrated] using depth as input and unlike our approach cannot be applied to a single image. The second class of methods aims at recovering all parameters of the kinematic chain from a single RGBD image, including the joint angles, without knowing the specific articulated object [li2020category, yi2018deep, abbatematteo2019learning, zhou2016simultaneous]. In contrast, we focus on the set-up with a known 3D model, e.g. a specific type of a robot. The third class of methods, which is closest to our set-up, considers pose and joint angle estimation [michel2015pose, desingh2019factored, pauwels2014real] for known articulated objects but relies on depth as input and only considers relatively simple kinematic chains such as laptops or drawers where the joint parameters only affect the pose of one part. Others recover joint angles of a known articulated robot part [bohg2014robot, widmaier2016robot] but do not recover the 6D pose of the camera and also rely on depth. In contrast, our method accurately estimates the pose and joint angles of a robotic arm with many degrees of freedom from a single RGB image.
To the best of our knowledge, only [zuo2019craves] has addressed a scenario similar to us where the robot pose and joint angles are estimated together from a single RGB image. A set of predefined 2D keypoints is recognized in the image and the 6D pose and joint angles are then recovered by solving a nonlinear non-convex optimization problem. Results are shown on a 4 DoF robotic arm. In contrast, we describe a new render & compare approach for this problem, demonstrate significantly improvements in 3D accuracy and show results on robots with up to 15 DoF.
We present our render & compare framework to recover the state of a robot within a 3D scene given a single RGB image. We assume the camera intrinsic parameters, the CAD model and kinematic description of the robot are known. We start by formalizing the problem in Section 3.1. We then present an overview of our approach in Section 3.2 and explain our training in Section 3.3. Finally, we detail the key choices in the problem parametrization in Section 3.4.
Our notations are summarized in Figure 2. We consider a known robot composed of rigid parts ,…, whose 3D models are known. An articulation, or joint, connects a parent part to a child part. Given the joint angle of the -th joint, we can retrieve the relative 6D transformation between the parent and child reference frames. Note that for simplicity we only consider revolute joints, i.e. joints parametrized by a single scalar angle of rotation , but our method is not specific to this type of joints. The rigid parts and the joints define the kinematic tree of the robot. This kinematic description can be used to compute the relative 6D pose between any two parts of the robot. In robotics, the full state of a robot is defined by the joint angles and the 6D pose of the root of the kinematic tree. Defining the 6D pose of the robot with respect to the root (whose pose is independent of the joint angles since it is not a child of any joint) is a crucial choice in the parametrization of the problem, but also arbitrary, since an equivalent kinematic tree could be defined using any part as the root. We instead define the full state of the robot by (i) the selection of an anchor part , (ii) the 6D pose of the anchor with respect to the camera , and (iii) the joint angles , where is the number of joints. Note the anchor part can change across iterations of our approach. We discuss the choice of the anchor in Section 3.4 and experimentally demonstrate it has an important influence on the results.
We now present our iterative deep render & compare framework, illustrated in Figure 3. We iteratively refine the state estimate as follows. First, given a current estimate of the state we render an RGB image of the robot and the mask of the anchor part. We then apply a deep refiner network that takes as input crops of the rendered image and the input RGB image of the scene. It outputs a new state of the robot to attempt to match the ground truth state of the observed robot. Unlike prior works that have used render & compare strategies for estimating the 6D pose of rigid objects [li2018deepim, zakharov2019dpod, labbe2020cosypose], our method does not require a coarse pose estimate as initialization.
To render the image of the robot we use a fixed focal length (defining an intrinsic camera matrix) during training. The rendering is fully defined by the state of the robot and the camera matrix. Instead of giving to the refiner network the full image and the rendered view, we focus the inputs on the robot by cropping the images as follows. We project the centroid of the rendered robot in the image, consider the smallest bounding box of aspect ratio 4/3 centered on this point that encloses the projected robot and increase its size by 40% (see details in the appendix). This crop depends on the projection of the robot to the input image that varies during training, and thus provides an augmentation of the effective focal length of the virtual cropped cameras. Hence, our method can be applied to cameras with different intrinsics at test time as we show in our experiments.
We initialize the robot to a state defined by the joint configuration and the pose of the anchor part with respect to the camera . At training time we define using perturbations of the ground truth state. At test time we initialize the joints to the middle of the joint limits, and the initial pose so that the frame of the robot base is aligned with the camera frame and the 2D bounding box defined by the projection of the robot model approximately matches the size of image. More details are given in the appendix.
At iteration , the refiner predicts an update of the joint angles (composed of one scalar angle per joint), such that
and an update of the current 6D pose of the anchor part, such that
where we follow DeepIM [li2018deepim]’s parametrization for pose update . This parametrization disentangles rotation and translation prediction but crucially depends on the choice of a reference point we call . In DeepIM this point is simply taken as the center of the reference frame of the rigid object but there is not such a natural choice of reference point for articulated objects, which have multiple moving parts. We discuss several possible choices of the reference point in Sec. 3.4 and demonstrate experimentally it has an important impact on the results. In particular, we show that naively selecting the reference frame of the root part is sub-optimal.
In the following, we describe our loss function, synthetic training data, implementation details and discuss how to best use known joint angles if available.
We train our refiner network using the following loss:
where are the parameters of the refiner network, is the maximum number of iterations of the refinement algorithm, is the ground truth 6D pose of the anchor, are the ground truth joint angles and is a hyper-parameter to balance between the 6D pose loss and the joint angle loss . The 6D pose loss measures the distance between the predicted 3D point cloud obtained using transformed with and the ground truth 3D point cloud (obtained using ) of the anchor . We use the same loss as [labbe2020cosypose] that disentangles rotation, depth and image-plane translations [Simonelli2019-da] (see equations in the appendix). For , we use a simple regression loss, .
Note that the 6D pose loss is measured only on the anchor part while the alignment of the other parts of the robot is measured by the error on their joint angles (rather than alignment of their 3D point clouds). This disentangles the 6D pose loss from the joint angle loss and we found this leads to better convergence. We sum the loss over the refinement iterations
to imitate how the refinement algorithm is applied at test time but the error gradients are not backpropagated through rendering and iterations. Finally, for simplicity the loss (3) is written for a single training example, but we sum it over all examples in the training set.
For training the refiner, we use existing datasets [lee2020camera, zuo2019craves] provided by prior works for the Kuka, Panda, Baxter, OWI-535 robots. All of these datasets are synthetic, generated using similar procedures based on domain randomization [tobin2017domain, Loing2018, Sadeghi2017-dr, James2018-db]. The joint angles are sampled independently and uniformly within their bounds, without assuming further knowledge about their test time distribution. We add data augmentation similar to [labbe2020cosypose].
We sample the initial state
by adding noise to the ground truth state in order to simulate errors of the network prediction at the previous state of the refinement as well as the error at the initalization. For the pose, we sample a translation from a centered Gaussian distribution with standard deviation ofcm, and a rotation by sampling three angles from a centered Gaussian distribution of standard deviation . For the joint angles, we sample an additive noise from a centered Gaussian distribution with a standard deviation equal to of the joint range of motion, which is around for most of the joints of the robots we considered.
We train separate networks for each robot. We use a standard ResNet-34 architecture [he2016deep] as the backbone of the deep refiner. The hyper-parameters are and training iterations. Note that at test time we can perform more iterations, and the results we report correspond to 10 iterations. The anchor is sampled randomly among the 5 largest parts of the robot at each iteration. This choice is motivated in Section 3.4 and other choices are considered in the experiments, Section 4.3. We initialize the network parameters randomly and perform the optimization using Adam [Kingma2014-xz], with the procedure described in the appendix for all the networks.
The approach described previously could be used at test time with measured joint angles and by ignoring the joint update, but we observed better results by training a separate network which only predicts a pose update for this scenario. In this context where the joint values are known and constant, the full robot is considered as a single and unique anchor. Yet, the problem remains different from classic rigid object 6D object pose estimation because the network must generalize to new joint configurations unseen during training.
There are two main parametrization choices in our approach: (i) the choice of the reference point for the parametrization of the pose update in equation (2) and (ii) the choice of the anchor part to update the 6D pose and measure pose loss in equation (3). These choices have a significant impact on the results, as shown in Section 4.
Similar to [li2018deepim], we parametrize the pose update as a rotation around a reference point and a translation defined as a function of the position of with respect to the camera. The fact that the rotation is around is a first obvious influence of this choice on the transformation that needs to be predicted. The impact on the translation update parameters is more complicated: they are defined by a multiplicative update on the depth of and by an equivalent update in pixels in the image, which is also related to the real update by the depth of (see equations in the appendix).
A seemingly natural choice for reference point would be a physical point on the robot, for example the center of the base of the robot or the anchor part. However, on the contrary to the rigid object case, if that part is not visible or is partially occluded, the network cannot infer the position of the reference precisely, and thus cannot predict a relevant translation and rotation update. In experiments, we show it is better to use as the centroid of the estimated robot state, which takes into account the estimated joint configuration, and can be more reliably estimated.
The impact of the choice of the anchor part used for computing the 6D pose loss in equation (3), is illustrated in Figure 4. We explore several choices of anchor part in our experiments, and show that this choice has a significant impact on the results. Since the optimal choice depends on the robot, and the observed pose, we introduce a strategy where we randomly select the anchor among the largest parts of the robot, during both training and refinement, and show that on average it performs similarly or slightly better than the optimal oracle choice of a single unique anchor on the test set.
|Dataset informations||DREAM [lee2020camera]||DREAM [lee2020camera]||DREAM [lee2020camera]||Ours||Ours|
|Robot||Robot (DoF)||Real||# images||# 6D poses||cam.||VGG19-F||VGG19-Q||ResNet101-H||ResNet34||Unknown angles|
|Baxter DR||Baxter (15)||5982||5982||GL||-||75.47||-||86.59||32.66|
|Kuka DR||Kuka (7)||5997||5997||GL||-||-||73.30||89.62||80.17|
|Kuka Photo||Kuka (7)||5999||5999||GL||-||-||72.14||86.87||73.23|
|Panda DR||Panda (8)||5998||5998||GL||81.33||77.82||82.89||92.70||82.85|
|Panda Photo||Panda (8)||5997||5997||GL||79.53||74.30||81.09||89.89||79.72|
|Panda 3CAM-AK||Panda (8)||✓||6394||1||AK||68.91||52.38||60.52||76.54||70.37|
|Panda 3CAM-XK||Panda (8)||✓||4966||1||XK||24.36||37.47||64.01||85.97||77.61|
|Panda 3CAM-RS||Panda (8)||✓||5944||1||RS||76.13||77.98||78.83||76.90||74.31|
|Panda ORB||Panda (8)||✓||32315||27||RS||61.93||57.09||69.05||80.54||70.39|
We evaluate our method on recent benchmarks for the following two tasks: (i) camera-to-robot 6D pose estimation for three widely used manipulators (Kuka iiwa7, Rethink robotic Baxter, Franka Emika Panda) [lee2020camera], and (ii) full state estimation of the low-cost 4 DoF robotic arm OWI-535 [zuo2019craves]. In Section 4.1, we consider the first task, where an image of a robot with fixed known joint angles is used to estimate the 6D camera-to-robot pose. We show that our approach outperforms the state-of-the-art DREAM method [lee2020camera]. In Section 4.2, we evaluate our full approach where both the 6D pose and joint angles are unknown. We show our method outperforms the state-of-the-art method [zuo2019craves] for this problem on their dataset depicting the low-cost 4 DoF robot and that it can recover the 6D pose and joint angles of more complex robotic manipulators. Finally, Section 4.3 analyzes the parametrization choices discussed in Section 3.4.
We focus on the datasets annotated with 6D pose and joint angle measurements recently introduced by the state-of-the-art method for single-view camera-to-robot calibration, DREAM [lee2020camera]. We use the provided training datasets with 100k images generated with domain randomization. Test splits are available as well as photorealistic synthetic test images (Photo). For the Panda robot, real datasets are also available. The Panda 3CAM datasets display the fixed robot performing various motions captured by 3 fixed different cameras with different focal lengths and resolution, all of which are different than the focal length used during training. The largest dataset with the more varied viewpoints is Panda-ORB with 32,315 real images in a kitchen environnement captured from 27 viewpoints with different joint angles in each image.
We use the 3D reconstruction ADD metric which directly measures the pose estimation accuracy, comparing distances between 3D keypoints defined at joint locations of the robot in the ground truth and predicted pose. We refer to the appendix for exact details on the evaluation protocol of our comparison with DREAM [lee2020camera].
We train one network for each robot using the same synthetic datasets as [lee2020camera] and report our results in Table 1. Our method achieves significant improvements across datasets and robots except on Panda 3CAM-RS where the performance of [lee2020camera] with ResNet101-H variant is similar to ours. On the Panda 3CAM-AK and Panda 3CAM-XK datasets, the performance of our method is significantly higher than the ResNet101-H model of [lee2020camera] (e.g. on 3CAM-XK), which suggests that the approach of [lee2020camera] based on 2D keypoints is more sensitive to some viewpoints or camera parameters. Note that our method trained with the synthetic GL camera can be applied to different real cameras with different intrinsics at test time thanks to our cropping strategy which provides an augmentation of the effective focal length during training.
On Panda-ORB, the largest real dataset that covers multiple camera viewpoints, our method achieves a large improvement of points. Our performance on the synthetic datasets for the Kuka and Baxter robots is also significantly higher than [lee2020camera]. We believe the main reason for this large improvements is the fact that our Render & Compare approach can directly use the shape of the entire robot rendered in the observed configuration for estimating the pose rather than detecting a small number of keypoints.
|Ours (individual frames)||Ours (online)||DREAM [lee2020camera]|
In Table 2 we report the running time of our method on the Panda-ORB dataset which consists of robot motion videos captured from 27 different viewpoints. The first observation is that the accuracy increases with the number of refinement iterations used at test-time, and the most significant improvements are during the first 3 iterations. The importance of using multiple network iterations during training is further discussed in the appendix. We also report an online version of our approach that leverages temporal continuity. It runs the refiner with iterations on the first frame of the video and then uses the output pose as the initialization for the next frame and so on, without any additional temporal filtering of the resulting 6D poses. This version runs at 16 frames per second (FPS) and achieves a similar performance as the full approach that considers each frame independently and runs at 1 FPS.
We now evaluate the performance of our method in the more challenging scenario where the robot joint angles are unknown and need to be estimated jointly with the 6D pose from a single RGB image. Qualitative results on the considered datasets as well as on real images crawled from the web are shown in Figure 5. Please see the appendix for additional qualitative examples, and the project page [projectpage] for a movie showing our predictions on several videos.
|CRAVES [zuo2019craves]||CRAVES [zuo2019craves]||ours|
|Error||all||top 50%||all||top 50%|
|Trans xyz. (cm)||10.1||5.52||0.61||0.42|
|Trans norm. (cm)||19.6||10.5||1.31||0.90|
CRAVES [zuo2019craves] is the state-of-the-art approach for this task. We consider the two datasets used in [zuo2019craves]. CRAVES-lab displays the OWI-535 4DoF in a lab environment and contains 20,000 RGB images of which 428 key frames are annotated with 2D robot keypoints, ground truth joint angles (not used by our method) and camera intrinsics. CRAVES-youtube is the second dataset containing real-world images crawled from YouTube depicting large variations in viewpoint, illumination conditions and robot appearance. It contains 275 frames annotated with 2D keypoints but no camera intrinsic parameters, 6D pose or joint angle ground truth. In addition to metrics that measure the 6D pose and joint angle estimates, we report a 2D keypoint metric, PCK (percentage of keypoints), following [zuo2019craves]. We refer to the appendix for details of the metrics and the evaluation protocol.
We compare with two variants of CRAVES, one trained only on synthetic images (synt), and one that also requires real non-annotated images (synt+real*). Our method is trained only using the provided synthetic images. We report results on CRAVES-lab in Table 3. To compare with the 2D keypoint metric PCK@0.2, we project in the image the 3D keypoints of our estimated robot state. On this metric, our method outperforms CRAVES trained only on synthetic images and achieves a near-perfect score, similar to their approach trained with real images. More importantly, we achieve much better results when comparing with the 3D metrics (joint angles error and translation/rotation error). CRAVES achieves high average errors when all images of the datasets are considered, which is due to the complexity of solving the nonlinear nonconvex 2D-to-3D optimization problem for recovering 6D pose and joint angles given 2D keypoint positions. Our method trained to directly predict the 6D pose and joint angles, achieves large improvements in precision. We reduce the translation error by a factor of 10, demonstrating robustness to depth ambiguities.
|synt [zuo2019craves]||synt+real* [zuo2019craves]||f=500||f=1000||f=1500||f=2000||f=best|
We also evaluated our method on CRAVES-youtube. On this dataset, the camera intrinisic parameters are unknown and cannot be used for projecting the estimated robot pose into the 2D image. We therefore report results for different hypothesis of (fixed) focal lengths for all the images of the dataset, as well as using an oracle (f=best) which selects the best focal length for each image. Results are reported in Table 4. For 2D keypoints, our method for achieves results superior to CRAVES while not requiring real training images. Our method also clearly outperforms CRAVES when selecting the best focal length. 3D ground truth is not available, but similar to CRAVES-lab we could expect large improvements in 3D accuracy.
We also train our method for jointly predicting 6D pose and joint angles for the robots considered in Section 4.1. We evaluate the 6D pose and joint angles accuracy using ADD. Results are reported in Table 1 in blue (last column). For the 7DoF robotic arms (Kuka and Panda), these results demonstrate a competitive or superior ADD accuracy compared to [lee2020camera] for inferring the 3D geometry of a known robot, but our method does not require known joint angles. The more complex 15 DoF Baxter robot remains challenging although our qualitative results often show reasonable alignments. We discuss the failure modes of our approach in the appendix.
We analyze our method on the Panda-ORB dataset: it is the largest real dataset containing significant variations in joint angles and camera viewpoints and the Panda robot has a long kinematic chain with 8 DoF. We study the choice of reference point for the 6D pose update and the choice of the anchor part (see Section 3.4).
We train different networks with the reference point at the origin of the root , the part in the middle of the kinematic chain and at the end of the kinematic chain . Results are reported in Table 5(a). We observe that the performance indeed depends on the choice of the reference point and our approach of using the centroid of the robot as the reference point performs the best. The network trained with the “Hand” part (, the end effector) as a reference point fails to converge because this part is often difficult to identify in the training images and its pose cannot be inferred from any other part because the robot is not a rigid object. We investigate picking the reference point on one of the five largest parts (measured by their 3D volume which is correlated with 2D visibility) in Table 5(b) again demonstrating our approach of using the centroid of the robot performs better than any of these specific parts.
Table 6 reports results using different strategies for chosing the anchor part during training and testing. First, in 6(a) we show that choosing different parts as one (fixed) anchor results in significant variation in the resulting performance. To mitigate this issue we consider in 6(b) a strategy where the anchor is picked randomly among the robot parts at each iteration (both during training and testing). This strategy performs better than always naively selecting the root as anchor. By restricting the sampled anchors to the largest parts, our automatic strategy can also perform better than the best performing part .
We have introduced a new render & compare approach to estimate the joint angles and the 6D camera-to-robot pose of an articulated robot from a single image demonstrating significant improvements over prior state-of-the-art for this problem. These results open-up exciting applications in visually guided manipulation or collaborative robotics without fiducial markers or time-consuming hand-eye calibration. To stimulate these applications, we released the training code as well as the pre-trained models for commonly used robots.
This work was partially supported by the HPC resources from GENCI-IDRIS (Grant 011011181R1), the European Regional Development Fund under the project IMPACT (reg. no. CZ.02.1.01/0.0/0.0/15 003/0000468), Louis Vuitton ENS Chair on Artificial Intelligence, and the French government under management of Agence Nationale de la Recherche as part of the ”Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute).
This appendix is organized as follows. In Section A, we provide the complete set of equations for the pose update and its relation to the reference point introduced in the main paper. In Section B, we give the details of the pose loss used in equation (3) in the main paper. Sections C, D, E and F give details of the cropping strategy, the state initialization, training strategy and evaluation, respectively. In Section G, we provide additional experiments showing the benefits of the iterative formulation and studying the importance of running the refiner network for multiple iterations during training. In Section H, we discuss the applicability of our approach to real robotic problems. Section I presents additional qualitative examples randomly selected from the images in the datasets introduced in Section 4 of the main paper. Finally, we discuss and illustrate the main failure modes of our approach in Section J. Additional examples are in the supplementary video [projectpage].
Any rigid motion can be modeled by a transformation in . The 6D pose of the anchor part of the robot at iteration is therefore defined by the composition of its current transformation w.r.t. to the camera coordinate system at iteration composed with a pose update transformation :
as introduced in Section 3.2 in the main paper. In the following we present the equations that define this pose update. We follow the parametrization of DeepIM’s [li2018deepim] pose update and explicitly parametrize the pose update using a 3D reference point we call . We choose to use the centroid of the estimated robot at the -th iteration as , but other choices are also possible. Our neural network refiner predicts parameters , , that are used to compute the 3D translation . The refiner also predicts the 3D rotation that together with define the pose update in (4). The parameters and correspond to the displacement in the image plane (in pixels) of the reference point at iteration and corresponds to a relative update of the depth of , again relative to the camera coordinate system:
where is the 2D projection onto the image plane of the 3D point before applying the pose update, the 2D reprojection onto the image plane of after applying the pose update, are the 3D coordinates of the reference point before the pose update expressed in the camera frame, and are the 3D coordinates of the reference point after the pose update expressed in the camera frame. and are the intrinsic parameters of the virtual cropped camera that are assumed known and fixed.
From these equations, we can derive the update of the 3D translation of the 3D point :
with the , and components of translation update defined as :
Note that depends on the 3D point . Hence the network has to learn to internally infer this point to predict consistent transformation updates. The rotation update (parametrized by three angles) predicted by the refiner network is applied around the 3D reference point . This defines the transformation of any 3D point on the anchor. Let’s consider a point on the anchor part in position at iteration , its position at iteration is given by:
where is the rotation matrix predicted by the network using the same rotation parametrization used in [labbe2020cosypose, Zhou2018-eg]. The equation (12) can also be used to define the rotation matrix of the anchor part with respect to the camera:
where defines the rotation of the anchor part with respect to the camera before the pose update, defines the rotation of the anchor part with respect to the camera after the pose update and is the rotation matrix predicted by the network. The equations (12) and (13) fully define the pose update given by equation (4) as a function of the network predictions , , and , and the reference point . For computing the loss in the next section, we write the pose of the anchor part in the camera coordinate system at iteration (after the pose update) as
where expresses the updated pose of the anchor as a function of all network predictions , , , , and can be computed in a closed form using the equations derived above.
We define the following distance to measure the distance between two transformations and using the 3D points on the anchor part :
where is the norm. The pose loss in equation (3) in the main paper follows [labbe2020cosypose] and is written as:
where defines the pose update as explained in (14) in Section A, is the ground truth 6D pose of the anchor with repect to the camera, are the parameters of the pose update predicted by the network, and are the values of these updates that would lead to the ground truth pose. The different terms of this loss separate (or disentangle) the influence of different subsets of parameters on the loss. In detail, the first term (16) of the loss considers only the translation of the anchor part, where the rest of the parameters are fixed to their ground truth values, and . The second term (17) of the loss considers only the relative depth of the anchor part where the rest of the parameters are fixed to their ground truth values. Finally, the last term (18) of the loss considers only the rotation update (18) where the rest of the parameters are fixed to their ground truth values.
Let be the 2D projection of the 3D centroid of the robot by the camera with intrinsics . We use a cropping strategy similar to DeepIM but using only the projection of the 3D points of the model in the current pose estimate to define the bounding box of the crop. Let , be the coordinates of the upper left and lower right corners of the bounding box defined by the robot projection in the image, respectively. We define:
is the aspect ratio of the crop and following DeepIM [li2018deepim]. The crop is centered at , of width and height . An example of a crop is given in Figure 6.
Let and define the center and the size of the approximate 2D bounding box of the robot in the image. In our experiments, there is only one robot per image and thus we use the entire image as the 2D bounding box and denote the bounding box . The orientation of the base of the robot is set parallel to the axes of the camera with the axis pointing upwards. The centroid of the robot is set to match the center of the bounding box . We make the first hypothesis of the depth of the centroid by setting and use this initial value to estimate the coordinates and of the centroid in the camera frame:
We then update the depth estimate using the following simple strategy. We project the points of the robot using the initial guess we have just defined. These points define a bounding box with dimensions and the center remains unchanged by construction. We compute an updated depth of the centroid such that its width and height approximately match the size of the 2D detection:
and use this new depth to compute and using equations (23) and (24) that were used to define and . The initial state of the robot is defined by , , and the initial joint angles are set (when they are not measured) to , where and define the interval of the robot articulation angles. An example of initialization is shown in Figure 7.
In the experiments of Section 4 of the main paper, we use the entire image as the 2D bounding box because the test images show a single robot. Please note however that it would be straightforward to process multiple bounding boxes like [labbe2020cosypose] for multiple rigid objects. If multiple robots are in the input bounding box (e.g. as in some of the examples in the results video on the project webpage [projectpage]), our iterative refinement typically converges to the largest robot in the image.
The training of the neural networks is performed with Adam [Kingma2014-xz] using a learning rate of and batches of 1408 images, split on 44 GPUs. The networks are trained for 60k iterations. Following recommendations from [Goyal2017-of] for distributed training, we use a warm up-phase: the learning rate is linearly increased from to to during the first 5k iterations. The learning rate is reduced to 0.0003 at 45k iterations. In order to speed up the training, we start by training with refinement iterations and add other iterations , at 15k and 30k iterations respectively. We found that starting to train with and increasing the number of refinement iterations during training has no effect on the results, but allows to train the refiner network faster.
Next, we give more details about the evaluation metric and the evaluation protocol of our comparison with DREAM[lee2020camera] and CRAVES [zuo2019craves].
The average distance (ADD) metric measures the distance between 3D keypoints on the robot transformed with the predicted and ground truth 6D poses. The 3D keypoints are the same as in [lee2020camera] for the Kuka, Panda and Baxter robots and are defined at the locations of robot joints. The ADD errors of the 3D keypoints for the robot depicted in the image are averaged, and we report the area-under-the-curve of the ADD pass rate vs. threshold following [lee2020camera]. We refer to [lee2020camera, Xiang2018-dv] for more details about this metric. A slight difference with the evaluation protocol of [lee2020camera] is that all images are considered as possible evaluation targets, whereas [lee2020camera] discards images where there are fewer than 4 robot keypoints visible in the ground truth because PnP cannot be solved in this situation. This favors methods that rely on keypoints and does not fully consider the difficulty of estimating the pose of robots in situations with high self-occlusions, a situation that our method can handle. We thus evaluate ADD on all images, even those with fewer than four visible robot keypoints. The models from DREAM [lee2020camera] were re-evaluated with this protocol and in practice the difference is only minor compared to the results reported in DREAM [lee2020camera], less than for all datasets. For reproducing DREAM results reported in Table 1 of the main paper, we used the provided code and pre-trained models111https://github.com/NVlabs/DREAM using the instructions to reproduce the results from the paper, e.g. using the “shrink-and-crop” cropping strategy. Note that while evaluating these models, we found some results to be higher than reported in the original paper, e.g. the ResNet101-H is the best model on the largest real dataset Panda-ORB (69.05) but only the VGG19-F (61.93) was reported in [lee2020camera]. We reported results for all models of DREAM [lee2020camera] in Table 1 of the main paper.
When evaluating our method that predicts joint angles, ADD is computed using 3D keypoints on the robot that are computed using both the estimated 6D pose and the predicted joint angles using the robot forward kinematics. This provides a principled way to measure the accuracy of the 3D reconstruction of the robot without giving arbitrary weights to rotation, translation or joint angle errors.
For the comparison with CRAVES [zuo2019craves], we based our evaluation on the code provided with the paper222https://github.com/zuoym15/craves.ai. The 3D metrics reported in Table 3 for CRAVES-lab are:
Joints (degrees): Errors of the joint angle predictions over the 4 articulations , where is the predicted value of the joint angle for joint and is the ground truth value of the joint angle for joint .
Trans xyz. (cm): Translation error for the base averaged over the axes: , where ”pred” denotes the predicted and ”gt” the ground truth values.
Trans norm. (cm): Translation error for the base computed using the norm: .
Rot. (degrees): Errors between the Euler angles that define the rotation of the base with respect to the camera , where again ”pred” denotes the predicted and ”gt” the ground truth values.
We found that the ”Joints”, “Trans xyz.” and ”Rot.” metrics reported in [zuo2019craves] under ”3D pose estimation errors” were an average computed only over the top 50% images with the best predictions according to the joint angle errors. We also report the errors averaged over all images of the dataset (all) in Table 3 of the main paper as our method can successfully recover the 3D configuration of the robot in all images of the dataset because it does not rely on solving the 2D-to-3D nonlinear, nonconvex optimization problem that is difficult to solve in some situations.
On CRAVES-Youtube only the visible 2D keypoints are considered as estimation targets to provide a fair comparison with “Youtube-vis” of [zuo2019craves]. The results that we reproduced with the provided pre-trained models were slightly lower (around 0.5% lower) than reported in the paper for the PCK@0.2 metric, but we kept the (higher) results reported in the paper when comparing with their method.
In this section, we investigate the benefits of the number of refiner iterations at training and test time. In the following, we denote as the number of refiner network iterations used during training (denoted in equation (3) of the main paper) and as the number of iterations used at test time. We experimentally evaluate the influence of and using the ADD metric on the Panda-ORB dataset with unknown joint angles, similar to the experimental set-up reported in Section 4.2. We trained four networks with and report the results for a varying number of iterations at test time for each network. Results are reported in Table 7.
First, we observe that for each refiner network (each row in the table), the accuracy is significantly improved by using multiple iterations at test time as opposed to running the network for a single iteration () in a single-shot regression fashion. These results demonstrate that the iterative formulation is crucial to the success of the approach.
Second, we observe that running the network for multiple iterations during training significantly improves the results. For example, for , the ADD results are improved from for to for . Using multiple training iterations without backpropagating through the renderer augments the distribution of errors between the input state and the ground truth state with actual errors the refiner makes. These results show the importance of training the refiner network to correct both the large errors (between the initial state and the ground truth state ) as well as smaller errors (between the prediction at iteration k, , and ) to simulate what the network is going to see at test-time where the network is run for multiple iterations. In our experiments in the paper, we used as it is a good trade-off between performance and training speed.
Similar to related work [lee2020camera, zuo2019craves], our approach can be applied in real robotic set-ups. For example, the offline hand-eye calibration approach DREAM [lee2020camera] has been used in [tremblay2020indirect] for a manipulation task on a real robot, and CRAVES [zuo2019craves] demonstrates visually guided closed-loop control of a low-cost robot without reliable joint angle measurements. Our approach significantly improves over the accuracy of DREAM [zuo2019craves] and CRAVES [lee2020camera], and therefore we expect to also improve the capabilities of robotic systems similar to [tremblay2020indirect, zuo2019craves]. Please note that while our paper focuses on the robot state estimation from a single image, it is also suited for real-time and online applications as explained in Section 4.1. In this scenario (also illustrated in the video on the project webpage [projectpage]), our predictions could be temporally smoothed to reduce the jitter in the predictions by applying a simple temporal low-pass filter similar to [tremblay2020indirect].
Here we provide more qualitative examples on the datasets where we have shown the quantitative evaluation. All the presented results consider the most challenging scenario where the joint angles are unknown and are predicted together with the robot’s 6D pose. For each dataset, we have randomly sampled 9 images without any further manual selection and show the input RGB image and the output state predicted by our method. Results are presented for the CRAVES-lab (Figure 8), CRAVES-youtube (Figure 9), Panda ORB (Figure 10), Kuka Photo (Figure 11) and Baxter DR (Figure 12) datasets.
There are four main failure modes and limitations of our approach in the most challenging set-up with unknown joint angles, illustrated in Figure 13. First, while our method is tolerant to some amount of self-occlusion, it can fail in situations where multiple parts of the robot are occluded by external objects. Second, in some in-the-wild images (where camera intrinsics are also unknown), our iterative alignment can get stuck in a local minima. This could be improved by (i) using a better initialization (e.g. by using a separate external coarse estimation method) or by (ii) trying multiple initializations and selecting the best result. The third failure mode is related to the symmetry of individual robot parts or of entire robot configurations that are not taken into account in our method. Finally, robots with many degrees of freedom such as the 15 DoF Baxter remain challenging (please see the quantitative evaluation in Section 4.2 in the main paper), although our qualitative results often show reasonable alignment, as illustrated in Figure 12.