Single-view robot pose and joint angle estimation via render compare

04/19/2021
by   Yann Labbé, et al.
3

We introduce RoboPose, a method to estimate the joint angles and the 6D camera-to-robot pose of a known articulated robot from a single RGB image. This is an important problem to grant mobile and itinerant autonomous systems the ability to interact with other robots using only visual information in non-instrumented environments, especially in the context of collaborative robotics. It is also challenging because robots have many degrees of freedom and an infinite space of possible configurations that often result in self-occlusions and depth ambiguities when imaged by a single camera. The contributions of this work are three-fold. First, we introduce a new render compare approach for estimating the 6D pose and joint angles of an articulated robot that can be trained from synthetic data, generalizes to new unseen robot configurations at test time, and can be applied to a variety of robots. Second, we experimentally demonstrate the importance of the robot parametrization for the iterative pose updates and design a parametrization strategy that is independent of the robot structure. Finally, we show experimental results on existing benchmark datasets for four different robots and demonstrate that our method significantly outperforms the state of the art. Code and pre-trained models are available on the project webpage https://www.di.ens.fr/willow/research/robopose/.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 7

page 13

page 16

page 17

page 18

page 19

04/11/2022

Focal Length and Object Pose Estimation via Render and Compare

We introduce FocalPose, a neural render-and-compare method for jointly e...
11/30/2020

Nothing But Geometric Constraints: A Model-Free Method for Articulated Object Pose Estimation

We propose an unsupervised vision-based system to estimate the joint con...
04/20/2022

Vision System of Curling Robots: Thrower and Skip

We built a vision system of curling robot which can be expected to play ...
08/26/2020

Indirect Object-to-Robot Pose Estimation from an External Monocular RGB Camera

We present a robotic grasping system that uses a single external monocul...
04/19/2018

Task-centric Optimization of Configurations for Assistive Robots

Robots can provide assistance to a human by moving objects to locations ...
10/01/2021

Vision-Only Robot Navigation in a Neural Radiance World

Neural Radiance Fields (NeRFs) have recently emerged as a powerful parad...
06/22/2018

GONet++: Traversability Estimation via Dynamic Scene View Synthesis

Robots that interact with a dynamic environment, such as social robots a...

Code Repositories

robopose

Code for "Single-view robot pose and joint angle estimation via render & compare", CVPR 2021 (Oral).


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: RoboPose. (a) Given a single RGB image of a known articulated robot in an unknown configuration (left), RoboPose estimates the joint angles and the 6D camera-to-robot pose (rigid translation and rotation) providing the complete state of the robot within the 3D scene, here illustrated by overlaying the articulated CAD model of the robot over the input image (right). (b) When the joint angles are known at test-time (e.g. from internal measurements of the robot), RoboPose can use them as an additional input to estimate the 6D camera-to-robot pose to enable, for example, visually guided manipulation without fiducial markers.

The goal of this work is to recover the state of a known articulated robot within a 3D scene using a single RGB image. The robot state is defined by (i) its 6D pose, i.e. a 3D translation and a 3D rotation with respect to the camera frame, and (ii) the joint angle values of the robot’s articulations. The problem set-up is illustrated in Figure 1. This is an important problem to grant mobile and itinerant autonomous systems the ability to interact with other robots using only visual information in non-instrumented environments. For instance, in the context of collaborative tasks between two or more robots, having knowledge of the pose and the joint angle values of all other robots would allow better distribution of the load between robots involved in the task [caccavale2016cooperative].

The problem is, however, very challenging because robots can have many degrees of freedom (DoF) and an infinite space of admissible configurations that often result in self-occlusions and depth ambiguities when imaged by a single camera. The current best performing methods for this problem [lee2020camera, zuo2019craves]

use a deep neural network to localize in the image a fixed number of pre-defined keypoints (typically located at the articulations) and then solve a 2D-to-3D optimization problem to recover the robot 6D pose 

[lee2020camera] or pose and configuration [zuo2019craves]. For rigid objects, however, methods based on 2D keypoints [Lowe1999-bf, bay2006surf, Collet2010-zj, Collet2011-lj, Rad2017-de, Tremblay2018-bd, Kehl2017-ek, Tekin2017-hp, peng2019pvnet, pavlakos20176, hu2019segmentation] have been recently outperformed by render & compare methods that forgo explicit detection of 2D keypoints but instead use the entire shape of the object by comparing the rendered view of the 3D model to the input image and iteratively refining the object’s 6D pose [Rad2017-de, zakharov2019dpod, li2018deepim, labbe2020cosypose]. Motivated by this success, we investigate how to extend the render & compare

paradigm for articulated objects. This presents significant challenges. First, we need to estimate many more degrees of freedom than the sole 6D pose. Articulated robots we consider in this work can have up to 15 degrees of freedom in addition to their 6D rigid pose in the environment. Second, the space of configurations is continuous and hence there are infinitely many configurations in which the object can appear. As a result, it is not possible to see all configurations during training and the method has to generalize to unseen configurations at test time. Third, the choice of transformation parametrization plays an important role for 6D pose estimation of rigid objects 

[li2018deepim] and finding a good parametrization of pose updates for articulated objects is a key technical challenge.

Contributions.

To address these challenges, we make the following contributions. First, we introduce a new render & compare approach for estimating the 6D pose and joint angles of an articulated robot that can be trained from synthetic data, generalizes to new unseen robot configurations at test time, and can be applied to a large variety of robots (robotic arms, bi-manual robots, etc.). Second, we experimentally demonstrate the importance of the robot pose parametrization for the iterative pose updates and design an effective parametrization strategy that is independent of the robot. Third, we apply the proposed method in two settings: (i) with known joint angles (e.g. provided by internal measurements from the robot such as joint encoders), only predicting the camera-to-robot 6D pose, and (ii) with unknown joint angles, predicting both the joint angles and the camera-to-robot 6D pose. We show experimental results on existing benchmark datasets for both settings that include a total of four different robots and demonstrate significant improvements compared to the state of the art.

2 Related work

6D pose estimation of rigid objects

from RGB images [Roberts1963-ck, Lowe1987-yf, Lowe1999-bf]

is one of the oldest problems in computer vision. It has been successfully approached by estimating the pose from 2D-3D correspondences obtained via local invariant features 

[Lowe1999-bf, bay2006surf, Collet2010-zj, Collet2011-lj], or by template-matching [Hinterstoisser2011-es]

. Both these strategies have been revisited using convolutional neural networks (CNNs). A set of sparse

[Rad2017-de, Tremblay2018-bd, Kehl2017-ek, Tekin2017-hp, peng2019pvnet, pavlakos20176, hu2019segmentation] or dense [Xiang2018-dv, Park2019-od, song2020hybridpose, zakharov2019dpod] features is detected on the object in the image using a CNN and the resulting 2D-to-3D correspondences are used to recover the camera pose using PnP [lepetit2009epnp]. The best performing methods for 6D pose estimation from RGB images are now based on variants of the render & compare strategy [oberweger2015training, Rad2017-de, li2018deepim, manhardt2018deep, oberweger2019generalized, labbe2020cosypose, zakharov2019dpod] and are approaching the accuracy of methods using depth as input [hodan2020bop, Hodan_undated-sl, li2018deepim, labbe2020cosypose].

Hand-eye calibration

(HEC) [Horaud1995-fl, Heller2011-so] methods recover the 6D pose of the camera with respect to a robot. The most common approach is to detect in the image fiducial markers [garrido2014automatic, fiala2005artag, olson2011apriltag] placed on the robot at known positions. The resulting 3D-to-2D correspondences are then used to recover the camera-to-robot pose using known joint angles and the kinematic description of the robot by solving an optimization problem [park1994robot, ilonen2011robust, yang1994calibrating]. Recent works have explored using CNNs [lambrecht2019towards, lee2020camera] to perform this task by recognizing 2D keypoints at specific robot parts and using the resulting 3D-to-2D correspondences to recover the hand-eye calibration via PnP. The most recent work in this direction [lee2020camera] demonstrated that such learning-based approach could replace more standard hand-eye calibration methods [tsai1989new] to perform online calibration and object manipulation [tremblay2020indirect]. Our render & compare method significantly outperforms [lee2020camera] and we also demonstrate that our method can achieve a competitive accuracy without requiring known joint angles at test time.

Depth-based pose estimation of articulated objects.

Previous work on this problem can be split into three classes. The first class of methods aims at discovering properties of the kinematic chain through active manipulation [katz2008manipulating, katz2013interactive, hausman2015active, martin2016integrated] using depth as input and unlike our approach cannot be applied to a single image. The second class of methods aims at recovering all parameters of the kinematic chain from a single RGBD image, including the joint angles, without knowing the specific articulated object [li2020category, yi2018deep, abbatematteo2019learning, zhou2016simultaneous]. In contrast, we focus on the set-up with a known 3D model, e.g. a specific type of a robot. The third class of methods, which is closest to our set-up, considers pose and joint angle estimation [michel2015pose, desingh2019factored, pauwels2014real] for known articulated objects but relies on depth as input and only considers relatively simple kinematic chains such as laptops or drawers where the joint parameters only affect the pose of one part. Others recover joint angles of a known articulated robot part [bohg2014robot, widmaier2016robot] but do not recover the 6D pose of the camera and also rely on depth. In contrast, our method accurately estimates the pose and joint angles of a robotic arm with many degrees of freedom from a single RGB image.

Robot pose and joint angle estimation from an RGB image.

To the best of our knowledge, only [zuo2019craves] has addressed a scenario similar to us where the robot pose and joint angles are estimated together from a single RGB image. A set of predefined 2D keypoints is recognized in the image and the 6D pose and joint angles are then recovered by solving a nonlinear non-convex optimization problem. Results are shown on a 4 DoF robotic arm. In contrast, we describe a new render & compare approach for this problem, demonstrate significantly improvements in 3D accuracy and show results on robots with up to 15 DoF.

3 Approach

We present our render & compare framework to recover the state of a robot within a 3D scene given a single RGB image. We assume the camera intrinsic parameters, the CAD model and kinematic description of the robot are known. We start by formalizing the problem in Section 3.1. We then present an overview of our approach in Section 3.2 and explain our training in Section 3.3. Finally, we detail the key choices in the problem parametrization in Section 3.4.

3.1 Problem formalization

Figure 2: Problem definition. Given an RGB image (a) of a known robot, the goal is to recover (b) the 6D pose of an anchor part with respect to the camera frame and all the joint angles of the known robot kinematic description (in green).

Our notations are summarized in Figure 2. We consider a known robot composed of rigid parts ,…, whose 3D models are known. An articulation, or joint, connects a parent part to a child part. Given the joint angle of the -th joint, we can retrieve the relative 6D transformation between the parent and child reference frames. Note that for simplicity we only consider revolute joints, i.e. joints parametrized by a single scalar angle of rotation , but our method is not specific to this type of joints. The rigid parts and the joints define the kinematic tree of the robot. This kinematic description can be used to compute the relative 6D pose between any two parts of the robot. In robotics, the full state of a robot is defined by the joint angles and the 6D pose of the root of the kinematic tree. Defining the 6D pose of the robot with respect to the root (whose pose is independent of the joint angles since it is not a child of any joint) is a crucial choice in the parametrization of the problem, but also arbitrary, since an equivalent kinematic tree could be defined using any part as the root. We instead define the full state of the robot by (i) the selection of an anchor part , (ii) the 6D pose of the anchor with respect to the camera , and (iii) the joint angles , where is the number of joints. Note the anchor part can change across iterations of our approach. We discuss the choice of the anchor in Section 3.4 and experimentally demonstrate it has an important influence on the results.

3.2 Render & compare for robot state estimation

Figure 3: RoboPose overview. Given a single input RGB image, the state (6D camera-to-robot pose and joint angles) of the robot is iteratively updated using renderer and refiner modules to match the input image. The refinement module takes as input the cropped observed image and a rendering of the robot as well as the mask of an anchor part. The anchor part is used for updating the rigid 6D pose of the robot while the rest of the parts are updated by changing their joint angles. Note that the anchor part is changing across iterations making the refinement more robust.

We now present our iterative deep render & compare framework, illustrated in Figure 3. We iteratively refine the state estimate as follows. First, given a current estimate of the state we render an RGB image of the robot and the mask of the anchor part. We then apply a deep refiner network that takes as input crops of the rendered image and the input RGB image of the scene. It outputs a new state of the robot to attempt to match the ground truth state of the observed robot. Unlike prior works that have used render & compare strategies for estimating the 6D pose of rigid objects [li2018deepim, zakharov2019dpod, labbe2020cosypose], our method does not require a coarse pose estimate as initialization.

Image rendering and cropping.

To render the image of the robot we use a fixed focal length (defining an intrinsic camera matrix) during training. The rendering is fully defined by the state of the robot and the camera matrix. Instead of giving to the refiner network the full image and the rendered view, we focus the inputs on the robot by cropping the images as follows. We project the centroid of the rendered robot in the image, consider the smallest bounding box of aspect ratio 4/3 centered on this point that encloses the projected robot and increase its size by 40% (see details in the appendix). This crop depends on the projection of the robot to the input image that varies during training, and thus provides an augmentation of the effective focal length of the virtual cropped cameras. Hence, our method can be applied to cameras with different intrinsics at test time as we show in our experiments.

Initialization.

We initialize the robot to a state defined by the joint configuration and the pose of the anchor part with respect to the camera . At training time we define using perturbations of the ground truth state. At test time we initialize the joints to the middle of the joint limits, and the initial pose so that the frame of the robot base is aligned with the camera frame and the 2D bounding box defined by the projection of the robot model approximately matches the size of image. More details are given in the appendix.

Refiner and state update.

At iteration , the refiner predicts an update of the joint angles (composed of one scalar angle per joint), such that

(1)

and an update of the current 6D pose of the anchor part, such that

(2)

where we follow DeepIM [li2018deepim]’s parametrization for pose update . This parametrization disentangles rotation and translation prediction but crucially depends on the choice of a reference point we call . In DeepIM this point is simply taken as the center of the reference frame of the rigid object but there is not such a natural choice of reference point for articulated objects, which have multiple moving parts. We discuss several possible choices of the reference point in Sec. 3.4 and demonstrate experimentally it has an important impact on the results. In particular, we show that naively selecting the reference frame of the root part is sub-optimal.

3.3 Training

In the following, we describe our loss function, synthetic training data, implementation details and discuss how to best use known joint angles if available.

Loss function.

We train our refiner network using the following loss:

(3)

where are the parameters of the refiner network, is the maximum number of iterations of the refinement algorithm, is the ground truth 6D pose of the anchor, are the ground truth joint angles and is a hyper-parameter to balance between the 6D pose loss and the joint angle loss . The 6D pose loss measures the distance between the predicted 3D point cloud obtained using transformed with and the ground truth 3D point cloud (obtained using ) of the anchor . We use the same loss as [labbe2020cosypose] that disentangles rotation, depth and image-plane translations [Simonelli2019-da] (see equations in the appendix). For , we use a simple regression loss, .

Note that the 6D pose loss is measured only on the anchor part while the alignment of the other parts of the robot is measured by the error on their joint angles (rather than alignment of their 3D point clouds). This disentangles the 6D pose loss from the joint angle loss and we found this leads to better convergence. We sum the loss over the refinement iterations

to imitate how the refinement algorithm is applied at test time but the error gradients are not backpropagated through rendering and iterations. Finally, for simplicity the loss (

3) is written for a single training example, but we sum it over all examples in the training set.

Training data.

For training the refiner, we use existing datasets [lee2020camera, zuo2019craves] provided by prior works for the Kuka, Panda, Baxter, OWI-535 robots. All of these datasets are synthetic, generated using similar procedures based on domain randomization [tobin2017domain, Loing2018, Sadeghi2017-dr, James2018-db]. The joint angles are sampled independently and uniformly within their bounds, without assuming further knowledge about their test time distribution. We add data augmentation similar to [labbe2020cosypose].

We sample the initial state

by adding noise to the ground truth state in order to simulate errors of the network prediction at the previous state of the refinement as well as the error at the initalization. For the pose, we sample a translation from a centered Gaussian distribution with standard deviation of

 cm, and a rotation by sampling three angles from a centered Gaussian distribution of standard deviation . For the joint angles, we sample an additive noise from a centered Gaussian distribution with a standard deviation equal to of the joint range of motion, which is around for most of the joints of the robots we considered.

Implementation details.

We train separate networks for each robot. We use a standard ResNet-34 architecture [he2016deep] as the backbone of the deep refiner. The hyper-parameters are and training iterations. Note that at test time we can perform more iterations, and the results we report correspond to 10 iterations. The anchor is sampled randomly among the 5 largest parts of the robot at each iteration. This choice is motivated in Section 3.4 and other choices are considered in the experiments, Section 4.3. We initialize the network parameters randomly and perform the optimization using Adam [Kingma2014-xz], with the procedure described in the appendix for all the networks.

Known joint angles at test time.

The approach described previously could be used at test time with measured joint angles and by ignoring the joint update, but we observed better results by training a separate network which only predicts a pose update for this scenario. In this context where the joint values are known and constant, the full robot is considered as a single and unique anchor. Yet, the problem remains different from classic rigid object 6D object pose estimation because the network must generalize to new joint configurations unseen during training.

3.4 Parametrization choices

There are two main parametrization choices in our approach: (i) the choice of the reference point for the parametrization of the pose update in equation (2) and (ii) the choice of the anchor part to update the 6D pose and measure pose loss in equation (3). These choices have a significant impact on the results, as shown in Section 4.

Choice of the reference point for the pose update.

Similar to [li2018deepim], we parametrize the pose update as a rotation around a reference point and a translation defined as a function of the position of with respect to the camera. The fact that the rotation is around is a first obvious influence of this choice on the transformation that needs to be predicted. The impact on the translation update parameters is more complicated: they are defined by a multiplicative update on the depth of and by an equivalent update in pixels in the image, which is also related to the real update by the depth of (see equations in the appendix).

A seemingly natural choice for reference point would be a physical point on the robot, for example the center of the base of the robot or the anchor part. However, on the contrary to the rigid object case, if that part is not visible or is partially occluded, the network cannot infer the position of the reference precisely, and thus cannot predict a relevant translation and rotation update. In experiments, we show it is better to use as the centroid of the estimated robot state, which takes into account the estimated joint configuration, and can be more reliably estimated.

Choice of the anchor part.

The impact of the choice of the anchor part used for computing the 6D pose loss in equation (3), is illustrated in Figure 4. We explore several choices of anchor part in our experiments, and show that this choice has a significant impact on the results. Since the optimal choice depends on the robot, and the observed pose, we introduce a strategy where we randomly select the anchor among the largest parts of the robot, during both training and refinement, and show that on average it performs similarly or slightly better than the optimal oracle choice of a single unique anchor on the test set.

Figure 4: Choice of the anchor part. (a) We analyze how the choice of the anchor part affects the complexity of the rigid alignment and the joint angle update to align an initial state of the robot (green) with the target state of the robot (red). (b) We show the required rigid pose update (composed of a rotation and a translation) and the required joint update for two different choices of the anchor part (shown using a dashed line). In (1), the required pose update of the anchor part consists of successively applying rotation and translation along and axes (in blue). In (2), the anchor part is aligned using only a translation along the axis resulting in a simpler solution compared to (1). These examples illustrate that the choice of the anchor can have a significant impact on the complexity of the alignment problem.

4 Experiments

Dataset informations DREAM [lee2020camera] DREAM [lee2020camera] DREAM [lee2020camera] Ours Ours
Robot Robot (DoF) Real # images # 6D poses cam. VGG19-F VGG19-Q ResNet101-H ResNet34 Unknown angles
Baxter DR Baxter (15) 5982 5982 GL - 75.47 - 86.59 32.66
Kuka DR Kuka (7) 5997 5997 GL - - 73.30 89.62 80.17
Kuka Photo Kuka (7) 5999 5999 GL - - 72.14 86.87 73.23
Panda DR Panda (8) 5998 5998 GL 81.33 77.82 82.89 92.70 82.85
Panda Photo Panda (8) 5997 5997 GL 79.53 74.30 81.09 89.89 79.72
Panda 3CAM-AK Panda (8) 6394 1 AK 68.91 52.38 60.52 76.54 70.37
Panda 3CAM-XK Panda (8) 4966 1 XK 24.36 37.47 64.01 85.97 77.61
Panda 3CAM-RS Panda (8) 5944 1 RS 76.13 77.98 78.83 76.90 74.31
Panda ORB Panda (8) 32315 27 RS 61.93 57.09 69.05 80.54 70.39
Table 1: Comparison of RoboPose (ours) with the state-of-the-art approach DREAM [lee2020camera] for the camera-to-robot 6D pose estimation task using the 3D reconstruction ADD metric (higher is better). The robot joint configuration is assumed to be known (results in black) and is different in each of the image in the dataset, but the pose of the camera with respect to the robot can be fixed (# number of 6D poses). Multiple cameras are considered to capture the input RGB images: synthetic rendering (GL), and real Microsoft Azure (AK), Microsoft Kinect360 (XK) and Intel RealSense (RS), which all have different intrinsic parameters. Our results in blue do not use ground truth joint angles (see Section 4.2) and the accuracy of the robot 3D reconstruction is evaluated using both the estimated 6D pose and the joint angles.

We evaluate our method on recent benchmarks for the following two tasks: (i) camera-to-robot 6D pose estimation for three widely used manipulators (Kuka iiwa7, Rethink robotic Baxter, Franka Emika Panda) [lee2020camera], and (ii) full state estimation of the low-cost 4 DoF robotic arm OWI-535 [zuo2019craves]. In Section 4.1, we consider the first task, where an image of a robot with fixed known joint angles is used to estimate the 6D camera-to-robot pose. We show that our approach outperforms the state-of-the-art DREAM method [lee2020camera]. In Section 4.2, we evaluate our full approach where both the 6D pose and joint angles are unknown. We show our method outperforms the state-of-the-art method [zuo2019craves] for this problem on their dataset depicting the low-cost 4 DoF robot and that it can recover the 6D pose and joint angles of more complex robotic manipulators. Finally, Section 4.3 analyzes the parametrization choices discussed in Section 3.4.

4.1 6D pose estimation with known joint angles

Datasets and metrics.

We focus on the datasets annotated with 6D pose and joint angle measurements recently introduced by the state-of-the-art method for single-view camera-to-robot calibration, DREAM [lee2020camera]. We use the provided training datasets with 100k images generated with domain randomization. Test splits are available as well as photorealistic synthetic test images (Photo). For the Panda robot, real datasets are also available. The Panda 3CAM datasets display the fixed robot performing various motions captured by 3 fixed different cameras with different focal lengths and resolution, all of which are different than the focal length used during training. The largest dataset with the more varied viewpoints is Panda-ORB with 32,315 real images in a kitchen environnement captured from 27 viewpoints with different joint angles in each image.

We use the 3D reconstruction ADD metric which directly measures the pose estimation accuracy, comparing distances between 3D keypoints defined at joint locations of the robot in the ground truth and predicted pose. We refer to the appendix for exact details on the evaluation protocol of our comparison with DREAM [lee2020camera].

Comparison with DREAM [lee2020camera].

We train one network for each robot using the same synthetic datasets as [lee2020camera] and report our results in Table 1. Our method achieves significant improvements across datasets and robots except on Panda 3CAM-RS where the performance of [lee2020camera] with ResNet101-H variant is similar to ours. On the Panda 3CAM-AK and Panda 3CAM-XK datasets, the performance of our method is significantly higher than the ResNet101-H model of  [lee2020camera] (e.g. on 3CAM-XK), which suggests that the approach of [lee2020camera] based on 2D keypoints is more sensitive to some viewpoints or camera parameters. Note that our method trained with the synthetic GL camera can be applied to different real cameras with different intrinsics at test time thanks to our cropping strategy which provides an augmentation of the effective focal length during training.

On Panda-ORB, the largest real dataset that covers multiple camera viewpoints, our method achieves a large improvement of points. Our performance on the synthetic datasets for the Kuka and Baxter robots is also significantly higher than [lee2020camera]. We believe the main reason for this large improvements is the fact that our Render & Compare approach can directly use the shape of the entire robot rendered in the observed configuration for estimating the pose rather than detecting a small number of keypoints.

Figure 5: Qualitative results of RoboPose 6D pose and joint angle estimation for four different robots. (a) The OWI-535 robot from the CRAVES-lab (first row) and CRAVES-youtube (second and third row) datasets, (b) the Panda robot from the Panda 3CAM dataset and (c) the Panda, Baxter and Kuka robots on example images from the Internet. Please see additional results in the appendix and in the supplementary video on the project webpage [projectpage].
Ours (individual frames) Ours (online) DREAM [lee2020camera]
K=1 K=2 K=3 K=5 K=10 K=1 ResNet101-H
ADD 28.5 72.8 79.1 80.4 80.7 80.6 69.1
FPS 16 8 4 2 1 16 30
Table 2: Benefits of iterative refinement and running time on Panda-ORB video sequence of robot trajectories. We report ADD and running time (frames per second, FPS) for a varying number of refiner iterations . The frames are either considered individually, or the estimate is used to initialize the refiner in the subsequent frames (online) without additionnal temporal filtering.

Running time.

In Table 2 we report the running time of our method on the Panda-ORB dataset which consists of robot motion videos captured from 27 different viewpoints. The first observation is that the accuracy increases with the number of refinement iterations used at test-time, and the most significant improvements are during the first 3 iterations. The importance of using multiple network iterations during training is further discussed in the appendix. We also report an online version of our approach that leverages temporal continuity. It runs the refiner with iterations on the first frame of the video and then uses the output pose as the initialization for the next frame and so on, without any additional temporal filtering of the resulting 6D poses. This version runs at 16 frames per second (FPS) and achieves a similar performance as the full approach that considers each frame independently and runs at 1 FPS.

4.2 6D pose and joint angle estimation

We now evaluate the performance of our method in the more challenging scenario where the robot joint angles are unknown and need to be estimated jointly with the 6D pose from a single RGB image. Qualitative results on the considered datasets as well as on real images crawled from the web are shown in Figure 5. Please see the appendix for additional qualitative examples, and the project page [projectpage] for a movie showing our predictions on several videos.

CRAVES [zuo2019craves] CRAVES [zuo2019craves] ours
synt synt+real* synt
PCK@0.2 95.66 99.55 99.20
Error all top 50% all top 50%
Joints (degrees) 11.3 4.74 5.49 3.22
Trans xyz. (cm) 10.1 5.52 0.61 0.42
Trans norm. (cm) 19.6 10.5 1.31 0.90
Rot. (degrees) 10.3 5.29 4.12 2.91
Table 3: Results on the CRAVES-lab [zuo2019craves] dataset with unknown joint angles. We report average errors on all the images of the dataset, or on the top 50% images selected according to the best joint angle accuracy with respect to the ground truth. Networks are trained on synthetic data only (synt) or also using non-annotated real images of the robot (synt+real*).

Comparison with CRAVES [zuo2019craves].

CRAVES [zuo2019craves] is the state-of-the-art approach for this task. We consider the two datasets used in [zuo2019craves]. CRAVES-lab displays the OWI-535 4DoF in a lab environment and contains 20,000 RGB images of which 428 key frames are annotated with 2D robot keypoints, ground truth joint angles (not used by our method) and camera intrinsics. CRAVES-youtube is the second dataset containing real-world images crawled from YouTube depicting large variations in viewpoint, illumination conditions and robot appearance. It contains 275 frames annotated with 2D keypoints but no camera intrinsic parameters, 6D pose or joint angle ground truth. In addition to metrics that measure the 6D pose and joint angle estimates, we report a 2D keypoint metric, PCK (percentage of keypoints), following [zuo2019craves]. We refer to the appendix for details of the metrics and the evaluation protocol.

We compare with two variants of CRAVES, one trained only on synthetic images (synt), and one that also requires real non-annotated images (synt+real*). Our method is trained only using the provided synthetic images. We report results on CRAVES-lab in Table 3. To compare with the 2D keypoint metric PCK@0.2, we project in the image the 3D keypoints of our estimated robot state. On this metric, our method outperforms CRAVES trained only on synthetic images and achieves a near-perfect score, similar to their approach trained with real images. More importantly, we achieve much better results when comparing with the 3D metrics (joint angles error and translation/rotation error). CRAVES achieves high average errors when all images of the datasets are considered, which is due to the complexity of solving the nonlinear nonconvex 2D-to-3D optimization problem for recovering 6D pose and joint angles given 2D keypoint positions. Our method trained to directly predict the 6D pose and joint angles, achieves large improvements in precision. We reduce the translation error by a factor of 10, demonstrating robustness to depth ambiguities.

CRAVES CRAVES Ours, synt
synt [zuo2019craves] synt+real* [zuo2019craves] f=500 f=1000 f=1500 f=2000 f=best
81.61 88.89 87.34 88.97 87.37 85.49 92.91
Table 4: PCK@0.2 on the CRAVES-Youtube dataset [zuo2019craves].

We also evaluated our method on CRAVES-youtube. On this dataset, the camera intrinisic parameters are unknown and cannot be used for projecting the estimated robot pose into the 2D image. We therefore report results for different hypothesis of (fixed) focal lengths for all the images of the dataset, as well as using an oracle (f=best) which selects the best focal length for each image. Results are reported in Table 4. For 2D keypoints, our method for achieves results superior to CRAVES while not requiring real training images. Our method also clearly outperforms CRAVES when selecting the best focal length. 3D ground truth is not available, but similar to CRAVES-lab we could expect large improvements in 3D accuracy.

Experiments on 7DoF+ robots.

We also train our method for jointly predicting 6D pose and joint angles for the robots considered in Section 4.1. We evaluate the 6D pose and joint angles accuracy using ADD. Results are reported in Table 1 in blue (last column). For the 7DoF robotic arms (Kuka and Panda), these results demonstrate a competitive or superior ADD accuracy compared to [lee2020camera] for inferring the 3D geometry of a known robot, but our method does not require known joint angles. The more complex 15 DoF Baxter robot remains challenging although our qualitative results often show reasonable alignments. We discuss the failure modes of our approach in the appendix.

4.3 Analysis of parametrization choices

Reference point ADD on Root 75.02 on Middle 79.45 on Hand 00.00 Centroid (ours) 80.54
(a)
Reference point Volume ADD on 3092 74.40 on 2812 75.06 on 2763 74.89 on 2660 75.02 on 2198 79.45 Centroid (ours) - 80.54
(b)
Table 5: Analysis of the choice of reference point . Networks are trained and evaluated with known joint angles as in Section 4.1. The reference point is placed on (a) a naively chosen part and (b) on one of the 5 largest parts. Our strategy of using the centroid of the imaged robot performs the best.

We analyze our method on the Panda-ORB dataset: it is the largest real dataset containing significant variations in joint angles and camera viewpoints and the Panda robot has a long kinematic chain with 8 DoF. We study the choice of reference point for the 6D pose update and the choice of the anchor part (see Section 3.4).

Reference point.

We train different networks with the reference point at the origin of the root , the part in the middle of the kinematic chain and at the end of the kinematic chain . Results are reported in Table 5(a). We observe that the performance indeed depends on the choice of the reference point and our approach of using the centroid of the robot as the reference point performs the best. The network trained with the “Hand” part (, the end effector) as a reference point fails to converge because this part is often difficult to identify in the training images and its pose cannot be inferred from any other part because the robot is not a rigid object. We investigate picking the reference point on one of the five largest parts (measured by their 3D volume which is correlated with 2D visibility) in Table 5(b) again demonstrating our approach of using the centroid of the robot performs better than any of these specific parts.

Anchor Volume ADD 3092 68.01 2812 65.56 2763 60.40 2660 57.44 2198 69.54 637 63.40
(a)
Anchor ADD Root part 57.44 Middle part 69.54 Hand part 63.40 Random (all) 64.28 Random (5 largest) 70.39 Random (3 largest) 71.36
(b)
Table 6: Analysis of the choice of the anchor part. Networks are trained and evaluated with unknown joint angles as in Section 4.2. (a) Results when one fixed anchor part is used during training and testing. (b) Randomly selecting the anchor part among a given set of largest robot parts during refinement in both training and testing.

Choice of the anchor part.

Table 6 reports results using different strategies for chosing the anchor part during training and testing. First, in 6(a) we show that choosing different parts as one (fixed) anchor results in significant variation in the resulting performance. To mitigate this issue we consider in 6(b) a strategy where the anchor is picked randomly among the robot parts at each iteration (both during training and testing). This strategy performs better than always naively selecting the root as anchor. By restricting the sampled anchors to the largest parts, our automatic strategy can also perform better than the best performing part .

5 Conclusion

We have introduced a new render & compare approach to estimate the joint angles and the 6D camera-to-robot pose of an articulated robot from a single image demonstrating significant improvements over prior state-of-the-art for this problem. These results open-up exciting applications in visually guided manipulation or collaborative robotics without fiducial markers or time-consuming hand-eye calibration. To stimulate these applications, we released the training code as well as the pre-trained models for commonly used robots.

Acknowledgments.

This work was partially supported by the HPC resources from GENCI-IDRIS (Grant 011011181R1), the European Regional Development Fund under the project IMPACT (reg. no. CZ.02.1.01/0.0/0.0/15 003/0000468), Louis Vuitton ENS Chair on Artificial Intelligence, and the French government under management of Agence Nationale de la Recherche as part of the ”Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute).

References

Appendix

This appendix is organized as follows. In Section A, we provide the complete set of equations for the pose update and its relation to the reference point introduced in the main paper. In Section B, we give the details of the pose loss used in equation (3) in the main paper. Sections CD,  E and F give details of the cropping strategy, the state initialization, training strategy and evaluation, respectively. In Section G, we provide additional experiments showing the benefits of the iterative formulation and studying the importance of running the refiner network for multiple iterations during training. In Section H, we discuss the applicability of our approach to real robotic problems. Section I presents additional qualitative examples randomly selected from the images in the datasets introduced in Section 4 of the main paper. Finally, we discuss and illustrate the main failure modes of our approach in Section J. Additional examples are in the supplementary video [projectpage].

Appendix A Pose update

Any rigid motion can be modeled by a transformation in . The 6D pose of the anchor part of the robot at iteration is therefore defined by the composition of its current transformation w.r.t. to the camera coordinate system at iteration composed with a pose update transformation :

(4)

as introduced in Section 3.2 in the main paper. In the following we present the equations that define this pose update. We follow the parametrization of DeepIM’s [li2018deepim] pose update and explicitly parametrize the pose update using a 3D reference point we call . We choose to use the centroid of the estimated robot at the -th iteration as , but other choices are also possible. Our neural network refiner predicts parameters , , that are used to compute the 3D translation . The refiner also predicts the 3D rotation that together with define the pose update in (4). The parameters and correspond to the displacement in the image plane (in pixels) of the reference point at iteration and corresponds to a relative update of the depth of , again relative to the camera coordinate system:

(5)
(6)
(7)

where is the 2D projection onto the image plane of the 3D point before applying the pose update, the 2D reprojection onto the image plane of after applying the pose update, are the 3D coordinates of the reference point before the pose update expressed in the camera frame, and are the 3D coordinates of the reference point after the pose update expressed in the camera frame. and are the intrinsic parameters of the virtual cropped camera that are assumed known and fixed.

From these equations, we can derive the update of the 3D translation of the 3D point :

(8)

with the , and components of translation update defined as :

(9)
(10)
(11)

Note that depends on the 3D point . Hence the network has to learn to internally infer this point to predict consistent transformation updates. The rotation update (parametrized by three angles) predicted by the refiner network is applied around the 3D reference point . This defines the transformation of any 3D point on the anchor. Let’s consider a point on the anchor part in position at iteration , its position at iteration is given by:

(12)

where is the rotation matrix predicted by the network using the same rotation parametrization used in [labbe2020cosypose, Zhou2018-eg]. The equation (12) can also be used to define the rotation matrix of the anchor part with respect to the camera:

(13)

where defines the rotation of the anchor part with respect to the camera before the pose update, defines the rotation of the anchor part with respect to the camera after the pose update and is the rotation matrix predicted by the network. The equations  (12) and (13) fully define the pose update given by equation (4) as a function of the network predictions , , and , and the reference point . For computing the loss in the next section, we write the pose of the anchor part in the camera coordinate system at iteration (after the pose update) as

(14)

where expresses the updated pose of the anchor as a function of all network predictions , , , , and can be computed in a closed form using the equations derived above.

Appendix B Pose loss

We define the following distance to measure the distance between two transformations and using the 3D points on the anchor part :

(15)

where is the norm. The pose loss in equation (3) in the main paper follows [labbe2020cosypose] and is written as:

(16)
(17)
(18)

where defines the pose update as explained in (14) in Section A, is the ground truth 6D pose of the anchor with repect to the camera, are the parameters of the pose update predicted by the network, and are the values of these updates that would lead to the ground truth pose. The different terms of this loss separate (or disentangle) the influence of different subsets of parameters on the loss. In detail, the first term  (16) of the loss considers only the translation of the anchor part, where the rest of the parameters are fixed to their ground truth values, and . The second term  (17) of the loss considers only the relative depth of the anchor part where the rest of the parameters are fixed to their ground truth values. Finally, the last term (18) of the loss considers only the rotation update (18) where the rest of the parameters are fixed to their ground truth values.

Appendix C Cropping strategy

Let be the 2D projection of the 3D centroid of the robot by the camera with intrinsics . We use a cropping strategy similar to DeepIM but using only the projection of the 3D points of the model in the current pose estimate to define the bounding box of the crop. Let , be the coordinates of the upper left and lower right corners of the bounding box defined by the robot projection in the image, respectively. We define:

(19)
(20)

where

(21)
(22)

is the aspect ratio of the crop and following DeepIM [li2018deepim]. The crop is centered at , of width and height . An example of a crop is given in Figure 6.

Figure 6: Cropping strategy. (a) input RGB image, (b) illustration of the cropping strategy: bounding box (shown in red) defined by the reprojection of the robot in the input state to the image (blue points). Bounding box defining the crop is shown in green. (c) The cropped input image, (d) Cropped rendered image, (e) The crop and the input image are overlaid. These two images are given as input to our network (the anchor mask is not displayed).

Appendix D State initialization

Let and define the center and the size of the approximate 2D bounding box of the robot in the image. In our experiments, there is only one robot per image and thus we use the entire image as the 2D bounding box and denote the bounding box . The orientation of the base of the robot is set parallel to the axes of the camera with the axis pointing upwards. The centroid of the robot is set to match the center of the bounding box . We make the first hypothesis of the depth of the centroid by setting and use this initial value to estimate the coordinates and of the centroid in the camera frame:

(23)
(24)

We then update the depth estimate using the following simple strategy. We project the points of the robot using the initial guess we have just defined. These points define a bounding box with dimensions and the center remains unchanged by construction. We compute an updated depth of the centroid such that its width and height approximately match the size of the 2D detection:

(25)

and use this new depth to compute and using equations (23) and (24) that were used to define and . The initial state of the robot is defined by , , and the initial joint angles are set (when they are not measured) to , where and define the interval of the robot articulation angles. An example of initialization is shown in Figure 7.

In the experiments of Section 4 of the main paper, we use the entire image as the 2D bounding box because the test images show a single robot. Please note however that it would be straightforward to process multiple bounding boxes like [labbe2020cosypose] for multiple rigid objects. If multiple robots are in the input bounding box (e.g. as in some of the examples in the results video on the project webpage [projectpage]), our iterative refinement typically converges to the largest robot in the image.

Figure 7: Example of initialization. (a) Input RGB image, (b) robot in the initial state, (c) output of RoboPose (pose and joint angles are predicted) after refinement iterations.

Appendix E Training strategy

The training of the neural networks is performed with Adam [Kingma2014-xz] using a learning rate of and batches of 1408 images, split on 44 GPUs. The networks are trained for 60k iterations. Following recommendations from [Goyal2017-of] for distributed training, we use a warm up-phase: the learning rate is linearly increased from to to during the first 5k iterations. The learning rate is reduced to 0.0003 at 45k iterations. In order to speed up the training, we start by training with refinement iterations and add other iterations , at 15k and 30k iterations respectively. We found that starting to train with and increasing the number of refinement iterations during training has no effect on the results, but allows to train the refiner network faster.

Appendix F Evaluation details

Next, we give more details about the evaluation metric and the evaluation protocol of our comparison with DREAM 

[lee2020camera] and CRAVES [zuo2019craves].

f.1 Comparison with DREAM

The average distance (ADD) metric measures the distance between 3D keypoints on the robot transformed with the predicted and ground truth 6D poses. The 3D keypoints are the same as in [lee2020camera] for the Kuka, Panda and Baxter robots and are defined at the locations of robot joints. The ADD errors of the 3D keypoints for the robot depicted in the image are averaged, and we report the area-under-the-curve of the ADD pass rate vs. threshold following [lee2020camera]. We refer to [lee2020camera, Xiang2018-dv] for more details about this metric. A slight difference with the evaluation protocol of [lee2020camera] is that all images are considered as possible evaluation targets, whereas [lee2020camera] discards images where there are fewer than 4 robot keypoints visible in the ground truth because PnP cannot be solved in this situation. This favors methods that rely on keypoints and does not fully consider the difficulty of estimating the pose of robots in situations with high self-occlusions, a situation that our method can handle. We thus evaluate ADD on all images, even those with fewer than four visible robot keypoints. The models from DREAM [lee2020camera] were re-evaluated with this protocol and in practice the difference is only minor compared to the results reported in DREAM [lee2020camera], less than for all datasets. For reproducing DREAM results reported in Table 1 of the main paper, we used the provided code and pre-trained models111https://github.com/NVlabs/DREAM using the instructions to reproduce the results from the paper, e.g. using the “shrink-and-crop” cropping strategy. Note that while evaluating these models, we found some results to be higher than reported in the original paper, e.g. the ResNet101-H is the best model on the largest real dataset Panda-ORB (69.05) but only the VGG19-F (61.93) was reported in [lee2020camera]. We reported results for all models of DREAM [lee2020camera] in Table 1 of the main paper.

When evaluating our method that predicts joint angles, ADD is computed using 3D keypoints on the robot that are computed using both the estimated 6D pose and the predicted joint angles using the robot forward kinematics. This provides a principled way to measure the accuracy of the 3D reconstruction of the robot without giving arbitrary weights to rotation, translation or joint angle errors.

f.2 Comparison with CRAVES

For the comparison with CRAVES [zuo2019craves], we based our evaluation on the code provided with the paper222https://github.com/zuoym15/craves.ai. The 3D metrics reported in Table 3 for CRAVES-lab are:

  • Joints (degrees): Errors of the joint angle predictions over the 4 articulations , where is the predicted value of the joint angle for joint and is the ground truth value of the joint angle for joint .

  • Trans xyz. (cm): Translation error for the base averaged over the axes: , where ”pred” denotes the predicted and ”gt” the ground truth values.

  • Trans norm. (cm): Translation error for the base computed using the norm: .

  • Rot. (degrees): Errors between the Euler angles that define the rotation of the base with respect to the camera , where again ”pred” denotes the predicted and ”gt” the ground truth values.

We found that the ”Joints”, “Trans xyz.” and ”Rot.” metrics reported in [zuo2019craves] under ”3D pose estimation errors” were an average computed only over the top 50% images with the best predictions according to the joint angle errors. We also report the errors averaged over all images of the dataset (all) in Table 3 of the main paper as our method can successfully recover the 3D configuration of the robot in all images of the dataset because it does not rely on solving the 2D-to-3D nonlinear, nonconvex optimization problem that is difficult to solve in some situations.

On CRAVES-Youtube only the visible 2D keypoints are considered as estimation targets to provide a fair comparison with “Youtube-vis” of [zuo2019craves]. The results that we reproduced with the provided pre-trained models were slightly lower (around 0.5% lower) than reported in the paper for the PCK@0.2 metric, but we kept the (higher) results reported in the paper when comparing with their method.

Appendix G Benefits of the iterative formulation

In this section, we investigate the benefits of the number of refiner iterations at training and test time. In the following, we denote as the number of refiner network iterations used during training (denoted in equation (3) of the main paper) and as the number of iterations used at test time. We experimentally evaluate the influence of and using the ADD metric on the Panda-ORB dataset with unknown joint angles, similar to the experimental set-up reported in Section 4.2. We trained four networks with and report the results for a varying number of iterations at test time for each network. Results are reported in Table 7.

First, we observe that for each refiner network (each row in the table), the accuracy is significantly improved by using multiple iterations at test time as opposed to running the network for a single iteration () in a single-shot regression fashion. These results demonstrate that the iterative formulation is crucial to the success of the approach.

Second, we observe that running the network for multiple iterations during training significantly improves the results. For example, for , the ADD results are improved from for to for . Using multiple training iterations without backpropagating through the renderer augments the distribution of errors between the input state and the ground truth state with actual errors the refiner makes. These results show the importance of training the refiner network to correct both the large errors (between the initial state and the ground truth state ) as well as smaller errors (between the prediction at iteration k, , and ) to simulate what the network is going to see at test-time where the network is run for multiple iterations. In our experiments in the paper, we used as it is a good trade-off between performance and training speed.

1 2 3 5 10
1 0.43 24.31 34.72 39.77 41.25
2 1.70 32.22 52.08 62.76 65.73
3 1.14 25.51 49.22 65.64 70.39
5 0.61 15.80 36.81 61.08 70.77
Table 7: Benefits of the iterative refinement. We study the influence of the number of training and testing iterations, denoted and , respectively. We report the metric (higher is better) on the Panda ORB dataset with unknown joint angles. The best performing set-up is shown in bold.

Appendix H Applications in robotic set-ups

Similar to related work [lee2020camera, zuo2019craves], our approach can be applied in real robotic set-ups. For example, the offline hand-eye calibration approach DREAM [lee2020camera] has been used in [tremblay2020indirect] for a manipulation task on a real robot, and CRAVES [zuo2019craves] demonstrates visually guided closed-loop control of a low-cost robot without reliable joint angle measurements. Our approach significantly improves over the accuracy of DREAM [zuo2019craves] and CRAVES [lee2020camera], and therefore we expect to also improve the capabilities of robotic systems similar to [tremblay2020indirect, zuo2019craves]. Please note that while our paper focuses on the robot state estimation from a single image, it is also suited for real-time and online applications as explained in Section 4.1. In this scenario (also illustrated in the video on the project webpage [projectpage]), our predictions could be temporally smoothed to reduce the jitter in the predictions by applying a simple temporal low-pass filter similar to [tremblay2020indirect].

Appendix I Qualitative examples

Random examples.

Here we provide more qualitative examples on the datasets where we have shown the quantitative evaluation. All the presented results consider the most challenging scenario where the joint angles are unknown and are predicted together with the robot’s 6D pose. For each dataset, we have randomly sampled 9 images without any further manual selection and show the input RGB image and the output state predicted by our method. Results are presented for the CRAVES-lab (Figure 8), CRAVES-youtube (Figure 9), Panda ORB (Figure 10), Kuka Photo (Figure 11) and Baxter DR (Figure 12) datasets.

Figure 8: Random selection of examples on the CRAVES-lab datasets. Both joint angles and the 6D pose of the robot are predicted from the input image. For each of the 9 examples, we show the input RGB image (left) and the predicted state of the robot within the 3D scene (right). We illustrate the predicted state (right) by overlaying the articulated CAD model of the robot in the predicted state over the input image.
Figure 9: Random selection of examples on the CRAVES-youtube dataset. Both joint angles and the 6D pose of the robot are predicted from the input image. For each of the 9 examples, we show the input RGB image (left) and the predicted state of the robot within the 3D scene (right). We illustrate the predicted state (right) by overlaying the articulated CAD model of the robot in the predicted state over the input image.
Figure 10: Random selection of examples on the Panda ORB dataset. Both joint angles and the 6D pose of the robot are predicted from the input image. For each of the 9 examples, we show the input RGB image (left) and the predicted state of the robot within the 3D scene (right). We illustrate the predicted state (right) by overlaying the articulated CAD model of the robot in the predicted state over the input image.
Figure 11: Random selection of examples on the Kuka Photo dataset. Both joint angles and the 6D pose of the robot are predicted from the input image. For each of the 9 examples, we show the input RGB image (on the left) and the predicted state of the robot within the 3D scene (on the right). We illustrate the predicted state (on the right) by overlaying the articulated CAD model of the robot in the predicted state over the input image.
Figure 12: Random selection of examples on the Baxter DR dataset. Both joint angles and the 6D pose of the robot are predicted from the input image. For each of the 9 examples, we show the input RGB image (left) and the predicted state of the robot within the 3D scene (right). We illustrate the predicted state (right) by overlaying the articulated CAD model of the robot in the predicted state over the input image.

Appendix J Failure modes

There are four main failure modes and limitations of our approach in the most challenging set-up with unknown joint angles, illustrated in Figure 13. First, while our method is tolerant to some amount of self-occlusion, it can fail in situations where multiple parts of the robot are occluded by external objects. Second, in some in-the-wild images (where camera intrinsics are also unknown), our iterative alignment can get stuck in a local minima. This could be improved by (i) using a better initialization (e.g. by using a separate external coarse estimation method) or by (ii) trying multiple initializations and selecting the best result. The third failure mode is related to the symmetry of individual robot parts or of entire robot configurations that are not taken into account in our method. Finally, robots with many degrees of freedom such as the 15 DoF Baxter remain challenging (please see the quantitative evaluation in Section 4.2 in the main paper), although our qualitative results often show reasonable alignment, as illustrated in Figure 12.

Figure 13: Main failure modes. We illustrate the four main failure modes of our approach. In (1), the severe occlusions (circled in red) of the base of the robot and of one of the parts in the middle of the kinematic chain lead to an incorrect prediction. In (2), the state of the robot is incorrectly estimated because our iterative procedure gets stuck in a local minimum. In (3), the gripper of the robot (circled in red) is a symmetric part, which leads to an incorrect estimate of the joint angle between the gripper and the rest of the robot. In (4), we illustrate an example of incorrect alignment of parts (circled in red) for the complex 15 DoF Baxter robot.