Robust Keypoint Detection and Pose Estimation of Robot Manipulators with Self-Occlusions via Sim-to-Real Transfer

by   Jingpei Lu, et al.
University of California, San Diego

Keypoint detection is an essential building block for many robotic applications like motion capture and pose estimation. Historically, keypoints are detected using uniquely engineered markers such as checkerboards, fiducials, or markers. More recently, deep learning methods have been explored as they have the ability to detect user-defined keypoints in a marker-less manner. However, deep neural network (DNN) detectors can have an uneven performance for different manually selected keypoints along the kinematic chain. An example of this can be found on symmetric robotic tools where DNN detectors cannot solve the correspondence problem correctly. In this work, we propose a new and autonomous way to define the keypoint locations that overcomes these challenges. The approach involves finding the optimal set of keypoints on robotic manipulators for robust visual detection. Using a robotic simulator as a medium, our algorithm utilizes synthetic data for DNN training, and the proposed algorithm is used to optimize the selection of keypoints through an iterative approach. The results show that when using the optimized keypoints, the detection performance of the DNNs improved so significantly that they can even be detected in cases of self-occlusion. We further use the optimized keypoints for real robotic applications by using domain randomization to bridge the reality gap between the simulator and the physical world. The physical world experiments show how the proposed method can be applied to the wide-breadth of robotic applications that require visual feedback, such as camera-to-robot calibration, robotic tool tracking, and whole-arm pose estimation.


page 1

page 3

page 6

page 7

page 8


Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects

Using synthetic data for training deep neural networks for robotic manip...

Learning 6D Pose Estimation from Synthetic RGBD Images for Robotic Applications

In this work, we propose a data generation pipeline by leveraging the 3D...

MarkerPose: Robust Real-time Planar Target Tracking for Accurate Stereo Pose Estimation

Despite the attention marker-less pose estimation has attracted in recen...

ParaPose: Parameter and Domain Randomization Optimization for Pose Estimation using Synthetic Data

Pose estimation is the task of determining the 6D position of an object ...

Sim-to-Real 6D Object Pose Estimation via Iterative Self-training for Robotic Bin-picking

In this paper, we propose an iterative self-training framework for sim-t...

SEKD: Self-Evolving Keypoint Detection and Description

Researchers have attempted utilizing deep neural network (DNN) to learn ...

I Introduction

Fig. 1: A visualization of the keypoints on the Baxter robot arm (left) and the da Vinci surgical tool (right). The candidate keypoints are shown in blue, and the optimized keypoints are shown in red. Surprisingly, some of the optimized keypoints are inside of the robot’s link rather than on the surface. The bottom row shows these detections being used for pose estimation by utilizing the optimized keypoints.

Visual feedback plays an integral role in robotics because of the rich information images provide. Historically, there have been two popular approaches to incorporating visual feedback. The first is through calibration and finding the transform between the base of the robot and the camera [3]. This transform describes the geometric relationship between objects in the camera frame and the robot, hence allowing image processing and detection algorithms to provide the context necessary for a robot to perform in an environment. The second approach is directly estimating the relationship between the control, typically joint angles, and the end-effector position in the camera frame [21]. This type of visual feedback also allows for end-effector control in the camera frame.

An important step to properly integrating visual feedback techniques is detecting features and finding their correspondence on the robot. A common approach is placing visual markers on the robot that are easy to detect and hence provide keypoints in the image frame [5][20]. But where should one place the markers? There does not seem to be any established approach that works the best. One must consider how these marker-based methods require modification of the robot, and how the visibility of the markers will frequently suffer from self-occlusions. Moreover, it is challenging to find the 3D location of the marker relative to the robotic kinematic chain which can cause inaccuracies in visual feedback. Deep learning approaches for detecting keypoints have been proposed to remove the need for modification of the robot [16]. The training of Deep Neural Networks (DNNs) for keypoint detection has even been extended to use synthetically generated data for accurate training with the correct corresponding 3D location relative to the kinematic chain [11].

In spite of the fact that these proposed methods have presented robust keypoint detections, they have failed to address an important consideration which is where to place the keypoints relative to the kinematic chain. Previous work relies on hand-picked locations for the keypoints which may be sub-optimal and even limiting in performance for the detection algorithm. An example of this challenge can be found on symmetric robotic tools where the key-point detection algorithm cannot solve the correspondence problem correctly [16].

Taking the advantages of the DNN as the learnable keypoint detector and robotic simulation tools, we present the following novel contributions:

  1. a general keypoint optimization algorithm which solves for the locations of the set of keypoints to maximize their detectability,

  2. a keypoint detection algorithm based on a DNN so it can be trained to learn any candidate set of keypoints in simulation, and

  3. effective simulation-to-real (sim-to-real) transfer of the DNN using domain randomization

To show the effectiveness of the proposed methods, the optimized keypoint detection algorithm is tested on both a calibration and a tracking scenario: (i) a Rethink Robotics Baxter robot [25] for calibrating the base to camera transform and (ii) the da Vinci Research Kit (dVRK) [10] for real-time surgical tool tracking. Surprisingly, the resulting keypoint locations from the optimization in these experiments are inside the robotic links rather than on the surface. The corresponding DNN based keypoint detector is capable of consistently and accurately detecting the 2D image projections of these keypoints even in cases of self-occlusion. These unexpected results highlight the necessity for keypoint location optimization since no manually chosen keypoints could ever reproduce this level of detectability.

Ii Related Works

Keypoint optimization has been widely investigated in various areas. Early works in robotics have been explored how to select the optimal keypoints extracted from SIFT [15] and SURF [1], for visual odometry [19] and localization [29]

. In computer vision,

[2] and [18]

select the salient keypoints by considering their detectability, distinctiveness and repeatability. Recently, there has been a shift to using DNNs for detecting keypoints instead because of the improved performance. More specifically, works in the computer vision community have learned and optimized keypoints for better face recognition and human pose estimation using the DNN

[30][7][34]. However, those algorithms usually consume a large number of real images for training. Recently, [28] proposed an end-to-end framework to optimized the keypoints for 3D pose estimation in an unsupervised manner by utilizing the synthetically generated data. Our work differs from those in the particular goal that optimizing the keypoints detection performance for the DNNs. Instead of optimizing the keypoints for specific downstream tasks, our algorithm is more general and can be applied to various robotic tasks that use visual feedback, where the kinematic and the 3D geometric information of keypoints are required. On the contrary, these approaches only learn the keypoints for a specific downstream task and cannot generalize and adapt to various robotic applications.

Iii Methodology

Our method, achieving the robust keypoint detection of robot manipulators with self-occlusions, comprises three major components: a DNN that can be trained to detect a set of user-defined keypoints, a novel algorithm for keypoint optimization, and an effective data generation technique for sim-to-real transfer.

Iii-a Deep Neural Network for Keypoint Detection

The advantage of the deep learning methods is their learnability, as their parameters can be optimized for desired output in a data-driven manner. Inspired by DeepLabCut [17], a method for marker-less tracking of a set of user-defined keypoints, we utilized its backbone neural network for keypoint detections. This DNN consists of a ResNet [6]

as the feature extractor. The feature extractor is followed by the deconvolutional layers to up-sample the feature maps, and the sigmoid activations are used to produce the spatial probability densities for estimating the 2D keypoint locations.

For simplicity, the DNN can be regarded as a function, denoted as , which estimates the 2D image coordinates of a set of keypoints, , on the input image ,


where is the set of estimated locations for the keypoints in on image frame . While trained with synthetic data, which provides the ground-truth location for every keypoint as long as it is within the camera’s frustum, the DNN can learn to estimate the keypoint location even with self-occlusions.

Iii-B Keypoint Optimization for Robotic Manipulators

Finding the keypoints that minimize the detection error will improve the performance of many robotic tasks that rely on keypoints as a visual feedback. As the DNN is trained end-to-end to detect a set of keypoints, the detection performance of one certain keypoint can be varied while grouped with different keypoints. Therefore, the keypoint optimization algorithm’s main objective is to iteratively find the set of keypoints that optimize the detection performance of the DNN.

In the proposed algorithm, candidate keypoints are sampled on the kinematic chain. The size of the keypoint set, , can be varied but should be less than the number of candidate keypoints (). In the optimization process, the algorithm uses a Random Sample Consensus (RANSAC [4]) style approach to finding the optimal set of keypoints iteratively. This involves four steps. The first step is to sample a set of keypoints. Each candidate keypoint has an associated weight, which can be interpreted as the confidence of being in the optimal set, and a set of keypoints are sampled based on their weights in each iteration. The second step is to train a DNN to detect the selected keypoints using the training data. In our implementation, we generate the training and testing images with the ground-truth label for all candidate keypoints beforehand. By doing so, the labels for selected keypoints in each iteration can be obtained without re-generating the datasets. The third step is to evaluate the trained DNN using the testing data. For evaluation, the detection error is defined as the pixel-wise distance between the estimations and the ground-truth image coordinates, and the average detection error of each keypoint is computed to update their weights. The fourth step is to update the weights of the selected keypoints, so that the keypoints with lower average detection error will have a higher weight. To find the optimal set of keypoints, these four steps are repeated for iteration and the set of keypoints that minimize the mean of the average detection error is returned as the optimal set.

Input: The set of candidate keypoints
Output: The optimal set of keypoints
// Initialization
for  to  do
end for
// Optimize keypoints iteratively
for  to  do
       if  then
             // Update optimal keypoint set
       end if
end for
Algorithm 1 Keypoint Optimization for Robust Visual Detections

In the keypoints optimization algorithm, see Algorithm 1, we represent the set of candidate keypoints as , where is the 3D position of the keypoint with respect to the robotic link it belongs to, and is its corresponding weight. For iteration , is the set of selected keypoints, which a DNN will be trained to detect. Then the following functions will be used in each iteration.

Iii-B1 Sampling

The function sampleKeypoints randomly selects keypoints among candidates by taking their weights as the probabilities of been chosen. The keypoints can be divided into sub-groups to constraint the sampling result. For example, to have one optimized keypoint per link, the keypoints of a robotic arm can be grouped by link so that the function will sample one keypoint among the candidates for each link.

Iii-B2 Training

The function Train trains a DNN on a training dataset to predict the image coordinates of the selected keypoints in .

Iii-B3 Evaluation

The function Evaluate evaluates the keypoint set by running the detections on the testing dataset, and computes the average detection error, , for each of the selected keypoints. For -th keypoint, the average detection error is calculated as


where is the total number of test images, is the estimated keypoint location, and is the ground-truth keypoint location in the -th frame.

Iii-B4 Update

At the end of each iteration, the function Update is used to update the weight of the selected keypoints based on their average detection errors. The following update rule is used, which ensures the weights of all keypoints will always sum up to 1. For example, the selected keypoints in -th iteration are . Given their average detection errors on the test dataset , their weights are updated as


where is a tuned parameter to control the effect of detection error on changing the weights.

Iii-C Synthetic Data Generation From Robotic Simulator

Fig. 2: Simulation setup and synthetically generated image of the Rethink Baxter (left) and the da Vinci Surgical System (right).

The robotic simulator offers several advantages in algorithm development. For example, the exact pose of an object can be obtained in a simple command, which is a non-trivial task in the real world. To generate the synthetic data for keypoint detection, we set up the simulation environment using the robotic simulator CoppeliaSim [26] and interfaced using PyRep [8] to generate the ground truth label and render RGB images, as shown in Figure 2. The candidate keypoints are randomly placed on the robots and the RGB images are rendered from the virtual cameras. After obtaining the position of the keypoint relative to the camera frame, , the 2D label in image coordinates for the keypoint, , are calculated using the camera projection model.


where is the intrinsic matrix of the virtual camera, is the z-value of as , and the notation represents the homogeneous representation of a point (e.g. ).

Although robotic simulation software is preferable for robotic algorithm development, we are also interested in real robotic applications. To bridge the reality gap, the simple but effective technique known as domain randomization [31] is applied to transfer the keypoint detection DNN, which is trained only on synthetic data, to robots in the physical world. During the data generation, the virtual cameras are manually placed in the simulated scene that approximately matches the viewpoint of the real camera, and the following randomization settings are applied to generate each training sample:

  • The angle for the robot joints is randomized within the joint limits

  • The pose of virtual cameras are randomized by adding a zero-mean Gaussian noise to the initial pose, such that


    where is the quaternion,

    is the translational vector, and

    is the covariance matrix

  • The number of the scene lights is randomly chosen between 1 to 3, and are positioned freely in the simulated scene with varying intensities

  • Distractor objects, like chairs and tables, are placed in the simulated environment with random poses

  • The background of the rendered images are randomly selected from COCO dataset

    [14] and Indoor Scene dataset [22]

  • The color of the robot mesh is randomized by adding a zero-mean Gaussian noise with a small variance to the default RGB value

  • The rendered images are augmented by adding the additive white Gaussian noise using the image augmentation tool [9]

These randomization techniques were applied when generating synthetic data for both keypoint optimization and domain transfer.

Iv Experiments and Results

In this section, we describe efforts towards evaluating the robustness of the keypoints optimization algorithm and the performance of using them in the real robotics applications. Specifically, we are interested in examining the impact of different neural network architectures on the keypoint optimization algorithm and the advantage of using the optimized keypoints for robotic applications though camera-to-base pose estimation and robot tool tracking.

Iv-a Datasets and Evaluation Metrics

Iv-A1 Baxter-sim dataset

This dataset contains 400 image frames (resolution: 640480) of the Rethink Robotics Baxter robot rendered from different virtual cameras with random joint configurations. The ground-truth camera-to-base transformations for both arms and the corresponding joint angles are known from the simulation. The accuracy of the camera-to-base pose estimation is evaluated in both 2D and 3D.

For 3D evaluations, the average distance (ADD), used in [11], is calculated for the set of keypoints. The ADD calculates the average distance (Euclidean distance) between the 3D keypoints transformed according to the ground-truth camera-to-base transformation (rotation and translation ) and the estimated transformation (rotation and translation ),


where is set of keypoints, is the number of keypoints in the set, and is the 3D keypoint position relative to the base frame.

For 2D evaluations, we propose the reprojection error (RE) and the percentage of correct keypoints (PCK) for the end-effector. The RE is the distance between the estimated end-effector position and its ground-truth position in image coordinates,


where is the 3D end-effector position relative to the robot base frame, is the ground-truth end-effector position in image coordinates, and is the z-value of the projected point. The PCK, proposed in [32], measures the percentage of keypoints that the distance between the predicted and the ground-truth position is within a certain threshold. In this experiment, the PCK for 2D end-effector localization is calculated, and the thresholds are the distances in image coordinates.

Iv-A2 Baxter-real dataset

This dataset contains 100 image frames (resolution: 20481526) of the Baxter robot with 20 different joint configurations collected using a Microsoft Azure Kinect. The ground-truth end-effector positions in the camera frame are provided, which is obtained by attaching an Aruco marker physically at the end-effector position. For evaluation, we first compute the end-effector position in the camera frame using the estimated camera-to-base transformation. The 2D and 3D PCK for the estimated end-effector are calculated using the distances as thresholds.

Iv-A3 SuPer tool tracking dataset

The SuPer dataset*** is a recording of a repeated tissue manipulation experiment using the da Vinci Research Kit (dVRK) surgical robotic system [10]

, where the stereo endoscopic video stream and the encoder readings of the surgical robot are provided. We extended the original surgical tool tracking dataset, which originally has 50 ground-truth surgical tool masks, to 80 ground-truth masks by hand labeling. The extended dataset covers a better variation of the tool poses, and the performance of the tool tracking is evaluated using the Intersection-Over-Union (IoU, Jaccard Index) for the rendered tool masks,


where is the ground-truth tool mask area and is the predicted tool mask area in image plane.

Iv-B Keypoints Optimization

Fig. 3: The plot of average detection error from the evaluation step for each iteration of keypoint optimization algorithm, while optimizing the keypoints for da Vinci surgical tool. The algorithm behaves similarly with various DNN architectures.

To demonstrate the keypoint optimization algorithm, we randomly placed 32 candidate keypoints () on the da Vinci surgical tool, and applied the Algorithm 1 to optimize the set of 7 () keypoints for robust detection. We constraint the sampling function so that one keypoint will be sampled for each side of the gripper to capture its motion. The keypoint optimization algorithm was running for 15 iterations () with on the synthetic dataset, containing 2K samples for training and 500 samples for evaluations. In the train step, the DNN is trained on the training dataset for 100,000 iterations, with the learning rate of .

To examine the impact of different neural network architectures on the keypoint optimization algorithm, we trained the DNNs with four different feature extractors: ResNet_50, ResNet_101, MobileNet_v2_1.0, and MobileNet_v2_0.5 [27]. The last digit of the ResNets indicates the number of layers, and for the MobilNets, that indicates the width of the network, which essentially implies the number of parameters of the network. The average detection error from the evaluation step in each iteration are shown on the Fig. 3, demonstrating that our algorithm is agnostic to different DNN architectures. With different feature extractors, the algorithm can converge at a similar rate and output the same set of optimal keypoints, which is shown in the middle-right of Fig. 1. However, deeper networks do end up having smaller errors. Note that the optimized keypoints are inside the tool body, which is impossible to label without the simulation software.

Iv-C Camera-to-base Pose Estimation from a Single Image

We explored camera-to-base pose estimation performance from a single RGB image by utilizing the optimized keypoints. The widely-used research robot, Rethink Robotics Baxter robot [25], was used for the experiments. Without loss of generality, we refer the frame

to the base frame of the left 7 degree-of-freedom (DoF) arm. To localize the robot arm with 7 DoF, we optimized seven keypoints (one per link) to estimate the pose through the Efficient Perspective-n-Point Camera Pose Estimation Algorithm (EPnP)

[12]. The DNN was used for estimating the 2D positions of the optimized keypoints for a given RGB image and the corresponding 3D positions were computed using the forward kinematics of the robot arm. Then, the EPnP algorithm implemented using the OpenCV package was used to estimate the transformation between the camera and robot (, ).

To find the optimal keypoints, we randomly sampled three candidate keypoints for each link on the left arm, and the Algorithm 1 was applied to find one optimal keypoint per link, with and

. The ResNet_50 was utilized as the feature extractor of the DNNs and was initialized with ImageNet-pretrained weights. The DNNs were trained for 500,000 iterations with the learning rate of 0.2 for the first 400,000 iterations and 0.02 for the remaining 100,000 iterations. The optimized keypoints for the Baxter’s left arm are shown in the middle-left of Fig.


Iv-C1 Experiments in simulation

Keypoint Set Left Arm Right Arm
Random Keypoints 6.51 33.93
Keypoint at Joints 4.72 35.02
Optimized Keypoints 3.55 5.98
TABLE I: The average RE (pixel) for 2D end-effector pose estimation in the Baxter-sim dataset. The keypoint detection is optimized for the left arm but tested on both arms.
Keypoint Set Left Arm Right Arm
Random Keypoints 76.95 230.60
Keypoint at Joints 55.60 262.82
Optimized Keypoints 50.08 82.37
TABLE II: The average ADD (mm) for the set of 3D keypoints in the Baxter-sim dataset.

Even though the DNN for keypoint detection was trained to estimate the left arm’s keypoints, we showed that those keypoints on the right arm can also be detected by flipping the images horizontally. The Fig. 4 shows the detections of the optimized keypoints for both arms. We further visualize the camera-to-base pose estimation performance by skeletonizing the arms. The skeleton of the arms is obtained using the forward kinematics based on the estimated base frame . As shown in Fig. 4, the estimated skeletons perfectly aligned with the robot arms. For comparison, we experimented with three different keypoint sets on the Baxter-sim dataset: a set of randomly selected keypoints with one keypoint for each link, the keypoints at the exact location of the seven joints (as used in [11] and what one might normally assume are the most obvious / optimal), and the set of 7 optimized keypoints. The RE and ADD results of different keypoint sets are shown on the Table I and II, demonstrating that the optimized keypoints achieve the best performance for both the left and right arm. The 2D PCK results with various thresholds are shown in Fig. 5. While no drastic improvements for the left arm, the detections for the right arm improved significantly by using the optimized keypoints, indicating that the robustness of the optimized keypoints provides better detection even with flipped images.

Fig. 4: The examples of the optimized keypoints detection (left) and the skeletonization (right) of the Baxter robot in synthetic images. The skeleton of the arms is based on the estimated base frame (RGB reference frame).
Fig. 5: The PCK results of the left and right arm for three different sets of keypoints on the Baxter-sim dataset. As the DNN only detects the keypoints on the left arm, the estimations on the right arm are obtained by flipping the images horizontally. The thresholds are distance in pixels and the numbers in parentheses indicate the area under the curve (AUC).

Iv-C2 Real-word Experiments

To transfer the keypoint detection on the real robot, we applied the domain randomization techniques, as described in Section III-C, to bridge the reality gap, and the experimental results show that the DNN generalizes well to real-world images. The original images were scaled by 0.25 before sending them to the DNN. Fig. 6 shows the optimized keypoints’ detections on the Baxter-real dataset, demonstrating that the keypoints can be detected even with self-occlusions. Given the accurate keypoint detections, the estimated skeletons are perfectly aligned with the robot arm. To examine the detection performance of different keypoints on real robots, the three keypoint sets in simulation experiments were used. We also implemented the traditional camera-to-base pose estimation procedure for comparison by placing the Aruco markers on the robot arm. The 2D and 3D PCK results for different methods are shown in Fig. 7. Using the optimized keypoints, around 50 percent of the estimations have fewer than 25 pixels error in the image plane (about 1 percent of the image size) and have an error of less than 50mm in 3D space, which is much better than other methods. Another number to notice is the area under the curve, which indicates the mean of PCKs over the thresholds. The optimized keypoints achieve the highest value in both 2D and 3D evaluations. Due to the self-occlusions and the camera’s pose, limiting the visibility of visual markers, some image frames do not have enough detected Aruco markers for EPnP (). Therefore, the estimation for those frames is unavailable.

Fig. 6: The examples of the optimized keypoints detection (left) and the skeletonization (right) of the Baxter robot in real-world images. Keypoints are estimated even with self-occlusions, and the skeletons are perfectly aligned with the robot arm.
Fig. 7: The PCK results on the Baxter-real dataset using different keypoints for end-effector localization in 2D (top) and 3D (bottom). The thresholds are the distances and AUCs are shown in the parentheses. Around 50 percent of the estimations have an error of less than 1 percent of the image size (20481526) using the optimized keypoints.

Iv-D Surgical Tool Tracking

Surgical tool tracking continuously estimates the 3D pose of the tool end-effector with respect to the camera frame. This is necessary for augmented reality displays [23] and transferring of learning-based control policies [24], among other applications [33]. We employed the tool tracker previously developed in [16], which combines a keypoint detector and a particle filter for 3D pose estimation. The keypoint detector detects the predefined keypoints on the input images to provide visual feedback, and the particle filter is used to estimate the 3D pose of the tool by combining the forward kinematics of the robot with the keypoint observations. Differing from [13] and [16], we are using the optimized keypoints (red points in Fig. 1) instead of the original hand-picked keypoints. Then, domain randomization technique is used to bridge the reality gap between the synthetic and real-world images. The resulting keypoint detections are shown in Fig. 8. Note that the optimized keypoints are inside of the tool body, and the DNN can accurately predict their projections onto the image plane in different tool configurations.

For comparison, we evaluated the tool tracking performance with three different setups described as follows.

Hand-labelled SuPer Deep Keypoints: This setup is identical to the tool tracking approach in [16]. The seven hand-picked keypoints were used for tracking, which is on the surface of the tool. The DNN was then trained on the 100 real-world images with the hand-labeled ground-truth position.

Synthetically-labelled SuPer Deep Keypoints: The second setup used the same set of hand-picked keypoints from SuPer Deep, but the DNN was trained on the around 20K synthetic images with domain randomization, where the keypoint labels are provided even with occlusions.

Optimized Keypoints: The third setup was using the seven optimized keypoints, as shown in Fig. 1, and the DNN was also trained on the synthetic images with domain randomization.

The tool tracking performance is computed by rendering a re-projected tool mask on the image frame from the estimation based on the keypoints. The IoU is computed with the ground-truth mask for evaluation. The quantitative results are shown in Fig. 9, and the qualitative results are shown in Fig. 10. The setup with Hand-labelled SuPer Deep Keypoints fails to track the tool when the tool was turning, as those hand-picked keypoints on the tool surface are occluded and humans cannot provide the label for those non-visible keypoints. However, using the optimized keypoints, the DNN makes pretty accurate predictions for non-visible keypoints, as shown in Fig. 8, since the simulator can always inquire the pose of a point in a virtual camera frame. Although both Synthetically-labelled SuPer Deep Keypoints and optimized keypoints setups are utilizing the synthetic dataset, the optimized keypoints achieves higher accuracy because the keypoints are optimizing the detection performance of the DNN. Another advantage of the optimized keypoints is to reduce the ambiguity in detection. As stated in [16], the detection of some keypoints is challenging due to the tool’s symmetry, which would cause false detections. The optimized keypoints, distributed around the central line of the tool, could significantly reduce the ambiguity for symmetry.

Fig. 8: Detection of the optimized keypoints for the surgical tool on synthetic (top) and real (bottom) images. The 3D position of the optimized keypoints relative to the tool is shown in the middle-right of the Fig. 1, and the DNN accurately predicts their projections on the image plane.
Fig. 9:

The box plot of the IoUs of the rendered tool mask using three different tool tracking setups (circles are outliers). The

Optimized Keypoints has less variance with high accuracy.
Fig. 10: Qualitative results of the tool tracking for three different setups. From top to bottom raw the figures show the results of Hand-labelled SuPer Deep Keypoints, the Synthetically-labelled SuPer Deep Keypoints, and the Optimized Keypoints. The green area shows the intersection of the rendered mask and ground-truth mask (), and the red area shows the difference between the rendered mask and ground-truth mask ().

V Discussion and Conclusion

We proposed a general keypoint optimization algorithm to maximize the performance of keypoint detection on robotic manipulators. Our algorithm utilized a DNN for keypoint detections and can even handle self-occlusions by optimizing the keypoint locations and training on synthetically generated data. The results show that the optimized keypoints yield higher accuracy on detections compared to manually or randomly selected keypoints, hence resulting in better performance for the wide breadth of robotic applications that rely on keypoints for visual feedback. To show this, we presented both quantitative and qualitive results from two robotic applications: camera-to-base pose estimation and surgical tool tracking.

The experimental results of detecting keypoints in cases of self-occlusion and their 3D locations being inside the kinematic links rather than on the surface further motivates the importance of this work. No manually selected keypoints has ever produced this type of result before. For future work, we will investigate self-supervised approaches to selecting candidate points such that the resulting optimized keypoints have even higher detectability.


  • [1] H. Bay et al. (2006) Surf: speeded up robust features. In European conference on computer vision, pp. 404–417. Cited by: §II.
  • [2] S. Buoncompagni et al. (2015) Saliency-based keypoint selection for fast object detection and matching. Pattern Recognition Letters 62, pp. 32–40. Cited by: §II.
  • [3] I. Fassi and G. Legnani (2005) Hand to sensor calibration: a geometrical interpretation of the matrix equation ax= xb. Journal of Robotic Systems 22 (9), pp. 497–506. Cited by: §I.
  • [4] M. A. Fischler and R. C. Bolles (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §III-B.
  • [5] S. Garrido-Jurado et al. (2014) Automatic generation and detection of highly reliable fiducial markers under occlusion. Pattern Recognition 47 (6), pp. 2280–2292. Cited by: §I.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. External Links: 1512.03385 Cited by: §III-A.
  • [7] T. Jakab et al. (2018) Unsupervised learning of object landmarks through conditional image generation. In Advances in neural information processing systems, pp. 4016–4027. Cited by: §II.
  • [8] S. James, M. Freese, and A. J. Davison (2019) PyRep: bringing v-rep to deep robot learning. arXiv preprint arXiv:1906.11176. Cited by: §III-C.
  • [9] A. B. Jung et al. (2020) imgaug. Note: Online; accessed 01-Feb-2020 Cited by: 7th item.
  • [10] P. Kazanzides et al. (2014-05)

    An open-source research kit for the da vinci® surgical system

    In 2014 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 6434–6439. Cited by: §I, §IV-A3.
  • [11] T. E. Lee et al. (2020) Camera-to-robot pose estimation from a single image. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 9426–9432. Cited by: §I, §IV-A1, §IV-C1.
  • [12] V. Lepetit, F. Moreno-Noguer, and P. Fua (2009) Epnp: an accurate o (n) solution to the pnp problem. International journal of computer vision 81 (2), pp. 155. Cited by: §IV-C.
  • [13] Y. Li et al. (2020) SuPer: a surgical perception framework for endoscopic tissue manipulation with surgical robotics. IEEE Robotics and Automation Letters 5 (2), pp. 2294–2301. Cited by: §IV-D.
  • [14] T. Lin et al. (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: 5th item.
  • [15] D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §II.
  • [16] J. Lu et al. (2020)

    SuPer deep: a surgical perception framework for robotic tissue manipulation using deep learning for feature extraction

    arXiv preprint arXiv:2003.03472. Cited by: §I, §I, §IV-D, §IV-D, §IV-D.
  • [17] A. Mathis et al. (2018) DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nature neuroscience 21 (9), pp. 1281. Cited by: §III-A.
  • [18] P. Mukherjee, S. Srivastava, and B. Lall (2016) Salient keypoint selection for object representation. In 2016 Twenty Second National Conference on Communication (NCC), pp. 1–6. Cited by: §II.
  • [19] V. Nannen and G. Oliver (2013) Grid-based spatial keypoint selection for real time visual odometry.. In ICPRAM, pp. 586–589. Cited by: §II.
  • [20] E. Olson (2011) AprilTag: a robust and flexible visual fiducial system. In 2011 IEEE International Conference on Robotics and Automation, pp. 3400–3407. Cited by: §I.
  • [21] J. Qian and J. Su (2002) Online estimation of image jacobian matrix by kalman-bucy filter for uncalibrated stereo vision feedback. In Proceedings 2002 IEEE International Conference on Robotics and Automation, Vol. 1, pp. 562–567. Cited by: §I.
  • [22] A. Quattoni and A. Torralba (2009) Recognizing indoor scenes. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 413–420. Cited by: 5th item.
  • [23] F. Richter et al. (2019) Augmented reality predictive displays to help mitigate the effects of delayed telesurgery. In 2019 International Conference on Robotics and Automation (ICRA), pp. 444–450. Cited by: §IV-D.
  • [24] F. Richter et al. (2019)

    Open-sourced reinforcement learning environments for surgical robotics

    arXiv preprint arXiv:1903.02090. Cited by: §IV-D.
  • [25] R. Robotics (2013) Baxter. Retrieved Jan 10, pp. 2014. Cited by: §I, §IV-C.
  • [26] E. Rohmer, S. P. N. Singh, and M. Freese (2013) CoppeliaSim (formerly v-rep): a versatile and scalable robot simulation framework. In Proc. of The International Conference on Intelligent Robots and Systems (IROS), Note: Cited by: §III-C.
  • [27] M. Sandler et al. (2018-06) MobileNetV2: inverted residuals and linear bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. External Links: ISBN 9781538664209, Link, Document Cited by: §IV-B.
  • [28] S. Suwajanakorn et al. (2018) Discovery of latent 3d keypoints via end-to-end geometric reasoning. In Advances in neural information processing systems, pp. 2059–2070. Cited by: §II.
  • [29] H. Tamimi et al. (2006) Localization of mobile robots with omnidirectional vision using particle filter and iterative sift. Robotics and Autonomous Systems 54 (9), pp. 758–765. Cited by: §II.
  • [30] J. Thewlis et al. (2017) Unsupervised learning of object landmarks by factorized spatial embeddings. In Proceedings of the IEEE international conference on computer vision, pp. 5916–5925. Cited by: §II.
  • [31] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 23–30. Cited by: §III-C.
  • [32] Y. Yang and D. Ramanan (2012) Articulated human detection with flexible mixtures of parts. IEEE transactions on pattern analysis and machine intelligence 35 (12), pp. 2878–2890. Cited by: §IV-A1.
  • [33] M. Yip and N. Das (2017) Robot autonomy for surgery. In Encyclopedia of Medical Robotics, pp. 281–313. External Links: Document, Link, Cited by: §IV-D.
  • [34] Y. Zhang et al. (2018) Unsupervised discovery of object landmarks as structural representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2694–2703. Cited by: §II.