Visual feedback plays an integral role in robotics because of the rich information images provide. Historically, there have been two popular approaches to incorporating visual feedback. The first is through calibration and finding the transform between the base of the robot and the camera . This transform describes the geometric relationship between objects in the camera frame and the robot, hence allowing image processing and detection algorithms to provide the context necessary for a robot to perform in an environment. The second approach is directly estimating the relationship between the control, typically joint angles, and the end-effector position in the camera frame . This type of visual feedback also allows for end-effector control in the camera frame.
An important step to properly integrating visual feedback techniques is detecting features and finding their correspondence on the robot. A common approach is placing visual markers on the robot that are easy to detect and hence provide keypoints in the image frame . But where should one place the markers? There does not seem to be any established approach that works the best. One must consider how these marker-based methods require modification of the robot, and how the visibility of the markers will frequently suffer from self-occlusions. Moreover, it is challenging to find the 3D location of the marker relative to the robotic kinematic chain which can cause inaccuracies in visual feedback. Deep learning approaches for detecting keypoints have been proposed to remove the need for modification of the robot . The training of Deep Neural Networks (DNNs) for keypoint detection has even been extended to use synthetically generated data for accurate training with the correct corresponding 3D location relative to the kinematic chain .
In spite of the fact that these proposed methods have presented robust keypoint detections, they have failed to address an important consideration which is where to place the keypoints relative to the kinematic chain. Previous work relies on hand-picked locations for the keypoints which may be sub-optimal and even limiting in performance for the detection algorithm. An example of this challenge can be found on symmetric robotic tools where the key-point detection algorithm cannot solve the correspondence problem correctly .
Taking the advantages of the DNN as the learnable keypoint detector and robotic simulation tools, we present the following novel contributions:
a general keypoint optimization algorithm which solves for the locations of the set of keypoints to maximize their detectability,
a keypoint detection algorithm based on a DNN so it can be trained to learn any candidate set of keypoints in simulation, and
effective simulation-to-real (sim-to-real) transfer of the DNN using domain randomization
To show the effectiveness of the proposed methods, the optimized keypoint detection algorithm is tested on both a calibration and a tracking scenario: (i) a Rethink Robotics Baxter robot  for calibrating the base to camera transform and (ii) the da Vinci Research Kit (dVRK)  for real-time surgical tool tracking. Surprisingly, the resulting keypoint locations from the optimization in these experiments are inside the robotic links rather than on the surface. The corresponding DNN based keypoint detector is capable of consistently and accurately detecting the 2D image projections of these keypoints even in cases of self-occlusion. These unexpected results highlight the necessity for keypoint location optimization since no manually chosen keypoints could ever reproduce this level of detectability.
Ii Related Works
Keypoint optimization has been widely investigated in various areas. Early works in robotics have been explored how to select the optimal keypoints extracted from SIFT  and SURF , for visual odometry  and localization 
. In computer vision, and 
select the salient keypoints by considering their detectability, distinctiveness and repeatability. Recently, there has been a shift to using DNNs for detecting keypoints instead because of the improved performance. More specifically, works in the computer vision community have learned and optimized keypoints for better face recognition and human pose estimation using the DNN. However, those algorithms usually consume a large number of real images for training. Recently,  proposed an end-to-end framework to optimized the keypoints for 3D pose estimation in an unsupervised manner by utilizing the synthetically generated data. Our work differs from those in the particular goal that optimizing the keypoints detection performance for the DNNs. Instead of optimizing the keypoints for specific downstream tasks, our algorithm is more general and can be applied to various robotic tasks that use visual feedback, where the kinematic and the 3D geometric information of keypoints are required. On the contrary, these approaches only learn the keypoints for a specific downstream task and cannot generalize and adapt to various robotic applications.
Our method, achieving the robust keypoint detection of robot manipulators with self-occlusions, comprises three major components: a DNN that can be trained to detect a set of user-defined keypoints, a novel algorithm for keypoint optimization, and an effective data generation technique for sim-to-real transfer.
Iii-a Deep Neural Network for Keypoint Detection
The advantage of the deep learning methods is their learnability, as their parameters can be optimized for desired output in a data-driven manner. Inspired by DeepLabCut , a method for marker-less tracking of a set of user-defined keypoints, we utilized its backbone neural network for keypoint detections. This DNN consists of a ResNet 
as the feature extractor. The feature extractor is followed by the deconvolutional layers to up-sample the feature maps, and the sigmoid activations are used to produce the spatial probability densities for estimating the 2D keypoint locations.
For simplicity, the DNN can be regarded as a function, denoted as , which estimates the 2D image coordinates of a set of keypoints, , on the input image ,
where is the set of estimated locations for the keypoints in on image frame . While trained with synthetic data, which provides the ground-truth location for every keypoint as long as it is within the camera’s frustum, the DNN can learn to estimate the keypoint location even with self-occlusions.
Iii-B Keypoint Optimization for Robotic Manipulators
Finding the keypoints that minimize the detection error will improve the performance of many robotic tasks that rely on keypoints as a visual feedback. As the DNN is trained end-to-end to detect a set of keypoints, the detection performance of one certain keypoint can be varied while grouped with different keypoints. Therefore, the keypoint optimization algorithm’s main objective is to iteratively find the set of keypoints that optimize the detection performance of the DNN.
In the proposed algorithm, candidate keypoints are sampled on the kinematic chain. The size of the keypoint set, , can be varied but should be less than the number of candidate keypoints (). In the optimization process, the algorithm uses a Random Sample Consensus (RANSAC ) style approach to finding the optimal set of keypoints iteratively. This involves four steps. The first step is to sample a set of keypoints. Each candidate keypoint has an associated weight, which can be interpreted as the confidence of being in the optimal set, and a set of keypoints are sampled based on their weights in each iteration. The second step is to train a DNN to detect the selected keypoints using the training data. In our implementation, we generate the training and testing images with the ground-truth label for all candidate keypoints beforehand. By doing so, the labels for selected keypoints in each iteration can be obtained without re-generating the datasets. The third step is to evaluate the trained DNN using the testing data. For evaluation, the detection error is defined as the pixel-wise distance between the estimations and the ground-truth image coordinates, and the average detection error of each keypoint is computed to update their weights. The fourth step is to update the weights of the selected keypoints, so that the keypoints with lower average detection error will have a higher weight. To find the optimal set of keypoints, these four steps are repeated for iteration and the set of keypoints that minimize the mean of the average detection error is returned as the optimal set.
In the keypoints optimization algorithm, see Algorithm 1, we represent the set of candidate keypoints as , where is the 3D position of the keypoint with respect to the robotic link it belongs to, and is its corresponding weight. For iteration , is the set of selected keypoints, which a DNN will be trained to detect. Then the following functions will be used in each iteration.
The function sampleKeypoints randomly selects keypoints among candidates by taking their weights as the probabilities of been chosen. The keypoints can be divided into sub-groups to constraint the sampling result. For example, to have one optimized keypoint per link, the keypoints of a robotic arm can be grouped by link so that the function will sample one keypoint among the candidates for each link.
The function Train trains a DNN on a training dataset to predict the image coordinates of the selected keypoints in .
The function Evaluate evaluates the keypoint set by running the detections on the testing dataset, and computes the average detection error, , for each of the selected keypoints. For -th keypoint, the average detection error is calculated as
where is the total number of test images, is the estimated keypoint location, and is the ground-truth keypoint location in the -th frame.
At the end of each iteration, the function Update is used to update the weight of the selected keypoints based on their average detection errors. The following update rule is used, which ensures the weights of all keypoints will always sum up to 1. For example, the selected keypoints in -th iteration are . Given their average detection errors on the test dataset , their weights are updated as
where is a tuned parameter to control the effect of detection error on changing the weights.
Iii-C Synthetic Data Generation From Robotic Simulator
The robotic simulator offers several advantages in algorithm development. For example, the exact pose of an object can be obtained in a simple command, which is a non-trivial task in the real world. To generate the synthetic data for keypoint detection, we set up the simulation environment using the robotic simulator CoppeliaSim  and interfaced using PyRep  to generate the ground truth label and render RGB images, as shown in Figure 2. The candidate keypoints are randomly placed on the robots and the RGB images are rendered from the virtual cameras. After obtaining the position of the keypoint relative to the camera frame, , the 2D label in image coordinates for the keypoint, , are calculated using the camera projection model.
where is the intrinsic matrix of the virtual camera, is the z-value of as , and the notation represents the homogeneous representation of a point (e.g. ).
Although robotic simulation software is preferable for robotic algorithm development, we are also interested in real robotic applications. To bridge the reality gap, the simple but effective technique known as domain randomization  is applied to transfer the keypoint detection DNN, which is trained only on synthetic data, to robots in the physical world. During the data generation, the virtual cameras are manually placed in the simulated scene that approximately matches the viewpoint of the real camera, and the following randomization settings are applied to generate each training sample:
The angle for the robot joints is randomized within the joint limits
The pose of virtual cameras are randomized by adding a zero-mean Gaussian noise to the initial pose, such that
where is the quaternion,
is the translational vector, andis the covariance matrix
The number of the scene lights is randomly chosen between 1 to 3, and are positioned freely in the simulated scene with varying intensities
Distractor objects, like chairs and tables, are placed in the simulated environment with random poses
The color of the robot mesh is randomized by adding a zero-mean Gaussian noise with a small variance to the default RGB value
The rendered images are augmented by adding the additive white Gaussian noise using the image augmentation tool 
These randomization techniques were applied when generating synthetic data for both keypoint optimization and domain transfer.
Iv Experiments and Results
In this section, we describe efforts towards evaluating the robustness of the keypoints optimization algorithm and the performance of using them in the real robotics applications. Specifically, we are interested in examining the impact of different neural network architectures on the keypoint optimization algorithm and the advantage of using the optimized keypoints for robotic applications though camera-to-base pose estimation and robot tool tracking.
Iv-a Datasets and Evaluation Metrics
Iv-A1 Baxter-sim dataset
This dataset contains 400 image frames (resolution: 640480) of the Rethink Robotics Baxter robot rendered from different virtual cameras with random joint configurations. The ground-truth camera-to-base transformations for both arms and the corresponding joint angles are known from the simulation. The accuracy of the camera-to-base pose estimation is evaluated in both 2D and 3D.
For 3D evaluations, the average distance (ADD), used in , is calculated for the set of keypoints. The ADD calculates the average distance (Euclidean distance) between the 3D keypoints transformed according to the ground-truth camera-to-base transformation (rotation and translation ) and the estimated transformation (rotation and translation ),
where is set of keypoints, is the number of keypoints in the set, and is the 3D keypoint position relative to the base frame.
For 2D evaluations, we propose the reprojection error (RE) and the percentage of correct keypoints (PCK) for the end-effector. The RE is the distance between the estimated end-effector position and its ground-truth position in image coordinates,
where is the 3D end-effector position relative to the robot base frame, is the ground-truth end-effector position in image coordinates, and is the z-value of the projected point. The PCK, proposed in , measures the percentage of keypoints that the distance between the predicted and the ground-truth position is within a certain threshold. In this experiment, the PCK for 2D end-effector localization is calculated, and the thresholds are the distances in image coordinates.
Iv-A2 Baxter-real dataset
This dataset contains 100 image frames (resolution: 20481526) of the Baxter robot with 20 different joint configurations collected using a Microsoft Azure Kinect. The ground-truth end-effector positions in the camera frame are provided, which is obtained by attaching an Aruco marker physically at the end-effector position. For evaluation, we first compute the end-effector position in the camera frame using the estimated camera-to-base transformation. The 2D and 3D PCK for the estimated end-effector are calculated using the distances as thresholds.
Iv-A3 SuPer tool tracking dataset
The SuPer dataset***https://sites.google.com/ucsd.edu/super-framework/home is a recording of a repeated tissue manipulation experiment using the da Vinci Research Kit (dVRK) surgical robotic system 
, where the stereo endoscopic video stream and the encoder readings of the surgical robot are provided. We extended the original surgical tool tracking dataset, which originally has 50 ground-truth surgical tool masks, to 80 ground-truth masks by hand labeling. The extended dataset covers a better variation of the tool poses, and the performance of the tool tracking is evaluated using the Intersection-Over-Union (IoU, Jaccard Index) for the rendered tool masks,
where is the ground-truth tool mask area and is the predicted tool mask area in image plane.
Iv-B Keypoints Optimization
To demonstrate the keypoint optimization algorithm, we randomly placed 32 candidate keypoints () on the da Vinci surgical tool, and applied the Algorithm 1 to optimize the set of 7 () keypoints for robust detection. We constraint the sampling function so that one keypoint will be sampled for each side of the gripper to capture its motion. The keypoint optimization algorithm was running for 15 iterations () with on the synthetic dataset, containing 2K samples for training and 500 samples for evaluations. In the train step, the DNN is trained on the training dataset for 100,000 iterations, with the learning rate of .
To examine the impact of different neural network architectures on the keypoint optimization algorithm, we trained the DNNs with four different feature extractors: ResNet_50, ResNet_101, MobileNet_v2_1.0, and MobileNet_v2_0.5 . The last digit of the ResNets indicates the number of layers, and for the MobilNets, that indicates the width of the network, which essentially implies the number of parameters of the network. The average detection error from the evaluation step in each iteration are shown on the Fig. 3, demonstrating that our algorithm is agnostic to different DNN architectures. With different feature extractors, the algorithm can converge at a similar rate and output the same set of optimal keypoints, which is shown in the middle-right of Fig. 1. However, deeper networks do end up having smaller errors. Note that the optimized keypoints are inside the tool body, which is impossible to label without the simulation software.
Iv-C Camera-to-base Pose Estimation from a Single Image
We explored camera-to-base pose estimation performance from a single RGB image by utilizing the optimized keypoints. The widely-used research robot, Rethink Robotics Baxter robot , was used for the experiments. Without loss of generality, we refer the frame
to the base frame of the left 7 degree-of-freedom (DoF) arm. To localize the robot arm with 7 DoF, we optimized seven keypoints (one per link) to estimate the pose through the Efficient Perspective-n-Point Camera Pose Estimation Algorithm (EPnP). The DNN was used for estimating the 2D positions of the optimized keypoints for a given RGB image and the corresponding 3D positions were computed using the forward kinematics of the robot arm. Then, the EPnP algorithm implemented using the OpenCV package†††https://docs.opencv.org/3.4/d9/d0c/group__calib3d.html was used to estimate the transformation between the camera and robot (, ).
To find the optimal keypoints, we randomly sampled three candidate keypoints for each link on the left arm, and the Algorithm 1 was applied to find one optimal keypoint per link, with and
. The ResNet_50 was utilized as the feature extractor of the DNNs and was initialized with ImageNet-pretrained weights. The DNNs were trained for 500,000 iterations with the learning rate of 0.2 for the first 400,000 iterations and 0.02 for the remaining 100,000 iterations. The optimized keypoints for the Baxter’s left arm are shown in the middle-left of Fig.1.
Iv-C1 Experiments in simulation
|Keypoint Set||Left Arm||Right Arm|
|Keypoint at Joints||4.72||35.02|
|Keypoint Set||Left Arm||Right Arm|
|Keypoint at Joints||55.60||262.82|
Even though the DNN for keypoint detection was trained to estimate the left arm’s keypoints, we showed that those keypoints on the right arm can also be detected by flipping the images horizontally. The Fig. 4 shows the detections of the optimized keypoints for both arms. We further visualize the camera-to-base pose estimation performance by skeletonizing the arms. The skeleton of the arms is obtained using the forward kinematics based on the estimated base frame . As shown in Fig. 4, the estimated skeletons perfectly aligned with the robot arms. For comparison, we experimented with three different keypoint sets on the Baxter-sim dataset: a set of randomly selected keypoints with one keypoint for each link, the keypoints at the exact location of the seven joints (as used in  and what one might normally assume are the most obvious / optimal), and the set of 7 optimized keypoints. The RE and ADD results of different keypoint sets are shown on the Table I and II, demonstrating that the optimized keypoints achieve the best performance for both the left and right arm. The 2D PCK results with various thresholds are shown in Fig. 5. While no drastic improvements for the left arm, the detections for the right arm improved significantly by using the optimized keypoints, indicating that the robustness of the optimized keypoints provides better detection even with flipped images.
Iv-C2 Real-word Experiments
To transfer the keypoint detection on the real robot, we applied the domain randomization techniques, as described in Section III-C, to bridge the reality gap, and the experimental results show that the DNN generalizes well to real-world images. The original images were scaled by 0.25 before sending them to the DNN. Fig. 6 shows the optimized keypoints’ detections on the Baxter-real dataset, demonstrating that the keypoints can be detected even with self-occlusions. Given the accurate keypoint detections, the estimated skeletons are perfectly aligned with the robot arm. To examine the detection performance of different keypoints on real robots, the three keypoint sets in simulation experiments were used. We also implemented the traditional camera-to-base pose estimation procedure for comparison by placing the Aruco markers on the robot arm. The 2D and 3D PCK results for different methods are shown in Fig. 7. Using the optimized keypoints, around 50 percent of the estimations have fewer than 25 pixels error in the image plane (about 1 percent of the image size) and have an error of less than 50mm in 3D space, which is much better than other methods. Another number to notice is the area under the curve, which indicates the mean of PCKs over the thresholds. The optimized keypoints achieve the highest value in both 2D and 3D evaluations. Due to the self-occlusions and the camera’s pose, limiting the visibility of visual markers, some image frames do not have enough detected Aruco markers for EPnP (). Therefore, the estimation for those frames is unavailable.
Iv-D Surgical Tool Tracking
Surgical tool tracking continuously estimates the 3D pose of the tool end-effector with respect to the camera frame. This is necessary for augmented reality displays  and transferring of learning-based control policies , among other applications . We employed the tool tracker previously developed in , which combines a keypoint detector and a particle filter for 3D pose estimation. The keypoint detector detects the predefined keypoints on the input images to provide visual feedback, and the particle filter is used to estimate the 3D pose of the tool by combining the forward kinematics of the robot with the keypoint observations. Differing from  and , we are using the optimized keypoints (red points in Fig. 1) instead of the original hand-picked keypoints. Then, domain randomization technique is used to bridge the reality gap between the synthetic and real-world images. The resulting keypoint detections are shown in Fig. 8. Note that the optimized keypoints are inside of the tool body, and the DNN can accurately predict their projections onto the image plane in different tool configurations.
For comparison, we evaluated the tool tracking performance with three different setups described as follows.
Hand-labelled SuPer Deep Keypoints: This setup is identical to the tool tracking approach in . The seven hand-picked keypoints were used for tracking, which is on the surface of the tool. The DNN was then trained on the 100 real-world images with the hand-labeled ground-truth position.
Synthetically-labelled SuPer Deep Keypoints: The second setup used the same set of hand-picked keypoints from SuPer Deep, but the DNN was trained on the around 20K synthetic images with domain randomization, where the keypoint labels are provided even with occlusions.
Optimized Keypoints: The third setup was using the seven optimized keypoints, as shown in Fig. 1, and the DNN was also trained on the synthetic images with domain randomization.
The tool tracking performance is computed by rendering a re-projected tool mask on the image frame from the estimation based on the keypoints. The IoU is computed with the ground-truth mask for evaluation. The quantitative results are shown in Fig. 9, and the qualitative results are shown in Fig. 10. The setup with Hand-labelled SuPer Deep Keypoints fails to track the tool when the tool was turning, as those hand-picked keypoints on the tool surface are occluded and humans cannot provide the label for those non-visible keypoints. However, using the optimized keypoints, the DNN makes pretty accurate predictions for non-visible keypoints, as shown in Fig. 8, since the simulator can always inquire the pose of a point in a virtual camera frame. Although both Synthetically-labelled SuPer Deep Keypoints and optimized keypoints setups are utilizing the synthetic dataset, the optimized keypoints achieves higher accuracy because the keypoints are optimizing the detection performance of the DNN. Another advantage of the optimized keypoints is to reduce the ambiguity in detection. As stated in , the detection of some keypoints is challenging due to the tool’s symmetry, which would cause false detections. The optimized keypoints, distributed around the central line of the tool, could significantly reduce the ambiguity for symmetry.
V Discussion and Conclusion
We proposed a general keypoint optimization algorithm to maximize the performance of keypoint detection on robotic manipulators. Our algorithm utilized a DNN for keypoint detections and can even handle self-occlusions by optimizing the keypoint locations and training on synthetically generated data. The results show that the optimized keypoints yield higher accuracy on detections compared to manually or randomly selected keypoints, hence resulting in better performance for the wide breadth of robotic applications that rely on keypoints for visual feedback. To show this, we presented both quantitative and qualitive results from two robotic applications: camera-to-base pose estimation and surgical tool tracking.
The experimental results of detecting keypoints in cases of self-occlusion and their 3D locations being inside the kinematic links rather than on the surface further motivates the importance of this work. No manually selected keypoints has ever produced this type of result before. For future work, we will investigate self-supervised approaches to selecting candidate points such that the resulting optimized keypoints have even higher detectability.
-  (2006) Surf: speeded up robust features. In European conference on computer vision, pp. 404–417. Cited by: §II.
-  (2015) Saliency-based keypoint selection for fast object detection and matching. Pattern Recognition Letters 62, pp. 32–40. Cited by: §II.
-  (2005) Hand to sensor calibration: a geometrical interpretation of the matrix equation ax= xb. Journal of Robotic Systems 22 (9), pp. 497–506. Cited by: §I.
-  (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §III-B.
-  (2014) Automatic generation and detection of highly reliable fiducial markers under occlusion. Pattern Recognition 47 (6), pp. 2280–2292. Cited by: §I.
-  (2015) Deep residual learning for image recognition. External Links: Cited by: §III-A.
-  (2018) Unsupervised learning of object landmarks through conditional image generation. In Advances in neural information processing systems, pp. 4016–4027. Cited by: §II.
-  (2019) PyRep: bringing v-rep to deep robot learning. arXiv preprint arXiv:1906.11176. Cited by: §III-C.
-  (2020) imgaug. Note: Online; accessed 01-Feb-2020 Cited by: 7th item.
An open-source research kit for the da vinci® surgical system. In 2014 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 6434–6439. Cited by: §I, §IV-A3.
-  (2020) Camera-to-robot pose estimation from a single image. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 9426–9432. Cited by: §I, §IV-A1, §IV-C1.
-  (2009) Epnp: an accurate o (n) solution to the pnp problem. International journal of computer vision 81 (2), pp. 155. Cited by: §IV-C.
-  (2020) SuPer: a surgical perception framework for endoscopic tissue manipulation with surgical robotics. IEEE Robotics and Automation Letters 5 (2), pp. 2294–2301. Cited by: §IV-D.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: 5th item.
-  (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §II.
SuPer deep: a surgical perception framework for robotic tissue manipulation using deep learning for feature extraction. arXiv preprint arXiv:2003.03472. Cited by: §I, §I, §IV-D, §IV-D, §IV-D.
-  (2018) DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nature neuroscience 21 (9), pp. 1281. Cited by: §III-A.
-  (2016) Salient keypoint selection for object representation. In 2016 Twenty Second National Conference on Communication (NCC), pp. 1–6. Cited by: §II.
-  (2013) Grid-based spatial keypoint selection for real time visual odometry.. In ICPRAM, pp. 586–589. Cited by: §II.
-  (2011) AprilTag: a robust and flexible visual fiducial system. In 2011 IEEE International Conference on Robotics and Automation, pp. 3400–3407. Cited by: §I.
-  (2002) Online estimation of image jacobian matrix by kalman-bucy filter for uncalibrated stereo vision feedback. In Proceedings 2002 IEEE International Conference on Robotics and Automation, Vol. 1, pp. 562–567. Cited by: §I.
-  (2009) Recognizing indoor scenes. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 413–420. Cited by: 5th item.
-  (2019) Augmented reality predictive displays to help mitigate the effects of delayed telesurgery. In 2019 International Conference on Robotics and Automation (ICRA), pp. 444–450. Cited by: §IV-D.
Open-sourced reinforcement learning environments for surgical robotics. arXiv preprint arXiv:1903.02090. Cited by: §IV-D.
-  (2013) Baxter. Retrieved Jan 10, pp. 2014. Cited by: §I, §IV-C.
-  (2013) CoppeliaSim (formerly v-rep): a versatile and scalable robot simulation framework. In Proc. of The International Conference on Intelligent Robots and Systems (IROS), Note: www.coppeliarobotics.com Cited by: §III-C.
-  (2018-06) MobileNetV2: inverted residuals and linear bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. External Links: Cited by: §IV-B.
-  (2018) Discovery of latent 3d keypoints via end-to-end geometric reasoning. In Advances in neural information processing systems, pp. 2059–2070. Cited by: §II.
-  (2006) Localization of mobile robots with omnidirectional vision using particle filter and iterative sift. Robotics and Autonomous Systems 54 (9), pp. 758–765. Cited by: §II.
-  (2017) Unsupervised learning of object landmarks by factorized spatial embeddings. In Proceedings of the IEEE international conference on computer vision, pp. 5916–5925. Cited by: §II.
-  (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 23–30. Cited by: §III-C.
-  (2012) Articulated human detection with flexible mixtures of parts. IEEE transactions on pattern analysis and machine intelligence 35 (12), pp. 2878–2890. Cited by: §IV-A1.
-  (2017) Robot autonomy for surgery. In Encyclopedia of Medical Robotics, pp. 281–313. External Links: Cited by: §IV-D.
-  (2018) Unsupervised discovery of object landmarks as structural representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2694–2703. Cited by: §II.