I Introduction
Modern robots rely on extensive systems to accomplish a given task, such as a perception module to monitor the state of the world [33, 2, 45]. Simple perception failure in this context is catastrophic for the robot, since its motion generator relies on it. Moreover, classic motion generators are quite rigid in how they accomplish a task, e.g., the robot has to pick an object in a specific way and might not recover if the grasp fails. These problems make robotics systems unstable, and hard to scale to new domains. In order to expand robotics reach we need more robust, adaptive, and flexible systems.
Learningbased method, such as Reinforcement Learning (RL) has the capacity to adapt, and deal directly with raw sensory inputs, which are not subject to estimation errors [28, 24]. The strength of RL stems from its capacity to define a task at a higher level through a reward function indicating what to do, not through an explicit set of control actions describing how the task should be performed. RL does not need specific physical modelling as they implicitly learn a datadriven model from interacting with the environment [1], allowing the method to be deployed in different settings. These characteristics are desired but come with different limitations: 1) randomly interacting with an environment can be quite unsafe for the human users as well as for the equipment, 2) RL is not recognized for being sample efficient. As such, introducing RL to a new environment can be time consuming and difficult.
Classic robotic approaches have mastered generating movements within free space, where there are no contacts with other elements in the environment [30]. We refer to these accessible methods as Model Based (MB) methods. One of their main limitations is that they normally do not handle perception errors and physical interactions naturally, e.g., grasping an object, placing an object, object insertion, etc. As such this limits the expressiveness of roboticists and the reliability of the system.
In this work we present an algorithmic framework that is aware of its own uncertainty in the perception and actuation system. As such a MB guides the agent to the relevant region, hence reducing the area where the RL policy needs to be optimized and making it more invariant to the absolute goal location. Our novel algorithm combines the strengths from MB and RL. We leverage the efficiency of MB to move in freespace, and the capacity of RL to learn from its environment from a loosely defined goal. In order to efficiently fuse MB and RL, we introduce a perception system that provides uncertainty estimates of the region where contacts might occur. This uncertainty is used to determine the region where the MB method shouldn’t be applied, and an RL policy should be learned instead. Therefore, we call our algorithm Guided Uncertainty Aware Policy Optimization (GUAPO).
Figure LABEL:fig:real shows an overview of our system, the task is initialized with MB where it guides the robot within the range of the uncertainties of the object of interest, e.g., the box where to insert the peg. Once we have reached that region, we switch to RL to complete the task. At learning time, we leverage information from task completion by the RL policy to reduce our perception system’s uncertainties. This work makes the following contributions:

We present a simple and yet efficient way to express pose uncertainties for a keypoint based pose estimator;

We show that our approach is sample efficient for learning methods on realworld robots.
Ii Definitions and Formulation
We tackle the problem of learning to perform an operation, unknown apriori
, in an area of which we only have an estimated location and no accurate model. We formalize this problem as a Markov Decision Process (MDP), where we want to find a policy
that is a probability distribution over actions
conditioned on a given state . We optimize this policy to maximize the expected return , where is a given reward function, is the horizon of each rollout, and is the discount factor. The first assumption we leverage in this work can be expressed as having a partial knowledge of the transition function dictating the probability over next states when executing a certain action in the current state. Specifically, we assume this transition function is available only within a subspace of the statespace . This is a common case in robotics, where it is perfectly known how the robot moves while it is in freespace, but there are no reliable and accurate models of general contacts and interactions with its surrounding [9]. This partial model can be combined with well established methods able to plan and execute trajectories that traverse [31, 34, 19], but these methods cannot successfully complete tasks that require acting in . It is usually easy for the user to specify this split relative to the objects in the scene, e.g., all points around or in contact with an object are not free space. If we call a point of interest, , in that region we can therefore express that region relative to it as . We do not assume perfect knowledge of the absolute coordinates of that point nor of , but rather only a noisy estimate of them as described in Section IIIA.The tasks we consider consist on reaching a particular state or configuration through interaction with the environment, like peginsertion, switchtoggling, or grasping. These tasks are conveniently defined by a binary reward function that indicates having successfully reached a goal set , usually described with respect to a point [32, 7, 8]. Unfortunately this reward is extremely sparse, and random actions can take an prohibitive amount of samples to discover it [4, 6]. Therefore this paper addresses how to leverage the partial model described above to efficiently learn to solve the full task through interactions with the environment, overcoming an imperfect perception system and dynamics.
Iii Method
In this section, we describe the different components of GUAPO. We define as a superset of generated from the perception system uncertainty estimation. We use this set to partition the space into the regions where the MB method is used, and regions where the RL policy is trained. Then we describe a MB method that can now confidently be used outside of to bring the robot within that set. Finally we define the RL policy, and how the learning can be more efficient by making its inputs local. We also outline our algorithm in Algorithm 1.
Iiia From coarse perception to the RL workspace
Coarse perception systems are usually cheaper and faster to setup because they might require simpler hardware like RGB cameras, and can be used outofthebox without excessive tuning and calibration efforts [40]. If we use such a system to directly localize , the perception errors might misleadingly indicate that a certain area belongs to , hence trying to apply the MB method and potentially not being able to learn how to recover from there. Instead, we propose to use a perception system that also gives an uncertainty estimate. Many methods can represent the uncertainty by a nonparametric distribution, with possible descriptions of the region and their associated weights . By interpreting these weights as the likelihoods , we can express the likelihood of a certain state belonging to as:
(1) 
If the perception system provides a parametric distribution, the above probability can be computed by integration, or approximated in a way such that the set is a superset of for an appropriate set by the user. A more accurate perception system would make a tighter superset of . Now that we have an overapproximation of the area where we cannot use our modelbased method, we define a function indicating when to apply an RL policy instead of the given MB one . In short, GUAPO uses a hybrid policy presented in 2.
(2) 
Therefore we use a switch between these two policies, based on the uncertainty estimate. A lower perception uncertainty reduces the area where the reinforcement learning method is required, and improves the overall efficiency. We now detail how each of these policies is obtained.
IiiB Modelbased actuation
In the previous section we defined , the region containing the goal set and hence the agent’s reward. In our problem statement we assume that outside that region, the environment model is well known, and therefore it is amenable to use a modelbased approach. Therefore, whenever we are outside of , the MB approach corrects any deviations.
Our formulation can be extended for obstacle avoidance. Using a similar approach used to overestimate the set , we can overestimate the obstacle set to be avoided by , and remove that space from where the MB method can be applied, . An obstacleavoiding MB method can be used to get to the area where the goal is, while avoiding the regions where the obstacle might be, as shown in our videos^{1}^{1}1https://sites.google.com/view/guaporl.
IiiC From ModelBased to Reinforcement learning
Once has brought the system within , the control is handedover to as expressed in Eq. 2. Note that our switching definition goes both ways, and therefore if takes exploratory actions that move it outside of , the MB method will act again to funnel the state to the area of interest. This also provides a framework for safe learning [5] in case there are obstacles to avoid as introduced in the section above. There are several advantages to having a more restricted area where the RL policy needs to learn how to act: first the exploration becomes easier, second, the policy can be local. In particular, we only feed to the images from a wristmounted camera and its current velocities, as depicted in Fig. 0(b). Not using global information from our perception system in Fig. 0(a) can make our RL policy generalize better across locations of . Finally, we propose to use an offpolicy RL algorithm, so all the observed transitions can be added in the replay buffer, no matter if they come from or .
IiiD Closing the MBRL loop
This framework also allows to use any newly acquired experience to reduce such that successive rollouts can use the modelbased method in larger areas of the statespace. For example, in the peginsertion task, once the reward of fully inserting the peg is received, the location of the opening is immediately known. Since we no longer need to rely on the noisy perception system to estimate the location of the hole, we can update , where now the reinforcement learning algorithm only needs to do the actual insertion and not keep looking for the opening.
Iv Implementation Details
Here we describe the implementation details of our GUAPO algorithm for a peg insertion task with a Franka Panda robot (7DoF torquecontrolled robot). We first introduce the perception module and how an uncertainty estimate is obtained to localize . Then we describe the modelbased policy used to navigate in while avoiding obstacles, and the RL algorithm and the architecture of the RL policy being learned. Finally, we introduce our task setup, the baseline algorithms we compare GUAPO with, and their implementations.
Iva Perception Module
We use Deep Object Pose Estimator (DOPE) [40]
as the base for our perception system. DOPE uses a simple neural network architecture that can be quickly trained with synthetic data and domain randomization using NDDS
[39]. Figure 2 shows generated images with domain randomization used to train our perception system and thus allowing domain transfer (from synthetic to real world). Note that the model of the object that DOPE needs to detect is not very detailed, consisting of the approximate shape without texture. This is a challenging case, specially because no depth sensing is used to supplement the RGB information. DOPE algorithm first finds the object cuboid keypoints using local peaks on the map. Using the cuboid real dimensions, camera intrinsics, and the keypoint locations, DOPE runs a PnP algorithm [22] to find the final object pose in the camera frame.For this work we extended the DOPE perception system to obtain uncertainty estimates of the object pose. This extension augments the peak estimation algorithm by fitting a 2d Gaussian around each found peak, as depicted by the dark contour maps in Fig. 0(a). We then run PnP algorithm on set of keypoints, where each set of keypoint is constructed by sampling from all the 2d Gaussians. This provides possible poses of the object consistent with the detection algorithm, as drawn in green bounding boxes in Fig. 0(a). In this work we treat them as equally likely.
From our problem formulation, we assume access to a rough description of the area of interest, , around the object where an operation needs to be performed. In our peg insertion task, this is a rectangle centered at the opening of the hole. For each of the pose samples given by our extended DOPE perception algorithm, we compute the associated hole opening positions, , represented by the green dots in Fig. 0(a). These points are then fitted by 3d Gaussian with diagonal covariance, represented in blue in the same figure. We use the mean as the center of and we overapproximate Eqn. 1 by displacing
along the axis by one standard deviation.
The perception module setup is depicted in Fig. LABEL:fig:real in orange, where the camera for DOPE (640x480x3 RGB images from Logitech Carl Zeiss Tessar) is mounted overlooking our workspace. The top center image with the orange border is a sample from that camera.
IvB ModelBased Controller Design
As modelbased controller, we use a target attractors defined by Riemannian Motion Policies (RMPs) [30] to move the robot towards a desired endeffector location. The RMPs take in a desired endeffector position in Cartesian space. The target is set to be the centroid of , which in our case corresponds to the opening of the hole . As explained in the previous section, a coarse model of the object is required to train a perception module able to provide this location estimate and its uncertainty. The RMPs also require a model of the robot. These two requirements are the ones that give this part of the method the “modelbased” component. By utilizing the RL component described in the following section, our GUAPO algorithm does not need these models to be extremely accurate. In case obstacles need to be avoided to reach , we can define barriertype RMPs.
The policies are sending endeffector position commands at 20 Hz. The RMPs are computing desired joint positions at 1000 Hz. Given that impedanceendeffector control is an action space which has been shown to improve sample efficiency for policy learning for RL [26], we also use the RMPs interface as our reinforcement learning action space.
IvC Reinforcement Learning Algorithm and Architecture
We use a stateoftheart modelfree offpolicy RL algorithm, Soft Actor Critic [11]. The RL policy acts directly from raw sensory inputs. This consists on joint velocities and images from a wristmounted camera (64x64x3 RGB images from a Logitech Carl Zeiss Tessar) on the robot (see Fig. LABEL:fig:real). As illustrated in Fig. 0(b), all inputs are fed into a VAE [12]. The VAE gives us a lowdimensional latentspace representation of the state, which has been shown to improve sample efficiency of RL algorithms [23]. The parameters of this VAE are trained beforehand on a dataset collected offline. The only part that is learned by the RL algorithm is a 2layer MLP that takes as input the 64dimensional latent representation given by the VAE, and produces 3D position displacement of the robot endeffector.
IvD Training Details
The VAE is pretrained with 160,000 datapoints for 12 epochs, on the Titan XP GPU. DOPE is trained for 8 hours on 4 p100 GPU. All our learningbased policy methods (GUAPO, SAC baseline, and the Residual Policy baseline described in Sec.
V) were trained for 60 training iterations. In total, each policy was trained with 120 training episodes, as each iteration has two training episodes, each with 1000 steps. This takes 90 min. to train.IvE Rewards
For GUAPO, we use a sparse reward when the policy finishes the task (inserts the peg). The policy gets 1 everywhere, and 0 when it finishes the task. For our other learningbased baselines (SAC [11] and Residual policy [35, 16]), we use a negative L2 norm to the perception estimate of the goal location , 0 when it reaches , and 1 when it finishes the task.




SAC [11]  RESIDUAL [16]  GUAPO (ours)  

Success Rate  100%  0%  86.67%  26.6%  0%  0%  93%  

158.3  n/a  554.1  925.4  n/a  n/a  469.6  
In  100%  0%  100%  70.0%  0%  0%  100%  
In  100%  100%  100%  93.3%  0%  100%  100% 
V Experimental Design and Results
In this section we seek to answer the following questions: How does our method compares to our baseline policies, such as, Residual policies, in terms of sample efficiency and task completion? And, is the proposed algorithm capable of performing peg insertion on a real robot?
Va Comparison Methods
All the different baselines were initialized about 75 cm away from the goal. They were all implemented on our real robotics system. As such we compare our proposed method to the following:

MBPerfect. This method consists of a scripted policy under perfect state estimation.

MBRandPerfect.
This method uses the same policy as MBPerfect where we injected random actions, which we sample from a normal distribution with 0 mean and a standard deviation defined by the perception uncertainty from DOPE (which is around 2.5cm to 3 cm).

MBDOPE. This method is similar to MBPerfect, but instead uses the pose estimator prediction to servo to the hole and accomplish insertion.

MBRandDope. This method uses the same policy as MBDope where we injected random actions, which is sampled in the same way as MBRandPerfect.

SAC. This uses just the policy learned from the RL algorithm, SoftActor Critic (SAC), to accomplish the task.
VB Results
The results comparing the different methods is shown in Table I, this table presents the success rate for insertion as well as the average number of steps needed for completion (a step is equivalent to 50 milliseconds of following the same robot command, as our policy is running at 20 Hz), and the percentage that the endeffector ends up in the and regions over 30 trials. We also present training iteration performance (task success and steps to completion) for the different methods in Figure 3.
MBPerfect is able to insert 100% of the time, as it has perfect knowledge of the state, and can be seen as an oracle. We can see that taking random actions with MBRandPerfect does not degrade excessively the full performance achieved by MBPerfect. However, when we used DOPE as the perception system, which has around 2.5 to 3.5 cm of noise and error, the performance of MBDOPE and MBRandDOPE drops drastically. MBRandDOPE performs 26.6% better than MBDOPE, as the random actions can help offset the perception error.
In our setup SAC did not achieve any insertion. This is due to the low number of samples that SAC trained on, since most success stories of RL in the real world require several orders of magnitude more data [25]. The Residual method also did not achieve any insertions. The Residual method often would apply large actions far away from the hole opening, and end up sliding off the box and getting stuck pushing against the side of the box. In comparison, GUAPO only turns on the reinforcement learning policy once it is already nearby the region of interest, and hence does not suffer from this. However, Residual was able to reach 100% of the time after 120 training episodes, while SAC never did.
In comparison, as seen in Fig. 3, after around 8 training iterations, GUAPO is also able to start inserting into the hole (which is about 12 minute realworld training time). As the policy trains, the average number of steps it takes to insert the peg also decreases. After 120 training episodes (and 90 minutes of training), GUAPO is able to achieve 93% insertion rate.
Vi Related Work
In robotic manipulation there are two dominating paradigms to perform a task: leveraging model of the environment (modelbased method) or leveraging data to learn (learningbased method). The first category of methods relies on a precise description of the task, such as object CAD models, as well as powerful and sophisticated perception systems [33, 43]. With an accurate model, a well engineered solution can be designed for that particular task [41, 18], or the model can then be combined with some search algorithm like motion planning [38]. This type of modelbased approach is limited by the ingenuity of the roboticist, and could lead to irrecoverable failure if the perception system has unmodeled noise and error.
On the other hand, learningbased approaches in manipulation [24, 10] do not require such detailed description, but rather require access to interaction with the environment, as well as a reward that indicates success. Such binary rewards are easy to describe, but unfortunately they render Reinforcement Learning methods extremely sampleinefficient. Hence many prior works use shaped rewards [20], which requires considerable tuning. Other works use lowdimensional state spaces [47] instead of image inputs, which requires either precise perception systems or speciallydesigned hardware with sensors. There are some proposed methods that manage to deal directly with the sparse rewards, like automatic curriculum generation [7, 8] or the use of demonstrations [42, 3, 27], but these approaches still require large amounts of interaction with the environment. Furthermore, if the position of the objects in the scene changes or there are new distractors in the scene, these methods need to be fully retrained. On the other hand, our method is extremely sampleefficient with a sparse success reward, and is robust to these variations thanks to the modelbased component.
Recent works can also be understood as combining modelbased and learningbased approaches. One such method [17] uses a reinforcement learning algorithm to find the best parameters that describe the behavior of the agent based on a modelbased template. The learning is very efficient, but at the cost of an extremely engineered presolution that also relies on an accurate perception system. Another line of work that allows to combine modelbased and learningbased methods is Residual Learning [16, 35], where RL is used to learn an additive policy that can potentially fully overwrite the original modelbased policy and does not require any further structure. Nevertheless, these methods are hard to tune, and hardly preserve any of the benefits of the underlying modelbased method once trained.
The problem of known object pose estimation is a vibrant subject within the robotics and computer vision communities
[40, 13, 14, 46, 44, 15, 29, 36, 37]. Regressing to keypoints on the object or on a cuboid encompassing the object seems to have become the defacto approach for the problem. Keypoints are first detected by a neural network, then PP [21] is used to predict the pose of the object. Peng et al. [29]also explored the problem of using uncertainty by leveraging a ransac voting algorithm to find regions where a keypoint could be detected. This approach differs from ours as they do not directly regress to a keypoint probability map, they regress to a vector voting map, where line intersection is then used to find keypoints. Moreover their method does not carry pose uncertainty in the final prediction.
Vii Conclusions
We introduce a novel algorithm, Guided Uncertainty Aware Policy Optimization (GUAPO), that combines the generalization capabilities of modelbased methods and the adaptability of learningbased methods. It allows to loosely define the task to perform, by solely providing a coarse model of the objects, and a rough description of the area where some operation needs to be performed. The modelbased system leverage this highlevel information and accessible state estimation systems to create a funnel around the area of interest. We use the uncertainty estimate provided by the perception system to automatically switch between the modelbased policy, and a learningbased policy that can learn from an easytodefine sparse reward, overcoming the model and estimation errors of the modelbased part. We show learning in the real world of a peg insertion task.
Acknowledgment
Carlos Florensa and Michelle Lee are grateful to all the robotics team at NVIDIA for providing a great learning environment, and providing constant support. Special thanks to Ankur Handa for helping with the compute infrastructure.
References
 [1] (2018) Closing the simtoreal loop: adapting simulation randomization with real world experience. arXiv preprint arXiv:1810.05687. Cited by: §I.
 [2] (2019) RMPflow : A Computational Graph for Automatic Motion Policy Generation. preprint arxiv:1811.07049. External Links: Link Cited by: §I.

[3]
(2019)
Goalconditioned Imitation Learning
.Workshop on SelfSupervised Learning at ICML
. External Links: Link Cited by: §VI. 
[4]
(2016)
Benchmarking Deep Reinforcement Learning for Continuous Control.
International Conference in Machine Learning
. External Links: Link, ISBN 9781510829008 Cited by: §II.  [5] (201705) A General Safety Framework for LearningBased Control in Uncertain Robotic Systems. preprint arxiv:1705.01292. External Links: Link Cited by: §IIIC.
 [6] (2017) Stochastic Neural Networks for Hierarchical Reinforcement Learning. International Conference in Learning Representations, pp. 1–17. External Links: Link, ISBN 9781613242643, Document, ISSN 14779129 Cited by: §II.
 [7] (2018) Automatic Goal Generation for Reinforcement Learning Agents. International Conference in Machine Learning. External Links: Link Cited by: §II, §VI.
 [8] (2017) Reverse Curriculum Generation for Reinforcement Learning. Conference on Robot Learning, pp. 1–16. External Links: Link, Document, ISSN 19387228 Cited by: §II, §VI.
 [9] (2018) Experimental studies of contact space model for multisurface collisions in articulated rigidbody systems. In International Symposium on Experimental Robotics, Cited by: §II.
 [10] (2018) Composable Deep Reinforcement Learning for Robotic Manipulation. In Proceedings  IEEE International Conference on Robotics and Automation, pp. 6244–6251. External Links: ISBN 9781538630815, Document, ISSN 10504729 Cited by: §VI.
 [11] (2018) Soft ActorCritic: OffPolicy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. Internation Conference in Machine Learning, pp. 1–15. Cited by: §IVC, §IVE, TABLE I.
 [12] (201711) VAE: LEARNING BASIC VISUAL CONCEPTS WITH A CONSTRAINED VARIATIONAL FRAMEWORK. International Conference in Learning Representations, pp. 1–22. External Links: Link, ISBN 10780874, Document, ISSN 10780874 Cited by: §IVC.
 [13] (2012) Model based training, detection and pose estimation of textureless 3D objects in heavily cluttered scenes. In ACCV, Cited by: §VI.
 [14] (2017) TLESS: an RGBD dataset for 6D pose estimation of textureless objects. In WACV, Cited by: §VI.

[15]
(2019)
Segmentationdriven 6D object pose estimation.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 3385–3394. Cited by: §VI.  [16] (2019) Residual Reinforcement Learning for Robot Control. preprint arxiv:1812.03201, pp. 1–9. External Links: Document Cited by: 1st item, §IVE, TABLE I, 6th item, §VI.
 [17] (2019) A Framework for Robot Manipulation: Skill Formalism, Meta Learning and Adaptive Control. Technical report Technische Universitat Munchen. External Links: Link Cited by: §VI.
 [18] (201904) ShallowDepth Insertion: Peg in Shallow Hole Through Robotic InHand Manipulation. IEEE Robotics and Automation Letters 4 (2), pp. 383–390. External Links: Link, Document, ISSN 23773766 Cited by: §VI.
 [19] (2006) Planning algorithms. Cambridge University Press, Cambridge, U.K.. Note: Available at http://planning.cs.uiuc.edu/ Cited by: §II.
 [20] (2018) Making Sense of Vision and Touch: SelfSupervised Learning of Multimodal Representations for ContactRich Tasks. International Conference on Robotics and Automation. Cited by: §VI.
 [21] (2009) EPnP: An accurate O(n) solution to the PnP problem. International Journal Computer Vision 81 (2). Cited by: §VI.
 [22] (200902) EPnP: an accurate o(n) solution to the pnp problem. International Journal of Computer Vision 81, pp. . External Links: Document Cited by: §IVA.
 [23] (2018) State representation learning for control: an overview. CoRR abs/1802.04181. Cited by: §IVC.
 [24] (2016) EndtoEnd Training of Deep Visuomotor Policies. Journal of Machine Learning Research 17, pp. 1–40. Cited by: §I, §VI.

[25]
(2018)
Learning handeye coordination for robotic grasping with deep learning and largescale data collection
. The International Journal of Robotics Research 37 (45), pp. 421–436. Cited by: §VB.  [26] (2019) Variable impedance control in endeffector space: an action space for reinforcement learning in contactrich tasks. arXiv preprint arXiv:1906.08880. Cited by: §IVB.
 [27] (2018) Overcoming Exploration in Reinforcement Learning with Demonstrations. International Conference on Robotics and Automation. External Links: Link, ISBN 9781538630808, Document, ISSN 09692290 Cited by: §VI.
 [28] (2018) Visual Reinforcement Learning with Imagined Goals. Adavances in Neural Information Processing Systems. Cited by: §I.
 [29] (2019) PVNet: pixelwise voting network for 6dof pose estimation. In CVPR, Cited by: §VI.
 [30] (201801) Riemannian Motion Policies. arXiv preprint arXiv:1801.02854. External Links: Link Cited by: §I, §IVB.
 [31] (2015) Understanding the geometry of workspace obstacles in motion optimization. In International Conference on Robotics and Automation, Cited by: §II.
 [32] (2015) Universal Value Function Approximators. Internation Conference in Machine Learning. External Links: Link Cited by: §II.
 [33] (2015) DART: Dense Articulated RealTime Tracking. Technical report University of Washington. External Links: Link Cited by: §I, §VI.
 [34] (201306) Finding locally optimal, collisionfree trajectories with sequential convex optimization. In RSS, pp. . External Links: Document Cited by: §II.
 [35] (2018) Residual Policy Learning. preprint arXiv:1812.06298. Cited by: 1st item, §IVE, 6th item, §VI.
 [36] (2018) Implicit 3d orientation learning for 6d object detection from rgb images. In ECCV, Cited by: §VI.
 [37] (2018) Realtime seamless single shot 6D object pose prediction. In CVPR, Cited by: §VI.
 [38] (2018) Learning Robotic Assembly from CAD. International Conference on Robotics and Automation. Cited by: §VI.
 [39] (2018) NDDS: NVIDIA deep learning dataset synthesizer. Note: https://github.com/NVIDIA/Dataset_Synthesizer Cited by: §IVA.
 [40] (2018) Deep object pose estimation for semantic robotic grasping of household objects. arXiv preprint arXiv:1809.10790. Cited by: §IIIA, §IVA, §VI.
 [41] (2018) Comparative PeginHole Testing of a ForceBased Manipulation Controlled Robotic Hand. IEEE Transactions on Robotics 34 (2), pp. 542–549. External Links: Document, ISSN 15523098 Cited by: §VI.
 [42] (2017) Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards. preprint arxiv:1707.08817, pp. 1–11. Cited by: §VI.
 [43] (201311) Probabilistic object tracking using a range camera. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3195–3202. Cited by: §VI.

[44]
(2018)
PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes
. In RSS, Cited by: §VI.  [45] (2017) Posecnn: a convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199. Cited by: §I.
 [46] (2019) DPOD: Dense 6D pose object detector in RGB images. arXiv preprint arXiv:1902.11020. Cited by: §VI.
 [47] (2019) Dexterous manipulation with deep reinforcement learning: efficient, general, and lowcost. In 2019 International Conference on Robotics and Automation (ICRA), pp. 3651–3657. Cited by: §VI.