Guided Uncertainty-Aware Policy Optimization: Combining Learning and Model-Based Strategies for Sample-Efficient Policy Learning

by   Michelle A. Lee, et al.

Traditional robotic approaches rely on an accurate model of the environment, a detailed description of how to perform the task, and a robust perception system to keep track of the current state. On the other hand, reinforcement learning approaches can operate directly from raw sensory inputs with only a reward signal to describe the task, but are extremely sample-inefficient and brittle. In this work, we combine the strengths of model-based methods with the flexibility of learning-based methods to obtain a general method that is able to overcome inaccuracies in the robotics perception/actuation pipeline, while requiring minimal interactions with the environment. This is achieved by leveraging uncertainty estimates to divide the space in regions where the given model-based policy is reliable, and regions where it may have flaws or not be well defined. In these uncertain regions, we show that a locally learned-policy can be used directly with raw sensory inputs. We test our algorithm, Guided Uncertainty-Aware Policy Optimization (GUAPO), on a real-world robot performing peg insertion. Videos are available at: <>


page 1

page 3


SOLAR: Deep Structured Latent Representations for Model-Based Reinforcement Learning

Model-based reinforcement learning (RL) methods can be broadly categoriz...

Uncertainty-aware Model-based Policy Optimization

Model-based reinforcement learning has the potential to be more sample e...

Imagined Value Gradients: Model-Based Policy Optimization with Transferable Latent Dynamics Models

Humans are masters at quickly learning many complex tasks, relying on an...

SAGCI-System: Towards Sample-Efficient, Generalizable, Compositional, and Incremental Robot Learning

Building general-purpose robots to perform an enormous amount of tasks i...

Agents that Listen: High-Throughput Reinforcement Learning with Multiple Sensory Systems

Humans and other intelligent animals evolved highly sophisticated percep...

Reinforcement Learning for Robotics and Control with Active Uncertainty Reduction

Model-free reinforcement learning based methods such as Proximal Policy ...

Multi-sensory Integration in a Quantum-Like Robot Perception Model

Formalisms inspired by Quantum theory have been used in Cognitive Scienc...

I Introduction

Modern robots rely on extensive systems to accomplish a given task, such as a perception module to monitor the state of the world [33, 2, 45]. Simple perception failure in this context is catastrophic for the robot, since its motion generator relies on it. Moreover, classic motion generators are quite rigid in how they accomplish a task, e.g., the robot has to pick an object in a specific way and might not recover if the grasp fails. These problems make robotics systems unstable, and hard to scale to new domains. In order to expand robotics reach we need more robust, adaptive, and flexible systems.

Learning-based method, such as Reinforcement Learning (RL) has the capacity to adapt, and deal directly with raw sensory inputs, which are not subject to estimation errors [28, 24]. The strength of RL stems from its capacity to define a task at a higher level through a reward function indicating what to do, not through an explicit set of control actions describing how the task should be performed. RL does not need specific physical modelling as they implicitly learn a data-driven model from interacting with the environment [1], allowing the method to be deployed in different settings. These characteristics are desired but come with different limitations: 1) randomly interacting with an environment can be quite unsafe for the human users as well as for the equipment, 2) RL is not recognized for being sample efficient. As such, introducing RL to a new environment can be time consuming and difficult.

Classic robotic approaches have mastered generating movements within free space, where there are no contacts with other elements in the environment [30]. We refer to these accessible methods as Model Based (MB) methods. One of their main limitations is that they normally do not handle perception errors and physical interactions naturally, e.g., grasping an object, placing an object, object insertion, etc. As such this limits the expressiveness of roboticists and the reliability of the system.

In this work we present an algorithmic framework that is aware of its own uncertainty in the perception and actuation system. As such a MB guides the agent to the relevant region, hence reducing the area where the RL policy needs to be optimized and making it more invariant to the absolute goal location. Our novel algorithm combines the strengths from MB and RL. We leverage the efficiency of MB to move in free-space, and the capacity of RL to learn from its environment from a loosely defined goal. In order to efficiently fuse MB and RL, we introduce a perception system that provides uncertainty estimates of the region where contacts might occur. This uncertainty is used to determine the region where the MB method shouldn’t be applied, and an RL policy should be learned instead. Therefore, we call our algorithm Guided Uncertainty Aware Policy Optimization (GUAPO).

Figure LABEL:fig:real shows an overview of our system, the task is initialized with MB where it guides the robot within the range of the uncertainties of the object of interest, e.g., the box where to insert the peg. Once we have reached that region, we switch to RL to complete the task. At learning time, we leverage information from task completion by the RL policy to reduce our perception system’s uncertainties. This work makes the following contributions:

  • We demonstrate that GUAPO outperforms pure RL, pure MB, as well as a Residual policy baseline [35, 16] that combines MB and RL for peg insertion;

  • We present a simple and yet efficient way to express pose uncertainties for a keypoint based pose estimator;

  • We show that our approach is sample efficient for learning methods on real-world robots.

Ii Definitions and Formulation

We tackle the problem of learning to perform an operation, unknown a-priori

, in an area of which we only have an estimated location and no accurate model. We formalize this problem as a Markov Decision Process (MDP), where we want to find a policy

that is a probability distribution over actions

conditioned on a given state . We optimize this policy to maximize the expected return , where is a given reward function, is the horizon of each rollout, and is the discount factor. The first assumption we leverage in this work can be expressed as having a partial knowledge of the transition function dictating the probability over next states when executing a certain action in the current state. Specifically, we assume this transition function is available only within a sub-space of the state-space . This is a common case in robotics, where it is perfectly known how the robot moves while it is in free-space, but there are no reliable and accurate models of general contacts and interactions with its surrounding [9]. This partial model can be combined with well established methods able to plan and execute trajectories that traverse [31, 34, 19], but these methods cannot successfully complete tasks that require acting in . It is usually easy for the user to specify this split relative to the objects in the scene, e.g., all points around or in contact with an object are not free space. If we call a point of interest, , in that region we can therefore express that region relative to it as . We do not assume perfect knowledge of the absolute coordinates of that point nor of , but rather only a noisy estimate of them as described in Section III-A.

The tasks we consider consist on reaching a particular state or configuration through interaction with the environment, like peg-insertion, switch-toggling, or grasping. These tasks are conveniently defined by a binary reward function that indicates having successfully reached a goal set , usually described with respect to a point [32, 7, 8]. Unfortunately this reward is extremely sparse, and random actions can take an prohibitive amount of samples to discover it [4, 6]. Therefore this paper addresses how to leverage the partial model described above to efficiently learn to solve the full task through interactions with the environment, overcoming an imperfect perception system and dynamics.

Iii Method

(a) DOPE perception and uncertainty to estimate



Variational autoencoder for


Fig. 1: Perception modules for the model-based component (left) and reinforcement learning component (right).

In this section, we describe the different components of GUAPO. We define as a super-set of generated from the perception system uncertainty estimation. We use this set to partition the space into the regions where the MB method is used, and regions where the RL policy is trained. Then we describe a MB method that can now confidently be used outside of to bring the robot within that set. Finally we define the RL policy, and how the learning can be more efficient by making its inputs local. We also outline our algorithm in Algorithm 1.

Iii-a From coarse perception to the RL workspace

Coarse perception systems are usually cheaper and faster to setup because they might require simpler hardware like RGB cameras, and can be used out-of-the-box without excessive tuning and calibration efforts [40]. If we use such a system to directly localize , the perception errors might misleadingly indicate that a certain area belongs to , hence trying to apply the MB method and potentially not being able to learn how to recover from there. Instead, we propose to use a perception system that also gives an uncertainty estimate. Many methods can represent the uncertainty by a nonparametric distribution, with possible descriptions of the region and their associated weights . By interpreting these weights as the likelihoods , we can express the likelihood of a certain state belonging to as:


If the perception system provides a parametric distribution, the above probability can be computed by integration, or approximated in a way such that the set is a super-set of for an appropriate set by the user. A more accurate perception system would make a tighter super-set of . Now that we have an over-approximation of the area where we cannot use our model-based method, we define a function indicating when to apply an RL policy instead of the given MB one . In short, GUAPO uses a hybrid policy presented in 2.


Therefore we use a switch between these two policies, based on the uncertainty estimate. A lower perception uncertainty reduces the area where the reinforcement learning method is required, and improves the overall efficiency. We now detail how each of these policies is obtained.

Iii-B Model-based actuation

In the previous section we defined , the region containing the goal set and hence the agent’s reward. In our problem statement we assume that outside that region, the environment model is well known, and therefore it is amenable to use a model-based approach. Therefore, whenever we are outside of , the MB approach corrects any deviations.

Our formulation can be extended for obstacle avoidance. Using a similar approach used to over-estimate the set , we can over-estimate the obstacle set to be avoided by , and remove that space from where the MB method can be applied, . An obstacle-avoiding MB method can be used to get to the area where the goal is, while avoiding the regions where the obstacle might be, as shown in our videos111

Iii-C From Model-Based to Reinforcement learning

Once has brought the system within , the control is handed-over to as expressed in Eq. 2. Note that our switching definition goes both ways, and therefore if takes exploratory actions that move it outside of , the MB method will act again to funnel the state to the area of interest. This also provides a framework for safe learning [5] in case there are obstacles to avoid as introduced in the section above. There are several advantages to having a more restricted area where the RL policy needs to learn how to act: first the exploration becomes easier, second, the policy can be local. In particular, we only feed to the images from a wrist-mounted camera and its current velocities, as depicted in Fig. 0(b). Not using global information from our perception system in Fig. 0(a) can make our RL policy generalize better across locations of . Finally, we propose to use an off-policy RL algorithm, so all the observed transitions can be added in the replay buffer, no matter if they come from or .

Iii-D Closing the MB-RL loop

This framework also allows to use any newly acquired experience to reduce such that successive rollouts can use the model-based method in larger areas of the state-space. For example, in the peg-insertion task, once the reward of fully inserting the peg is received, the location of the opening is immediately known. Since we no longer need to rely on the noisy perception system to estimate the location of the hole, we can update , where now the reinforcement learning algorithm only needs to do the actual insertion and not keep looking for the opening.

Input: : reset state : global observation RGB workspace camera : local observations wrist-mounted camera, robot velocity observations : model uncertain region containing the goal region , both unknown a-priori until reward is obtained.
Output: : robot actions
goal_localized False for each episode do
        // Reset robot
       if not goal_localized then
             DOPE() ;
              // Run perception
       end if
      for  do
             if robot not in  then
                    // Takes robot to
             end if
             end if
            Apply action to environment ; Add to replay buffer ; if  = 1 then
                    // No DOPE uncert.
                   goal_localized True ;
             end if
       end for
end for
Algorithm 1 GUAPO

Iv Implementation Details

Here we describe the implementation details of our GUAPO algorithm for a peg insertion task with a Franka Panda robot (7-DoF torque-controlled robot). We first introduce the perception module and how an uncertainty estimate is obtained to localize . Then we describe the model-based policy used to navigate in while avoiding obstacles, and the RL algorithm and the architecture of the RL policy being learned. Finally, we introduce our task set-up, the baseline algorithms we compare GUAPO with, and their implementations.

Fig. 2: Representative synthetic training images for our hole box.

Iv-a Perception Module

We use Deep Object Pose Estimator (DOPE) [40]

as the base for our perception system. DOPE uses a simple neural network architecture that can be quickly trained with synthetic data and domain randomization using NDDS

[39]. Figure 2 shows generated images with domain randomization used to train our perception system and thus allowing domain transfer (from synthetic to real world). Note that the model of the object that DOPE needs to detect is not very detailed, consisting of the approximate shape without texture. This is a challenging case, specially because no depth sensing is used to supplement the RGB information. DOPE algorithm first finds the object cuboid keypoints using local peaks on the map. Using the cuboid real dimensions, camera intrinsics, and the keypoint locations, DOPE runs a PnP algorithm [22] to find the final object pose in the camera frame.

For this work we extended the DOPE perception system to obtain uncertainty estimates of the object pose. This extension augments the peak estimation algorithm by fitting a 2d Gaussian around each found peak, as depicted by the dark contour maps in Fig. 0(a). We then run PnP algorithm on set of keypoints, where each set of keypoint is constructed by sampling from all the 2d Gaussians. This provides possible poses of the object consistent with the detection algorithm, as drawn in green bounding boxes in Fig. 0(a). In this work we treat them as equally likely.

From our problem formulation, we assume access to a rough description of the area of interest, , around the object where an operation needs to be performed. In our peg insertion task, this is a rectangle centered at the opening of the hole. For each of the pose samples given by our extended DOPE perception algorithm, we compute the associated hole opening positions, , represented by the green dots in Fig. 0(a). These points are then fitted by 3d Gaussian with diagonal covariance, represented in blue in the same figure. We use the mean as the center of and we over-approximate Eqn. 1 by displacing

along the axis by one standard deviation.

The perception module setup is depicted in Fig. LABEL:fig:real in orange, where the camera for DOPE (640x480x3 RGB images from Logitech Carl Zeiss Tessar) is mounted overlooking our workspace. The top center image with the orange border is a sample from that camera.

Iv-B Model-Based Controller Design

As model-based controller, we use a target attractors defined by Riemannian Motion Policies (RMPs) [30] to move the robot towards a desired end-effector location. The RMPs take in a desired end-effector position in Cartesian space. The target is set to be the centroid of , which in our case corresponds to the opening of the hole . As explained in the previous section, a coarse model of the object is required to train a perception module able to provide this location estimate and its uncertainty. The RMPs also require a model of the robot. These two requirements are the ones that give this part of the method the “model-based” component. By utilizing the RL component described in the following section, our GUAPO algorithm does not need these models to be extremely accurate. In case obstacles need to be avoided to reach , we can define barrier-type RMPs.

The policies are sending end-effector position commands at 20 Hz. The RMPs are computing desired joint positions at 1000 Hz. Given that impedance-end-effector control is an action space which has been shown to improve sample efficiency for policy learning for RL [26], we also use the RMPs interface as our reinforcement learning action space.

Iv-C Reinforcement Learning Algorithm and Architecture

We use a state-of-the-art model-free off-policy RL algorithm, Soft Actor Critic [11]. The RL policy acts directly from raw sensory inputs. This consists on joint velocities and images from a wrist-mounted camera (64x64x3 RGB images from a Logitech Carl Zeiss Tessar) on the robot (see Fig. LABEL:fig:real). As illustrated in Fig. 0(b), all inputs are fed into a -VAE [12]. The VAE gives us a low-dimensional latent-space representation of the state, which has been shown to improve sample efficiency of RL algorithms [23]. The parameters of this VAE are trained before-hand on a data-set collected off-line. The only part that is learned by the RL algorithm is a 2-layer MLP that takes as input the 64-dimensional latent representation given by the VAE, and produces 3D position displacement of the robot end-effector.

Iv-D Training Details

The VAE is pre-trained with 160,000 datapoints for 12 epochs, on the Titan XP GPU. DOPE is trained for 8 hours on 4 p100 GPU. All our learning-based policy methods (GUAPO, SAC baseline, and the Residual Policy baseline described in Sec. 

V) were trained for 60 training iterations. In total, each policy was trained with 120 training episodes, as each iteration has two training episodes, each with 1000 steps. This takes 90 min. to train.

Fig. 3: GUAPO is compared with five other methods: (1) Model-based policy with perfect goal estimate (MB-Perfect), (2) Model-based policy with additive random actions and perfect goal estimate (MB-Rand-Perfect), (3) Model-Based policy with additive random actions using DOPE goal estimates (MB-Rand-DOPE), (4) Reinforcement learning algorithm Soft Actor Critic (SAC), and (5) Residual policy. We train the policy for over 60 iterations, each with two episodes, 1000 steps long.

Iv-E Rewards

For GUAPO, we use a sparse reward when the policy finishes the task (inserts the peg). The policy gets -1 everywhere, and 0 when it finishes the task. For our other learning-based baselines (SAC [11] and Residual policy [35, 16]), we use a negative L2 norm to the perception estimate of the goal location , 0 when it reaches , and 1 when it finishes the task.

SAC [11] RESIDUAL [16] GUAPO (ours)
Success Rate 100% 0% 86.67% 26.6% 0% 0% 93%
Avg. Steps for
Task Completion
158.3 n/a 554.1 925.4 n/a n/a 469.6
In 100% 0% 100% 70.0% 0% 0% 100%
In 100% 100% 100% 93.3% 0% 100% 100%
TABLE I: Real world peg insertion results out of 30 trials. All learning policies (SAC, Residual, and Guapo) trained for 120 episodes (which takes around 90 minutes). The first row indicates percentage success for a full peg insertion. The second row depicts the speed of insertion for the trained policies. The last two rows indicate the percentage the method enters and .

V Experimental Design and Results

In this section we seek to answer the following questions: How does our method compares to our baseline policies, such as, Residual policies, in terms of sample efficiency and task completion? And, is the proposed algorithm capable of performing peg insertion on a real robot?

V-a Comparison Methods

All the different baselines were initialized about 75 cm away from the goal. They were all implemented on our real robotics system. As such we compare our proposed method to the following:

  • MB-Perfect. This method consists of a scripted policy under perfect state estimation.

  • MB-Rand-Perfect.

    This method uses the same policy as MB-Perfect where we injected random actions, which we sample from a normal distribution with 0 mean and a standard deviation defined by the perception uncertainty from DOPE (which is around 2.5cm to 3 cm).

  • MB-DOPE. This method is similar to MB-Perfect, but instead uses the pose estimator prediction to servo to the hole and accomplish insertion.

  • MB-Rand-Dope. This method uses the same policy as MB-Dope where we injected random actions, which is sampled in the same way as MB-Rand-Perfect.

  • SAC. This uses just the policy learned from the RL algorithm, Soft-Actor Critic (SAC), to accomplish the task.

  • Residual. This method is based off recent residual-learning techniques that combine model-based and reinforcement learning methods [35, 16].

V-B Results

The results comparing the different methods is shown in Table I, this table presents the success rate for insertion as well as the average number of steps needed for completion (a step is equivalent to 50 milliseconds of following the same robot command, as our policy is running at 20 Hz), and the percentage that the end-effector ends up in the and regions over 30 trials. We also present training iteration performance (task success and steps to completion) for the different methods in Figure 3.

MB-Perfect is able to insert 100% of the time, as it has perfect knowledge of the state, and can be seen as an oracle. We can see that taking random actions with MB-Rand-Perfect does not degrade excessively the full performance achieved by MB-Perfect. However, when we used DOPE as the perception system, which has around 2.5 to 3.5 cm of noise and error, the performance of MB-DOPE and MB-Rand-DOPE drops drastically. MB-Rand-DOPE performs 26.6% better than MB-DOPE, as the random actions can help offset the perception error.

In our setup SAC did not achieve any insertion. This is due to the low number of samples that SAC trained on, since most success stories of RL in the real world require several orders of magnitude more data [25]. The Residual method also did not achieve any insertions. The Residual method often would apply large actions far away from the hole opening, and end up sliding off the box and getting stuck pushing against the side of the box. In comparison, GUAPO only turns on the reinforcement learning policy once it is already nearby the region of interest, and hence does not suffer from this. However, Residual was able to reach 100% of the time after 120 training episodes, while SAC never did.

In comparison, as seen in Fig. 3, after around 8 training iterations, GUAPO is also able to start inserting into the hole (which is about 12 minute real-world training time). As the policy trains, the average number of steps it takes to insert the peg also decreases. After 120 training episodes (and 90 minutes of training), GUAPO is able to achieve 93% insertion rate.

Vi Related Work

In robotic manipulation there are two dominating paradigms to perform a task: leveraging model of the environment (model-based method) or leveraging data to learn (learning-based method). The first category of methods relies on a precise description of the task, such as object CAD models, as well as powerful and sophisticated perception systems [33, 43]. With an accurate model, a well engineered solution can be designed for that particular task [41, 18], or the model can then be combined with some search algorithm like motion planning [38]. This type of model-based approach is limited by the ingenuity of the roboticist, and could lead to irrecoverable failure if the perception system has un-modeled noise and error.

On the other hand, learning-based approaches in manipulation [24, 10] do not require such detailed description, but rather require access to interaction with the environment, as well as a reward that indicates success. Such binary rewards are easy to describe, but unfortunately they render Reinforcement Learning methods extremely sample-inefficient. Hence many prior works use shaped rewards [20], which requires considerable tuning. Other works use low-dimensional state spaces [47] instead of image inputs, which requires either precise perception systems or specially-designed hardware with sensors. There are some proposed methods that manage to deal directly with the sparse rewards, like automatic curriculum generation [7, 8] or the use of demonstrations [42, 3, 27], but these approaches still require large amounts of interaction with the environment. Furthermore, if the position of the objects in the scene changes or there are new distractors in the scene, these methods need to be fully retrained. On the other hand, our method is extremely sample-efficient with a sparse success reward, and is robust to these variations thanks to the model-based component.

Recent works can also be understood as combining model-based and learning-based approaches. One such method [17] uses a reinforcement learning algorithm to find the best parameters that describe the behavior of the agent based on a model-based template. The learning is very efficient, but at the cost of an extremely engineered pre-solution that also relies on an accurate perception system. Another line of work that allows to combine model-based and learning-based methods is Residual Learning [16, 35], where RL is used to learn an additive policy that can potentially fully over-write the original model-based policy and does not require any further structure. Nevertheless, these methods are hard to tune, and hardly preserve any of the benefits of the underlying model-based method once trained.

The problem of known object pose estimation is a vibrant subject within the robotics and computer vision communities

[40, 13, 14, 46, 44, 15, 29, 36, 37]. Regressing to keypoints on the object or on a cuboid encompassing the object seems to have become the defacto approach for the problem. Keypoints are first detected by a neural network, then PP [21] is used to predict the pose of the object. Peng et al. [29]

also explored the problem of using uncertainty by leveraging a ransac voting algorithm to find regions where a keypoint could be detected. This approach differs from ours as they do not directly regress to a keypoint probability map, they regress to a vector voting map, where line intersection is then used to find keypoints. Moreover their method does not carry pose uncertainty in the final prediction.

Vii Conclusions

We introduce a novel algorithm, Guided Uncertainty Aware Policy Optimization (GUAPO), that combines the generalization capabilities of model-based methods and the adaptability of learning-based methods. It allows to loosely define the task to perform, by solely providing a coarse model of the objects, and a rough description of the area where some operation needs to be performed. The model-based system leverage this high-level information and accessible state estimation systems to create a funnel around the area of interest. We use the uncertainty estimate provided by the perception system to automatically switch between the model-based policy, and a learning-based policy that can learn from an easy-to-define sparse reward, overcoming the model and estimation errors of the model-based part. We show learning in the real world of a peg insertion task.


Carlos Florensa and Michelle Lee are grateful to all the robotics team at NVIDIA for providing a great learning environment, and providing constant support. Special thanks to Ankur Handa for helping with the compute infrastructure.


  • [1] Y. Chebotar, A. Handa, V. Makoviychuk, M. Macklin, J. Issac, N. Ratliff, and D. Fox (2018) Closing the sim-to-real loop: adapting simulation randomization with real world experience. arXiv preprint arXiv:1810.05687. Cited by: §I.
  • [2] C. Cheng, M. Mukadam, J. Issac, S. Birchfield, D. Fox, B. Boots, and N. Ratliff (2019) RMPflow : A Computational Graph for Automatic Motion Policy Generation. preprint arxiv:1811.07049. External Links: Link Cited by: §I.
  • [3] Y. Ding, C. Florensa, M. Phielipp, and P. Abbeel (2019)

    Goal-conditioned Imitation Learning


    Workshop on Self-Supervised Learning at ICML

    External Links: Link Cited by: §VI.
  • [4] Y. Duan, X. Chen, J. Schulman, and P. Abbeel (2016) Benchmarking Deep Reinforcement Learning for Continuous Control.

    International Conference in Machine Learning

    External Links: Link, ISBN 9781510829008 Cited by: §II.
  • [5] J. F. Fisac, A. K. Akametalu, M. N. Zeilinger, S. Kaynama, J. Gillula, and C. J. Tomlin (2017-05) A General Safety Framework for Learning-Based Control in Uncertain Robotic Systems. preprint arxiv:1705.01292. External Links: Link Cited by: §III-C.
  • [6] C. Florensa, Y. Duan, and P. Abbeel (2017) Stochastic Neural Networks for Hierarchical Reinforcement Learning. International Conference in Learning Representations, pp. 1–17. External Links: Link, ISBN 9781613242643, Document, ISSN 14779129 Cited by: §II.
  • [7] C. Florensa, D. Held, X. Geng, and P. Abbeel (2018) Automatic Goal Generation for Reinforcement Learning Agents. International Conference in Machine Learning. External Links: Link Cited by: §II, §VI.
  • [8] C. Florensa, D. Held, M. Wulfmeier, M. Zhang, and P. Abbeel (2017) Reverse Curriculum Generation for Reinforcement Learning. Conference on Robot Learning, pp. 1–16. External Links: Link, Document, ISSN 1938-7228 Cited by: §II, §VI.
  • [9] S. Ganguly and O. Khatib (2018) Experimental studies of contact space model for multi-surface collisions in articulated rigid-body systems. In International Symposium on Experimental Robotics, Cited by: §II.
  • [10] T. Haarnoja, V. Pong, A. Zhou, M. Dalal, P. Abbeel, and S. Levine (2018) Composable Deep Reinforcement Learning for Robotic Manipulation. In Proceedings - IEEE International Conference on Robotics and Automation, pp. 6244–6251. External Links: ISBN 9781538630815, Document, ISSN 10504729 Cited by: §VI.
  • [11] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, and C. Sciences (2018) Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. Internation Conference in Machine Learning, pp. 1–15. Cited by: §IV-C, §IV-E, TABLE I.
  • [12] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, A. Lerchner, and G. Deepmind (2017-11) -VAE: LEARNING BASIC VISUAL CONCEPTS WITH A CONSTRAINED VARIATIONAL FRAMEWORK. International Conference in Learning Representations, pp. 1–22. External Links: Link, ISBN 1078-0874, Document, ISSN 1078-0874 Cited by: §IV-C.
  • [13] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab (2012) Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In ACCV, Cited by: §VI.
  • [14] T. Hodaň, P. Haluza, Š. Obdržálek, J. Matas, M. Lourakis, and X. Zabulis (2017) T-LESS: an RGB-D dataset for 6D pose estimation of texture-less objects. In WACV, Cited by: §VI.
  • [15] Y. Hu, J. Hugonot, P. Fua, and M. Salzmann (2019) Segmentation-driven 6D object pose estimation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 3385–3394. Cited by: §VI.
  • [16] T. Johannink, S. Bahl, A. Nair, J. Luo, and A. Kumar (2019) Residual Reinforcement Learning for Robot Control. preprint arxiv:1812.03201, pp. 1–9. External Links: Document Cited by: 1st item, §IV-E, TABLE I, 6th item, §VI.
  • [17] L. Johannsmeier, M. Gerchow, and S. Haddadin (2019) A Framework for Robot Manipulation: Skill Formalism, Meta Learning and Adaptive Control. Technical report Technische Universitat Munchen. External Links: Link Cited by: §VI.
  • [18] C. H. Kim and J. Seo (2019-04) Shallow-Depth Insertion: Peg in Shallow Hole Through Robotic In-Hand Manipulation. IEEE Robotics and Automation Letters 4 (2), pp. 383–390. External Links: Link, Document, ISSN 2377-3766 Cited by: §VI.
  • [19] S. M. LaValle (2006) Planning algorithms. Cambridge University Press, Cambridge, U.K.. Note: Available at Cited by: §II.
  • [20] M. A. Lee, Y. Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg (2018) Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks. International Conference on Robotics and Automation. Cited by: §VI.
  • [21] V. Lepetit, F. Moreno-Noguer, and P. Fua (2009) EPnP: An accurate O(n) solution to the PnP problem. International Journal Computer Vision 81 (2). Cited by: §VI.
  • [22] V. Lepetit, F. Moreno-Noguer, and P. Fua (2009-02) EPnP: an accurate o(n) solution to the pnp problem. International Journal of Computer Vision 81, pp. . External Links: Document Cited by: §IV-A.
  • [23] T. Lesort, N. Díaz-Rodríguez, J. Goudou, and D. Filliat. (2018) State representation learning for control: an overview. CoRR abs/1802.04181. Cited by: §IV-C.
  • [24] S. Levine and C. Finn (2016) End-to-End Training of Deep Visuomotor Policies. Journal of Machine Learning Research 17, pp. 1–40. Cited by: §I, §VI.
  • [25] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen (2018)

    Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection

    The International Journal of Robotics Research 37 (4-5), pp. 421–436. Cited by: §V-B.
  • [26] R. Martín-Martín, M. A. Lee, R. Gardner, S. Savarese, J. Bohg, and A. Garg (2019) Variable impedance control in end-effector space: an action space for reinforcement learning in contact-rich tasks. arXiv preprint arXiv:1906.08880. Cited by: §IV-B.
  • [27] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel (2018) Overcoming Exploration in Reinforcement Learning with Demonstrations. International Conference on Robotics and Automation. External Links: Link, ISBN 9781538630808, Document, ISSN 0969-2290 Cited by: §VI.
  • [28] A. Nair, V. Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine (2018) Visual Reinforcement Learning with Imagined Goals. Adavances in Neural Information Processing Systems. Cited by: §I.
  • [29] S. Peng, Y. Liu, Q. Huang, X. Zhou, and H. Bao (2019) PVNet: pixel-wise voting network for 6dof pose estimation. In CVPR, Cited by: §VI.
  • [30] N. D. Ratliff, J. Issac, D. Kappler, S. Birchfield, and D. Fox (2018-01) Riemannian Motion Policies. arXiv preprint arXiv:1801.02854. External Links: Link Cited by: §I, §IV-B.
  • [31] N. Ratliff, M. Toussaint, and S. Schaal (2015) Understanding the geometry of workspace obstacles in motion optimization. In International Conference on Robotics and Automation, Cited by: §II.
  • [32] T. Schaul, D. Horgan, K. Gregor, and D. Silver (2015) Universal Value Function Approximators. Internation Conference in Machine Learning. External Links: Link Cited by: §II.
  • [33] T. Schmidt, R. Newcombe, and D. Fox (2015) DART: Dense Articulated Real-Time Tracking. Technical report University of Washington. External Links: Link Cited by: §I, §VI.
  • [34] J. Schulman, J. Ho, A. Lee, I. Awwal, H. Bradlow, and P. Abbeel (2013-06) Finding locally optimal, collision-free trajectories with sequential convex optimization. In RSS, pp. . External Links: Document Cited by: §II.
  • [35] T. Silver, K. Allen, J. Tenenbaum, and L. Kaelbling (2018) Residual Policy Learning. preprint arXiv:1812.06298. Cited by: 1st item, §IV-E, 6th item, §VI.
  • [36] M. Sundermeyer, Z. Marton, M. Durner, M. Brucker, and R. Triebel (2018) Implicit 3d orientation learning for 6d object detection from rgb images. In ECCV, Cited by: §VI.
  • [37] B. Tekin, S. N. Sinha, and P. Fua (2018) Real-time seamless single shot 6D object pose prediction. In CVPR, Cited by: §VI.
  • [38] G. Thomas, M. Chien, A. Tamar, J. A. Ojea, P. Abbeel, and R. O. Mar (2018) Learning Robotic Assembly from CAD. International Conference on Robotics and Automation. Cited by: §VI.
  • [39] T. To, J. Tremblay, D. McKay, Y. Yamaguchi, K. Leung, A. Balanon, J. Cheng, and S. Birchfield (2018) NDDS: NVIDIA deep learning dataset synthesizer. Note: Cited by: §IV-A.
  • [40] J. Tremblay, T. To, B. Sundaralingam, Y. Xiang, D. Fox, and S. Birchfield (2018) Deep object pose estimation for semantic robotic grasping of household objects. arXiv preprint arXiv:1809.10790. Cited by: §III-A, §IV-A, §VI.
  • [41] K. Van Wyk, M. Culleton, J. Falco, and K. Kelly (2018) Comparative Peg-in-Hole Testing of a Force-Based Manipulation Controlled Robotic Hand. IEEE Transactions on Robotics 34 (2), pp. 542–549. External Links: Document, ISSN 15523098 Cited by: §VI.
  • [42] M. Vecerik, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot, N. Heess, T. Rothörl, T. Lampe, and M. Riedmiller (2017) Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards. preprint arxiv:1707.08817, pp. 1–11. Cited by: §VI.
  • [43] M. Wüthrich, P. Pastor, M. Kalakrishnan, J. Bohg, and S. Schaal (2013-11) Probabilistic object tracking using a range camera. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3195–3202. Cited by: §VI.
  • [44] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox (2018)

    PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes

    In RSS, Cited by: §VI.
  • [45] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox (2017) Posecnn: a convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199. Cited by: §I.
  • [46] S. Zakharov, I. Shugurov, and S. Ilic (2019) DPOD: Dense 6D pose object detector in RGB images. arXiv preprint arXiv:1902.11020. Cited by: §VI.
  • [47] H. Zhu, A. Gupta, A. Rajeswaran, S. Levine, and V. Kumar (2019) Dexterous manipulation with deep reinforcement learning: efficient, general, and low-cost. In 2019 International Conference on Robotics and Automation (ICRA), pp. 3651–3657. Cited by: §VI.