Log In Sign Up

Deep Reinforcement Learning for Industrial Insertion Tasks with Visual Inputs and Natural Rewards

by   Gerrit Schoettler, et al.

Connector insertion and many other tasks commonly found in modern manufacturing settings involve complex contact dynamics and friction. Since it is difficult to capture related physical effects with first-order modeling, traditional control methods often result in brittle and inaccurate controllers, which have to be manually tuned. Reinforcement learning (RL) methods have been demonstrated to be capable of learning controllers in such environments from autonomous interaction with the environment, but running RL algorithms in the real world poses sample efficiency and safety challenges. Moreover, in practical real-world settings we cannot assume access to perfect state information or dense reward signals. In this paper, we consider a variety of difficult industrial insertion tasks with visual inputs and different natural reward specifications, namely sparse rewards and goal images. We show that methods that combine RL with prior information, such as classical controllers or demonstrations, can solve these tasks from a reasonable amount of real-world interaction.


page 1

page 2

page 4


Residual Reinforcement Learning for Robot Control

Conventional feedback control methods can solve various types of robot c...

Achieving Sample-Efficient and Online-Training-Safe Deep Reinforcement Learning with Base Controllers

Application of Deep Reinforcement Learning (DRL) algorithms in real-worl...

Residual Reinforcement Learning from Demonstrations

Residual reinforcement learning (RL) has been proposed as a way to solve...

PixL2R: Guiding Reinforcement Learning Using Natural Language by Mapping Pixels to Rewards

Reinforcement learning (RL), particularly in sparse reward settings, oft...

Proactive Action Visual Residual Reinforcement Learning for Contact-Rich Tasks Using a Torque-Controlled Robot

Contact-rich manipulation tasks are commonly found in modern manufacturi...

Reinforcement Learning for Sparse-Reward Object-Interaction Tasks in First-person Simulated 3D Environments

First-person object-interaction tasks in high-fidelity, 3D, simulated en...

Learning Coordinated Terrain-Adaptive Locomotion by Imitating a Centroidal Dynamics Planner

Dynamic quadruped locomotion over challenging terrains with precise foot...

Code Repositories

I Introduction

Many industrial tasks on the edge of automation require a degree of adaptability that is difficult to achieve with conventional robotic automation techniques. While standard control methods, such as PID controllers, are heavily employed to automate many tasks in the context of positioning, tasks that require significant adaptability or tight visual perception-control loops are often beyond the capabilities of such methods, and therefore are typically performed manually. Standard control methods can struggle in presence of complex dynamical phenomena that are hard to model analytically, such as complex contacts. Reinforcement learning (RL) offers a different solution, relying on trial and error learning instead of accurate modeling to construct an effective controller. RL with expressive function approximation, i.e. deep RL, has further shown to automatically handle high dimensional inputs such as images [1].

However, deep RL has thus far not seen wide adoption in the automation community due to several practical obstacles. Sample efficiency is one obstacle: tasks must be completed without excessive interaction time or wear and tear on the robot. Progress in recent years on developing better RL algorithms has led to significantly better sample efficiency, even in dynamically complicated tasks [2, 3], but remains a challenge for deploying RL in real-world robotics contexts. Another major, often underappreciated, obstacle is goal specification: while prior work in RL assumes a reward signal to optimize, it is often carefully shaped to allow the system to learn [4, 5, 6]. Obtaining such dense reward signals can be a significant challenge, as one must additionally build a perception system that allows computing dense rewards on state representations. Shaping a reward function so that an agent can learn from it is also a manual process that requires considerable manual effort. An ideal RL system would learn from rewards that are natural and easy to specify. How can we enable robots to autonomously perform complex tasks without significant engineering effort to design perception and reward systems?

Fig. 1: We train an agent directly in the real world to solve connector insertion tasks from raw pixel input and without access to ground-truth state information for reward functions. Left, a rollout from a learned policy that successfully completes the insertion task for each connector is shown. Right, a full view of the robot setup including the three connectors we use in this work. Videos of the results are available at

We first consider an end-to-end approach that learns a policy from images, where the images serve as both the state representation and the goal specification. Using goal images is not fully general, but can successfully represent tasks when the task is to reach a final desired state [7]. Specifying goals via goal images is convenient, and makes it possible to specify goals with minimal manual effort. Using images as the state representation also allows a robot to learn behaviors that utilize direct visual feedback, which provides some robustness to sensor and actuator noise.

Secondly, we consider learning from simple and sparse reward signals. Sparse rewards can often be obtained conveniently, for instance from human-provided labels or simple instrumentation. In many electronic assembly tasks, which we consider here, we can directly detect whether the electronics are functional, and use that signal as a reward. Learning from sparse rewards poses a challenge, as exploration with sparse reward signals is difficult, but by using sufficient prior information about the task, one can overcome this challenge. To handle this challenge, we extend the residual RL approach [8, 9], which learns a parametric policy on top of a fixed, hand-specified controller, to the setting of vision-based manipulation.

Fig. 2: A close-up view of the three connector insertion tasks shows the contacts and tight tolerances the agent must navigate to solve these tasks. These tasks require sub-millimeter precision without visual feedback.

In our experiments, we show that we can successfully complete real-world tight tolerance assembly tasks, such as inserting USB connectors, using RL from images with reward signals that are convenient for users to specify. We show that we can learn from only a sparse reward based on the electrical connection for a USB adapter plug, and we demonstrate learning insertion skills with rewards based only on goal images. These reward signals require no extra engineering and are easy to specify for many tasks. Beyond showing the feasibility of RL for solving these tasks, we evaluate multiple RL algorithms across three tasks and study their robustness to imprecise positioning and noise.

Ii Related Work

Learning has been applied previously in a variety of robotics contexts. Different forms of learning have enabled autonomous driving [10], biped locomotion [11], block stacking [12], grasping [13], and navigation [14, 15]. Among these methods, many involve reinforcement learning, where an agent learns to perform a task by maximizing a reward signal. Reinforcement learning algorithms have been developed and applied to teach robots to perform tasks such as balancing a robot [16], playing ping-pong [17] and baseball [18]

. The use of large function approximators, such as neural networks, in RL has further broadened the generality of RL

[1]. Such techniques, called “deep” RL, have further allowed robots to be trained directly in the real world to perform fine-grained manipulation tasks from vision [19], open doors [20], play hockey [21], stack Lego blocks [22], use dexterous hands [23], and grasp objects [24]. In this work we further explore solving real-world robotics tasks using RL.

Many RL algorithms introduce prior information about the specific task to be solved. One common method is reward shaping [4], but reward shaping can become arbitrarily difficult as the complexity of the task increases. Other methods incorporate a trajectory planner [25, 26, 27] but for complex assembly tasks, trajectory planners require a host of information about objects and geometries which can be difficult to provide.

Another body of work on incorporating prior information studies using demonstrations either to initialize a policy [18, 28], infer reward functions using inverse reinforcement learning [29, 30, 31, 32, 33] or to improve the policy throughout the learning procedure [34, 35, 36, 37]

. These methods require multiple demonstrations, which can be difficult to collect, especially for assembly tasks, although learning a reward function by classifying goal states

[38, 39] may partially alleviate this issue. More recently, manually specifying a policy and learning the residual task has been proposed [8, 9]. In this work we evaluate both residual RL and combining RL with learning from demonstrations.

Previous work has also tackled high precision assembly tasks, especially insertion-type tasks. One line of work focuses on obtaining high dimensional observations, including geometry, forces, joint positions and velocities [40, 41, 42, 43], but this information is not easily procured, increasing complexity of the experiments and the supervision required. Other work relies on external trajectory planning or very high precision control [42, 41], but this can be brittle to error in other components of the system, such as perception. We show how our method not only solves insertion tasks with much less information about the environment, but also does so under noisy conditions.

Iii Electric Connector Plug Insertion Tasks

Fig. 3: Illustration of the robot’s cascade control scheme. The actions are computed at a frequency of up to , desired joint angles are obtained by inverse kinematics, and a joint-space impedance controller with anti-windup PID control commands actuator torques at .

In this work, we empirically evaluate learning methods on a set of electric connector assembly tasks, pictured in Fig. 2. Connector plug insertions are difficult for two reasons. First, the robot must be very precise in lining up the plug with its socket. As we show in our experiments, errors as small as  mm can lead to consistent failure. Second, there is significant friction when the connector plug touches the socket, and the robot must learn to apply sufficient force in order to insert the plug. Image sequences of successful insertions are shown in Fig. 1

, where it is also possible to see details of the gripper setup that we used to ensure a failure free, fully automated training process. In our experiments, we use a 7 degrees of freedom Sawyer robot with end-effector control, meaning that the action signal

can be interpreted as the relative end-effector movement in Cartesian coordinates. The robot’s underlying internal control pipeline is illustrated in Fig. 3.

To comprehensively evaluate connector assembly tasks, we repeat our experiments on a variety of connectors. Each connector offers a different challenge in terms of required precision and force to overcome friction. We chose to benchmark the controllers performance on the insertion of a USB connector, a U-Sub connector, and a waterproof Model-E connector manufactured by MISUMI. All the explored use cases were part of the IROS 2017 Robotic Grasping and Manipulation Competition [44], included as part of a task board developed by NIST to benchmark the performance of assembly robots.

Iii-a Adapters

In the following we describe the used adapters, USB, D-Sub, and Model-E. The observed difficulty of the insertion increases in that order.

Iii-A1 Usb

The USB connector is a ubiquitous, widely-used connector and offers a challenging insertion task. Because the adapter becomes smoother and therefore easier to insert over time due to wear and tear, we periodically replace the adapter. Of the three tested adapters, the USB adapter is the easiest.

Iii-A2 Dsub

Inserting this adapter requires aligning several pins correctly, and is therefore more sensitive than inserting the USB adapter. It also requires more downward force due to a tighter fit.

Iii-A3 Model-E

This adapter is the most difficult of the three tested connectors as it contains several edges and grooves to align and requires significant downward force to successfully insert the part.

Iii-B Experimental Settings

We consider three settings in our experiments in order to evaluate how plausible it is to solve these tasks with more convenient state representations and reward functions and to evaluate the performance of different algorithms changes as the setting is modified.

Iii-B1 Visual

In this experiment, we evaluate whether the RL algorithms can learn to perform the connector assembly tasks from vision without having access to state information. The state provided to the learned policy is a grayscale image, such as shown in Fig. 4

. For goal specification, we use a goal image, avoiding the need for state information to compute rewards. The reward is the pixelwise L1 distance to the given goal image. Being able to learn from such a setup is compelling as it does not require any extra state estimation and many tasks can be specified easily by a goal image.

Iii-B2 Electrical (Sparse)

In this experiment, the reward is obtained by directly measuring whether the connection is alive and transmitting:


This is the exact true reward for the task of connecting a cable, and can be naturally obtained in many manufacturing systems. As state, the robot is given the Cartesian coordinates of the end-effector and the vertical force that is acting on the end-effector. We could only automatically detect the USB connection, so we only include the USB adapter for the sparse experiments.

Iii-B3 Dense

In this experiment, the robot receives a manually shaped reward based on the distance to the target location . We use the reward function



. The hyperparameters are set to

, , and . When an insertion is indicated through a distance measurement, the sign of the force term flips, so that when the connector is inserted. This rewards the agent for pressing down after an insertion and showed to improve the learning process. The force measurements are calibrated before each rollout to account for measurement bias and to decouple the measurements from the robot pose.

Iv Methods

To solve the connector insertion tasks, we consider and evaluate a variety of reinforcement learning algorithms.

Iv-a Preliminaries

In a Markov decision process (MDP), an agent at every time step is at state

, takes actions , receives a reward , and the state evolves according to environment transition dynamics . The goal of reinforcement learning is to choose actions to maximize the expected returns where is the horizon and is a discount factor. The policy is often chosen to be an expressive parametric function approximator, such as a neural network, as we use in this work.

Iv-B Efficient Off-Policy Reinforcement Learning

One class of RL methods additionally estimates the expected discounted return after taking action from state , the Q-value . Q-values can be recursively defined with the Bellman equation:


and learned from off-policy transitions . Because we are interested in sample-efficient real-world learning, we use such RL algorithms that can take advantage of off-policy data.

For control with continuous actions, computing the required maximum in the Bellman equation is difficult. Continuous control algorithms such as deep deterministic policy gradients (DDPG) [45] additionally learn a policy to approximately choose the maximizing action. In this paper we specifically consider two related reinforcement learning algorithms that lend themselves well to real-world learning as they are sample efficient, stable, and require little hyperparameter tuning.

Iv-B1 Twin Delayed Deep Deterministic Policy Gradients (TD3)

Like DDPG, TD3 optimizes a deterministic policy [46] but uses two Q-function approximators to reduce value overestimation [47] and delayed policy updates to stabilize training.

Iv-B2 Soft Actor Critic (SAC)

SAC is an off-policy value-based reinforcement learning method based on the maximum entropy reinforcement learning framework with a stochastic policy [2].

We used the implementation of these RL algorithms publicly available at rlkit [48].

Iv-C Residual Reinforcement Learning

Instead of randomly exploring from scratch, we can inject prior information into an RL algorithm in order to speed up the training process, as well as to minimize unsafe exploration behavior. In residual RL, actions are chosen by additively combining a fixed policy with a parametric policy :


The parameters can be learned using any RL algorithm. In this work, we evaluate both SAC and TD3, explained in the previous section. The residual RL implementation that we use in our experiments is summarized in Algorithm 1.

A simple P-controller serves as the hand-designed controller of our experiments. The P-controller operates in Cartesian space and calculates the current control action by


where denotes the commanded goal location. As control gains we use . This P-controller quickly centers the end-effector above the goal position and reaches the goal after about 10 time steps when starting from the reset positions, which is located about 5cm above the goal.

1:policy , hand-engineered controller .
2:for  episodes do
3:   Sample initial state .
4:   for  steps do
5:      Get policy action .
6:      Get action to execute .
7:      Get next state .
8:      Store into replay buffer .
9:      Sample set of transitions .
10:      Optimize using RL with sampled transitions.
11:   end for
12:end for
Algorithm 1 Residual reinforcement learning

Fig. 4: Successful insertion on the Model-E connector task. The grayscale images are the only observations that the image-based reinforcement learning algorithm receives.
Fig. 5: Resulting final mean distance during the vision-based training. The comparison includes RL, residual RL, and RL with learning from demonstrations. Only residual RL manages to deal with the high-dimensional input and consistently solve all the tasks after the given amount of training. The other methods learn to move downwards, but often get stuck in the beginning of the insertion and fail to recover from unsuccessful attempts.

Iv-D Learning from Demonstrations

Another method to incorporate prior information is to use demonstrations from an expert policy to guide exploration during RL. We first collected demonstrations with a joystick controller. Then, we add a behavior cloning loss while performing RL that pushes the policy towards the demonstrator actions, as previously considered in [35]. Instead of DDPG, the underlying algorithm RL algorithm used is TD3.

V Experiments

We evaluate the industrial applicability of the residual RL approach on a variety of connector insertion tasks that are performed on a real robot, using easy-to-obtain reward signals. In this section, we consider two types of natural rewards which are intuitive to humans: an image directly specifying a goal and a binary sparse reward indicating success. For both cases, we report success rates on tasks they solve. We aim to answer the following questions: (1) Can such trained policies provide comparable performance to policies that are trained with densely-shaped rewards? (2) Are these trained policies robust to light variations and noise?

V-a Vision-based Learning

For the vision-based learning experiments, we use only raw image observations and distance between the current image and goal image as the reward signal. Sample images that the robot received are shown in Fig. 4. We evaluate this type of reward on all three connectors. In our experiments, we use grayscale images.

V-B Learning from Sparse Rewards

In the sparse reward experiment, we use the binary signal of the connector being electrically connected as the reward signal. This experiment is most applicable to electronic manufacturing settings where the electrical connection between connectors can be directly measured. We only evaluate the sparse reward setting on the USB connector, as it was straightforward to obtain the electrical connection signal.

V-C Perfect State Information

After evaluating the tasks in the above settings, we further evaluate with full state information with a dense and carefully shaped reward signal, given in Eq. 2, that incorporates distance to the goal and force information. Evaluating in this setting gives us an “oracle” that can be compared to the previous experiments in order to understand how much of a challenge sparse or image rewards pose for various algorithms.

V-D Robustness

For safe and reliable future usage, it is required that the insertion controller is robust against small measurement or calibration errors that can occur when disassembling and reassembling a mechanical system. In this experiment, small goal perturbations are introduced in order to uncover the required setup precision of our algorithms.

V-E Exploration Comparison

One advantage of using reinforcement learning is the exploratory behavior that allows the controller to adapt from new experiences unlike a deterministic control law. The two RL algorithms we consider in this paper, SAC and TD3, explore differently. SAC maintains a stochastic policy, and the algorithm also adapts the stochasticity through training. TD3 has a deterministic policy, but uses another noise process (in our case Gaussian) to inject exploratory behavior during training time. We compare the two algorithms, as well as when they are used in conjunction with residual RL, in order to evaluate the effect of the different exploration schemes.

D-Sub Connector Goal
 Perfect  Noisy
Pure RL Dense 16% 0%
Images, SAC 0% 0%
Images, TD3 12% 12%
RL + LfD Images 52% 52%
Residual RL Dense 100% 60%
Images, SAC 100% 64%
Images, TD3 52% 52%
Human P-Controller 100% 44%
Model-E Connector Goal
 Perfect  Noisy
Pure RL Dense 0% 0%
Images, SAC 0% 0%
Images, TD3 0% 0%
RL + LfD Images 20% 20%
Residual RL Dense 100% 76%
Images, SAC 100% 76%
Images, TD3 0% 0%
Human P-Controller 52% 24%
Fig. 6: We report the average success out of 25 policy executions after training is finished for each method. For noisy goals, noise is added in form of perturbations of the goal location. Interaction with the environment allows learning methods to learn a robust feedback controller, allowing them to solve the task even in the presence of this noise. Residual RL with SAC tends to be the best performing method across all three connectors. For the Model-E connector, which is particularly difficult to align precisely, only residual RL manages to solve the task in the given amount of training time.

Vi Results

In order to evaluate our experiments with dense and vision-based rewards, we analyze the achieved final distance to the goal throughout the training process. Policies trained with sparse rewards are compared based on their success rate because their training objective does not include the minimization of the distance to the goal. We report the success rate of all final policies and compare their robustness towards measurement noise in the goal location.

Vi-a Vision-based Learning

The results of the vision-based experiment are shown in Fig. 5. Our experiments show that a successful and consistent vision-based insertion policy can be learned from relatively few samples using residual RL. This result suggests that goal-specification through images is a practical way to solve these types of industrial tasks.

Although image-based rewards are often very sparse and hard to learn from, in this case the distance between images corresponds to a relatively dense reward signal which is sufficient to distinguish the different stages of the insertion process.

Interestingly, during training with pure RL, the policy would sometimes learn to “hack” the reward signal by moving down in the image in front of or behind the socket. In contrast, the stabilizing human-engineered controller in residual RL provides sufficient horizontal control to prevent this and it also transforms the 3-dimensional task into a quasi 1-dimensional problem for the reinforcement learning algorithm, which explains the very good results obtained with residual RL in conjunction with vision-based rewards.

Vi-B Learning From Sparse Rewards

In this experiment, we compare these methods on the USB insertion task with sparse rewards. The results are reported in Fig. 7. All methods are able to achieve very high success rates in the sparse setting, but the methods that use prior information learn about twice as fast. This result shows that we can learn precise industrial insertion tasks in sparse-reward settings, which can often be obtained much easier than a dense reward. In fact, prior work has found that the final policy for sparse rewards can outperform the final policy for dense rewards as it does not suffer from a misspecified objective [49].

Fig. 7: Learning curves for solving the USB insertion task with a sparse reward are shown. In this experiment, ground truth state is given as observations to the agent. Residual RL and RL with learning from demonstrations both solve the task relatively quickly, while RL alone takes about twice as long to solve the task at the same level of performance.

USB Connector
 Perfect  Noisy

Pure RL
Dense 28% 20%
Sparse,  SAC 16% 8%
Sparse,  TD3 44% 28%
Images, SAC 36% 32%
Images, TD3 28% 28%
RL + LfD Sparse 100% 32%
Images 88% 60%
Residual RL Dense 100% 84%
Sparse, SAC 88% 84%
Sparse, TD3 100% 36%
Images, SAC 100% 80%
Images, TD3 0% 0%
Human P-Controller 100% 60%
Fig. 8: Average success rate on the USB insertion task.
Fig. 9: Plots of the final mean distance to the goal during the state-based training. Final distances greater than indicate unsuccessful insertions. Here, the residual RL approach performs noticeably better than pure RL and is often able to solve the task during the exploration in the early stages of the training.

Vi-C Perfect State Information

The results of the experiment with perfect state information and dense rewards is shown in Fig. 9. In this case, the same conclusions of residual RL outperforming RL alone hold. Due to the shaped reward, RL makes more initial progress than it does with the other reward signals, but it never overcomes the friction required to insert the plug in fully and complete the task. It appears that the hand-designed reward function does not incentivize the full insertion enough, as we were able to obtain better results on the USB insertion with sparse rewards.

Vi-D Robustness

In previous set of experiments, the goal locations were known exactly. In this case, the hand-engineered controller performs well. However, once noise is induced to the goal location, the deterministic P-controller does not solve the task anymore. After training on perfect observations, a goal perturbation is created artificially and the controllers are tested under this condition. In the presence of a perturbation on the USB insertion task, the hand-engineered controller succeeds in only trials, while the best policies with residual RL can still solve the task in of the trials. A similar difference of the success rates was observed on the D-Sub connector, where all polices showed a larger decrease in performance when noise was added. Remarkable performance was achieved by residual RL when trained on the Model-E connector insertion. There, the high precision requirements prevented the human-engineered controller from succeeding more than of the trials, even though the exact goal location was given. Residual RL consistently solved this task and even managed to handle goal perturbations in of the trials. The results of our robustness evaluations are included in Figures 6 and 8. The agent demonstrably learns consistent, small corrective feedback behaviors in order to move in the right direction towards the descending path, a behavior that is very difficult to specify manually. The results of this experiment showcases the strength of residual RL. Since the human controller already specifies the general trajectory of the optimal policy, environment samples are only required to learn this corrective feedback behavior.

Vi-E Exploration Comparison

All experiments were also performed using TD3 instead of SAC. The final success rates of these experiments are included in Fig. 6. When combined with residual RL, SAC and TD3 perform comparably. However, TD3 is often less robust. These results are likely explained by the nature of the exploration for the two algorithms. TD3 has a deterministic policy and fixed noise during training, so once it observes some high-reward states, it quickly learns to repeat that trajectory. SAC adapts the noise to the correct scale, helping SAC stay robust to small perturbations. Furthermore, we found that the outputted action of TD3 approaches the extreme values at the edge of the allowed action space. This suggests that TD3 finds a local minimum, from which it is difficult to improve further.

Vii Discussion and Future Work

In this paper we studied deep reinforcement learning in a practical setting and demonstrated that this approach can solve complex industrial assembly tasks with tight tolerances, e. g. connector plug insertions. Compared to previous work [8], which uses dense reward signals, we showed that we can learn insertion policies only from sparse binary rewards or even purely from goal images and also limited to only image observations. We conducted a series of experiments for various connector type assemblies under challenging conditions such as noisy goals and complex connector geometries. We found that residual RL consistently performs well in all of these settings. Our study demonstrates the feasibility the application of RL to industrial automation tasks, where reward shaping or perception may be difficult, but sparse rewards or image goals can often be provided.

There remains significant challenges for applying these techniques in more complex environments. One practical direction for future work is focusing on multi-stage assembly tasks through vision. This would pose a challenge to the goal-based policies as the background would be visually more complex. Moreover, multi-step tasks involve adapting to previous mistakes or inaccuracies, which could be difficult, but in theory should be able to be handled by RL. Extending the presented approach to multi-stage assembly tasks will pave the road to a higher robot autonomy in flexible manufacturing.

Viii Acknowledgements

This work was supported by the Siemens Corporation, the Office of Naval Research under a Young Investigator Program Award, and Berkeley DeepDrive.


  • [1] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing Atari with Deep Reinforcement Learning,” in

    NIPS Workshop on Deep Learning

    , 2013, pp. 1–9.
  • [2] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor,” in

    International Conference on Machine Learning

    , 2018.
  • [3] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver, “Rainbow: Combining improvements in deep reinforcement learning,” in

    Thirty-Second AAAI Conference on Artificial Intelligence

    , 2018.
  • [4] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward transformations: Theory and application to reward shaping,” in International Conference on Machine Learning (ICML), 1999.
  • [5] I. Popov, N. Heess, T. Lillicrap, R. Hafner, G. Barth-maron, M. Vecerik, T. Lampe, Y. Tassa, T. Erez, M. Riedmiller, and M. R. Deepmind, “Data-efficient Deep Reinforcement Learning for Dexterous Manipulation,” CoRR, vol. abs/1704.0, 2017.
  • [6] C. Daniel, M. Viering, J. Metz, O. Kroemer, and J. Peters, “Active Reward Learning,” in Robotics: Science and Systems (RSS), 2014.
  • [7] A. Nair, V. Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine, “Visual Reinforcement Learning with Imagined Goals,” in Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • [8] T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. Aparicio Ojea, E. Solowjow, and S. Levine, “Residual Reinforcement Learning for Robot Control,” in IEEE International Conference on Robotics and Automation (ICRA), 2019.
  • [9] T. Silver, K. Allen, J. Tenenbaum, and L. Kaelbling, “Residual Policy Learning,” dec 2018.
  • [10] D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural network,” in Advances in Neural Information Processing Systems (NIPS), 1989, pp. 305–313.
  • [11] J. Nakanishi, J. Morimoto, G. Endo, G. Cheng, S. Schaal, and M. Kawato, “Learning from demonstration and adaptation of biped locomotion,” in Robotics and Autonomous Systems, vol. 47, no. 2-3, 2004, pp. 79–91.
  • [12] M. P. Deisenroth, C. E. Rasmussen, and D. Fox, “Learning to Control a Low-Cost Manipulator using Data-Efficient Reinforcement Learning,” Robotics: Science and Systems (RSS), vol. VII, pp. 57–64, 2011.
  • [13]

    L. Pinto and A. Gupta, “Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours,”

    IEEE International Conference on Robotics and Automation (ICRA), 2016.
  • [14] A. Giusti, J. J. Guzzi, D. C. Cirean, F.-L. He, J. P. Rodríguez, F. Fontana, M. Faessler, C. Forster, J. J. Schmidhuber, G. D. Caro, D. Scaramuzza, L. M. Gambardella, D. C. Ciresan, F.-L. He, J. P. Rodriguez, F. Fontana, M. Faessler, C. Forster, J. J. Schmidhuber, G. D. Caro, D. Scaramuzza, and L. M. Gambardella, “A Machine Learning Approach to Visual Perception of Forest Trails for Mobile Robots,” in IEEE Robotics and Automation Letters., vol. 1, no. 2, 2015, pp. 2377–3766.
  • [15] D. Pathak, P. Mahmoudieh, G. Luo, P. Agrawal, D. Chen, Y. Shentu, E. Shelhamer, J. Malik, A. A. Efros, and T. Darrell, “Zero-Shot Visual Imitation,” in International Conference on Learning Representations (ICLR), 2018.
  • [16] M. P. Deisenroth and C. E. Rasmussen, “PILCO: A model-based and data-efficient approach to policy search,” in International Conference on Machine Learning (ICML), 2011, pp. 465–472.
  • [17] J. Peters, K. Mülling, and Y. Altün, “Relative Entropy Policy Search,” in AAAI Conference on Artificial Intelligence, 2010, pp. 1607–1612.
  • [18] J. Peters and S. Schaal, “Reinforcement learning of motor skills with policy gradients,” Neural Networks, vol. 21, no. 4, pp. 682–697, 2008.
  • [19] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-End Training of Deep Visuomotor Policies,” Journal of Machine Learning Research (JMLR), vol. 17, no. 1, pp. 1334–1373, 2016.
  • [20] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous Deep Q-Learning with Model-based Acceleration,” in International Conference on Machine Learning (ICML), 2016.
  • [21] Y. Chebotar, K. Hausman, M. Zhang, G. Sukhatme, S. Schaal, and S. Levine, “Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning,” in International Conference on Machine Learning (ICML), 2017.
  • [22] M. Zhang, S. Vikram, L. Smith, P. Abbeel, M. J. Johnson, and S. Levine, “SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning,” in International Conference on Machine Learning (ICML), aug 2019.
  • [23] H. Zhu, A. Gupta, A. Rajeswaran, S. Levine, and V. Kumar, “Dexterous Manipulation with Deep Reinforcement Learning: Efficient, General, and Low-Cost,” in International Conference on Robotics and Automation (ICRA), oct 2018.
  • [24] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine, “QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation,” in Conference on Robot Learning (CoRL), 2018.
  • [25] G. Thomas, M. Chien, A. Tamar, J. A. Ojea, and P. Abbeel, “Learning Robotic Assembly from CAD,” in IEEE International Conference on Robotics and Automation (ICRA), 2018.
  • [26] V. Eruhimov and W. Meeussen, “Outlet detection and pose estimation for robot continuous operation,” in 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2011, pp. 2941–2946.
  • [27] B. Mayton, L. LeGrand, and J. R. Smith, “Robot, feed thyself: Plugging in to unmodified electrical outlets by sensing emitted ac electric fields,” 2010, pp. 715–722.
  • [28] J. Kober and J. Peter, “Policy search for motor primitives in robotics,” in Advances in Neural Information Processing Systems (NIPS), vol. 97, 2008, pp. 83–117.
  • [29] C. Finn, S. Levine, and P. Abbeel, “Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization,” in International Conference on Machine Learning (ICML), 2016.
  • [30] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement learning,” in International Conference on Machine Learning (ICML), 2004, p. 1.
  • [31] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey, “Maximum Entropy Inverse Reinforcement Learning.” in AAAI Conference on Artificial Intelligence, 2008, pp. 1433–1438.
  • [32]

    J. Ho and S. Ermon, “Generative Adversarial Imitation Learning,” in

    Advances in Neural Information Processing Systems (NIPS), 2016.
  • [33] N. Rhinehart and K. M. Kitani, “First-person activity forecasting with online inverse reinforcement learning,” in

    The IEEE International Conference on Computer Vision (ICCV)

    , Oct 2017.
  • [34] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, G. Dulac-Arnold, I. Osband, J. Agapiou, J. Z. Leibo, and A. Gruslys, “Learning from Demonstrations for Real World Reinforcement Learning,” in AAAI Conference on Artificial Intelligence, 2018.
  • [35] A. Nair, B. Mcgrew, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Overcoming Exploration in Reinforcement Learning with Demonstrations,” in IEEE International Conference on Robotics and Automation (ICRA), 2018.
  • [36] A. Rajeswaran, V. Kumar, A. Gupta, J. Schulman, E. Todorov, and S. Levine, “Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations,” in Robotics: Science and Systems, 2018.
  • [37] M. Večerík, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot, N. Heess, T. Rothörl, T. Lampe, and M. Riedmiller, “Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards,” CoRR, vol. abs/1707.0, 2017.
  • [38] J. Fu, A. Singh, D. Ghosh, L. Yang, and S. Levine, “Variational Inverse Control with Events: A General Framework for Data-Driven Reward Definition,” in Neural Information Processing Systems (NeurIPS), may 2018.
  • [39] A. Singh, L. Yang, K. Hartikainen, C. Finn, and S. Levine, “End-to-End Robotic Reinforcement Learning without Reward Engineering,” in Robotics: Science and Systems (RSS), apr 2019.
  • [40] R. Li, R. Platt, W. Yuan, A. Ten Pas, N. Roscup, M. A. Srinivasan, and E. Adelson, “Localization and Manipulation of Small Parts Using GelSight Tactile Sensing,” in International Conference on Intelligent Robots and Systems (IROS), 2014.
  • [41] A. Tamar, G. Thomas, T. Zhang, S. Levine, and P. Abbeel, “Learning from the hindsight plan — episodic mpc improvement,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 336–343.
  • [42] T. Inoue, G. De Magistris, A. Munawar, T. Yokoya, and R. Tachibana, “Deep reinforcement learning for high precision assembly tasks,” in Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on.    IEEE, 2017, pp. 819–825.
  • [43] J. Luo, E. Solowjow, C. Wen, J. Aparicio Ojea, A. Agogino, A. Tamar, and A. P, “Reinforcement learning on variable impedance controller for high-precision robotic assembly,” in Robotics and Automation (ICRA), 2019 IEEE International Conference on.    IEEE, 2019.
  • [44] J. Falco, Y. Sun, and M. Roa, “Robotic grasping and manipulation competition: Competitor feedback and lessons learned,” in Robotic Grasping and Manipulation, Y. Sun and J. Falco, Eds.    Cham: Springer International Publishing, 2018, pp. 180–189.
  • [45] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” in International Conference on Learning Representations (ICLR), 2016.
  • [46] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing Function Approximation Error in Actor-Critic Methods,” in International Conference on Machine Learning (ICML), 2018.
  • [47] H. Van Hasselt, A. Guez, and D. Silver, “Deep Reinforcement Learning with Double Q-learning,” in Association for the Advancement of Artificial Intelligence (AAAI), 2016.
  • [48] V. Pong, S. Gu, M. Dalal, and S. Levine, “Temporal Difference Models: Model-Free Deep RL For Model-Based Control,” in International Conference on Learning Representations (ICLR), 2018.
  • [49] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. Mcgrew, J. Tobin, P. Abbeel, and W. Zaremba, “Hindsight Experience Replay,” in Advances in Neural Information Processing Systems (NIPS), 2017.