1 Introduction
The latest generation of collaborative robots are designed to eliminate cumbersome path programming by allowing humans to kinesthetically guide a robot through a desired motion. This approach dramatically reduces the time and expertise required to get a robot to solve a novel task, but there is still a fundamental dependence on scripted trajectories. Consider the task of inserting a wire into a connector: it is difficult to imagine any predefined motion which can handle variability in wire shape and stiffness. To solve these sorts of tasks, it is desirable to have a richer control policy which considers a large amount of feedback including states, forces, and even raw images. Reinforcement Learning (RL) offers, in principle, a method to learn such policies from exploration, but the amount of actual exploration required has prohibited its use in real applications. In this paper we address this challenge by combining the demonstration and RL paradigms into a single framework which uses kinesthetic demonstrations to guide a deepRL algorithm. Our longterm vision is for it to be possible to provide a few minutes of demonstrations, and have the robot rapidly and safely learn a policy to solve arbitrary manipulation tasks.
The primary alternative to demonstrations for guiding RL agents in continuous control tasks is reward shaping. Shaping is typically achieved using a handcoded function, such as Cartesian distance to a goal site, which provides a smoothly varying reward signal for every state the agent visits. While attractive in theory, reward shaping can lead to bizarre behavior or premature convergence to local minima, and in practice requires considerable engineering and experimentation to get right [9]. By contrast, it is often quite natural to express a task goal as a sparse reward function, e.g. +1 if the wire is inserted, and 0 otherwise. Our central contribution is to show that offpolicy replaymemorybased RL (e.g. DDPG) is a natural vehicle for injecting demonstration data into sparsereward tasks, and that it obviates the need for rewardshaping. In contrast to onpolicy RL algorithms, such as classical policy gradient, DDPG can accept and learn from arbitrary transition data. Furthermore, the replay memory allows the agent to maintain these transitions for long enough to propagate the sparse rewards throughout the value function.
We present results of simulation experiments on a set of robot insertion problems involving rigid and flexible objects. We then demonstrate the viability of our approach on a real robot task consisting of inserting a clip (flexible object) into a rigid object. This task is realized by a Sawyer robotic arm, using demonstrations collected by kinesthetically controlling an arm by the wrist. Our results suggest that sparse rewards and a few human demonstrations are a practical alternative to shaping for teaching robots to solve challenging continuous control tasks.
2 Background
This section provides mathematical background for Markov Decision Processes (MDPs), DDPG, and deep RL techniques such as prioritized replay and
step return. We adopt the standard Markov Decision Process (MDP) formalism for this work [15]. An MDP is defined by a tuple , which consists of a set of states , a set of actions , a reward function , a transition function , and a discount factor . In each state , the agent takes an action . Upon taking this action, the agent receives a reward and reaches a new state, determined from the probability distribution
. A deterministic and stationary policy specifies for each state which action the agent will take. The goal of the agent is to find the policy mapping states to actions that maximizes the expected discounted total reward over the agent’s lifetime. This concept is formalized by the action value function: , where is the expectation over the distribution of the admissible trajectories obtained by executing the policy starting from and. Here, we are interested in continuous control problems, and take an actorcritic approach in which both components are represented using neural networks. These methods consist in maximizing a mean value
with respect to parameters that parameterise the policy and where is an initial state distribution. To do so, a gradient approach is considered and the parameters are updated as follows: . Deep Deterministic Policy Gradient (DDPG) [7] is an actorcritic algorithm which directly uses the gradient of the Qfunction w.r.t. the action to train the policy. DDPG maintains a parameterized policy network (actor function) and a parameterized actionvalue function network (critic function) . It produces new transitions by acting according to where is a random process allowing action exploration. Those transitions are added to a replay buffer . To update the actionvalue network, a onestep offpolicy evaluation is used and consists of minimizing the following loss:(1) 
where is a distribution over transitions contained in a replay buffer and the onestep return is defined as: .
Here and are the associated target networks of and which stabilizes the learning (updated every steps to the values of their associated networks). To update the policy network a gradient step is taken with respect to:
(2) 
The offpolicy nature of the algorithm allows the use of arbitrary data such as human demonstrations.
Our experiments made use of several general techniques from the deep RL literature which significantly improved the overall performance of DDPG on our test domains. As we discuss in Sec. 5, these improvements had a particularly large impact when combined with demonstration data.
3 DDPG from Demonstrations
Our algorithm modifies DDPG to take advantage of demonstrations. The demonstrations are of the form of RL transitions: . DDPGfD loads the demonstration transitions into the replay buffer before the training begins and keeps all transitions forever.
DDPGfD uses prioritized replay to enable efficient propagation of the reward information, which is essential in problems with sparse rewards. Prioritized experience replay [13] modifies the agent to sample more important transitions from its replay buffer more frequently. The probability of sampling a particular transition is proportional to its priority, , where is the priority of the transition. DDPGfD uses , where is the last TD error calculated for this transition, the second term represents the loss applied to the actor, is a small positive constant to ensure all transitions are sampled with some probability, is a positive constant for demonstration transitions to increase their probability of getting sampled, and is used to weight the contributions. To account for the change in the distribution, updates to the network are weighted with importance sampling weights, . DDPGfD uses and as we want to learn about the correct distribution from the very beginning. In addition, the prioritized replay is used to prioritize samples between the demonstration and agent data, controlling the ratio of data between the two in a natural way.
A second modification for the sparse reward case is to use a mix of 1step and nstep returns when updating the critic function. Incorporating nstep returns helps propagate the Qvalues along the trajectories. The step return loss consists of using rollouts (forward view) of size of a policy close to the current policy in order to evaluate the actionvalue function . The idea is to minimize the difference between the actionvalue at state and the return of a rollout of size starting from and following . The step return has the following form: . The loss corresponding to this particular rollout is then: .
A third modification is to do multiple learning updates per environment step. If a single learning update per environment step is used, each transition will only be sampled as many times as the size of the minibatch. Choosing a balance between gathering fresher data and doing more learning is in general a complicated tradeoff. If our data is stale, the samples from the replay buffer no longer represent the distribution of states our current policy would experience. This can lead to wrong Q values in states which were not previously visited and potentially cause our policy and values to diverge. However in our case we require data efficiency and therefore we need to use each transition several times. In our experiments, we could increase the number of learning updates to without affecting the perupdate learning efficiency. In practice, we used the value of which provided a good balance between learning from previous interaction (data efficiency) and stability.
Finally, L2 regularization on the parameters of the actor and the critic networks are added to stabilize the final learning performance.
The final loss can be written as:
(3)  
(4) 
To summarize, we modified the original DDPG algorithm in the following ways:

[noitemsep,nolistsep]

Transitions from a human demonstrator are added to the replay buffer.

Prioritized replay is used for sampling transitions across both the demonstration and agent data.

A mix of 1step and nstep return losses are used.

Learning multiple times per environment step.

L2 regularization losses on the weights of the critic and the actor are used.
4 Experimental setup
Our approach is designed for problems in which it is easy to specify a goal state, but difficult to specify a smooth distance function for reward shaping that does not lead to suboptimal behavior. One example of this is insertion tasks in which the goal state for the plug is at the bottom of a socket, but the only path to reach it, and therefore the focus of exploration, is at the socket opening. While this may sound like a minor distinction, we found in our initial experiments that DDPG with a simple goaldistance reward would quickly find a path to a local minimum on the outside of the socket, and fail to ever explore around the opening.
We therefore sought to design a set of insertion tasks that presented a range of exploration difficulties. Our tasks are illustrated in Fig. 1. The first (Fig. 1(a)) is a classic peginhole task, in which both bodies are rigid, and the plug is free to rotate along the insertion axis. The second (Fig. 1(b)) models a driveinsertion problem into an ATXstyle computer chassis. Both bodies are again rigid, but in this case the drive orientation is relevant. The third task (Fig. 1(c)) models the problem of inserting a twopronged deformable plastic clip into a housing. The clip is modeled as three separate bodies with hinge joints at the base of each prong. These joints are springloaded, and the resting state pinches inwards as is common with physical connectors to maintain pressure on the housing. The final task (Fig. 1(d)) is a simplified cable insertion task in which the plug is modeled as a 20link chain of capsules coupled by balljoints. This cable is highly underactuated, but otherwise shares the same task specification as the peginhole task.
We created two reward functions for our experiments. The first is a sparse reward function which returned if the plug was within a small tolerance of the goal site(s) on the socket:
(5) 
where is the position of the tip site on the plug, is the goal site on the socket,
contains weighting coefficients for the goal site error vector, and
is a proximity threshold. If this tolerance was reached, the robot received the reward signal and the episode was immediately terminated.The second reward function is a shaped reward which composes terms for two movement phases: a reaching phase to align the plug to the socket opening, and an inserting phase to reach the socket goal. Both terms compute a weighted distance between the plug tip(s) and their respective goal site(s). The distance from the goal to the opening site (i.e. the maximum value of ) is added to during the reaching phase, such that the reward monotonically increases throughout an insertion:
(6)  
(7)  
(8) 
where is the goal site, is the opening site, and are weighting coefficients for the goal and opening site errors, respectively, is the indicator function, and and are scaling parameters for logtransforming these distances into rewards ranging from to . Note that tuning the weighting of each dimension in and must be done very carefully for the agent to learn the real desired task. In addition, the shaping of both stages must be balanced out in a delicate manner.
All tasks utilized a single vertically mounted robot arm. The robot was a Sawyer 7DOF torquecontrolled arm from Rethink Robotics, instrumented with a cuff for kinesthetic teaching. We utilized the Mujoco simulator [19] to simulate the Sawyer using publicly available kinematics and mesh files. In the simulation experiments the actions were joint velocities, the rewards were sparse or shaped as described above, and the observations included joint position and velocity, jointtorque feedback, and the global pose of the socket and plug. In both the simulation and real world experiments the object being inserted was rigidly attached to the gripper, and the socket was fixed to a table top.
In addition to the four simulation tasks, we also constructed a real world clip insertion problem using a physical Sawyer robot. In the real robot experiment the clip was rigidly mounted to the robot gripper using a 3D printed attachment. The socket position was provided to the robot, and rewards were computed by evaluating the distance from the clip prongs (available via the robot’s kinematics) to the goal sites in the socket as described above. In real robot experiments the observations included the robot joint position and velocity, gravitycompensated torque feedback from the joints, and the relative pose of the plug tip sites in the socket opening site frames.
4.1 Demonstration data collection
To collect the demonstration data in simulated tasks, we used a Sawyer robotic arm. The arm was kinesthetically force controlled by a human demonstrator. In simulation an agent was running a hardcoded joint space Pcontroller to match the joint positions of the simulated Sawyer robot to the joint positions of the real one. This agent was using the same action space as the DDPGfD agent which allowed the demonstration transitions to be added directly to the agent’s replay buffer.
For providing demonstration for the real world tasks we used the same setup, this time controlling a second robotic arm. Separating the arm we were controlling and the arm which solved the task ensured that the demonstrator did not affect the dynamics of the environment from the agent’s perspective. For each experiment, we collected 100 episodes of human demonstrations which were on average about 25 steps (s) long. This involved a total of 1015 minutes of robot interaction time per task.
5 Results
In our first experiment we compared our approach to DDPG on sparse and shaped variants of the four simulated robotic tasks presented in Sec. 4. In addition, we show rewards for the demonstrations themselves as well as supervised imitation of the demonstrations. The DDPG implementation utilized all of the optimizations we incorporated into DDPGfD, including prioritized replay, nstep returns, and 2 regularization. For each task we evaluated the agent with both the shaped and sparse versions of the reward, with results shown in Figure 3. All traces plot the shapedreward value achieved, regardless of which reward was given to the agent. All of these experiments were performed with fixed hyperparameters, tuned in advance.
We can see that in the case where we have handtuned shaping rewards all algorithms can solve the task. The results show that DDPGfD always outperforms DDPG, even when DDPG is given a welltuned shaping reward. In contrast, DDPGfD learns nearly as well with sparse rewards as with shaping rewards. DDPGfD even outperforms DDPG on the hard drive insertion task, where the demonstrations are relatively poor. In general, DDPGfD not only learns to solve the task, but learns to solve it more efficiently than the demonstrations, usually learning to insert the object in 24x fewer steps than the demonstrations. DDPGfD also learns more reliably, as the percentile plots are much wider for DDPG. Doing purely supervised learning of the demonstration policy performs poorly in every task.
In our second experiment we examined the effect of varying the quantity of demonstration data on agent performance. Fig. 4(a) compares learning curves for DDPGfD agents initialized with 1, 2, 3, 5, 10, and 100 expert trajectories on the sparsereward clipinsertion task. DDPGfD is capable of solving this task with only a single demonstration, and we see diminishing returns with 50100 demonstrations. This was surprising, since each demonstration contains only one state transition with nonzero reward.
Finally, we show results of DDPGfD learning the clip insertion task on physical Sawyer robot in Figure 4(b). DDPGfD was able to learn a robust insertion policy on the real robot. DDPGfD with sparse rewards outperforms shaped DDPG, showing that DDPGfD achieves faster learning without the extra engineering.
A video demonstrating the performance can be viewed here: https://www.youtube.com/watch?v=WGJwLfeVN9w
6 Related work
Imitation learning
is primarily concerned with matching expert demonstrations. Our work combines imitation learning with learning from task rewards, so that the agent is able to improve upon the demonstrations it has seen. Imitation learning can be cast into a supervised learning problem (like classification)
[10, 11]. One popular imitation learning algorithm is DAGGER [12] which iteratively produces new policies based on polling the expert policy outside its original state space. This leads to noregret over validation data in the online learning sense. DAGGER requires the expert to be available during training to provide additional feedback to the agent.Imitation can also been achieved through inverse optimal control or inverse RL. The main principle is to learn a cost or a reward function under which the demonstration data is optimal. For instance, in [16, 17] the inverse RL problem is cast into a twoplayer zerosum game where one player chooses policies and the other chooses reward functions. However, it doesn’t scale to continuous stateaction spaces and requires knowledge of the dynamics. To address continuous state spaces and unknown dynamics, [5] solve inverse RL by combining classification and regression. Yet it is restricted to discrete action spaces. Demonstrations have also been used for inverse optimal control in highdimensional, continuous robotic control problems [1]. However, these approaches only do imitation learning and do not allow for learning from task rewards.
Guided Cost Learning (GCL) [1] and Generative Adversarial Imitation Learning (GAIL) [4] are the first efficient imitation learning algorithms to learn from highdimensional inputs without knowledge of the dynamics and handcrafted features. They have a very similar algorithmic structure which consists of matching the distribution of the expert trajectories. To do so, they simultaneously learn the reward and the policy that imitates the expert demonstrations. At each step, sampled trajectories of the current policy and the expert policy are used to produce a reward function. Then, this reward is (partially) optimized to produce an updated policy and so on. In GAIL, the reward is obtained from a network trained to discriminate between expert trajectories and (partial) trajectories sampled from a generator (the policy), which is itself trained by TRPO[14]. In GCL, the reward is obtained by minimization of the Maximum Entropy IRL cost[20] and one could use any RL algorithm procedure (DDPG, TRPO etc.) to optimize this reward.
Control in continuous stateaction domains typically uses smooth shaped rewards that are designed to be amenable to classical analysis yielding closedform solutions. Such requirements might be difficult to meet in real world applications. For instance, iterative Linear Quadratic Gaussian (iLQG) [18] is a method for nonlinear stochastic systems where the dynamics is known and the reward has to be quadratic (and thus entails handcrafted task designs). It uses iterative linearization of the dynamics around the current trajectory in order to obtain a noisy linear system (where the noise is a centered Gaussian) and where the reward constraints are quadratic. Then the algorithm uses the Ricatti family of equations to obtain locally linear optimal trajectories that improve on the current trajectory.
Guided Policy Search [6] aims at finding an optimal policy by decomposing the problem into three steps. First, it uses nominal or expert trajectories, obtained by previous interactions with the environment to learn locally linear approximations of its dynamics. Then, it uses optimal control algorithms such as iLQG or DDP to find the locally linear optimal policies corresponding to these dynamics. Finally, via supervised learning, a neural network is trained to fit the trajectories generated by these policies. Here again, there is a quadratic constraint on the reward that must be purposely shaped.
Normalized Advantage Functions (NAF) [2] with modelbased acceleration is a modelfree RL algorithm using imagination rollouts coming from a model learned with the previous interactions with the environment or via expert demonstrations. NAF is the natural extension of QLearning in the continuous case where the advantage function is parameterized as a quadratic function of nonlinear state features. The unimodal nature of this function allows the maximizing action for the Qfunction to be obtained directly as the mean policy. This formulation makes the greedy step of QLearning tractable for continuous action domains. Then, similarly as GPS, locally linear approximations of the dynamics of the environment are learned and iLQG is used to produce modelguided rollouts to accelerate learning.
The most similar work to ours is DQfD [3], which combines Deep Q Networks (DQN) [8] with learning from demonstrations in a similar way to DDPGfD. It additionally adds a supervised loss to keep the agent close to the policy from the demonstrations. However DQfD is restricted to domains with discrete action spaces and is not applicable to robotics.
7 Conclusion
In this paper we presented DDPGfD, an offpolicy RL algorithm which uses demonstration trajectories to quickly bootstrap performance on challenging motor tasks specified by sparse rewards. DDPGfD utilizes a prioritized replay mechanism to prioritize samples across both demonstration and selfgenerated agent data. In addition, it incorporates nstep returns to better propagate the sparse rewards across the entire trajectory.
Most work on RL in highdimensional continuous control problems relies on welltuned shaping rewards both for communicating the goal to the agent as well as easing the exploration problem. While many of these tasks can be defined by a terminal goal state fairly easily, tuning a proper shaping reward that does not lead to degenerate solutions is very difficult. This task only becomes more difficult when you move to multistage tasks such as insertion. In this work, we replaced these difficult to tune shaping reward functions with demonstrations of the task from a human demonstrator. This eases the exploration problem without requiring careful tuning of shaping rewards.
In our experiments we sought to determine whether demonstrations were a viable alternative to shaping rewards for training object insertion tasks. Insertion is an important subclass of object manipulation, with extensive applications in manufacturing. In addition, it is a challenging set of domains for shaping rewards, as it requires two stages: one for reaching the insertion point, and one for inserting the object. Our results suggest that DeepRL is poised to have a large impact on real robot applications by extending the learningfromdemonstration paradigm to include richer, forcesensitive policies.
References
 Finn et al. [2016] C. Finn, S. Levine, and P. Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In Proc. of ICML, 2016.
 Gu et al. [2016] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep qlearning with modelbased acceleration. In Proc. of ICML, 2016.
 Hester et al. [2017] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, A. Sendonaris, G. DulacArnold, I. Osband, J. Agapiou, et al. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732, 2017.
 Ho and Ermon [2016] J. Ho and S. Ermon. Generative adversarial imitation learning. In Proc. of NIPS, 2016.
 Klein et al. [2013] E. Klein, B. Piot, M. Geist, and O. Pietquin. A cascaded supervised learning approach to inverse reinforcement learning. In Proc. of ECML, 2013.
 Levine and Koltun [2013] S. Levine and V. Koltun. Guided policy search. In Proc. of ICML, pages 1–9, 2013.
 Lillicrap et al. [2016] T. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. In Proc. of ICLR, 2016.
 Mnih et al. [2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Ng et al. [1999] A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Proc. of ICML, volume 99, pages 278–287, 1999.
 Pomerleau [1989] D. A. Pomerleau. ALVINN: An autonomous land vehicle in a neural network. In Proc. of NIPS, 1989.
 Ratliff et al. [2007] N. Ratliff, J. A. Bagnell, and S. S. Srinivasa. Imitation learning for locomotion and manipulation. In 2007 7th IEEERAS International Conference on Humanoid Robots, 2007.
 Ross et al. [2011] S. Ross, G. J. Gordon, and J. A. Bagnell. A reduction of imitation learning and structured prediction to noregret online learning. In Proc. of AISTATS, 2011.
 Schaul et al. [2016] T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. In Proc. of ICLR, volume abs/1511.05952, 2016.
 Schulman et al. [2015] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. Trust region policy optimization. In Proc. of ICML, 2015.
 Sutton and Barto [1998] R. S. Sutton and A. G. Barto. Introduction to reinforcement learning. MIT Press, 1998.
 Syed and Schapire [2007] U. Syed and R. E. Schapire. A gametheoretic approach to apprenticeship learning. In Proc. of NIPS, 2007.

Syed et al. [2008]
U. Syed, M. Bowling, and R. E. Schapire.
Apprenticeship learning using linear programming.
In Proc. of ICML, 2008.  Todorov and Li [2005] E. Todorov and W. Li. A generalized iterative lqg method for locallyoptimal feedback control of constrained nonlinear stochastic systems. In American Control Conference, 2005. Proceedings of the 2005, pages 300–306. IEEE, 2005.
 Todorov et al. [2012] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for modelbased control. In Proc. of IROS, pages 5026–5033, 2012.
 Ziebart et al. [2008] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In Proc. of AAAI, pages 1433–1438, 2008.
Appendix A Real robot safety
To be able to run DDPG on the real robot we needed to ensure that the agent will not apply excessive force. To do this we created an intermediate impedance controller which subjects the agent’s commands to safety constraints before relaying them to the robot. It modifies the target velocity set by the agent according to the externally applied forces.
(9) 
Where is agent’s control signal, are externally applied forces such as the clip pushing against the housing, and and are constants to choose the correct sensitivity. We further limit the velocity control signal to limit the maximal speed increase while still allowing the agent to stop quickly. This increases the control stability of the system.
This allowed us to keep the agent’s control frequency, , at Hz while still having a physically safe system as and were updated at kHz.