I Introduction
Deep reinforcement learning (RL) methods are increasingly common and increasingly successful in robotic manipulation domains like grasping and pushing [1, 2, 3, 4, 5]. But for most complex problems of interest, learning from scratch remains intractable. For example, consider the task illustrated in Figure 1a. A simulated Fetch robot must pick up and use a hook to drag an outofreach block to a target location. The only reward offered is a positive signal once the block reaches the target. This longhorizon, sparsereward problem remains out of reach for current deep RL methods. In contrast, it is relatively straightforward to handdesign a policy that accomplishes this hook task perfectly in simulation (see Section VB).
While a handdesigned policy may be robust to variations in the initial block position and target, it will likely break down with more dramatic variations in the task. For example, consider the task variation illustrated in Figure 1b. The robot must now move a more complex rigid object to the goal. The task is further complicated by static “bumps” on the table that may impede the movement of the hook and object. Moreover, the robot’s state includes no information about the bumps, which randomly regenerate at each trial, nor information about the object’s shape, which is randomly selected from a library of 100 diverse objects. The policy designed for the original task sometimes succeeds in this setup, but more often fails.
What should be done when a policy — be it a handdesigned policy, a modelpredictive controller, or any other controller mapping states to actions — performs below par? One path forward is to manually tweak the policy. This option, while potentially laborious, may suffice for some problems. But for other problems like the complex hook task described above, it is unclear how to even begin improving the policy by hand.
In this work, we propose Residual Policy Learning (RPL): a method for improving policies using deep reinforcement learning. Our main idea is to augment arbitrary initial policies by learning residuals on top of them. Given an initial policy with states and actions , we learn a residual function so that we have a residual policy given by
Observe that , that is, the gradient of the policy does not depend on the initial policy . We can therefore use policy gradient methods to learn even if the initial policy is not differentiable.
There are two ways to see the role of the residual. If the initial policy is nearly perfect, the residual may be viewed as a corrective term. But if the initial policy is far from ideal, we may interpret the outputs of as merely “hints” to guide exploration. In practice, these two interpretations of the residual represent ends of a spectrum. We study problems all along this spectrum in this paper.
We present experimental results on several complex manipulation tasks that feature issues central to robotics and controller design: partial observability, sensor noise, model misspecification, and controller miscalibration. Our experiments are designed to investigate when and to what extent the following two claims hold:

RPL improves on initial policies; and

RPL is more dataefficient than learning from scratch.
We examine two common sources of initial policies: handdesigned policies and modelpredictive controllers (MPC). We consider MPC with both known and learned transition models. In the latter case, we use Probabilistic Ensembles with Trajectory Sampling (PETS), a stateoftheart method for modelbased RL, to derive the initial controller [6]. In all cases, RPL is able to substantially improve on the original policies, while requiring far less data than learning from scratch to achieve the same performance. Furthermore, in complex manipulation tasks like that in Figure 1b, RPL succeeds where learning from scratch is intractable and handdesigning perfect policies is unrealistic.
Ii Related Work
RPL be seen as tackling two separate but related questions: how to improve imperfect controllers, and how to make deep reinforcement learning methods more data efficient and able to handle longer horizon planning.
There has been a substantial body of work on improving the data efficiency of deep RL by combining modelfree and modelbased approaches. These methods often first learn a dynamics model and then use this dynamics model to simulate experience [7, 8, 9] or compute gradients for modelfree updates [10, 11]. Another set of approaches uses the learned dynamics model (or inverse dynamics model) to perform trajectory optimization or modelpredictive control [6, 12]. Further work uses such modelbased methods to guide a modelfree learner in a DAGGERstyle imitation strategy [13]. More recent work has shown an equivalence between modelfree and modelbased RL with goalconditioned value functions [14], and used this to improve modelfree RL data efficiency. RPL can be seen as an extension of this line of work, as it provides a new means for combining the benefits of modelbased and modelfree RL. We show in experiments that the modelbased method proposed by Chua et al. [6] can be improved upon with RPL. However, RPL is also more general; it can be used to improve upon arbitrary policies, including but not limited to modelbased ones.
RPL can also be seen as a form of imitation learning. This set of approaches considers an expert that provides demonstrations of a task to a learner. Most approaches then attempt to copy the expert’s strategy
[13, 15], or to use inverse reinforcement learning to infer goals and subgoals of the expert agent [16, 17]. Underpinning most of these approaches is the supposition that the expert is perfect. If the expert is indeed perfect, then RPL will be immediately perfect as well due to our initialization strategy (see Section IVA). But if the expert is imperfect and is only meant to provide “hints,” RPL learns to improve nonetheless.From robotics, many methods exist for learning different aspects of the perception, control, execution pipeline. Focusing on control specifically, Bayesian optimization approaches are popular for learning controllers based on Gaussian process models of objective functions to be optimized [18, 19, 20, 21, 22]. Learning an accurate dynamics model is another central focus for robotics (termed system identification), and has been approached using analytic gradients [23, 24], finite differences [25] or Bayesian Optimization [8]. In contrast, RPL does not presuppose which aspect of the controller needs correction. This is particularly valuable in partially observable settings, where it is unclear how to learn a good dynamics model or design a better objective function.
In the case of dynamics learning, our work is inspired by Ajay et al. [26] and Kloss et al. [27] who learn a correction to an analytical physics model in order to perform better modelpredictive control. RPL is more general in that it can learn to correct the model implicitly by correcting the policy, but can also provide corrections which could not be provided by dynamics corrections (such as partially observable or noisy domains).
Concurrent work by Johannink et al. [28] also proposes residual reinforcement learning, and focuses on showing the value of the approach for real robots in a task of block insertion, investigating the effects of variation in the initial state, control noise, and the transfer from sim to real. Here we aim to show the power of residual policies for a variety of different tasks that disentangle several sources of difficulty: partial observability, sensor noise, model misspecification, and controller miscalibration. We also empirically analyze the root cause of RPL’s success by introducing a baseline that uses the initial policy only as an “expert” to guide exploration.
Iii Background
RPL operates within a standard (Partially Observable) Markov Decision Process (MDP) framework. An MDP is a tuple
where are states, are actions, is the reward for taking action in state ,is the probability of transitioning to state
following state and action , and is a temporal discount factor. We assume all trajectories or episodes sampled from the MDP have a finite number of actions (horizon). In all of the experiments described in this paper, states and actions are realvalued vectors. A policy
maps states to actions. Given an initial state , the reinforcement learning problem is to find a policy that maximizes expected rewards discounted over time .Let be the actionvalue function that gives the expected future discounted rewards following policy . Many reinforcement learning methods make use of the Bellman equation for the actionvalue function
Actorcritic methods learn both a parameterized policy (the actor) and a parameterized actionvalue function
(the critic). The critic is trained with a loss function derived from the Bellman equation above and the actor is trained to produce actions that maximize the critic. This approach is typically more stable than training the actor alone.
For the experiments in this work, we use Deep Deterministic Policy Gradients (DDPG) [29], an actorcritic method that works well in domains with continuous states and actions (though any RL method could be used with RPL in principle). In DDPG, the actor is updated following the deterministic policy gradient
DDPG makes use of experience replay, in which transitions sampled from the environment are stored in a replay buffer. During training, transitions are then randomly drawn from the replay buffer in an effort to break the correlation between consecutive transitions.
Hindsight Experience Replay (HER) [5] extends experience replay to dramatically improve data efficiency in domains with sparse binary rewards (goals) like those we consider in our experiments. In HER, the reward function and policy are additionally parameterized by a goal so that they become and respectively. For our purposes, the goal is a subvector of the final state of an episode. During training, each transition added to the replay buffer includes a goal that was achieved “in hindsight,” i.e. the goal that was actually reached at the end of the training episode. Given a sampled transition , the policy is then updated according to the reward . This trick is especially useful early in training when the chance of achieving nonzero rewards is low. We combine HER and DDPG for all of the experiments presented in this work.
Iv Residual Policy Learning (RPL)
In Residual Policy Learning (RPL), we begin with an initial policy . Our goal is to learn a residual to create an improved final policy .
Observe that a fixed initial policy together with an MDP induces a residual MDP where
If we view as an MDP like any other, we see that the residual that we wish to learn, , is a policy in this MDP. We can thus apply standard reinforcement learning techniques to learn the residual. In this work, we parameterize
as a neural network and use modelfree deep RL methods for learning.
RPL is as simple as that: given an initial policy, create a residual policy and proceed with deep RL. We now describe a few minor extensions that can improve performance and data efficiency in practice.
Iva Initializing the Residual
A desirable property of RPL is that it should never make a good initial policy worse. In the extreme case, if an initial policy is perfect, then we would like the residual policy to have no influence. We therefore endeavor to initialize the residual function so that for all . We do this by initializing the last layer of the network to be zero.
IvB RPL with ActorCritic Methods
RPL learns a residual on the output of an initial policy. Actorcritic methods like DDPG involve not only a policy but also a learned actionvalue function. If we begin with a perfect initial policy and a poor critic, the policy performance may degrade, since it is trained with reference to the critic. We therefore propose to train the critic alone for a “burn in” period while leaving the policy fixed. We can determine an appropriate burn in length automatically by monitoring the critic loss function and waiting for it to dip below a threshold
, which becomes a hyperparameter of our method. We use
for all experiments in this paper.IvC Recurrent RPL for POMDPs
RPL can also be extended to handle Partially Observable Markov Decision Processes (POMDPs). Generally, this is done in deep reinforcement learning by making recurrent. In practice, this is challenging for DDPG [30], and so we present an approximation by simply considering a ”history” of previous states. This is equivalent to writing with being the current timestep, and the history length. While the history length could take on any value, we found that a history length of just 1 (meaning the policy considers the current state and previous state) to be effective. We take advantage of this extension in our NoisyHook experiment in which observation noise obscures the input to the policy.
V Experiments
Here we investigate to what extent RPL improves on initial policies, learns faster than modelfree RL alone, and succeeds in tasks where modelfree RL is intractable.
Va Tasks
We study six simulated manipulation tasks. All environments are implemented in MuJoCo [31]. To provide direct comparison with previous work, we begin with a Push task and a PickAndPlace task, both taken from Plappert et al. [32]. We then present three more difficult tasks that have not been previously considered: SlipperyPush, NoisyHook, and ComplexHook. These first five tasks all involve a Fetch robot positioned in front of a table top. In the final task, we use the “7DOF Pusher” environment from Chua et al. [6]. Our focus in the last task is modelbased RL, so we call this environment MBRLPusher.
In the first five tasks, following previous work, we parameterize the action space in terms of changes to the end effector’s position in world coordinates [32]. A fourth action coordinate modulates the gripper’s two fingers symmetrically. (In the push tasks, the gripper is locked, and the fourth action coordinate has no effect.) In the sixth task, actions actuate the joints of the 7DOF robot arm directly. All actions are normalized so that the resulting action space is , where for the first five tasks and for the last. The state spaces and rewards vary per task, as described next.
VA1 Push
This task is taken directly from [32]. The objective is to push an object (a cube) to a target location on the table surface. The initial position of the object and the target location are randomized. The state space includes:

Gripper position and velocity (6 dims)

Object position, rotation, velocities (12 dims)

Object position relative to the gripper (3 dims)

Gripper finger joint states and velocities (4 dims)
for a total dimensionality of 25. To use Hindsight Experience Replay, we must also specify achieved and desired goals. Here the achieved goal is the threedimensional final position of the object and the desired goal is the target location. Rewards are sparse and binary: a reward of is given when the object is within a small radius around the target location and otherwise. The episode is counted as a success if the last reward is , i.e. the goal is achieved. Episode lengths are 50 and do not terminate early.
VA2 SlipperyPush
Here we present a slight modification to the original Push environment. In the original environment, the object has a sliding friction coefficient of . In this SlipperyPush environment, the same coefficient is set to . The initial state randomization, state space, goals, rewards, and horizon are otherwise identical to Push.
VA3 PickAndPlace
This task is taken directly from [32]. As in the previous tasks, the objective is to move an object (a cube) to a target location. However, the target location may now be either on the table top or in the air above the table. At the beginning of each episode, the position for the target location is randomly sampled as before. Then with probability, the location is set to be on the table surface; otherwise, the location is randomly sampled to be above the table surface. As mentioned above, the gripper is now unlocked so that the fingers open and close following the fourth action dimension. All other environment details are unchanged with respect to Push.
VA4 NoisyHook
In this task, the robot cannot initially reach the block with its gripper. A new hook object is introduced and positioned to the right of the robot (see Figure 1a). The objective is still to move the cube to a target location, but now the robot must use the hook to manipulate the cube. The target location is randomly initialized so that it lies between the cube and the robot. In addition to the 25 state dimensions included in the previous tasks, the state space now includes information about the hook:

Hook position, rotation, velocities (12 dims)

Hook position relative to the gripper (3 dims)
for a total of dimensions. Rewards and goals are the same as in previous tasks; we provide no additional shaping rewards.
This NoisyHook task is further complicated with the addition of observation noise. We suppose that the robot has precise proprioception but has significant uncertainty about the positions of the hook and cube. At each time step, we add IID diagonal Gaussian noise () to the position of the block and the position of the hook, as well as the rotation of both objects. Since the achieved goals are derived from the state, they too are affected. Here we double the episode length for a total of 100 frames.
VA5 ComplexHook
This task again features a hook and an object that must be moved to a target location. There is no longer noise added to the state. There is, however, significant uncertainty of two different, structured kinds. We first replace the simple cube from previous tasks with complex objects that vary in mass, friction, and shape. We use 100 objects taken from previous work by Finn et al. [1]. The object meshes were originally downloaded from thingiverse.com and include bowls, teddy bears, and small chairs among many other shapes. No information about the object shape or physical parameters is included in the state. To accomplish this task robustly, a policy must work across all possible objects.
To introduce a second source of structured uncertainty, we simulate large “bumps” on the table. A bump is a rigid box that is fixed to the table top. The width, length, height, position, and count of the bumps are randomly selected. See Figures 1b and 5 for two examples. Note crucially that no information about the bumps are included in the state space. Thus the complete state space and other task parameters remain unchanged.
VA6 MBRLPusher
This final task is taken directly from [6]. A 7DOF robot arm (not Fetch, but a simpler model) is positioned in front of a table with a tall cylinder and a target area. The objective is to push the cylinder to the target. The cylinder position and initial arm velocity are randomized per trial but the goal is fixed. The state space includes:

The robot joint positions and velocities ( dims)

The cylinder center of mass ( dims)

The gripper center of mass ( dims)
for a total of 20 dimensions. Goals are the same as in previous tasks. Rewards are weighted sums of three terms: negative L1 norm between the cylinder and the goal, negative L1 norm between the gripper and the cylinder, and negative L2 norm of the action, with weights respectively. The task horizon here is 150 frames.
VB Initial Policies
In Residual Policy Learning (RPL), we begin with an environment and an initial policy and we learn to improve on that initial policy. The initial policies that we use for our experiments are:

DiscreteMPCPush
: a modelpredictive controller with discrete actions and heuristics specific to
Push. 
ReactivePush: a reactive policy designed to work perfectly in the original Push task.

ReactivePickAndPlace: a reactive policy for the PickAndPlace task with miscalibrated gains.

ReactiveHook: a reactive policy designed to work perfectly in the noiseless hook task (Figure 1a).

CachedPETS: a modelpredictive controller with a learned transition model [6]. To make this controller fast, we cache the output actions for 500 input states. Given a new state, the final CachedPETS controller finds the nearest state in the cache and outputs the corresponding stored action. The number 500 was selected based on a small performance analysis (see appendix).
See the appendix for details on all policies.
VC Architectures and Training Details
RPL is indifferent to the deep RL method applied or architecture used. However, for consistency, we use the same actorcritic architecture with Deep Deterministic Policy Gradients [29] and Hindsight Experience Replay [5]
across our experiments. The network consists of 3 fully connected layers of 256 units each, with ReLU nonlinearities (not on the output layer). We use the same hyperparameters as in
[32], given in the appendix. Our only substantial modification is to initialize the last layer of the network to zeros, so that the policy starts with the base controller (as described in section IVA).When training in noisy environments, we use a history of 1 (see Section IVC). We considered two variants. In the first variant, the states are concatenated and fed to the network: . In the second variant, we consider the average of the features obtained for the states: . In practice, we found the second variant to work better, and so use it for all noisy environments.
VD Baselines
We consider three baselines for all experiments. First, in all experiments, we show the result of running the initial policy without learning. Second, we show the result of learning from scratch with DDPG and HER.
Our third baseline is designed to disentangle the causes of RPL’s success. One hypothesis for why residual learning might be helpful is that the initial policy provides a smart means for exploration. The baseline, “Expert Explore”, uses the initial (“expert”) policy for exploration only. Actions are selected by selecting and proceeding as follows:
where and are hyperparameters that we selected with a small grid search. Thus the agent acts of the time according to the learned policy, according to the expert, and the rest of the time takes random actions. This baseline is similar to a policyreuse method [33].
VE Results
Here we present empirical and qualitative results for RPL across the six complex manipulation tasks described in Section VA. For each task, we show RPL’s superior data efficiency and performance compared to the three baselines described in Section VD
. All empirical results are presented with mean and standard deviation across five random seeds.
VE1 DiscreteMPCPush in Push
In this experiment, we examine whether RPL can overcome the limitations of an MPC controller that makes coarse approximations in an effort to trade performance for speed. In particular, we use the DiscreteMPCPush as our initial policy for the Push task.
We graph the success rates of RPL and the baselines in Figure 3a. The success rate of DiscreteMPCPush starts around 0.5. We noticed three common sources of suboptimality for this initial policy. First, the limited node expansions per MPC call, which is necessitated by the speed bottleneck of querying the MPC’s model, means that a good action sequence is not always found. Second, the discreteness of the actions sometimes leads to circuitous executions in which the episode ends before the object reaches the target. Third, the heuristic used to guide the MPC’s search, while very informative, can also be misleading in rare cases. These failure modes are especially common when the gripper must move from one side of the cube to the other, since the cube acts as an obstacle in this context.
We confirm the results reported in previous work [32] that learning from scratch with DDPG and HER works well in this domain, converging to a success rate of nearly 1.0 after roughly 2 million simulator steps. The performance of RPL before convergence greatly surpasses both the initial policy and learning from scratch, while still converging to a perfect success rate. For example, RPL takes an order of magnitude fewer training samples to reach an average success rate of 0.9 versus the learning from scratch baseline.
Note that the performance of RPL drops early in training before quickly recovering and surpassing the baselines. We see this pattern in the following experiments as well. This is a manifestation of the issue discussed in Section IVB whereby the critic is initialized poorly with respect to the actor. We found that decreasing the burnin parameter mitigated the drop but did not significantly affect the time to convergence. We thus left the results as they are for the benefit of discussion.
To analyze the source of RPL’s superior data efficiency, we turn to the performance of the Expert Explore baseline. We find that this baseline also improves on learning from scratch, but that RPL converges slightly faster. This suggests that RPL’s advantage in this Push task derives in large part from more efficient exploration, but also from the residual parameterization and initialization.
VE2 ReactivePush in SlipperyPush
Our second experiment examines model misspecification. We tuned the ReactivePush policy to achieve near perfect performance in the original Push task. We now transfer this policy to the SlipperyPush task in which the sliding friction coefficient of the cube is 5x smaller.
The success rates of RPL and the baselines on the SlipperyPush task are shown in Figure 3b. As expected, the ReactivePush policy is not perfect, achieving a success rate of around 0.45. The most common failure mode of this initial policy is when the gripper pushes the slippery cube too hard and the cube slides off the table. In other cases, the cube does not fall off, but is pushed back and forth across the goal without converging. A representative trial is illustrated in Figure 2 (top row). As in the first experiment, we find that RPL is far better before convergence and converges to the same perfect success rate as modelfree learning from scratch.
VE3 ReactivePickAndPlace in PickAndPlace
In this experiment, we consider an example of a poorly calibrated initial policy that leads to detrimental oscillatory behavior. Such oscillations are a common issue in stateless robotic control when gains are improperly tuned. To create a representative scenario, we start with the ReactivePickAndPlace policy and artificially increase the gains. Oscillations quickly arise, e.g. when the gripper overshoots the waypoints implicit in the design of the policy. These oscillations cause the success rate of the ReactivePickAndPlace to drop to roughly 0.5, as seen in Figure 3c.
As reported in previous work [32], learning from scratch with DDPG and HER requires far more data to reach a success rate of 1.0 in PickAndPlace versus Push. Here we find the data efficiency of RPL to be substantially better. RPL converges to a success rate of 1.0 after roughly 1 million simulator steps, which represents a nearly 10x improvement over learning from scratch. Comparing with the Expert Explore baseline, we find that not all of the advantage can be explained by improved exploration; the good parameterization and initialization of the policy is also to credit.
It was not a priori obvious that the initial policy would aid RPL here as much as it apparently does. By design, we know that the policy is close in “gain space” to a near optimal one, but that does not guarantee that the policy is similarly close in “residual weight space.” Fortunately, it seems the two notions coincide here.
VE4 ReactiveHook in NoisyHook
Now we turn to another prevalent problem in robotic control — sensor noise — and investigate whether RPL can improve the robustness of a sensitive initial policy. As discussed in Section VA, the NoisyHook task features Gaussian noise applied to the positions and rotations of the block and hook. While the ReactiveHook policy is nearly perfect in a noiseless version of the same task, the policy proves to be quite sensitive to the sensor noise. We observe diverse failure modes throughout the course of execution: the gripper often moves to a wrong position, sometimes fails to pick up the hook, and other times drops the hook. As shown in Figure 4a, the success rate of the initial policy is roughly , far lower than in our previous experiments.
In this experiment, we make use of the two frame policy architecture described in Section IVC to cope with sensor noise. We use the same architecture for all three learning methods for comparison.
Learning from scratch with DDPG and HER fails in this task, never achieving a nontrivial success rate. This failure is not surprising given the long horizon and sparse rewards in the task. The Expert Explore baseline also performs quite poorly, only beginning to reach nontrivial success rates after 5 million simulator steps. We speculate that this failure is due to the fact that the hook is so often dropped by the initial policy.
In contrast, we see that RPL quickly converges to a success rate of roughly 0.8. This represents the first instance of RPL obtaining strong performance in a task that is both out of reach for current deep RL methods and nontrivial for robotic control alone. Moreover, the results suggest that RPL is a promising method for overcoming the common challenge of sensor noise.
VE5 ReactiveHook in ComplexHook
In this experiment, we study structured uncertainty inspired by the common mismatch between physics simulators and real robotics tasks. As described in Section VA, the ComplexHook task contains two challenging innovations over the noiseless hook task: bumps are randomly scattered across the table surface; and the object takes on a variety of shapes, masses, and coefficients of friction. We observed that each of these two innovations independently cause the ReactiveHook policy performance to drop by roughly 20%. With both changes present, the initial policy success rate drops to 0.55, as shown in Figure 4b.
A random or null policy is occasionally successful in this task due to the scene randomization. With this in mind, we see that learning from scratch with DDPG and HER does not obtain any nontrivial success rate, as in the previous experiment. We again find that the policy never causes the gripper to touch the hook, let alone move it to reach the object.
Interestingly, the Expert Explore baseline does achieve a nontrivial success rate, eventually slightly surpassing the success rate of the initial policy. This task is easier than NoisyHook from the perspective of the expert baseline if only because the initial success rate is much higher.
Finally, RPL learns a robust policy with strong data efficiency, converging at a success rate just below 0.8. The fact that RPL is able to achieve this success rate is fairly remarkable given the diversity in the objects and obstacles, and the fact that the state contains no information about this diversity. RPL has apparently learned a “conformant” policy that works for most objects and obstacles without discretion. We show one intriguing example of RPL succeeding where the initial policy fails in Figure 5.
VE6 CachedPETS in MBRLPusher
In this final experiment, we examine whether RPL can improve on a modelbased RL method while converging faster than modelfree RL. As described in Section VA, to derive the initial controller, we begin by learning a transition model. Following Chua et al. [6], we train for 15000 simulator steps in their MBRLPusher environment. We then query the environment for an additional 500 steps to construct our CachedPETS controller. (We thus offset our plotted results by 15500 relative to the baseline in Figure 4c.) The performance of PETS, averaged over 10 trials, is plotted as a dashed line in Figure 4c. We see that the drop in performance due to the caching approximation is fairly small.
We find that RPL improves substantially not only on the initial CachedPETS controller, but also on the original PETS controller. Furthermore, RPL converges faster than DDPG+HER, indicating that the initial controller was beneficial. It is worth emphasizing that no domain knowledge was used to design the initial policy here; this same combination of MBRL and RPL could be applied immediately to a new domain. These results suggest that RPL may be seen as a general RL method that marries the data efficiency of MBRL with the superior asymptotic performance of modelfree RL.
Vi Discussion and Conclusion
We have described Residual Policy Learning (RPL), a simple method that combines the strengths of deep RL and robotic control. Our experimental results suggest that RPL is a powerful approach to deal with pervasive issues in complex manipulation tasks such as sensor noise, model misspecification, controller miscalibration, and partial observability. We find that RPL consistently improves on initial policies and achieves better data efficiency than modelfree RL alone. Furthermore, RPL can improve on initial policies for longhorizon, sparsereward problems where modelfree RL fails.
We have also seen the promise of combining RPL with modelbased RL [6]. MBRL is often more dataefficient whereas modelfree RL can be faster to run and asymptotically superior. RPL offers a simple mechanism for combining the strengths of both. We find empirically that learning a residual on top of MBRL improves on MBRL alone, converging to the same performance as modelfree RL with less data.
We postulate three main causes for the success of RPL. First, as described in Section IV, we take care to initialize the residual policy so that its output at first matches the initial policy. When the initial policy is strong, this initialization gives RPL a clear boost. The second cause of RPL’s success is improved exploration early on during training. In learning from scratch with sparse rewards and long horizons, the first successful trajectory must be discovered by chance. Hindsight Experience Replay is designed to face this challenge, but RPL offers a more direct solution. RPL can discover successful trajectories immediately if the initial policy produces them with nontrivial frequency. To measure the impact of this exploration advantage, we introduced the Expert Explore baseline described in Section VD. Empirically we find this baseline performance to lie midway between RPL and learning from scratch. A third likely cause of RPL’s success is that the residual reinforcement learning problem induced by the initial policy may be easier than the original problem. This cause may best explain the superior performance of RPL in the NoisyHook task, where both the initial policy and the “Expert Explore” baseline are empirically poor.
Though the six case studies we have presented all involve robotic manipulation with DDPG and HER, RPL is far more general than any specific task domain or deep RL method. The method we have described can be immediately applied in any domain with continuous actions and with any gradientbased learning method. However, RPL is especially well suited for complex manipulation because of the availability of good but imperfect initial policies and the longhorizon, sparsereward tasks that naturally arise.
In recent years, complex manipulation problems have been at the forefront of research in robotics and deep RL. Both fields have made significant strides in often complementary directions. RPL should be viewed as one piece of a larger effort to combine the strengths of both approaches. We conjecture that solving the hardest open problems in manipulation will require such a synthesis.
Acknowledgments
We thank Evan Shelhamer for helpful discussions. We gratefully acknowledge support from NSF grants 1523767 and 1723381; from ONR grant N000141310333; from AFOSR grant FA95501710165; from ONR grant N000141812847; from Honda Research; and from the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF1231216. KA acknowledges support from NSERC. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of our sponsors.
References
 Finn et al. [2017] Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. Oneshot visual imitation learning via metalearning. In Conference on Robot Learning, pages 357–368, 2017.
 Kalashnikov et al. [2018] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Scalable deep reinforcement learning for visionbased robotic manipulation. In Conference on Robot Learning, pages 651–673, 2018.
 Gu et al. [2017] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates. In IEEE International Conference on Robotics and Automation (ICRA), pages 3389–3396. IEEE, 2017.
 Zeng et al. [2018] Andy Zeng, Shuran Song, Stefan Welker, Johnny Lee, Alberto Rodriguez, and Thomas Funkhouser. Learning synergies between pushing and grasping with selfsupervised deep reinforcement learning. arXiv preprint arXiv:1803.09956, 2018.
 Andrychowicz et al. [2017] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058, 2017.
 Chua et al. [2018] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. arXiv preprint arXiv:1805.12114, 2018.
 Sutton [1990] Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning Proceedings 1990, pages 216–224. Elsevier, 1990.
 Deisenroth et al. [2015] Marc Peter Deisenroth, Dieter Fox, and Carl Edward Rasmussen. Gaussian processes for dataefficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):408–423, 2015.
 Gu et al. [2016] Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep Qlearning with modelbased acceleration. In International Conference on Machine Learning, pages 2829–2838, 2016.
 Heess et al. [2015] Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pages 2944–2952, 2015.
 Nguyen and Widrow [1990] Derrick H Nguyen and Bernard Widrow. Neural networks for selflearning control systems. IEEE Control Systems Magazine, 10(3):18–23, 1990.
 Mordatch et al. [2016] Igor Mordatch, Nikhil Mishra, Clemens Eppner, and Pieter Abbeel. Combining modelbased policy search with online model learning for control of physical humanoids. In IEEE International Conference on Robotics and Automation (ICRA), pages 242–248. IEEE, 2016.
 Nagabandi et al. [2018] Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In IEEE International Conference on Robotics and Automation (ICRA), pages 7559–7566. IEEE, 2018.
 Pong et al. [2018] Vitchyr Pong, Shixiang Gu, Murtaza Dalal, and Sergey Levine. Temporal difference models: Modelfree deep rl for modelbased control. International Conference on Learning Representations, 2018.

Ross et al. [2011]
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell.
A reduction of imitation learning and structured prediction to
noregret online learning.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
, pages 627–635, 2011.  Abbeel and Ng [2004] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the Twentyfirst International Conference on Machine Learning, page 1. ACM, 2004.
 Ziebart et al. [2008] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI Conference on Artificial Intelligence, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
 Wang et al. [2018] Zi Wang, Caelan Reed Garrett, Leslie Pack Kaelbling, and Tomás LozanoPérez. Active model learning and diverse action sampling for task and motion planning. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018.
 Marco et al. [2017] Alonso Marco, Felix Berkenkamp, Philipp Hennig, Angela P Schoellig, Andreas Krause, Stefan Schaal, and Sebastian Trimpe. Virtual vs. real: Trading off simulations and physical experiments in reinforcement learning with bayesian optimization. In IEEE International Conference on Robotics and Automation (ICRA), pages 1557–1563. IEEE, 2017.
 Lizotte et al. [2007] Daniel J Lizotte, Tao Wang, Michael H Bowling, and Dale Schuurmans. Automatic gait optimization with Gaussian process regression. In International Joint Conference on Artificial Intelligence (IJCAI), pages 944–949, 2007.
 Tesch et al. [2011] Matthew Tesch, Jeff Schneider, and Howie Choset. Using response surfaces and expected improvement to optimize snake robot gait parameters. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1069–1074. IEEE, 2011.
 Marco et al. [2016] Alonso Marco, Philipp Hennig, Jeannette Bohg, Stefan Schaal, and Sebastian Trimpe. Automatic LQR tuning based on Gaussian process global optimization. In IEEE International Conference on Robotics and Automation (ICRA), pages 270–277. IEEE, 2016.

Wan and Van Der Merwe [2000]
Eric A Wan and Rudolph Van Der Merwe.
The unscented kalman filter for nonlinear estimation.
In Adaptive Systems for Signal Processing, Communications, and Control Symposium, pages 153–158. Ieee, 2000.  de Avila BelbutePeres et al. [2018] Filipe de Avila BelbutePeres, Kevin Smith, Kelsey Allen, Josh Tenenbaum, and J Zico Kolter. Endtoend differentiable physics for learning and control. In Advances in Neural Information Processing Systems, pages 7176–7187, 2018.
 Kolev and Todorov [2015] Svetoslav Kolev and Emanuel Todorov. Physically consistent state estimation and system identification for contacts. In IEEERAS 15th International Conference on Humanoid Robots (Humanoids), pages 1036–1043. IEEE, 2015.
 Ajay et al. [2018] Anurag Ajay, Jiajun Wu, Nima Fazeli, Maria Bauza, Leslie P. Kaelbling, Joshua B. Tenenbaum, and Alberto Rodriguez. Augmenting physical simulators with stochastic neural networks: Case study of planar pushing and bouncing. arXiv preprint arXiv:1808.03246, 2018.
 Kloss et al. [2017] Alina Kloss, Stefan Schaal, and Jeannette Bohg. Combining learned and analytical models for predicting action effects. arXiv preprint arXiv:1710.04102, 2017.
 Johannink et al. [2018] Tobias Johannink, Shikhar Bahl, Ashvin Nair, Jianlan Luo, Avinash Kumar, Matthias Loskyll, Juan Aparicio Ojea, Eugen Solowjow, and Sergey Levine. Residual reinforcement learning for robot control. arXiv preprint arxIV:1812.03201, 2018.
 Lillicrap et al. [2016] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. 2016.
 Kapturowski et al. [2019] Steven Kapturowski, Georg Ostrovski, Will Dabney, John Quan, and Remi Munos. Recurrent experience replay in distributed reinforcement learning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=r1lyTjAqYX.
 Todorov et al. [2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5026–5033. IEEE, 2012.
 Plappert et al. [2018] Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, et al. Multigoal reinforcement learning: Challenging robotics environments and request for research. arXiv preprint arXiv:1802.09464, 2018.
 Fernández and Veloso [2006] Fernando Fernández and Manuela Veloso. Probabilistic policy reuse in a reinforcement learning agent. In Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems, pages 720–727. ACM, 2006.
 Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Vii Appendix
Viia Initial Policies
Here we describe in detail the initial policies used for our experiments.
ViiA1 DiscreteMPCPush
Suppose we have a learned or known transition model that can be queried to predict the state trajectories and rewards that may result from a sequence of actions taken from an initial state. In ModelPredictive Control (MPC), we use this transition model to select each action taken by the policy . More specifically, given the current state , an MPC policy will internally consider multiple sequences of actions and compute the expected rewards accrued for each sequence. The first action in the best sequence is then the output of . To design an MPC policy, we must therefore specify the model and a procedure for selecting possible action sequences.
In highdimensional tasks with long horizons, sparse rewards, and continuous states and actions, MPC is intractable without an efficient mechanism for selecting action sequences. Here we opt to discretize the action space as a means to simplify the search. In particular, rather than consider the infinite number of possible gripper movements, we consider only six, one per cardinal direction. We can then use a discrete graph search to explore possible action sequences.
We develop a discrete MPC policy for the Push task. The model is a perfect copy of the environment (i.e. a separate instance of MuJoCo). We further improve the policy by introducing an informative heuristic to guide the discrete search. The heuristic is a tuple where is the distance between the object and the target location and is the distance between the gripper and the “push location.” The push location is meant to be the desired position of the gripper for pushing the block to the target; it is approximated by extending the vector difference between the object and target location by a small amount corresponding to the radius of a sphere circumscribed around the object. The second entry of the heuristic is only used to break ties when the first entry matches. We use this heuristic to perform a bestfirst search with 10 node expansions per environment step. At the end of the search, we find the node with the best heuristic value and take the corresponding first action. (If the root has the best heuristic value, we take a noop action.)
ViiA2 ReactivePush
Our second policy is designed for pushing an object to a target location. While this policy works nearly perfectly in the original Push task, its performance drops dramatically when the sliding friction on the block is reduced as in the SlipperyPush task. Given an input state, the policy checks the following conditional statements in order until one holds and proceeds accordingly.

If the object is already at the target location, do nothing.

If the block is between the gripper and the target location, move the gripper towards the target location.

If the gripper is above the push location (see definition in DiscreteMPCPush), move the gripper down to prepare to push.

Move the gripper to above the push location.
To determine whether the object or gripper is “at” a location, we measure the distance and check if it is below a global threshold. The other key hyperparameter is a gain that determines how far the gripper moves at each time step. We manually tuned this gain to achieve near optimal performance on the original Push task.
ViiA3 ReactivePickAndPlace
Our third policy is designed to pick up a cube and bring it to a target location on or above the table. Given an input state, the policy checks the following conditional statements in order until one holds and proceeds accordingly.

If the object is already at the target location, do nothing.

If the gripper is grasping the object, move towards the target location.

If the object is between the gripper fingers (but not grasped), close the gripper.

If the gripper is above the object

… and the gripper is closed, open the gripper.

… and the gripper is open, move the gripper down.

Move the gripper towards the location above the object.
To determine whether the gripper is grasping the object, we check that the object location is between the two fingers and that the fingers are not more than the block width apart. We again use the distance threshold and gain hyperparameters described above.
ViiA4 ReactiveHook
Our fourth policy is designed to pick up a hook, move it behind and to the right of an object, and push and pull the object towards a target location. The policy works nearly perfectly when the object is a cube, the table is clear of obstacles, and the observations are noisefree (see Figure1a). However, the policy performance drops substantially when transferred to the NoisyHook and ComplexHook tasks. Given an input state, the policy checks the following conditional statements in order until one holds and proceeds accordingly.

If the object is already at the target location, do nothing.

If the hook is not grasped and lifted above the table, grasp and lift the hook.

If the hook is not beyond and to the right of the object, move forward or rightward accordingly.

Move the gripper following the vector difference between the object and the target location.
The grasp position is fixed so that the robot always attempts to pick up the same part of the hook (near the bottom). In addition to the global threshold and gain hyperparameters, we use knowledge of the length and width of the hook to determine gripper movements as a function of desired hook movements.
ViiA5 CachedPETS
Our final policy uses a modelpredictive controller (MPC) with a learned transition model. We take the recently proposed Probabilistic Ensembles with Trajectory Sampling (PETS) as our method for modelbased reinforcement learning [6]. PETS learns a transition model in the form of an ensemble of probabilistic neural networks. During planning, a sequence of actions is sampled with reference to previously highreward action sequences using the crossentropy method. To predict the subsequent states and rewards using the learned transition model, a finite collection of particles are propagated forward in time. The action that leads to the highest expected reward is selected, and planning repeats after each environment step. We use the PETS implementation made available by Chua et al. [6] without modification.
MPC methods are generally much slower than modelfree counterparts. Indeed, we found PETS alone to be intractably slow as an initial policy for RPL. We therefore create a “cached” version that stores the action produced by PETS for 500 input states. The number 500 was selected based on a small performance analysis (see later in the appendix). We select these 500 input states by sampling trajectories from the environment onpolicy. At test time, given a new state, we find the nearest state in the cache (as measured by Euclidean distance) and take the corresponding action. Though quite simple and coarse as an approximation of the full MPC, the final CachedPETS controller performs only slightly worse than the original PETS (see Figure 4c) with a 23 order of magnitude speed up.
ViiB Model Hyperparameters
All experiments in this paper use the following hyperparameters, which are taken from [32].

Actor and critic networks: layers with units each and ReLU nonlinearities

Adam optimizer [34] with for training both actor and critic

Buffer size: transitions

Polyakaveraging coefficient:

Action L2 norm coefficient:

Observation clipping:

Batch size:

Cycles per epoch:

Batches per cycle:

Test rollouts per epoch:

Probability of random actions:

Scale of additive Gaussian noise: ( for hooks)

Probability of HER experience replay:

Normalized clipping:
For the Push and PickAndPlace experiments, we use 19 MPI workers with a rollout batch size of 2 to match the previous work. For the Hook experiments, we use 1 MPI worker and a rollout batch size of 4 to save on compute resources. We determined the ”Expert Explore” baseline hyperparameters and with a small grid search. For all experiments, we used a burnin threshold of . We did not optimize this hyperparameter and believe RPL’s performance could be further improved in doing so.
Comments
There are no comments yet.