Deep Residual Reinforcement Learning

05/03/2019 ∙ by Shangtong Zhang, et al. ∙ University of Oxford 38

We revisit residual algorithms in both model-free and model-based reinforcement learning settings. We propose the bidirectional target network technique to stabilize residual algorithms, yielding a residual version of DDPG that significantly outperforms vanilla DDPG in the DeepMind Control Suite benchmark. Moreover, we find the residual algorithm an effective approach to the distribution mismatch problem in model-based planning. Compared with the existing TD(k) method, our residual-based method makes weaker assumptions about the model and yields a greater performance boost.



There are no comments yet.


page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semi-gradient algorithms have recently enjoyed great success in deep reinforcement learning (RL) problems, e.g., DQN (Mnih et al., 2015) achieves human-level control in the Arcade Learning Environment (ALE, Bellemare et al. 2013). However, such algorithms lack theoretical support. Most semi-gradient algorithms suffer from divergence under nonlinear function approximation or off-policy training (Baird, 1995; Tsitsiklis and Van Roy, 1997). By contrast, residual gradient (RG, Baird 1995) algorithms are true stochastic gradient algorithms and enjoy convergence guarantees (to a local minimum) under mild conditions with both nonlinear function approximation and off-policy training. Baird (1995) further proposes residual algorithms (RA) to unify residual gradients and semi-gradients via mixing them together.

Residual algorithms suffer from the double sampling issue (Baird, 1995): two independently sampled successor states are required to compute the gradients. This requirement can be easily satisfied in model-based RL or in deterministic environments. However, even in these settings, residual algorithms have long been either overlooked or dismissed as impractical. In this paper, we aim to overturn that conventional wisdom with new algorithms built on RA and empirical results showing their efficacy.

Our contributions are threefold. First, we give a thorough overview of the previous comparison between residual gradients and semi-gradients.

Second, we showcase the advantages of RA in a model-free RL setting with deterministic environments. While target networks (Mnih et al., 2015) are usually an important component in a deep RL algorithm (Mnih et al., 2015; Lillicrap et al., 2015), we find a naive combination of target networks and residual algorithms, in general, does not improve performance. Therefore, we propose the bidirectional target network technique to stabilize residual algorithms. We show that our residual version of Deep Deterministic Policy Gradients (DDPG, Lillicrap et al. 2015) significantly outperforms vanilla DDPG in DMControl.

Third, we showcase the advantages of RA in a model-based RL setting, where a learned model generates imaginary transitions to train the value function. In general, model-based methods suffer from a distribution mismatch problem (Feinberg et al., 2018). The value function trained on real states does not generalize well to imaginary states generated by a model. To address this issue, Feinberg et al. (2018) train the value function on both real and imaginary states via the TD() trick. However, TD() requires that predictions steps in the future made by model rollouts will be accurate (Feinberg et al., 2018). In this paper, we show that RA naturally allows the value function to be trained on both real and imaginary states and requires only 1-step rollouts. Our experiments show that RA-based planning boosts performance more than TD()-based planning.

2 Background

We consider an MDP (Puterman, 2014) consisting of a finite state space , a finite action space , a reward function , a transition kernel and a discount factor . With denoting a policy, at time , an agent at a state takes an action according to . The agent then gets a reward satisfying and proceeds to a new state according to . We use to denote the return from time , to denote the state value function of , and to denote the state-action value function of . We use to denote the transition matrix induced by a policy , i.e., , and use to denote its unique stationary distribution, assuming

is ergodic. The reward vector induced by

is .

The value function is the unique fixed point of the Bellman operator (Bellman, 2013). In a matrix form, is defined as , where v can be any vector in . Here is the number of states.

Policy Evaluation: We consider the problem of finding for a given policy and use v, parameterized by

, to denote an estimate of

, the vector form of . We consider on-policy linear function approximation and use to denote a feature function which maps a state to a -dimensional feature. The feature matrix is then , and the value estimate is .

To approximate , one direct goal is to minimize the Mean Squared Value Error:

To minimize MSVE, a Monte Carlo return can be used as a sample for to train v

. However, this method suffers from a large variance and usually requires off-line learning

(Bertsekas and Tsitsiklis, 1996). To address those issues, we consider minimizing the Mean Squared Projected Bellman Error (MSPBE) and the Mean Squared Bellman Error (MSBE):

Here is a projection operator which maps an arbitrary vector onto the column vector space of X, minimizing a -weighted projection error, i.e., , where . With linear function approximation, is linear.

There are various algorithms for minimizing MSPBE and MSBE. Temporal Difference learning (TD, Sutton 1988) is commonly used to minimize MSPBE. TD updates w as

where is a step size. Under mild conditions, on-policy linear TD converges to the point where MSPBE is 0 (Tsitsiklis and Van Roy, 1997). TD is a semi-gradient (Sutton and Barto, 2018) algorithm in that it ignores the dependency of on w. There are also true gradient algorithms for optimizing MSPBE, e.g., Gradient TD methods (Sutton et al., 2009). Gradient TD methods compute the gradient of MSPBE directly and also enjoy convergence guarantees.

Baird (1995) proposes residual gradient algorithms for minimizing MSBE, which updates w as


where is another sampled successor state for , independent of . This requirement for two independent samples is known as the double sampling issue (Baird, 1995). If both the transition kernel and the policy are deterministic, we can simply use one sample without introducing bias. Otherwise, we may need to have access to the transition kernel , which is usually not available in model-free RL. Regardless, RG is a true gradient algorithm with convergence guarantees under mild conditions.

We now expand our discussion about policy evaluation into off-policy learning and nonlinear function approximation. True gradient algorithms like Gradient TD methods and RG remain convergent to local minima under off-policy training with any function approximation (Baird, 1995; Sutton et al., 2009; Maei, 2011). However, the empirical success of Gradient TD methods is limited to simple domains due to its large variance (Sutton et al., 2016). Semi-gradient algorithms are not convergent in general, e.g., the divergence of off-policy linear TD is well-documented (Tsitsiklis and Van Roy, 1997).

Semi-gradient algorithms are fast but in general not convergent. Residual gradient algorithms are convergent but slow (Baird, 1995). To take advantage of both, Baird (1995) proposes to mix semi-gradients and residual gradients together, yielding the residual algorithms. The RA version of TD (Baird, 1995) updates w as

where controls how the two gradients are mixed. Little empirical study has been conducted for RA.

Control: We now consider the problem of control, where we are interested in finding an optimal policy such that . We use to denote the state-action value function of and to denote an estimate of , parameterized by . Q-learning (Watkins and Dayan, 1992) is usually used to train

and enjoys convergence guarantees in the tabular setting. When Q-learning is combined with neural networks, Deep-Q-Networks (DQN,

Mnih et al. 2015) update as


where is a step size, is a transition sampled from a replay buffer (Lin, 1992), and indicates the estimate is from a target network (Mnih et al., 2015), parameterized by , which is synchronized with periodically.

When the action space is continuous, it is hard to perform the operation in the DQN update (2). DDPG can be interpreted as a continuous version of DQN, where an actor , parameterized by , is trained to output the greedy action. DDPG updates and as


where is a step size and indicates the greedy action is from a target network, parameterized by , which is synchronized with periodically.

Both DQN and DDPG are semi-gradient algorithms. There are also true gradient methods for control, e.g., Greedy-GQ (Maei et al., 2010), the residual version of Q-learning (Baird, 1995). The empirical success of Greedy-GQ is limited to simple domains due to its large variance (Sutton et al., 2016).

3 Comparing TD and RG

In this section, we review existing comparisons between RG and TD. We start by comparing their fixed points, MSBE and MSPBE, in the setting of linear function approximation.

Cons of MSBE:

  • Sutton and Barto (2018) show that MSBE is not uniquely determined by the observed data. Different MDPs may have the same data distribution due to state aliasing, but the minima of MSBE can still be different. This questions the learnability of MSBE as sampled transitions are all that is available in model-free RL. By contrast, the minima of MSPBE are always the same for MDPs with the same data distribution.

  • Empirically, optimizing MSBE can lead to unsatisfying solutions. For example, in the A-presplit example (Sutton and Barto, 2018), the value of most states can be represented accurately by the function approximator but the MSBE minimizer does not do so, while the MSPBE minimizer does. Furthermore, empirically the MSBE minimizer can be further from the MSVE minimizer than the MSPBE minimizer (Dann et al., 2014).

Pros of MSBE:

  • Williams and Baird (1993) show MSBE can be used to bound MSVE (up to a constant). By contrast, at a point where MSPBE is minimized, MSVE can be arbitrarily large (Bertsekas and Tsitsiklis, 1996).

  • MSBE is an upper bound of MSPBE (Scherrer, 2010), indicating that optimizing MSBE implicitly optimizes MSPBE.

We now compare RG and TD.

Cons of RG:

  • Due to the double sampling issue, it is usually hard to apply RG in the stochastic model-free setting (Baird, 1995), while TD is in general compatible with all kinds of environments.

  • RG is usually slower than TD. Empirically, this is observed by Baird (1995); van Hasselt (2011); Gordon (1995, 1999). Intuitively, in the RG update (1), a state and its successor are often similar under function approximation. As a result, the two gradients and tend to be similar and cancel each other, slowing down the learning. Theoretically, Schoknecht and Merke (2003b) prove TD converges faster than RG in a tabular setting.

  • Lagoudakis and Parr (2003) argue that TD usually provides a better solution than RG, even though the value function is not as well approximated. The TD solution “preserves the shape of the value function to some extent rather than trying to fit the absolute values”. Thus “the improved policy from the corresponding approximate value function is closer to the improved policy from the exact value function” (Lagoudakis and Parr, 2003; Li, 2008; Sun and Bagnell, 2015).

Pros of RG:

  • RG is a true gradient algorithm and enjoys convergence guarantees in most settings under mild conditions. By contrast, the divergence of TD with off-policy learning or nonlinear function approximation is well documented (Tsitsiklis and Van Roy, 1997). Empirically, Munos (2003); Li (2008) show that RG is more stable than TD.

  • Schoknecht and Merke (2003b) observe that RG converges faster than TD in the four-room domain (Sutton et al., 1999) with linear function approximation. Scherrer (2010) shows empirically that the TD solution is usually slightly better than RG but in some cases fails dramatically. So RG should be preferred on average.


  • Li (2008) proves that TD makes more accurate predictions (i.e., the predicted state value is close to the true state value), while RG yields smaller temporal differences (i.e., the value predictions for a state and its successor are more consistent). This is also explained in Sutton and Barto (2018).

To summarize, previous insights about RG and TD are mixed. There is little empirical study for RG in deep RL problems, much less RA. It is not clear whether and how we can take advantage of RA in model-free and model-based RL to solve deep RL problems.

4 Residual Algorithms in Model-free RL

In this section, we investigate how to combine RA and DDPG. In particular, we consider (almost) deterministic environments (e.g., DMControl) to avoid the double sampling issue.

In semi-gradient algorithms, value propagation goes backwards in time. The value of a state depends on the value of its successor through bootstrapping, and a target network is used to stabilize this bootstrapping. RA allows value propagation both forwards and backwards. The value of a state depends on the value of both its successor and predecessor. Therefore, we need to stabilize the bootstrapping in both directions. To this end, we propose the bidirectional target network technique. Employing this in DDPG yields Bi-Res-DDPG, which updates the critic parameters as:

where are target networks and controls how the two gradients are mixed. The actor update remains unchanged.

Figure 1: AUC improvements of Bi-Res-DDPG over DDPG on all 28 DMControl tasks, computed as

We compared Bi-Res-DDPG to DDPG in all 28 DMControl tasks. Our DDPG implementation uses the same architecture and hyperparameters as

Lillicrap et al. (2015), which are inherited by Bi-Res-DDPG (and all other DDPG variants in this paper). For Bi-Res-DDPG, we tune over on walker-stand and use across all tasks. We perform 20 deterministic evaluation episodes every training steps and plot the averaged evaluation episode returns. All curves are averaged over 5 independent runs and are available in the supplementary materials. In the main text, we report the improvement of AUC (area under the curve) of the evaluation curves in Figure 1. AUC serves as a proxy for learning speed. Bi-Res-DDPG achieves a 20% (37%) AUC improvement over the original DDPG in terms of the median (mean).

To further investigate the relationship between the target network and RA, we study several variants of DDPG. We define a shorthand and use “T” and “O” to denote the target network and the online network respectively. We have:

Res-DDPG: (5)
TO-Res-DDPG: (6)
OT-Res-DDPG: (7)
TT-Res-DDPG: (8)

Res-DDPG is a direct combination of RA and DDPG without a target network. TO-Res-DDPG simply adds a residual gradient term to the original DDPG. OT-Res-DDPG focuses on stabilizing the bootstrapping for the forward value propagation. TT-Res-DDPG stabilizes bootstrapping in both directions but destroys the connection between prediction and error. By contrast, Bi-Res-DDPG stabilizes bootstrapping in both directions and maintains the connection between prediction and error.

Figure 2: Performance of Bi-Res-DDPG variants on walker-stand.

Figure 2 compares these variants on walker-stand. The main points to note are: (1) Both Bi-Res-DDPG() and TO-Res-DDPG() are the same as vanilla DDPG. The curves are similar, verifying the stability of our implementation. (2) Res-DDPG() corresponds to vanilla DDPG without a target network, which performs poorly. This confirms that a target network is important for stabilizing training and mitigating divergence when a nonlinear function approximator is used (Mnih et al., 2015; Lillicrap et al., 2015). (3) Increasing improves Res-DDPG’s performance. This complies with the argument from Baird (1995) that residual gradients help semi-gradients converge. All variants fail with a large (e.g., 0.8 or 1). This complies with the argument from Baird (1995) that pure residual gradients are slow. (4) TO-Res-DDPG() (i.e., vanilla DDPG) is similar to Res-DDPG(), indicating a naive combination of RA and DDPG without a target network is ineffective. (5) For To-Res-DDPG, achieves the best performance, indicating adding a residual gradient term to DDPG directly is ineffective. To summarize, these variants confirm the necessity of the bidirectional target network.

We also evaluated a Bi-Res version of DQN in three ALE environments (BeamRider, Seaquest, Breakout). The performance is similar to the original DQN. One of the many differences between DMControl and ALE is that rewards in ALE are much more sparse. This might indicate that forward value propagation in RA is less likely to yield a performance boost with sparse rewards.

5 Residual Algorithms in Model-based RL

Input: a critic parameterized by an actor parameterized by planning steps a noise process a critic update procedure Initialize target networks Initialize a replay buffer , a model Get an initial state and set while true do
       Execute and get Store into Fit with data in Sample a batch of transitions from for  do
             Update with , (3), (4) // Planning
             for  do
                   Update with and
             end for
       end for
       Update according to
end while
Algorithm 1 Dyna-DDPG

Dyna (Sutton, 1990) is a commonly used model-based RL framework that trains a value function with imaginary transitions from a learned model. In this paper, we consider the combination of Dyna and DDPG. In model-based RL, the double sampling issue can be easily addressed by querying the learned model (either deterministic or stochastic). Given the empirical success of deterministic models and their robustness in complex tasks (Kurutach et al., 2018; Feinberg et al., 2018; Buckman et al., 2018), we consider deterministic models in this paper. For each planning step, we sample a transition from a replay buffer and add some noise to the action , yielding a new action . We then query a learned model with and get . This imaginary transition is then used to train the -function. The pseudocode of this Dyna-DDPG is provided in Algorithm 1. We aim to investigate different strategies for updating during planning (i.e., the selection of in Algorithm 1).

One naive choice is to use the original DDPG critic update (3). However, this suffers from the distribution mismatch problem (Feinberg et al., 2018). When we apply (3) in an imaginary transition , we need the -value on for bootstrapping. The -function is trained to make an accurate prediction on the state distribution of , which is usually different from the state distribution of . This distribution mismatch results from both an imperfect model and the different sampling strategies for and . It yields an inaccurate prediction on , leading to poor performance (Feinberg et al., 2018). The TD() trick (Feinberg et al., 2018) is one attempt to address this issue. With a real transition sampled from a replay buffer, a model is unrolled for steps following , yielding a trajectory . TD() then updates to minimize


With this update, is trained on distributions of almost all the states (), which Feinberg et al. (2018) show helps performance. However, TD() still does not train on the last imaginary state . On the one hand, the influence of the bootstrapping error from decreases as the trajectory gets longer thanks to discounting. On the other hand, even small state prediction errors typically compound as trajectories get longer. This contradiction is deeply embedded in TD(). Consequently, TD() must assume the model is accurate for -step unrolling, which is usually hard to satisfy in practice.

In this paper, we seek to mitigate this distribution mismatch issue through RA. For an imaginary transition , RA naturally allows the -function to be trained on both and , without requiring further unrolling like TD(). The use of RA in model-based planning is inspired by the theoretical results from Li (2008), who proves that TD makes better predictions than RG. On a real transition, this accelerates backward value propagation by providing better bootstrapping. However, on an imaginary transition from a model, the value function is never trained on the imaginary successor state. It is questionable whether we should trust the value prediction on an imaginary state as much as a real state. We, therefore, propose to use RA on imaginary transitions.

We now evaluate RA in model-based planning experimentally. We compare the performance of Dyna-DDPG() (referred to as Dyna-DDPG), Dyna-DDPG() (referred to as Res-Dyna-DDPG), and MVE-DDPG (Feinberg et al., 2018) with TD(). We consider five Mujoco tasks used by Buckman et al. (2018), which is a superset of tasks used by Feinberg et al. (2018). In Feinberg et al. (2018), the unrolling steps of MVE-DDPG are different for different tasks, which serve as domain knowledge. For a fair comparison, Buckman et al. (2018) set for all tasks in their baseline MVE-DDPG. In our empirical study, we followed this convention. Other hyperparameters of our MVE-DDPG are the same as Feinberg et al. (2018).

To separate planning from model learning, we first consider planning with an oracle model. We tune hyperparameters for Dyna-DDPG and Res-Dyna-DDPG on Walker and set for all tasks. Other details are provided in the supplementary materials. The results are reported in Figure 3

. Curves are averaged over 8 independent runs and shadowed regions indicate standard errors. Both Dyna-DDPG and MVE-DDPG with an oracle model improve performance in 2 of 5 games, while Res-Dyna-DDPG improves performance in 4 out of 5 games. These results suggest that RA is a more effective approach to exploit a model for planning. In HalfCheetah, both MVE-DDPG and Res-Dyna-DDPG fail to outperform Dyna-DDPG. This could suggest that the distribution mismatch problem is not significant in this task. Furthermore, MVE-DDPG exhibits instability in HalfCheetah, as is also observed by

Buckman et al. (2018).

Figure 3: Evaluation performance for different model-based DDPG with an oracle model.

We now consider planning with a learned model. We use the same model parameterization and model training protocol as Feinberg et al. (2018). We set for all tasks. The results are reported in Figure 4. In Swimmer and Humanoid, Res-Dyna-DDPG significantly outperforms all other methods. In Walker and Hopper, Res-Dyna-DDPG reaches similar performance as MVE-DDPG. In HalfCheetah, Res-Dyna-DDPG () fails dramatically. We further test other values for and find produces reasonable performance, as shown by the extra black curve. This indicates that can serve as domain knowledge, reflecting our confidence in a learned model. A possible future work is to use model uncertainty estimation from a model ensemble to determine automatically, similar to what Buckman et al. (2018) propose for the unrolling steps in TD(), which significantly improves performance over MVE-DDPG.

Figure 4: Evaluation performance for different model-based DDPG with a learned model.

In this section, we consider the vanilla residual update (5) without the bidirectional target network. Our preliminary experiments show that introducing the bidirectional target network during planning does not further boost performance. The main purpose of a target network is to stabilize bootstrapping (value propagation). Due to the distribution mismatch problem on imaginary transitions, however, it may be more important for the value function to be consistent with the model than simply propagating the value in either direction. This may reduce the importance of the bidirectional target network.

6 Related Work

There are also other studies on Bellman residual methods. Geist et al. (2017) show that for policy-based methods, maximizing the average reward is better than minimizing the Bellman residual. Schoknecht and Merke (2003a) show RG converges with a problem-dependent constant learning rate when combined with certain function approximators. Dabney and Thomas (2014) extend RG with natural gradients. However, this paper appears to be the first to contrast residual gradients and semi-gradients in deep RL problems and demonstrate the efficacy of RA with new algorithms.

Dyna-style planning in RL has been widely used. Gu et al. (2016) learn a local linear model for planning. Kurutach et al. (2018) learn a model ensemble to avoid overfitting to an imperfect model, which is also achieved by meta-learning (Clavera et al., 2018). Kalweit and Boedecker (2017) use a value function ensemble to decide when to use a model. Besides Dyna-style planning, learned models are also used for a lookahead tree-search to improve value estimation at decision time (Silver et al., 2017; Oh et al., 2017; Talvitie, 2017). This tree-search is also used as an effective inductive bias in value function parameterization (Farquhar et al., 2018; Srinivas et al., 2018; Zhang et al., 2019). Trajectories from a learned model are also used as extra inputs for a value function (Weber et al., 2017), which reduces the negative influence of the model prediction error. In this paper, we focus on the simplest Dyna-style planning and leave the combination of RA and more advanced planning techniques for future work.

Besides RL, learned models are also used in other control methods, e.g., model predictive control (MPC, Garcia et al. 1989). Nagabandi et al. (2018) learn deterministic models via neural networks for MPC. Chua et al. (2018) conduct a thorough comparison between deterministic models and stochastic models and use particle filters when unrolling a model. Besides modeling the observation transition, Ha and Schmidhuber (2018); Hafner et al. (2018) propose to model the abstract state transition and use MPC on the abstract state space. In this paper, we focus on the simplest deterministic model and leave the combination of RA and more advanced models for future work.

7 Conclusions

In this paper, we give a thorough review of existing comparisons between RG and TD. We propose the bidirectional target network technique to stabilize bootstrapping in both directions in RA, yielding a significant performance boost. We also demonstrate that RA is a more effective approach to the distribution mismatch problem in model-based planning than the existing TD() method. Our empirical study showed the efficacy of RA in deep RL problems, which has long been underestimated by the community. A possible future work is to study RA in model-free RL with stochastic environments, where the double sampling issue cannot be trivially resolved.


SZ is generously funded by the Engineering and Physical Sciences Research Council (EPSRC). This project has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 637713). The experiments were made possible by a generous equipment grant from NVIDIA.


  • Baird (1995) Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. Machine Learning.
  • Bellemare et al. (2013) Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2013). The arcade learning environment: An evaluation platform for general agents.

    Journal of Artificial Intelligence Research

  • Bellman (2013) Bellman, R. (2013). Dynamic programming. Courier Corporation.
  • Bertsekas and Tsitsiklis (1996) Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scientific Belmont, MA.
  • Buckman et al. (2018) Buckman, J., Hafner, D., Tucker, G., Brevdo, E., and Lee, H. (2018). Sample-efficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems.
  • Chua et al. (2018) Chua, K., Calandra, R., McAllister, R., and Levine, S. (2018). Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems.
  • Clavera et al. (2018) Clavera, I., Rothfuss, J., Schulman, J., Fujita, Y., Asfour, T., and Abbeel, P. (2018). Model-based reinforcement learning via meta-policy optimization. arXiv preprint arXiv:1809.05214.
  • Dabney and Thomas (2014) Dabney, W. and Thomas, P. (2014). Natural temporal difference learning. In Proceedings of the 28th AAAI Conference on Artificial Intelligence.
  • Dann et al. (2014) Dann, C., Neumann, G., and Peters, J. (2014). Policy evaluation with temporal differences: A survey and comparison. Journal of Machine Learning Research.
  • Dhariwal et al. (2017) Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y., and Zhokhov, P. (2017). Openai baselines.
  • Farquhar et al. (2018) Farquhar, G., Rocktäschel, T., Igl, M., and Whiteson, S. (2018). Treeqn and atreec: Differentiable tree-structured models for deep reinforcement learning. arXiv preprint arXiv:1710.11417.
  • Feinberg et al. (2018) Feinberg, V., Wan, A., Stoica, I., Jordan, M. I., Gonzalez, J. E., and Levine, S. (2018). Model-based value estimation for efficient model-free reinforcement learning. arXiv preprint arXiv:1803.00101.
  • Garcia et al. (1989) Garcia, C. E., Prett, D. M., and Morari, M. (1989). Model predictive control: theory and practice—a survey. Automatica.
  • Geist et al. (2017) Geist, M., Piot, B., and Pietquin, O. (2017). Is the bellman residual a bad proxy? In Advances in Neural Information Processing Systems.
  • Gordon (1995) Gordon, G. J. (1995). Stable function approximation in dynamic programming. Machine Learning.
  • Gordon (1999) Gordon, G. J. (1999).

    Approximate solutions to Markov decision processes

    PhD thesis, Carnegie Mellon University.
  • Gu et al. (2016) Gu, S., Lillicrap, T., Sutskever, I., and Levine, S. (2016). Continuous deep q-learning with model-based acceleration. In Proceedings of the 33rd International Conference on Machine Learning.
  • Ha and Schmidhuber (2018) Ha, D. and Schmidhuber, J. (2018). World models. arXiv preprint arXiv:1803.10122.
  • Hafner et al. (2018) Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. (2018). Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551.
  • Huber et al. (1964) Huber, P. J. et al. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics.
  • Kalweit and Boedecker (2017) Kalweit, G. and Boedecker, J. (2017). Uncertainty-driven imagination for continuous deep reinforcement learning. In Proceedings of the 2017 Conference on Robot Learning.
  • Kurutach et al. (2018) Kurutach, T., Clavera, I., Duan, Y., Tamar, A., and Abbeel, P. (2018). Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592.
  • Lagoudakis and Parr (2003) Lagoudakis, M. G. and Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research.
  • Li (2008) Li, L. (2008). A worst-case comparison between temporal difference and residual gradient with linear function approximation. In Proceedings of the 25th International Conference on Machine Learning.
  • Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
  • Lin (1992) Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning.
  • Maei (2011) Maei, H. R. (2011). Gradient temporal-difference learning algorithms. PhD thesis, University of Alberta.
  • Maei et al. (2010) Maei, H. R., Szepesvári, C., Bhatnagar, S., and Sutton, R. S. (2010). Toward off-policy learning control with function approximation. In Proceedings of the 27th International Conference on Machine Learning.
  • Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature.
  • Munos (2003) Munos, R. (2003). Error bounds for approximate policy iteration. In Proceedings of the 20th International Conference on Machine Learning.
  • Nagabandi et al. (2018) Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. (2018). Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In Proceedings of the 2018 International Conference on Robotics and Automation.
  • Oh et al. (2017) Oh, J., Singh, S., and Lee, H. (2017). Value prediction network. In Advances in Neural Information Processing Systems.
  • Puterman (2014) Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
  • Scherrer (2010) Scherrer, B. (2010). Should one compute the temporal difference fix point or minimize the bellman residual? the unified oblique projection view. In Proceedings of the 27nd International Conference on Machine Learning.
  • Schoknecht and Merke (2003a) Schoknecht, R. and Merke, A. (2003a). Convergent combinations of reinforcement learning with linear function approximation. In Advances in Neural Information Pprocessing Systems.
  • Schoknecht and Merke (2003b) Schoknecht, R. and Merke, A. (2003b). Td (0) converges provably faster than the residual gradient algorithm. In Proceedings of the 20th International Conference on Machine Learning.
  • Silver et al. (2017) Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., Dulac-Arnold, G., Reichert, D., Rabinowitz, N., Barreto, A., et al. (2017). The predictron: End-to-end learning and planning. In Proceedings of the 34th International Conference on Machine Learning.
  • Srinivas et al. (2018) Srinivas, A., Jabri, A., Abbeel, P., Levine, S., and Finn, C. (2018). Universal planning networks. arXiv preprint arXiv:1804.00645.
  • Sun and Bagnell (2015) Sun, W. and Bagnell, J. A. (2015). Online bellman residual algorithms with predictive error guarantees. In Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence.
  • Sutton (1988) Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning.
  • Sutton (1990) Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the 7th International Conference on Machine Learning.
  • Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd Edition). MIT press.
  • Sutton et al. (2009) Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., and Wiewiora, E. (2009). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th International Conference on Machine Learning.
  • Sutton et al. (2016) Sutton, R. S., Mahmood, A. R., and White, M. (2016). An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research.
  • Sutton et al. (1999) Sutton, R. S., Precup, D., and Singh, S. (1999). Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence.
  • Talvitie (2017) Talvitie, E. (2017). Self-correcting models for model-based reinforcement learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence.
  • Tsitsiklis and Van Roy (1997) Tsitsiklis, J. N. and Van Roy, B. (1997). Analysis of temporal-diffference learning with function approximation. In Advances in Neural Information Pprocessing Systems.
  • van Hasselt (2011) van Hasselt, H. P. (2011). Insights in reinforcement learning. PhD thesis, Utrecht University.
  • Watkins and Dayan (1992) Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine Learning.
  • Weber et al. (2017) Weber, T., Racanière, S., Reichert, D. P., Buesing, L., Guez, A., Rezende, D. J., Badia, A. P., Vinyals, O., Heess, N., Li, Y., et al. (2017). Imagination-augmented agents for deep reinforcement learning. arXiv preprint arXiv:1707.06203.
  • Williams and Baird (1993) Williams, R. J. and Baird, L. C. (1993). Tight performance bounds on greedy policies based on imperfect value functions. Technical report, Citeseer.
  • Zhang et al. (2019) Zhang, S., Chen, H., and Yao, H. (2019). Ace: An actor ensemble algorithm for continuous control with tree search. Proceedings of the 33rd AAAI Conference on Artificial Intelligence.

Appendix A Experiment Details

For the model-based experiments, we tune hyperparameters in Walker with an oracle model for both Dyna-DDPG and Res-Dyna-DDPG. The planning steps is tuned over . The noise process is Gaussian noise , with tuned over . The mix coefficient in RA is tuned over . In all our experiments (with both an oracle model and a learned model), we set .

For MVE-DDPG, we use the same hyperparameters and architectures as Feinberg et al. (2018). However, we find the original TD() loss (9) yields significant instability. To improve stability, we made two modifications. First, for a trajectory , instead of minimizing the loss (9), we minimize

This new loss is different from (9) mainly in that it uses the real transition more. We find this significantly improves stability. Second, we replace the mean squared loss with a Huber loss (Huber et al., 1964). This replacement has been reported to improve stability (Dhariwal et al., 2017). Our MVE-DDPG implementation significantly outperforms the MVE-DDPG baselines in Buckman et al. (2018) in Hopper and Walker while maintains a similar performance in remaining tasks.

Appendix B Other Experimental Results

The evaluation curves of DDPG and Bi-Res-DDPG() on 28 DMControl tasks are reported in Figure 5.

Figure 5: Evaluation curves of DDPG and Bi-Res-DDPG() on 28 DMControl tasks. Curves are averaged over 5 independent runs and shaded regions indicate standard errors.