1 Introduction
Deep modelfree reinforcement learning has had great successes in recent years, notably in playing video games Mnih et al. [2013] and strategic board games Silver et al. [2016]. However, training agents using these algorithms requires tens to hundreds of millions of samples, which makes many practical applications infeasible, particularly in realworld control problems (e.g., robotics) where data collection is expensive.
Modelbased approaches aim to reduce the number of samples required to learn a policy by modeling the dynamics of the environment. A dynamics model can be used to increase sample efficiency in various ways, including training the policy on rollouts from the dynamics model Sutton [1990], using rollouts to improve targets for temporal difference (TD) learning Feinberg et al. [2018], and using information gained from rollouts as inputs to the policy Weber et al. [2017]. Modelbased algorithms such as PILCO Deisenroth and Rasmussen [2011] have shown that it is possible to learn from ordersofmagnitude fewer samples.
These successes have mostly been limited to environments where the dynamics are simple to model. In noisy, complex environments, it is difficult to learn an accurate model of the environment. When the model makes mistakes in this context, it can cause the wrong policy to be learned, hindering performance. Recent work has begun to address this issue. Kalweit and Boedecker [2017] train a modelfree algorithm on a mix of real and imagined data, adjusting the proportion in favor of real data as the Qfunction becomes more confident. Kurutach et al. [2018] train a modelfree algorithm on purely imaginary data, but use an ensemble of environment models to avoid overfitting to errors made by any individual model.
We propose stochastic ensemble value expansion (STEVE), an extension to modelbased value expansion (MVE) proposed by Feinberg et al. [2018]
. Both techniques use a dynamics model to compute “rollouts” that are used to improve the targets for temporal difference learning. MVE rolls out a fixed length into the future, potentially accumulating model errors or increasing value estimation error along the way. In contrast, STEVE interpolates between many different horizon lengths, favoring those whose estimates have lower uncertainty, and thus lower error. To compute the interpolated target, we replace both the model and Qfunction with ensembles, approximating the uncertainty of an estimate by computing its variance under samples from the ensemble. Through these uncertainty estimates, STEVE dynamically utilizes the model rollouts only when they do not introduce significant errors. For illustration, Figure
1 compares the sample efficiency of various algorithms on a tabular toy environment, which shows that STEVE significantly outperforms MVE and TDlearning baselines when the dynamics model is noisy. We systematically evaluate STEVE on several challenging continuous control benchmarks and demonstrate that STEVE significantly outperforms modelfree baselines with an orderofmagnitude increase in sample efficiency.2 Background
Reinforcement learning aims to learn an agent policy that maximizes the expected (discounted) sum of rewards [Sutton and Barto, 1998]. The agent starts at an initial state , where is the distribution of initial states of the environment. Then, the agent deterministically chooses an action according to its policy with parameters , deterministically transitions to a subsequent state according to the Markovian dynamics of the environment, and receives a reward . This generates a trajectory of states, actions, and rewards . If a trajectory reaches a terminal state, it concludes without further transitions or rewards; however, this is optional, and trajectories may instead be infinite in length. We abbreviate the trajectory by . The goal is to maximize the expected discounted sum of rewards along sampled trajectories where is a discount parameter.
2.1 Value Estimation with TDlearning
The actionvalue function is a critical quantity to estimate for many learning algorithms. Using the fact that satisfies a recursion relation
where and is an indicator function which returns when is a terminal state and otherwise. We can estimate offpolicy with collected transitions of the form sampled uniformly from a replay buffer [Sutton and Barto, 1998]. We approximate
with a deep neural network,
. We learn parameters to minimize the mean squared error (MSE) between Qvalue estimates of states and their corresponding TD targets:(1)  
(2) 
This expectation is taken with respect to transitions sampled from our replay buffer. Note that we use an older copy of the parameters, , when computing targets [Mnih et al., 2013].
Since we evaluate our method in a continuous action space, it is not possible to compute a policy from our Qfunction by simply taking . Instead, we use a neural network to approximate this maximization function Lillicrap et al. [2016], by learning a parameterized function to minimize the negative Qvalue:
(3) 
In this work, we use DDPG as the base learning algorithm, but our technique is generally applicable to other methods that use TD objectives.
2.2 ModelBased Value Expansion (MVE)
Recently, Feinberg et al. [2018] showed that a learned dynamics model can be used to improve value estimation. MVE forms TD targets by combining a short term value estimate formed by unrolling the model dynamics and a long term value estimate using the learned function. When the model is accurate, this reduces the bias of the targets, leading to improved performance.
The learned dynamics model consists of three learned functions: the transition function , which returns a successor state ; a termination function
, which returns the probability that
is a terminal state; and the reward function , which returns a scalar reward. This model is trained to minimize(4) 
where the expectation is over collected transitions , and is the crossentropy. In this work, we consider continuous environments; for discrete environments, the first term can be replaced by a crossentropy loss term.
To incorporate the model into value estimation, Feinberg et al. [2018] replace the standard Qlearning target with an improved target, , computed by rolling the learned model out for steps.
(5)  
(6) 
To use this target, we substitute in place of when training using Equation 2.^{1}^{1}1This formulation is a minor generalization of the original MVE objective in that we additionally model the reward function and termination function; Feinberg et al. [2018] consider “fully observable” environments in which the reward function and termination condition were known, deterministic functions of the observations. Because we use a function approximator for the termination condition, we compute the accumulated probability of termination, , at every timestep, and use this value to discount future returns. Note that when , MVE reduces to TDlearning (i.e., ).
When the model is perfect and the learned Qfunction has similar bias on all states and actions, Feinberg et al. [2018] show that the MVE target with rollout horizon will decrease the target error by a factor of . Errors in the learned model can lead to worse targets, so in practice, we must tune to balance between the errors in the model and the function estimates. An additional challenge is that the bias in the learned Qfunction is not uniform across states and actions [Feinberg et al., 2018]. In particular, they find that the bias in the Qfunction on states sampled from the replay buffer is lower than when the Qfunction is evaluated on states generated from model rollouts. They term this the distribution mismatch problem and propose the TDk trick as a solution; see Appendix B for further discussion of this trick.
While the results of Feinberg et al. [2018] are promising, they rely on taskspecific tuning of the rollout horizon . This sensitivity arises from the difficulty of modeling the transition dynamics and the function, which are taskspecific and may change throughout training as the policy explores different parts of the state space. Complex environments require much smaller rollout horizon , which limits the effectiveness of the approach (e.g., Feinberg et al. [2018] used for HalfCheetahv1, but had to reduce to on Walker2dv1). Motivated by this limitation, we propose an approach that balances model error and Qfunction error by dynamically adjusting the rollout horizon.
3 Stochastic Ensemble Value Expansion
From a single rollout of timesteps, we can compute distinct candidate targets by considering rollouts of various horizon lengths: ,,,,. Standard TD learning uses as the target, while MVE uses as the target. We propose interpolating all of the candidate targets to produce a target which is better than any individual. Conventionally, one could average the candidate targets, or weight the candidate targets in an exponentiallydecaying fashion, similar to TD() Sutton and Barto [1998]. However, we show that we can do still better by weighting the candidate targets in a way that balances errors in the learned function and errors from longer model rollouts. STEVE provides a computationallytractable and theoreticallymotivated algorithm for choosing these weights. We describe the algorithm for STEVE in Section 3.1, and justify it in Section 3.2.
3.1 Algorithm
To estimate uncertainty in our learned estimators, we maintain an ensemble of parameters for our Qfunction, reward function, and model: , , and , respectively. Each parameterization is initialized independently and trained on different subsets of the data in each minibatch.
We roll out an step trajectory with each of the models, . Each trajectory consists of states, , which correspond to in Equation 5 with the transition function parameterized by . Similarly, we use the reward functions and Qfunctions to evaluate Equation 6 for each at every rolloutlength . This gives us different values of for each rolloutlength . See Figure 2 for a visualization of this process.
Using these values, we can compute the empirical mean and variance for each partial rollout of length . In order to form a single target, we use an inverse variance weighting of the means:
(7) 
To learn a value function with STEVE, we substitute in in place of when training using Equation 2.
3.2 Derivation
We wish to find weights , where that minimize the meansquared error between the weightedaverage of candidate targets ,,,…, and the true Qvalue.
where the expectation considers the candidate targets as random variables conditioned on the collected data and minibatch sampling noise, and the approximation is due to assuming the candidate targets are independent
^{2}^{2}2Initial experiments suggested that omitting the covariance cross terms provided significant computational speedups at the cost of a slight performance degradation. As a result, we omitted the terms in the rest of the experiments..Our goal is to minimize this with respect to . We can estimate the variance terms using empirical variance estimates from the ensemble. Unfortunately, we could not devise a reliable estimator for the bias terms, and this is a limitation of our approach and an area for future work. In this work, we ignore the bias terms and minimize the weighted sum of variances
With this approximation, which is equivalent to in inversevariance weighting Fleiss [1993], we achieve stateoftheart results. Setting each equal to and normalizing yields the formula for given in Equation 7.
3.3 Note on ensembles
This technique for calculating uncertainty estimates is applicable to any family of models from which we can sample. For example, we could train a Bayesian neural network for each model MacKay [1992], or use dropout as a Bayesian approximation by resampling the dropout masks each time we wish to sample a new model Gal and Ghahramani [2016]. These options could potentially give better diversity of various samples from the family, and thus better uncertainty estimates; exploring them further is a promising direction for future work. However, we found that these methods degraded the accuracy of the base models. An ensemble is far easier to train, and so we focus on that in this work. This is a common choice, as the use of ensembles in the context of uncertainty estimations for deep reinforcement learning has seen wide adoption in the literature. It was first proposed by Osband et al. [2016] as a technique to improve exploration, and subsequent work showed that this approach gives a good estimate of the uncertainty of both value functions Kalweit and Boedecker [2017] and models Kurutach et al. [2018].
4 Experiments
4.1 Implementation
We use DDPG Lillicrap et al. [2016] as our baseline modelfree algorithm. We train two deep feedforward neural networks, a Qfunction network and a policy network
, by minimizing the loss functions given in Equations
2 and 3. We also train another three deep feedforward networks to represent our world model, corresponding to function approximators for the transition , termination , and reward , and minimize the loss function given in Equation 4.When collecting rollouts for evaluation, we simply take the action selected by the policy, , at every state . (Note that only the policy is required at testtime, not the ensembles of Qfunctions, dynamics models, or reward models.) Each run was evaluated after every 500 updates by computing the mean total episode reward (referred to as score) across many environment restarts. To produce the lines in Figures 3, 4, and 5
, these evaluation results were downsampled by splitting the domain into nonoverlapping regions and computing the mean score within each region across several runs. The shaded area shows one standard deviation of scores in the region as defined above.
When collecting rollouts for our replay buffer, we do greedy exploration: with probability , we select a random action by adding Gaussian noise to the pretanh policy action.
All algorithms were implemented in Tensorflow
[Abadi et al., 2016]. We use a distributed implementation to parallelize computation. In the style of ApeX Horgan et al. [2018], IMPALA Espeholt et al. [2018], and D4PG BarthMaron et al. [2018], we use a centralized learner with several agents operating in parallel. Each agent periodically loads the most recent policy, interacts with the environment, and sends its observations to the central learner. The learner stores received frames in a replay buffer, and continuously loads batches of frames from this buffer to use as training data for a model update. In the algorithms with a modelbased component, there are two learners: a policylearner and a modellearner. In these cases, the policylearner periodically reloads the latest copy of the model.All baselines reported in this section were reimplementations of existing methods. This allowed us to ensure that the various methods compared were consistent with one another, and that the differences reported are fully attributable to the independent variables in question. Our baselines are competitive with stateoftheart implementations of these algorithms Haarnoja et al. [2018], Feinberg et al. [2018]
. All MVE experiments utilize the TDk trick. For hyperparameters and additional implementation details, please see Appendix
C.^{3}^{3}3Our code is available opensource at: https://github.com/tensorflow/models/tree/master/research/steve4.2 Comparison of Performance
We evaluated STEVE on a variety of continuous control tasks [Brockman et al., 2016, Klimov and Schulman, ]; we plot learning curves in Figure 3. We found that STEVE yields significant improvements in both performance and sample efficiency across a wide range of environments. Importantly, the gains are most substantial in the complex environments. On the most challenging environments: Humanoidv1, RoboschoolHumanoidv1, RoboschoolHumanoidFlagrunv1, and BipedalWalkerHardcorev2, STEVE is the only algorithm to show significant learning within 5M frames.
4.3 Ablation Study
In order to verify that STEVE’s gains in sample efficiency are due to the reweighting, and not simply due to the additional parameters of the ensembles of its components, we examine several ablations. Ensemble MVE is the regular MVE algorithm, but the model and Qfunctions are replaced with with ensembles. MeanMVE uses the exact same architecture as STEVE, but uses a simple uniform weighting instead of the uncertaintyaware reweighting scheme. Similarly, TDL25 and TDL75 correspond to TD() reweighting schemes with and , respectively. COVSTEVE is a version of STEVE which includes the covariances between candidate targets when computing the weights (see Section 3.2). We also investigate the effect of the horizon parameter on the performance of both STEVE and MVE. These results are shown in Figure 4.
All of these variants show the same trend: fast initial gains, which quickly taper off and are overtaken by the baseline. STEVE is the only variant to converge faster and higher than the baseline; this provides strong evidence that the gains come specifically from the uncertaintyaware reweighting of targets. Additionally, we find that increasing the rollout horizon increases the sample efficiency of STEVE, even though the dynamics model for Humanoidv1 has high error.
4.4 WallClock Comparison
In the previous experiments, we synchronized data collection, policy updates, and model updates. However, when we run these steps asynchronously, we can reduce the wallclock time at the risk of instability. To evaluate this configuration, we compare DDPG, MVEDDPG, and STEVEDPPG on Humanoidv1 and RoboschoolHumanoidFlagrunv1. Both were trained on a P100 GPU and had 8 CPUs collecting data; STEVEDPPG additionally used a second P100 to learn a model in parallel. We plot reward as a function of wallclock time for these tasks in Figure 5. STEVEDDPG learns more quickly on both tasks, and it achieves a higher reward than DDPG and MVEDDPG on Humanoidv1 and performs comparably to DDPG on RoboschoolHumanoidFlagrunv1. Moreover, in future work, STEVE could be accelerated by parallelizing training of each component of the ensemble.
5 Discussion
Our primary experiments (Section 4.2) show that STEVE greatly increases sample efficiency relative to baselines, matching or outperforming both MVEDDPG and DDPG baselines on every task. STEVE also outperforms other recentlypublished results on these tasks in terms of sample efficiency Gu et al. [2017], Haarnoja et al. [2018], Schulman et al. [2017]. Our ablation studies (Section 4.3) support the hypothesis that the increased performance is due to the uncertaintydependent reweighting of targets, as well as demonstrate that the performance of STEVE consistently increases with longer horizon lengths, even in complex environments. Finally, our wallclock experiments (Section 4.4
) demonstrate that in spite of the additional computation per epoch, the gains in sample efficiency are enough that it is competitive with modelfree algorithms in terms of wallclock time. The speed gains associated with improved sample efficiency will only be exacerbated as samples become more expensive to collect, making STEVE a promising choice for applications involving realworld interaction.
Given that the improvements stem from the dynamic reweighting between horizon lengths, it may be interesting to examine the choices that the model makes about which candidate targets to favor most heavily. In Figure 6, we plot the average model usage over the course of training. Intriguingly, most of the lines seem to remain stable at around 50% usage, with two notable exceptions: Humanoidv1, the most complex environment tested (with an observationspace of size 376); and Swimmerv1, the least complex environment tested (with an observationspace of size 8). This supports the hypothesis that STEVE is trading off between Qfunction bias and model bias; it chooses to ignore the model almost immediately when the environment is too complex to learn, and gradually ignores the model as the Qfunction improves if an optimal environment model is learned quickly.
6 Related Work
Sutton and Barto [1998] describe TD, a family of Qlearning variants in which targets from multiple timesteps are merged via exponentially decay. STEVE is similar in that it is also computing a weighted average between targets. However, our approach is significantly more powerful because it adapts the weights to the specific characteristics of each individual rollout, rather than being constant between examples and throughout training. Our approach can be thought of as a generalization of TD(), in that the two approaches are equivalent in the specific case where the overall uncertainty grows exponentially at rate at every timestep.
Munos et al. [2016] propose Retrace(), a lowvariance method for offpolicy Qlearning. Retrace() is an offpolicy correction method, so it learns from nstep offpolicy data by multiplying each term of the loss by a correction coefficient, the trace, in order to reweight the data distribution to look more like the onpolicy distribution. Specifically, at each timestep, Retrace() updates the coefficient for that term by multiplying it by . Similarly to TD, the parameter corresponds to an exponential decay of the weighting of potential targets. STEVE approximates this weighting in a more complex way, and additionally learns a predictive model of the environment (under which onpolicy rollouts are possible) instead of using offpolicy correction terms to reweight real offpolicy rollouts.
Heess et al. [2015] describe stochastic value gradient (SVG) methods, which are a general family of hybrid modelbased/modelfree control algorithms. By reparameterizing distributions to separate out the noise, SVG is able to learn stochastic continuous control policies in stochastic environments. STEVE currently operates only with deterministic policies and environments, but this is a promising direction for future work.
Kurutach et al. [2018] propose modelensemble trustregion policy optimization (METRPO), which is motivated similarly to this work in that they also propose an algorithm which uses an ensemble of models to mitigate the deleterious effects of model bias. However, the algorithm is quite different. METRPO is a purely modelbased policygradient approach, and uses the ensemble to avoid overfitting to any one model. In contrast, STEVE interpolates between modelfree and modelbased estimates, uses a valueestimation approach, and uses the ensemble to explicitly estimate uncertainty.
Kalweit and Boedecker [2017] train on a mix of real and imagined rollouts, and adjust the ratio over the course of training by tying it to the variance of the Qfunction. Similarly to our work, this variance is computed via an ensemble. However, they do not adapt to the uncertainty of individual estimates, only the overall ratio of real to imagined data. Additionally, they do not take into account model bias, or uncertainty in model predictions.
Weber et al. [2017] use rollouts generated by the dynamics model as inputs to the policy function, by “summarizing” the outputs of the rollouts with a deep neural network. This second network allows the algorithm to implicitly calculate uncertainty over various parts of the rollout and use that information when making its decision. However, I2A has only been evaluated on discrete domains. Additionally, the lack of explicit model use likely tempers the sampleefficiency benefits gained relative to more traditional modelbased learning.
Gal et al. use a deep neural network in combination with the PILCO algorithm Deisenroth and Rasmussen [2011] to do sampleefficient reinforcement learning. They demonstrate good performance on the continuouscontrol task of cartpole swingup. They model uncertainty in the learned neural dynamics function using dropout as a Bayesian approximation, and provide evidence that maintaining these uncertainty estimates is very important for modelbased reinforcement learning.
Depeweg et al. [2016] use a Bayesian neural network as the environment model in a policy search setting, learning a policy purely from imagined rollouts. This work also demonstrates that modeling uncertainty is important for modelbased reinforcement learning with neural network models, and that uncertaintyaware models can escape many common pitfalls.
Gu et al. [2016] propose a continuous variant of Qlearning known as normalized advantage functions (NAF), and show that learning using NAF can be accelerated by using a modelbased component. They use a variant of DynaQ Sutton [1990], augmenting the experience available to the modelfree learner with imaginary onpolicy data generated via environment rollouts. They use an iLQG controller and a learned locallylinear model to plan over small, easilymodelled regions of the environment, but find that using more complex neural network models of the environment can yield errors.
Thomas et al. [2015] define the return, an alternative to the return that accounts for the variance of, and correlations between, predicted returns at multiple timesteps. Similarly to STEVE, the target used is an unbiased linear combination of returns with minimum variance. However, the return is not directly computable in nontabular state spaces, and does nstep offpolicy learning rather than learn a predictive model of the environment. Drawing a theoretical connection between the STEVE algorithm and the return is an interesting potential direction for future work.
7 Conclusion
In this work, we demonstrated that STEVE, an uncertaintyaware approach for merging modelfree and modelbased reinforcement learning, outperforms modelfree approaches while reducing sample complexity by an order magnitude on several challenging tasks. We believe that this is a strong step towards enabling RL for practical, realworld applications. Since submitting this manuscript for publication, we have further explored the relationship between STEVE and recent work on overestimation bias Fujimoto et al. [2018], and found evidence that STEVE may help to reduce this bias. Other future directions include exploring more complex worldmodels for various tasks, as well as comparing various techniques for calculating uncertainty and estimating bias.
Acknowledgments
The authors would like to thank the following individuals for their valuable insights and discussion: David Ha, Prajit Ramachandran, Tuomas Haarnoja, Dustin Tran, Matt Johnson, Matt Hoffman, Ishaan Gulrajani, and Sergey Levine. Also, we would like to thank Jascha SohlDickstein, Joseph Antognini, Shane Gu, and Samy Bengio for their feedback during the writing process, and Erwin Coumans for his help on PyBullet enivronments. Finally, we would like to thank our anonymous reviewers for their insightful suggestions.
References

Abadi et al. [2016]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard, et al.
Tensorflow: A system for largescale machine learning.
In OSDI, volume 16, pages 265–283, 2016.  BarthMaron et al. [2018] G. BarthMaron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, D. TB, A. Muldal, N. Heess, and T. Lillicrap. Distributional policy gradients. In International Conference on Learning Representations, 2018.
 Brockman et al. [2016] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 Deisenroth and Rasmussen [2011] M. Deisenroth and C. E. Rasmussen. PILCO: A modelbased and dataefficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML11), pages 465–472, 2011.
 Depeweg et al. [2016] S. Depeweg, J. M. HernándezLobato, F. DoshiVelez, and S. Udluft. Learning and policy search in stochastic dynamical systems with Bayesian neural networks. 2016.
 Espeholt et al. [2018] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al. Impala: Scalable distributed deeprl with importance weighted actorlearner architectures. In Proceedings of the International Conference on Machine Learning, 2018.
 Feinberg et al. [2018] V. Feinberg, A. Wan, I. Stoica, M. I. Jordan, J. E. Gonzalez, and S. Levine. Modelbased value estimation for efficient modelfree reinforcement learning. arXiv preprint arXiv:1803.00101, 2018.
 Fleiss [1993] J. Fleiss. Review papers: The statistical basis of metaanalysis. Statistical methods in medical research, 2(2):121–145, 1993.
 Fujimoto et al. [2018] S. Fujimoto, H. van Hoof, and D. Meger. Addressing function approximation error in actorcritic methods. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1587–1596, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/fujimoto18a.html.

Gal and Ghahramani [2016]
Y. Gal and Z. Ghahramani.
Dropout as a Bayesian approximation: Representing model uncertainty in deep learning.
In international conference on machine learning, pages 1050–1059, 2016.  [11] Y. Gal, R. McAllister, and C. E. Rasmussen. Improving PILCO with Bayesian neural network dynamics models.
 Gu et al. [2016] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep Qlearning with modelbased acceleration. In International Conference on Machine Learning, pages 2829–2838, 2016.
 Gu et al. [2017] S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine. Qprop: Sampleefficient policy gradient with an offpolicy critic. International Conference on Learning Representations, 2017.
 Haarnoja et al. [2018] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor, 2018.
 Heess et al. [2015] N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pages 2944–2952, 2015.
 Horgan et al. [2018] D. Horgan, J. Quan, D. Budden, G. BarthMaron, M. Hessel, H. van Hasselt, and D. Silver. Distributed prioritized experience replay. In International Conference on Learning Representations, 2018.
 Kalweit and Boedecker [2017] G. Kalweit and J. Boedecker. Uncertaintydriven imagination for continuous deep reinforcement learning. In Conference on Robot Learning, pages 195–206, 2017.
 Kingma and Ba [2015] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.
 [19] O. Klimov and J. Schulman. Roboschool. https://github.com/openai/roboschool.
 Kurutach et al. [2018] T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel. Modelensemble trustregion policy optimization. In International Conference on Learning Representations, 2018.
 Lillicrap et al. [2016] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. International Conference on Learning Representations, 2016.

MacKay [1992]
D. J. MacKay.
A practical Bayesian framework for backpropagation networks.
Neural computation, 4(3):448–472, 1992.  Mnih et al. [2013] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. In NIPS Deep Learning Workshop. 2013.
 Munos et al. [2016] R. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare. Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems, pages 1054–1062, 2016.
 Osband et al. [2016] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped DQN. In Advances in neural information processing systems, pages 4026–4034, 2016.
 Schulman et al. [2017] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Silver et al. [2016] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
 Sutton [1990] R. S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning Proceedings 1990, pages 216–224. Elsevier, 1990.
 Sutton and Barto [1998] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction, volume 1. MIT Press Cambridge, 1998.
 Thomas et al. [2015] P. S. Thomas, S. Niekum, G. Theocharous, and G. Konidaris. Policy evaluation using the return. In Advances in Neural Information Processing Systems, pages 334–342, 2015.
 Weber et al. [2017] T. Weber, S. Racanière, D. P. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li, et al. Imaginationaugmented agents for deep reinforcement learning. 31st Conference on Neural Information Processing Systems, 2017.
Appendix A Toy Problem: A Tabular FSM with Model Noise
To demonstrate the benefits of Bayesian modelbased value expansion, we evaluated it on a toy problem. We used a finite state environment with states , and a single action available at every state which always moves from state to , starting at and terminating at . The reward for every action is 1, except when moving from to , which is +100. Since this environment is so simple, there is only one possible policy , which is deterministic. It is possible to compute the true actionvalue function in closed form, which is .
We estimate the value of each state using tabular TDlearning. We maintain a tabular function , which is just a lookup table matching each state to its estimated value. We initialize all values to random integers between 0 and 99, except for the terminal state , which we initialize to 0 (and keep fixed at 0 at all times). We update using the standard undiscounted onestep TD update, . For each update, we sampled a nonterminal state and its corresponding transition at random. For experiments with an ensemble of Qfunctions, we repeat this update once for each member of the ensemble at each timestep.
The transition and reward function for the oracle dynamics model behaved exactly the same as the true environment. In the “noisy” dynamics model, noise was added in the following way: 10% of the time, rather than correctly moving from to , the model transitions to a random state. (Other techniques for adding noise gave qualitatively similar results.)
On the yaxis of Figure 1, we plot the mean squared error between the predicted values and the true values of each state: .
For both the STEVE and MVE experiments, we use an ensemble of size 8 for both the model and the Qfunction. To compute the MVE target, we average across all ensembled rollouts and predictions.
Appendix B The TDk Trick
The TDk trick, proposed by Feinberg et al. [2018], involves training the Qfunction using every intermediate state of the rollout:
where are defined as in Equation 5.
To summarize Feinberg et al. [2018], the TDk trick is helpful because the offpolicy states collected by the replay buffer may have little overlap with the states encountered during onpolicy model rollouts. Without the TDk trick, the Qfunction approximator is trained to minimize error only on states collected from the replay buffer, so it is likely to have high error on states not present in the replay buffer. This would imply that the Qfunction has high error on states produced by model rollouts, and that this error may in fact continue to increase the more steps of onpolicy rollout we take. By invoking the TDk trick, and training the Qfunction on intermediate steps of the rollout, Feinberg et al. [2018] show that we can decrease the Qfunction bias on frames encountered during modelbased rollouts, leading to better targets and improved performance.
The TDk trick is orthogonal to STEVE. STEVE tends to ignore estimates produced by states with poorlylearned Qvalues, so it is not hurt nearly as much as MVE by the distribution mismatch problem. However, better Qvalues will certainly provide more information with which to compute STEVE’s target, so in that regard the TDk trick seems beneficial. An obvious question is whether these two approaches are complimentary. STEVE+TDk is beyond the scope of this work, and we did not give it a rigorous treatment; however, initial experiments were not promising. In future work, we hope to explore the connection between these two approaches more deeply.
Appendix C Implementation Details
All models were feedforward neural networks with ReLU nonlinearities. The policy network, reward model, and termination model each had 4 layers of size 128, while the transition model had 8 layers of size 512. All environments were reset after 1000 timesteps. Parameters were trained with the Adam optimizer
Kingma and Ba [2015] with a learning rate of 3e4.Policies were trained using minibatches of size 512 sampled uniformly at random from a replay buffer of size 1e6. The first 1e5 frames were sampled via random interaction with the environment; after that, 4 policy updates were performed for every frame sampled from the environment. (In Section 4.4, the policy updates and frames were instead desynced.) Policy checkpoints were saved every 500 updates; these checkpoints were also frozen and used as . For modelbased algorithms, the most recent checkpoint of the model was loaded every 500 updates as well.
Each policy training had 8 agents interacting with the environment to send frames back to the replay buffer. These agents typically took the greedy action predicted by the policy, but with probability
, instead took an action sampled from a normal distribution surrounding the pretanh logit predicted by the policy. In addition, each policy had two greedy agents interacting with the environment for evaluation.
Dynamics models were trained using minibatches of size 1024 sampled uniformly at random from a replay buffer of size 1e6. The first 1e5 frames were sampled via random interaction with the environment; the dynamics model was then pretrained for 1e5 updates. After that, 4 model updates were performed for every frame sampled from the environment. (In Section 4.4, the model updates and frames were instead desynced.) Model checkpoints were saved every 500 updates.
All ensembles were of size 4. During training, each ensemble member was trained on an independentlysampled minibatch; all minibatches were drawn from the same buffer. Additionally, for all experiments.
Comments
There are no comments yet.