Beyond Prioritized Replay: Sampling States in Model-Based RL via Simulated Priorities

by   Jincheng Mei, et al.
University of Alberta

Model-based reinforcement learning (MBRL) can significantly improve sample efficiency, particularly when carefully choosing the states from which to sample hypothetical transitions. Such prioritization has been empirically shown to be useful for both experience replay (ER) and Dyna-style planning. However, there is as yet little theoretical understanding in RL about such prioritization strategies, and why they help. In this work, we revisit prioritized ER and, in an ideal setting, show an equivalence to minimizing cubic loss, providing theoretical insight into why it improves upon uniform sampling. This ideal setting, however, cannot be realized in practice, due to insufficient coverage of the sample space and outdated priorities of training samples. This motivates our model-based approach, which does not suffer from these limitations. Our key idea is to actively search for high priority states using gradient ascent. Under certain conditions, we prove that the distribution of hypothetical experiences generated from these states provides a diverse set of states, sampled proportionally to approximately true priorities. Our experiments on both benchmark and application-oriented domain show that our approach achieves superior performance over both the model-free prioritized ER method and several closely related model-based baselines.


Maximum Entropy Model Rollouts: Fast Model Based Policy Optimization without Compounding Errors

Model usage is the central challenge of model-based reinforcement learni...

Organizing Experience: A Deeper Look at Replay Mechanisms for Sample-based Planning in Continuous State Domains

Model-based strategies for control are critical to obtain sample efficie...

Stratified Experience Replay: Correcting Multiplicity Bias in Off-Policy Reinforcement Learning

Deep Reinforcement Learning (RL) methods rely on experience replay to ap...

When to use parametric models in reinforcement learning?

We examine the question of when and how parametric models are most usefu...

Hallucinating Value: A Pitfall of Dyna-style Planning with Imperfect Environment Models

Dyna-style reinforcement learning (RL) agents improve sample efficiency ...

Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping

We consider the problem of efficiently learning optimal control policies...

MBDP: A Model-based Approach to Achieve both Robustness and Sample Efficiency via Double Dropout Planning

Model-based reinforcement learning is a widely accepted solution for sol...

1 Introduction

Using hypothetical experience simulated from an environment model can significantly improve sample efficiency of RL agents (Ha and Schmidhuber, 2018; Holland et al., 2018; Pan et al., 2018; van Hasselt et al., 2019). Dyna  (Sutton, 1991) is a classical MBRL architecture where the agent uses real experience to updates its policy as well as its reward and dynamics models. In-between taking actions, the agent can simulate hypothetical experience from the model to further improve the policy.

An important question for effective Dyna-style planning is search-control: from what states should the agent simulate hypothetical transitions? On each planning step in Dyna, the agent has to select a state and action from which to query the model for the next state and reward. This question, in fact, already arises in what is arguably the simplest variant of Dyna: Experience Replay (ER) (Lin, 1992)

. In ER, visited transitions are stored in a buffer and at each time step, a mini-batch of experiences is sampled to update the value function. ER can be seen as an instance of Dyna, using a (limited) non-parametric model given by the buffer (see 

van Seijen and Sutton (2015) for a deeper discussion). Performance can be significantly improved by sampling proportionally to priorities based on errors, as in prioritized ER (Schaul et al., 2016; de Bruin et al., 2018), as well as specialized sampling for the off-policy setting (Schlegel et al., 2019).

Search-control strategies in Dyna similarly often rely on using priorities, though they can be more flexible in leveraging the model rather than being limited to only retrieving visited experiences. For example, a model enables the agent to sweep backwards by generating predecessors, as in prioritized sweeping (Moore and Atkeson, 1993; Sutton et al., 2008; Pan et al., 2018; Corneil et al., 2018). Other methods have tried alternatives to error-based prioritization, such as searching for states with high reward (Goyal et al., 2019), high value (Pan et al., 2019) or states that are difficult to learn (Pan et al., 2020). Another strategy has been to generate a more diverse set of states from which to sample (Gu et al., 2016; Holland et al., 2018), or to modulate the distance of such states from real experience (Janner et al., 2019). These methods are all supported by nice intuitions, but as yet lack solid theoretical reasons for why they can improve sample efficiency.

In this work, we provide new insights into how to choose the sampling distribution over states from which we generate hypothetical experience. In particular, we theoretically motivate why error-based prioritization is effective, and provide a mechanism to generate states according to more accurate error estimates. We first prove that

regression with error-based prioritized sampling is equivalent to minimizing a cubic objective with uniform sampling in an ideal setting. We then show that minimizing the cubic power objective has a faster convergence rate during early learning stage, providing theoretical motivation for error-based prioritization. We point out that this ideal setting is hard to achieve in practice using only ER due to two issues: insufficient sample space coverage and outdated priorities. Hence we propose a search-control strategy in Dyna that leverages a model to simulate errors and to find states with high expected error. Finally, we demonstrate the efficacy of our method on various benchmark domains and an autonomous driving application.

2 Problem Formulation

We formalize the problem as a Markov Decision Process (MDP), a tuple

including state space , action space

, probability transition kernel

, reward function , and discount rate . At each environment time step , an RL agent observes a state , and takes an action . The environment transitions to the next state , and emits a scalar reward signal . A policy is a mapping that determines the probability of choosing an action at a given state.

  Input: hill climbing crit. , batch-size Initialize empty search-control queue ; empty ER buffer ; initialize policy and model
  for  do
     Add to
     while within some budget time steps do
         //hill climbing
        Add into
     for  times do
        for  times do
           Sample , on-policy action
        Sample experiences from , add to
        Update policy on the mixed mini-batch
Algorithm 1 HC-Dyna: Generic framework

The agent’s objective is to find an optimal policy. A popular algorithm is Q-learning (Watkins and Dayan, 1992), where parameterized action-values are updated using for stepsize with TD-error

. The policy is defined by acting greedily w.r.t. these action-values. ER is critical when using neural networks to estimate

, as used in DQN (Mnih et al., 2015), both to stabilize and speed up learning. MBRL has the potential to provide even further sample efficiency improvements.

We build on the Dyna formalism (Sutton, 1991) for MBRL, and more specifically the recently proposed HC-Dyna (Pan et al., 2019) as shown in Algorithm 1. It is featured by a special Hill Climbing (HC)111The term Hill Climbing is used for generality as the vanilla gradient ascent procedure is modified to resolve certain challenges (Pan et al., 2019). search-control—the mechanism of generating states or state-action pairs from which to query the model to get next states and rewards (i.e. hypothetical experiences)—which generates states by hill climbing on some criterion function . For example, is the value function from Pan et al. (2019) and is the gradient magnitude from Pan et al. (2020). The former is used as measure of “importance” of states and the latter is considered as a measure of value approximation difficulty.

These hypothetical transitions are treated just like real transitions. For this reason, HC-Dyna combines both real experience and hypothetical experience into mini-batch updates. These updates, performed before taking the next action, are called planning updates, as they improve the action-value estimates—and so the policy—by a model.

3 A Deeper Look at Error-based Prioritized Sampling

In this section, we provide theoretical motivation for error-based prioritized sampling, and empirically investigate several insights highlighted by this theory. We show that prioritized sampling can be reformulated as optimizing a cubic power objective with uniform sampling. We prove that optimizing the cubic objective provides a faster convergence rate during early learning. Based on these results, we point out that for error-based prioritization, such as prioritized ER, to manifest advantages relies on up-to-date priorities and sufficient coverage of the sample space, and empirically highlight issues when those two properties are not obtained.

3.1 Prioritized Sampling as a Cubic Objective

In regression, we minimize the mean squared error , for training set and function approximator , such as a neural network. In error-based prioritized sampling, we define the priority of a sample as ; the probability of drawing a sample is typically . We employ the following form to compute the probabilities:


We can show an equivalence between the gradients of the squared objective with this prioritization and the cubic power objective . See Appendix A.2 for the proof.

Theorem 1.

For a constant determined by , we have

This simple theorem provides an intuitive reason for why prioritized sampling can help improve sample efficiency: the gradient direction of cubic function is sharper than that of the square function when the error is relatively large (Figure 1). Theorem 2 further characterizes the difference between the convergence rates by optimizing the mean square error and the cubic power objective.

Theorem 2 (Fast early learning).

Consider the following two objectives: , and . Denote , and . Define the functional gradient flow updates on these two objectives:


Given error threshold , define the hitting time and . For any initial function value s.t. , such that .222Finding the exact value of would require a definition of ordering on complex plane, which leads to and is a Wright Omega function, then we have . Our theorem statement is sufficient for the purpose of characterizing convergence rate.


Please see Appendix A.3. Given the same and the same initial value of , first we derive Then we analyze the condition on to see when , i.e. minimizing the square error is slower than minimizing the cubic error. ∎

The above theorem says that when the initial error is relatively large, it is faster to get to a certain low error point with the cubic objective. We can test this in simulation, with the following minimization problems: and . We use the hitting time formulae derived in the proof, to compute the hitting time ratio under different initial values and final error value . In Figure 1(c)(d), we can see that it usually takes a significantly shorter time for the cubic loss to reach a certain with various values.

(a) cubic v.s. square
(b) |derivative|
(c) v.s. hitting time
(d) v.s. hitting time
Figure 1: (a) show cubic v.s. square function. (b) shows their absolute derivatives. (c) shows the hitting time ratio v.s. initial value under different target value . (d) shows the ratio v.s. the target to reach under different . Note that a ratio larger than indicates a longer time to reach the given for the square loss.

3.2 Empirical Demonstrations

In this section, we empirically show: 1) the practical performance of cubic objective; 2) the importance of sufficient sample space coverage and of updating priorities of all training samples; 3) the reasons for why high power objective should not be preferred in general. We refer readers to A.6 for missing details and to  A.7 for additional experiments.

We conduct experiments on a supervised learning task. We use the dataset from 

Pan et al. (2020), where it is shown that the high frequency region is the main source of prediction error. Hence we expect prioritized sampling to make a clear difference in terms of sample efficiency. We generate a training set by uniformly sampling

and adding zero-mean Gaussian noise with standard deviation

to the target values, where if and if . The testing set contains k samples and the targets are not noise-contaminated.

We compare the following algorithms. L2: the regression with uniformly sampling from . Full-PrioritizedL2: the regression with prioritized sampling according to the distribution defined in (1), the priorities of all samples in the training set are updated after each mini-batch update. PrioritizedL2: the only difference with Full-PrioritizedL2 is that only the priorities of those training examples sampled in the mini-batch are updated at each iteration, the rest of the training samples keep the original priorities. Note that this resembles what the vanilla Prioritized ER does in RL setting (Schaul et al., 2016). Cubic: minimizing the cubic objective with uniformly sampling. Power4: with uniformly sampling. We include it to show that there is almost no gain and may get hurt by using higher powers.

We use tanh layers for all algorithms and optimize learning rate from the range . Figure 2 (a)-(d) show the learning curves in terms of testing error of all the above algorithms with various settings.333We show the testing error as it is finally concerned. The training error has similar comparative performance and is presented in Appendix 3, where we also include additional results with different settings. We identify five important observations: 1) with a small mini-batch size , there is a significant difference between Full-PrioritizedL2 and Cubic; 2) with increased mini-batch size, although all algorithms perform better, Cubic achieves largest improvement and its behavior tends to approximate the prioritized sampling algorithm; 3) as shown in Figure 2 (a), the prioritized sampling does not show advantage when the training set is small; 4) Prioritized without updating all priorities can be significantly worse than the vanilla (uniform sampling) regression; 5) when increasing noise standard deviation from to , all algorithms perform worse and the higher power the objective is, the more it can get hurt.

The importance of sample space coverage. Observation 1) and 2) show that high power objective has to use a much larger mini-batch size to achieve comparable performance with the with prioritized sampling. Though this coincides with Theorem 1 in that the two algorithms is equivalent in expectation, prioritized sampling seems to be robust to small mini-batch, which is an advantage in stochastic gradient methods. A possible reason is that prioritized sampling allows to immediately get many samples from those high error region; while uniformly sampling can get fewer those samples with limited mini-batch size. This motivates us to test prioritized sampling with a small training set where both algorithms can get fewer samples everywhere. Figure 2(a) together with (b) indicate that prioritized sampling needs sufficient samples across the sample space to maintain advantage. This requirement is intuitive but it illuminates an important limitation of prioritized ER in RL: only those visited real experiences from the ER buffer can get sampled. If the state space is large, the ER buffer likely contains only a small subset of the state space, indicating a very small training set.

Thorough priority updating. Observation 4) reminds us the importance of using an up-to-date sampling distribution at each time step. Outdated priorities change the sampling distribution in an unpredictable manner and the learning performance can get hurt. Though the effect of whether updating priorities of all samples or not is intuitive, it receives little attention in the existing RL literature. We further verify this phenomenon on the classical Mountain Car domain (Sutton and Barto, 2018; Brockman et al., 2016). Figure 2(e) shows the evaluation learning curves of different variants of Deep Q networks (DQN) corresponding to the supervised learning algorithms. We use a small ReLu NN as the -function. We expect that a small NN should highlight the issue of priority updating: every mini-batch update potentially perturbs the values of many other states. Hence it is likely that many experiences in the ER buffer have the wrong priorities without thorough priority updating. One can see that Full-PrioritizedER significantly outperforms the vanilla PrioritizedER algorithm which only updates priorities for those in the sampled mini-batch at each time step. However, updating the priorities of all samples at each time step in the ER buffer is usually computationally too expensive and is not scalable to number of visited samples.

Regarding high power objectives. As we discussed above, observation 1) and 2) tell us that that high power objective would require a large mini-batch size (ideally, use a batch algorithm, i.e. the whole training set) to manifest the advantage in improving convergence rate. This makes the algorithm not easily scalable to large training dataset. Observation 5) indicates another reason for why a high power objective should not be preferred: it augments the effect of noise added to the target variables. In Figure 2(d), the Power4 objective suffers most from the increased target noise.

(a) b=128, small
(b) b=128,
(c) b=512,
(d) b=512,
(e) Mountain Car.
Figure 2: Testing RMSE v.s. number of mini-batch updates. (a)(b)(c)(d) show the learning curves with different mini-batch size

or Guassian noises variance

added to the training targets. (a) is using and a smaller training set (solid line for , dotted line for ) than others but has the same testing set size. (e) shows the a corresponding experiment in RL setting on the classical mountain car domain. The results are averaged over random seeds on (a)-(d) and

on (e). The shade indicates standard error.

4 Acquiring Samples From Temporal Difference Error-based Sampling Distribution on Continuous Domains

In this section, we propose a method attempting to sample states: 1) which are not restricted to those visited ones; 2) with probability proportional to the expected TD error magnitude and the probability is computed according to -function parameters at current time step (i.e. this typically would require to compute priorities of all samples at each time step). We start by the following theorem. We denote as the transition probability given a policy .

Theorem 3.

Sampling method. Given the state , let be a differentiable value function under policy parameterized by . Define: , and denote the TD error as . Given some initial state , define the state sequence as the one generated by state updating rule , where is a sufficiently small stepsize and

is a Gaussian random variable with a sufficiently small variance. Then the sequence

converges to the distribution .

The proof is a direct consequence of the convergent behavior of Langevin dynamics stochastic differential equation (SDE)(L., 1996; Welling and Teh, 2011; Zhang et al., 2017). We include a brief discussion and background knowledge in the Appendix A.4.

In practice, we can compute the state value estimate by as suggested by Pan et al. (2019). In the case that a true environment model is not available, we have to compute an estimate of by a learned model. Then at each time step , states approximately following the distribution can be generated by


where is a Gaussian random variable with zero-mean and reasonably small variance. In implementation, observing that is small, we opt to consider as a constant given a state

without backpropagating through it. We provide a upper bound in below theorem for the difference between the sampling distribution acquired by the true model and the learned model. We denote the transition probability distribution under policy

and the true model as ; denote that with the learned model as . Let and be the convergent distributions described in Theorem 3 by using true and learned models respectively. Let be the total variation distance between two probability distributions. Define .

Theorem 4.

Assume: 1) the reward magnitude is bounded and define ; 2) the largest model error for a single state is some small value: . Then .

Please see Appendix A.5 for proof.

Algorithmic details. We present the key details of our algorithm called Dyna-TD (Temporal Difference error) in the Algorithm 3 in Appendix A.6. The algorithm closely follows the previous hill climbing Dyna by Pan et al. (2019). At each time step, we run the updating rule 3 and record the states along the gradient trajectories to populate the search-control queue. Then during planning stage, we sample states from the search-control queue and pair them with on-policy actions to get state-action pairs. We query the model those state-action pairs to acquire corresponding next states, rewards to acquire hypothetical experiences in the form of . Then we mix those hypothetical experiences with real experiences from the ER buffer to form a mixed mini-batch to update the NN parameters.

Empirical verification of sampling distribution. We validate the efficacy of our sampling method by empirically examining the distance between the sampling distribution acquired by our gradient ascent rule (3) (denoted as ) and the desired distribution computed by thorough priority updating of all states under the current parameter on the GridWorld domain (Pan et al., 2019) (Figure 3(a)), where the probability density can be conveniently approximated by discretization. We record the distance change when we train our Algorithm 3. The distance between the sampling distribution fo Prioritized ER (denoted as ) is also included for comparison. All those distributions are computed by normalizing visitation counts on the discretized GridWorld. We compute the distances of to by two sensible weighting schemes: 1) on-policy weighting: , where is approximated by uniformly sample k states from a recency buffer; 2) uniform weighting: . All details are in Appendix A.6

Figure 3(b)(c) show that our algorithm Dyna-TD, either with a true or an online learned model, maintains a significantly closer distance to the desired sampling distribution than PrioritizedER under both weighting schemes. Furthermore, despite there is mismatch between implementation and our above Theorem 3 in that Dyna-TD may not run enough gradient steps to reach stationary distribution, the induced sampling distribution is quite close to the one by running a long gradient steps (Dyna-TD-Long), which is closer to the theorem. This indicates that the we can reduce time cost by lowering the number of gradient steps, while keep the sampling distribution similar.

(a) GridWorld
(b) on-policy weighting
(c) uniform weighting
Figure 3: (a) shows the GridWorld taken from Pan et al. (2019). The state space is , and the agent starts from the left bottom and should learn to take action from to reach the right top within as few steps as possible. (b) shows the distance change as a function of training steps. The dashed line corresponds to our algorithm with an online learned model. The corresponding evaluation learning curve is in the Figure 4(c). All results are averaged over random seeds and the shade indicates standard error.

5 Experiments

In this section, we empirically show that our algorithm achieves stable and consistent performances across different settings. We firstly show the overall comparative performances on various benchmark domains. We then show that our algorithm Dyna-TD is more robust to environment noise than PrioritizedER. Last, we demonstrate the practical utility of our algorithm on an autonomous driving vehicle application. Note that our Dyna-TD keeps using the same hill climbing parameter setting across all benchmark domains. We refer readers to the Appendix A.6 for any missing details.

Baselines. We include the following baseline competitors. ER is the DQN with a regular ER buffer without prioritized sampling. PrioritizedER is using a priority queue to store visited experiences and each experience is sampled proportional to its TD error magnitude. Note that according to the original paper (Schaul et al., 2016), after each mini-batch update, only the priorities of those samples in the mini-batch are updated. Dyna-Value (Pan et al., 2019) is the Dyna variant which performs hill climbing on value function to acquire states to populate the search-control queue. Dyna-Frequency (Pan et al., 2020) is the Dyna variant which performs hill climbing on the norm of the gradient of the value function to acquire states to populate the search-control queue.

Overall Performances. Figure 4 shows the overall performances of different algorithms on Acrobot, CartPole, GridWorld (Figure 3(a)) and MazeGridWorld (Figure 4(g)). Our key observations are: 1) Dyna-Value or Dyna-Frequency may converge to a sub-optimal policy when using a large number of planning steps; 2) Dyna-Frequency has clearly inconsistent performances across different domains; 3) our algorithm performs the best in most cases; and even with an online learned model, our algorithm outperforms others on most of the tasks; 4) in most cases, model-based methods (Dyna-) significantly outperform model-free methods.

Our interpretations of those observations are as following. First, for Dyna-Value, think about the case where some states have high value but low TD error, the valued-based hill climbing methods may still frequently acquire those states and this can waste of samples and meanwhile incurs sampling distribution bias which leads to a sub-optimal policy. This sub-optimality can be clearly observed on Acrobot, GridWorld and MazeGridWorld. Similar reasoning applies to Dyna-Frequency. Second, for Dyna-Frequency, as indicated by the original paper Pan et al. (2020), the gradient or hessian norm have very different numerical scales and highly depends on the choice of the function approximator or domain, this indicates that the algorithm requires finely tuned parameter setting as testing domain varies, which possibly explains its inconsistent performances across domains. Furthermore, the Hessian-gradient product can be expensive and we observe Dyna-Frequency takes much longer time to run the experiment than other Dyna variants. Third, since we fetch the same number of states during the search-control process for all Dyna variants, the superior performance of our Dyna-TD indicates the utility of the samples acquired by our approach. Fourth, notice that each algorithm runs the same number of planning steps, while model-based algorithms perform significantly better, this indicates the benefits of leveraging the generalization power of the learned value function. In contrast, model-free methods can only utilize visited states.

(a) Acrobot,
(b) Acrobot,
(c) GridWorld,
(d) GridWorld,
(e) CartPole,
(f) CartPole,
(g) MazeGridWorld
(h) MazeGW,
Figure 4: Evaluation learning curves on benchmark domains with planning updates . The dashed line denotes Dyna-TD with an online learned model. All results are averaged over random seeds. Figure(g) shows MazeGridWorld(GW) taken from Pan et al. (2020) and the learning curves are in (h).

Robustness to Noise. As a correspondent experiment to the supervised learning setting in Section 3, we show that our algorithm is more robust to increased noise variance than the prioritized ER. Figure 5 show the evaluation learning curves on Mountain Car with planning steps and reward noise standard deviation . We would like to identify three key observations. First, our algorithm’s relative performance to PrioritizedER resembles the Full-PrioritizedL2 to PrioritizedL2 from the supervised learning setting, as Full-PrioritizedL2 is more robust to target noise than PrioritizedL2. Second, our algorithm achieves almost the same performance as Dyna-Frequency which is claimed to be robust to noise by Pan et al. (2020). Last, as observed on other environments, usually all algorithms can benefit from the increased number of planning steps; however, the PrioritizedER and ER get clearly hurt when using more planning steps with the noise presented, this illuminates the limitation of the model-free methods.

(a) plan steps 10,
(b) plan steps 10,
(c) plan steps 30,
(d) plan steps 30,
Figure 5: Evaluation learning curves on Mountain Car with different number of planning updates and different reward noise variance. At each time step, the reward is sampled from the Gaussian . indicates deterministic reward. All results are averaged over random seeds.

Practical Utility in Autonomous Driving Application. We study the practical utility of our method in an autonomous driving application (Leurent, 2018) with an online learned model. As shown in Figure 6 (a), we test on the roundabout-v0 domain, where the agent (i.e. the green car) should learn to go through a roundabout without collisions while maintaining as high speed as possible. We would like to emphasize that the domains are not difficult to train to reach some near optimal policy; this can be seen from the previous work by Leurent et al. (2019), which shows that different algorithms achieve similar episodic return. However, we observe that there is a significantly lower number of car crashes with the policy learned by our algorithm on both domains as we show in Figure 6(b). This coincides with our intuition—the crash should incur high temporal difference error and our method of actively searching such states by gradient ascent should make the agent get sufficient training during planning stage, hence the agent can better handle these scenarios than model-free methods.

(a) roundabout-v0
(b) Num of car crashes
Figure 6: (a) shows the roundabout domain, where . (b) shows the corresponding evaluation learning curves in terms of number of car crashes as a function of driving time steps. The results are averaged over random seeds. The shade indicates standard error.

6 Discussion

In this work, we provide theoretical reason for why prioritized ER can help improve sample efficiency. We identify crucial factors for it to be effective: sample space coverage and thorough priority updating. We then propose to sample states by Langevine dynamics and conduct experiments to show the efficacy of our method. Interesting future directions include: 1) studying the effect of model error in sample efficiency with our search control; 2) applying our method with a feature-to-feature model, which can improve scalability of our method.

7 Broader Impact Discussion

This work is about methodology of how to efficiently sample hypothetical experiences in model-based reinforcement learning. Potential impact of this work is likely to be further improvement of sample efficiency of reinforcement learning methods, which should be generally beneficial to the reinforcement learning research community. We have not considered specific applications or scenarios as the goal of this work.


  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, and et al. (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Software available from Cited by: §A.6.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI Gym. Note: arXiv:1606.01540 Cited by: §A.6.2, §3.2.
  • T. Chiang, C. Hwang, and S. J. Sheu (1987) Diffusion for global optimization in . SIAM Journal on Control and Optimization, pp. 737–753. Cited by: §A.4.
  • D. S. Corneil, W. Gerstner, and J. Brea (2018) Efficient model-based deep reinforcement learning with variational state tabulation. In International Conference on Machine Learning, pp. 1049–1058. Cited by: §A.1, §1.
  • T. de Bruin, J. Kober, K. Tuyls, and R. Babuska (2018) Experience selection in deep reinforcement learning for control. Journal of Machine Learning Research. Cited by: §1.
  • A. Durmus and E. Moulines (2017) Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. The Annals of Applied Probability, pp. 1551–1587. Cited by: §A.4.
  • H. Fanaee-T and J. Gama (2013) Event labeling combining ensemble detectors and background knowledge.

    Progress in Artificial Intelligence

    , pp. 1–15.
    Cited by: §A.7.
  • X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, Cited by: §A.6.2.
  • A. Goyal, P. Brakel, W. Fedus, S. Singhal, T. Lillicrap, S. Levine, H. Larochelle, and Y. Bengio (2019) Recall traces: backtracking models for efficient reinforcement learning. In International Conference on Learning Representations, Cited by: §A.1, §1.
  • S. Gu, T. P. Lillicrap, I. Sutskever, and S. Levine (2016) Continuous Deep Q-Learning with Model-based Acceleration.. In International Conference on Machine Learning, pp. 2829–2838. Cited by: §A.1, §1.
  • D. Ha and J. Schmidhuber (2018) Recurrent world models facilitate policy evolution. Advances in Neural Information Processing Systems, pp. 2450–2462. Cited by: §1.
  • G. Z. Holland, E. Talvitie, and M. Bowling (2018) The effect of planning shape on dyna-style planning in high-dimensional state spaces. CoRR abs/1806.01825. Cited by: §1, §1.
  • M. Janner, J. Fu, M. Zhang, and S. Levine (2019) When to trust your model: model-based policy optimization. Advances in Neural Information Processing Systems, pp. 12519–12530. Cited by: §A.1, §1.
  • D. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. International Conference on Learning Representations. Cited by: §A.6.
  • R. O. T. L. (1996) Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, pp. 341–363. Cited by: §A.4, §4.
  • E. Leurent, Y. Blanco, D. Efimov, and O. Maillard (2019) Approximate robust control of uncertain dynamical systems. CoRR abs/1903.00220. External Links: 1903.00220 Cited by: §A.6.2, §5.
  • E. Leurent (2018) An environment for autonomous driving decision-making. GitHub. Note: Cited by: §5.
  • L. Lin (1992) Self-Improving Reactive Agents Based On Reinforcement Learning, Planning and Teaching.. Machine Learning. Cited by: §1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015) Human-level control through deep reinforcement learning.. Nature. Cited by: §2.
  • A. W. Moore and C. G. Atkeson (1993) Prioritized sweeping: reinforcement learning with less data and less time. Machine learning, pp. 103–130. Cited by: §A.1, §1.
  • Y. Pan, J. Mei, and A. Farahmand (2020) Frequency-based search-control in dyna. In International Conference on Learning Representations, Cited by: §A.6.2, §A.6.2, §1, §2, §3.2, Figure 4, §5, §5, §5, footnote 5.
  • Y. Pan, H. Yao, A. Farahmand, and M. White (2019) Hill climbing on value estimates for search-control in dyna. In International Joint Conference on Artificial Intelligence, Cited by: §A.1, §A.6.2, §1, §2, Figure 3, §4, §4, §4, §5, footnote 1.
  • Y. Pan, M. Zaheer, A. White, A. Patterson, and M. White (2018) Organizing experience: a deeper look at replay mechanisms for sample-based planning in continuous state domains. In International Joint Conference on Artificial Intelligence, pp. 4794–4800. Cited by: §A.1, §1, §1.
  • T. Schaul, J. Quan, I. Antonoglou, and D. Silver (2016) Prioritized Experience Replay. In International Conference on Learning Representations, Cited by: §A.6.1, §1, §3.2, §5.
  • M. Schlegel, W. Chung, D. Graves, J. Qian, and M. White (2019) Importance resampling for off-policy prediction. Advances in Neural Information Processing Systems 32, pp. 1799–1809. Cited by: §1.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. Second edition, The MIT Press. Cited by: §A.1, §3.2.
  • R. S. Sutton, C. Szepesvári, A. Geramifard, and M. Bowling (2008) Dyna-style planning with linear function approximation and prioritized sweeping. In UAI, pp. 528–536. Cited by: §A.1, §1.
  • R. S. Sutton (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In ML, Cited by: §A.1.
  • R. S. Sutton (1991) Integrated modeling and control based on reinforcement learning and dynamic programming. In Advances in Neural Information Processing Systems, Cited by: §1, §2.
  • H. P. van Hasselt, M. Hessel, and J. Aslanides (2019) When to use parametric models in reinforcement learning?. In Advances in Neural Information Processing Systems, pp. 14322–14333. Cited by: §1.
  • H. van Seijen and R. S. Sutton (2015) A deeper look at planning as learning from replay. In International Conference on Machine Learning, pp. 2314–2322. Cited by: §1.
  • C. J. C. H. Watkins and P. Dayan (1992) Q-learning. Machine Learning, pp. 279–292. Cited by: §2.
  • M. Welling and Y. W. Teh (2011) Bayesian learning via stochastic gradient Langevin dynamics. In International Conference on Machine Learning, pp. 681–688. Cited by: §4.
  • Y. Zhang, P. Liang, and M. Charikar (2017) A hitting time analysis of stochastic gradient langevin dynamics. In Conference on Learning Theory, pp. 1980–2022. Cited by: §4.

Appendix A Appendix

In Section A.1, we introduce some background in Dyna architecture. We then provide the full proof of Theorem 2 in Section A.3. We briefly discuss Langevin dynamics in Section A.4, and present the proof for Theorem 4 in Section A.5. Details for reproducible research are in Section A.6. We provide supplementary experimental results in Section A.7.

a.1 Background in Dyna

Dyna integrates model-free and model-based policy updates in an online RL setting (Sutton, 1990). As shown in Algorithm 2, at each time step, a Dyna agent uses the real experience to learn a model and performs model-free policy update. During the planning stage, simulated experiences are acquired from the model to further improve the policy. It should be noted that the concept of planning refers to any computational process which leverages a model to improve policy, according to Sutton and Barto (2018). The mechanism of generating states or state-action pairs from which to query the model is called search-control, which is of critical importance to the sample efficiency. There are abundant existing works  (Moore and Atkeson, 1993; Sutton et al., 2008; Gu et al., 2016; Pan et al., 2018; Corneil et al., 2018; Goyal et al., 2019; Janner et al., 2019; Pan et al., 2019) report different level of sample efficiency improvements by using different way of generating hypothetical experiences during the planning stage.

  Initialize ; initialize model ,
  while true do
     observe , take action by -greedy w.r.t
     execute , observe reward and next State
     Q-learning update for
     update model (i.e. by counting)
     store into search-control queue
     for i=1:d do
        sample from search-control queue
         // simulated transition
        Q-learning update for // planning update
Algorithm 2 Tabular Dyna

a.2 Proof for Theorem 1

Theorem 1. For a constant determined by , we have


The proof is very intuitive. The expected gradient of the uniform sampling method is

Setting completes the proof. ∎

a.3 Proof for Theorem 2

Theorem 2. Consider the following two objectives: , and . Denote , and . Define the functional gradient flow updates on these two objectives:


Given error threshold , define the hitting time and . For any initial function value s.t. , such that .


For the gradient flow update on the objective, we have,


which implies,


Taking integral, we have,


which is equivalent to (letting ),


On the other hand, for the gradient flow update on the objective, we have,


which implies,


Taking integral, we have,


which is equivalent to (letting ),


Then we have,


Define the function is continuous and . We have , and is monotonically increasing for and monotonically decreasing for .

Given , we have . Using the intermediate value theorem for on , we have , such that . Since is monotonically increasing on and monotonically decreasing on , for any , we have .444Note that by the design of using gradient descent updating rule. If the two are equal, holds trivially. Hence we have,

Remark 1. Figure 7 shows the function . Fix arbitrary , there will be another root s.t. . However, there is no real-valued solution for . The solution in is , where is a Wright Omega function. Hence, finding the exact value of would require a definition of ordering on complex plane. Our current theorem statement is sufficient for the purpose of characterizing convergence rate. The theorem states that there always exists some desired low error level , minimizing the square loss converges slower than the cubic loss.

Figure 7: The function . The function reaches maximum at .

a.4 Discussion on the Langevin Dynamics

Define a SDE: , where is a -dimensional Brownian motion and is a continuous differentiable function. It turns out that the Langevin diffusion converges to a unique invariant distribution (Chiang et al., 1987). By applying the Euler-Maruyama discretization scheme to the SDE, we acquire the discretized version where is an i.i.d. sequence of standard -dimensional Gaussian random vectors and is a sequence of step sizes. It has been proved that the limiting distribution of the sequence converges to the invariant distribution of the underlying SDE L. (1996); Durmus and Moulines (2017). As a result, considering as , as completes the proof for Theorem 3.

a.5 Proof for Theorem 4

We now provide the error bound for Theorem 4. We denote the transition probability distribution under policy with the true model as ; denote that with the learned model as . Let and be the convergent distributions described in Theorem 3 by using true model and learned model respectively. Let be the total variation distance between two probability distributions. Define . Then we have the following bound.

Theorem 4. Assume: 1) the reward magnitude is bounded and define ; 2) the largest model error for a single state is some small value: . Then .


First, we bound the estimated temporal difference error. Fix an arbitrary state , it is sufficient the consider the case , then

Then we take into consideration of the normalizer of the Gibbs distribution. Consider the case first.

This corresponds to the second term in the maximum operation. The first term corresponds to the case . This completes the proof. ∎

a.6 Reproducible Research

Our implementations are based on tensorflow with version (Abadi et al., 2015). We use Adam optimizer (Kingma and Ba, 2014) for all experiments.

a.6.1 Reproduce experiments before Section 5

Supervised learning experiment.

For the supervised learning experiment shown in section 3, we use tanh units neural network, with learning rate swept from for all algorithms. We compute the constant as specified in the Theorem 1 at each time step for Cubic loss. We compute the testing error every iterations/mini-batch updates and our evaluation learning curves are plotted by averaging random seeds. For each random seed, we randomly split the dataset to testing set and training set and the testing set has k data points. Note that the testing set is not noise-contaminated.

Reinforcement Learning experiments in Section 3.

We use a particularly small neural network to highlight the issue of incomplete priority updating. Intuitively, a large neural network may be able to memorize each state’s value and thus updating one state’s value is less likely to affect others. We choose a small neural network, in which case a complete priority updating for all states should be very important. We set the maximum ER buffer size as k and mini-batch size as . The learning rate is and the target network is updated every k steps.

Distribution distance computation in Section 4.

We now introduce the implementation details for Figure 3. The distance is estimated by the following steps. First, in order to compute the desired sampling distribution, we discretize the domain into grids and calculate the absolute TD error of each grid (represented by the left bottom vertex coordinates) by using the true environment model and the current learned function. We then normalize these priorities to get probability distribution . Note that this distribution is considered as the desired one since we have access to all states across the state space with priorities computed by current Q-function at each time step. Second, we estimate our sampling distribution by randomly sampling k states from search-control queue and count the number of states falling into each discretized grid and normalize these counts to get . Third, for comparison, we estimate the sampling distribution of the conventional prioritized ER (Schaul et al., 2016) by sampling k states from the prioritized ER buffer and count the states falling into each grid and compute its corresponding distribution by normalizing the counts. Then we compute the distances of