1 Introduction
Using hypothetical experience simulated from an environment model can significantly improve sample efficiency of RL agents (Ha and Schmidhuber, 2018; Holland et al., 2018; Pan et al., 2018; van Hasselt et al., 2019). Dyna (Sutton, 1991) is a classical MBRL architecture where the agent uses real experience to updates its policy as well as its reward and dynamics models. Inbetween taking actions, the agent can simulate hypothetical experience from the model to further improve the policy.
An important question for effective Dynastyle planning is searchcontrol: from what states should the agent simulate hypothetical transitions? On each planning step in Dyna, the agent has to select a state and action from which to query the model for the next state and reward. This question, in fact, already arises in what is arguably the simplest variant of Dyna: Experience Replay (ER) (Lin, 1992)
. In ER, visited transitions are stored in a buffer and at each time step, a minibatch of experiences is sampled to update the value function. ER can be seen as an instance of Dyna, using a (limited) nonparametric model given by the buffer (see
van Seijen and Sutton (2015) for a deeper discussion). Performance can be significantly improved by sampling proportionally to priorities based on errors, as in prioritized ER (Schaul et al., 2016; de Bruin et al., 2018), as well as specialized sampling for the offpolicy setting (Schlegel et al., 2019).Searchcontrol strategies in Dyna similarly often rely on using priorities, though they can be more flexible in leveraging the model rather than being limited to only retrieving visited experiences. For example, a model enables the agent to sweep backwards by generating predecessors, as in prioritized sweeping (Moore and Atkeson, 1993; Sutton et al., 2008; Pan et al., 2018; Corneil et al., 2018). Other methods have tried alternatives to errorbased prioritization, such as searching for states with high reward (Goyal et al., 2019), high value (Pan et al., 2019) or states that are difficult to learn (Pan et al., 2020). Another strategy has been to generate a more diverse set of states from which to sample (Gu et al., 2016; Holland et al., 2018), or to modulate the distance of such states from real experience (Janner et al., 2019). These methods are all supported by nice intuitions, but as yet lack solid theoretical reasons for why they can improve sample efficiency.
In this work, we provide new insights into how to choose the sampling distribution over states from which we generate hypothetical experience. In particular, we theoretically motivate why errorbased prioritization is effective, and provide a mechanism to generate states according to more accurate error estimates. We first prove that
regression with errorbased prioritized sampling is equivalent to minimizing a cubic objective with uniform sampling in an ideal setting. We then show that minimizing the cubic power objective has a faster convergence rate during early learning stage, providing theoretical motivation for errorbased prioritization. We point out that this ideal setting is hard to achieve in practice using only ER due to two issues: insufficient sample space coverage and outdated priorities. Hence we propose a searchcontrol strategy in Dyna that leverages a model to simulate errors and to find states with high expected error. Finally, we demonstrate the efficacy of our method on various benchmark domains and an autonomous driving application.2 Problem Formulation
We formalize the problem as a Markov Decision Process (MDP), a tuple
including state space , action space, probability transition kernel
, reward function , and discount rate . At each environment time step , an RL agent observes a state , and takes an action . The environment transitions to the next state , and emits a scalar reward signal . A policy is a mapping that determines the probability of choosing an action at a given state.The agent’s objective is to find an optimal policy. A popular algorithm is Qlearning (Watkins and Dayan, 1992), where parameterized actionvalues are updated using for stepsize with TDerror
. The policy is defined by acting greedily w.r.t. these actionvalues. ER is critical when using neural networks to estimate
, as used in DQN (Mnih et al., 2015), both to stabilize and speed up learning. MBRL has the potential to provide even further sample efficiency improvements.We build on the Dyna formalism (Sutton, 1991) for MBRL, and more specifically the recently proposed HCDyna (Pan et al., 2019) as shown in Algorithm 1. It is featured by a special Hill Climbing (HC)^{1}^{1}1The term Hill Climbing is used for generality as the vanilla gradient ascent procedure is modified to resolve certain challenges (Pan et al., 2019). searchcontrol—the mechanism of generating states or stateaction pairs from which to query the model to get next states and rewards (i.e. hypothetical experiences)—which generates states by hill climbing on some criterion function . For example, is the value function from Pan et al. (2019) and is the gradient magnitude from Pan et al. (2020). The former is used as measure of “importance” of states and the latter is considered as a measure of value approximation difficulty.
These hypothetical transitions are treated just like real transitions. For this reason, HCDyna combines both real experience and hypothetical experience into minibatch updates. These updates, performed before taking the next action, are called planning updates, as they improve the actionvalue estimates—and so the policy—by a model.
3 A Deeper Look at Errorbased Prioritized Sampling
In this section, we provide theoretical motivation for errorbased prioritized sampling, and empirically investigate several insights highlighted by this theory. We show that prioritized sampling can be reformulated as optimizing a cubic power objective with uniform sampling. We prove that optimizing the cubic objective provides a faster convergence rate during early learning. Based on these results, we point out that for errorbased prioritization, such as prioritized ER, to manifest advantages relies on uptodate priorities and sufficient coverage of the sample space, and empirically highlight issues when those two properties are not obtained.
3.1 Prioritized Sampling as a Cubic Objective
In regression, we minimize the mean squared error , for training set and function approximator , such as a neural network. In errorbased prioritized sampling, we define the priority of a sample as ; the probability of drawing a sample is typically . We employ the following form to compute the probabilities:
(1) 
We can show an equivalence between the gradients of the squared objective with this prioritization and the cubic power objective . See Appendix A.2 for the proof.
Theorem 1.
For a constant determined by , we have
This simple theorem provides an intuitive reason for why prioritized sampling can help improve sample efficiency: the gradient direction of cubic function is sharper than that of the square function when the error is relatively large (Figure 1). Theorem 2 further characterizes the difference between the convergence rates by optimizing the mean square error and the cubic power objective.
Theorem 2 (Fast early learning).
Consider the following two objectives: , and . Denote , and . Define the functional gradient flow updates on these two objectives:
(2) 
Given error threshold , define the hitting time and . For any initial function value s.t. , such that .^{2}^{2}2Finding the exact value of would require a definition of ordering on complex plane, which leads to and is a Wright Omega function, then we have . Our theorem statement is sufficient for the purpose of characterizing convergence rate.
Proof.
Please see Appendix A.3. Given the same and the same initial value of , first we derive Then we analyze the condition on to see when , i.e. minimizing the square error is slower than minimizing the cubic error. ∎
The above theorem says that when the initial error is relatively large, it is faster to get to a certain low error point with the cubic objective. We can test this in simulation, with the following minimization problems: and . We use the hitting time formulae derived in the proof, to compute the hitting time ratio under different initial values and final error value . In Figure 1(c)(d), we can see that it usually takes a significantly shorter time for the cubic loss to reach a certain with various values.




3.2 Empirical Demonstrations
In this section, we empirically show: 1) the practical performance of cubic objective; 2) the importance of sufficient sample space coverage and of updating priorities of all training samples; 3) the reasons for why high power objective should not be preferred in general. We refer readers to A.6 for missing details and to A.7 for additional experiments.
We conduct experiments on a supervised learning task. We use the dataset from
Pan et al. (2020), where it is shown that the high frequency region is the main source of prediction error. Hence we expect prioritized sampling to make a clear difference in terms of sample efficiency. We generate a training set by uniformly samplingand adding zeromean Gaussian noise with standard deviation
to the target values, where if and if . The testing set contains k samples and the targets are not noisecontaminated.We compare the following algorithms. L2: the regression with uniformly sampling from . FullPrioritizedL2: the regression with prioritized sampling according to the distribution defined in (1), the priorities of all samples in the training set are updated after each minibatch update. PrioritizedL2: the only difference with FullPrioritizedL2 is that only the priorities of those training examples sampled in the minibatch are updated at each iteration, the rest of the training samples keep the original priorities. Note that this resembles what the vanilla Prioritized ER does in RL setting (Schaul et al., 2016). Cubic: minimizing the cubic objective with uniformly sampling. Power4: with uniformly sampling. We include it to show that there is almost no gain and may get hurt by using higher powers.
We use tanh layers for all algorithms and optimize learning rate from the range . Figure 2 (a)(d) show the learning curves in terms of testing error of all the above algorithms with various settings.^{3}^{3}3We show the testing error as it is finally concerned. The training error has similar comparative performance and is presented in Appendix 3, where we also include additional results with different settings. We identify five important observations: 1) with a small minibatch size , there is a significant difference between FullPrioritizedL2 and Cubic; 2) with increased minibatch size, although all algorithms perform better, Cubic achieves largest improvement and its behavior tends to approximate the prioritized sampling algorithm; 3) as shown in Figure 2 (a), the prioritized sampling does not show advantage when the training set is small; 4) Prioritized without updating all priorities can be significantly worse than the vanilla (uniform sampling) regression; 5) when increasing noise standard deviation from to , all algorithms perform worse and the higher power the objective is, the more it can get hurt.
The importance of sample space coverage. Observation 1) and 2) show that high power objective has to use a much larger minibatch size to achieve comparable performance with the with prioritized sampling. Though this coincides with Theorem 1 in that the two algorithms is equivalent in expectation, prioritized sampling seems to be robust to small minibatch, which is an advantage in stochastic gradient methods. A possible reason is that prioritized sampling allows to immediately get many samples from those high error region; while uniformly sampling can get fewer those samples with limited minibatch size. This motivates us to test prioritized sampling with a small training set where both algorithms can get fewer samples everywhere. Figure 2(a) together with (b) indicate that prioritized sampling needs sufficient samples across the sample space to maintain advantage. This requirement is intuitive but it illuminates an important limitation of prioritized ER in RL: only those visited real experiences from the ER buffer can get sampled. If the state space is large, the ER buffer likely contains only a small subset of the state space, indicating a very small training set.
Thorough priority updating. Observation 4) reminds us the importance of using an uptodate sampling distribution at each time step. Outdated priorities change the sampling distribution in an unpredictable manner and the learning performance can get hurt. Though the effect of whether updating priorities of all samples or not is intuitive, it receives little attention in the existing RL literature. We further verify this phenomenon on the classical Mountain Car domain (Sutton and Barto, 2018; Brockman et al., 2016). Figure 2(e) shows the evaluation learning curves of different variants of Deep Q networks (DQN) corresponding to the supervised learning algorithms. We use a small ReLu NN as the function. We expect that a small NN should highlight the issue of priority updating: every minibatch update potentially perturbs the values of many other states. Hence it is likely that many experiences in the ER buffer have the wrong priorities without thorough priority updating. One can see that FullPrioritizedER significantly outperforms the vanilla PrioritizedER algorithm which only updates priorities for those in the sampled minibatch at each time step. However, updating the priorities of all samples at each time step in the ER buffer is usually computationally too expensive and is not scalable to number of visited samples.
Regarding high power objectives. As we discussed above, observation 1) and 2) tell us that that high power objective would require a large minibatch size (ideally, use a batch algorithm, i.e. the whole training set) to manifest the advantage in improving convergence rate. This makes the algorithm not easily scalable to large training dataset. Observation 5) indicates another reason for why a high power objective should not be preferred: it augments the effect of noise added to the target variables. In Figure 2(d), the Power4 objective suffers most from the increased target noise.





or Guassian noises variance
added to the training targets. (a) is using and a smaller training set (solid line for , dotted line for ) than others but has the same testing set size. (e) shows the a corresponding experiment in RL setting on the classical mountain car domain. The results are averaged over random seeds on (a)(d) andon (e). The shade indicates standard error.
4 Acquiring Samples From Temporal Difference Errorbased Sampling Distribution on Continuous Domains
In this section, we propose a method attempting to sample states: 1) which are not restricted to those visited ones; 2) with probability proportional to the expected TD error magnitude and the probability is computed according to function parameters at current time step (i.e. this typically would require to compute priorities of all samples at each time step). We start by the following theorem. We denote as the transition probability given a policy .
Theorem 3.
Sampling method. Given the state , let be a differentiable value function under policy parameterized by . Define: , and denote the TD error as . Given some initial state , define the state sequence as the one generated by state updating rule , where is a sufficiently small stepsize and
is a Gaussian random variable with a sufficiently small variance. Then the sequence
converges to the distribution .The proof is a direct consequence of the convergent behavior of Langevin dynamics stochastic differential equation (SDE)(L., 1996; Welling and Teh, 2011; Zhang et al., 2017). We include a brief discussion and background knowledge in the Appendix A.4.
In practice, we can compute the state value estimate by as suggested by Pan et al. (2019). In the case that a true environment model is not available, we have to compute an estimate of by a learned model. Then at each time step , states approximately following the distribution can be generated by
(3) 
where is a Gaussian random variable with zeromean and reasonably small variance. In implementation, observing that is small, we opt to consider as a constant given a state
without backpropagating through it. We provide a upper bound in below theorem for the difference between the sampling distribution acquired by the true model and the learned model. We denote the transition probability distribution under policy
and the true model as ; denote that with the learned model as . Let and be the convergent distributions described in Theorem 3 by using true and learned models respectively. Let be the total variation distance between two probability distributions. Define .Theorem 4.
Assume: 1) the reward magnitude is bounded and define ; 2) the largest model error for a single state is some small value: . Then .
Please see Appendix A.5 for proof.
Algorithmic details. We present the key details of our algorithm called DynaTD (Temporal Difference error) in the Algorithm 3 in Appendix A.6. The algorithm closely follows the previous hill climbing Dyna by Pan et al. (2019). At each time step, we run the updating rule 3 and record the states along the gradient trajectories to populate the searchcontrol queue. Then during planning stage, we sample states from the searchcontrol queue and pair them with onpolicy actions to get stateaction pairs. We query the model those stateaction pairs to acquire corresponding next states, rewards to acquire hypothetical experiences in the form of . Then we mix those hypothetical experiences with real experiences from the ER buffer to form a mixed minibatch to update the NN parameters.
Empirical verification of sampling distribution. We validate the efficacy of our sampling method by empirically examining the distance between the sampling distribution acquired by our gradient ascent rule (3) (denoted as ) and the desired distribution computed by thorough priority updating of all states under the current parameter on the GridWorld domain (Pan et al., 2019) (Figure 3(a)), where the probability density can be conveniently approximated by discretization. We record the distance change when we train our Algorithm 3. The distance between the sampling distribution fo Prioritized ER (denoted as ) is also included for comparison. All those distributions are computed by normalizing visitation counts on the discretized GridWorld. We compute the distances of to by two sensible weighting schemes: 1) onpolicy weighting: , where is approximated by uniformly sample k states from a recency buffer; 2) uniform weighting: . All details are in Appendix A.6
Figure 3(b)(c) show that our algorithm DynaTD, either with a true or an online learned model, maintains a significantly closer distance to the desired sampling distribution than PrioritizedER under both weighting schemes. Furthermore, despite there is mismatch between implementation and our above Theorem 3 in that DynaTD may not run enough gradient steps to reach stationary distribution, the induced sampling distribution is quite close to the one by running a long gradient steps (DynaTDLong), which is closer to the theorem. This indicates that the we can reduce time cost by lowering the number of gradient steps, while keep the sampling distribution similar.



5 Experiments
In this section, we empirically show that our algorithm achieves stable and consistent performances across different settings. We firstly show the overall comparative performances on various benchmark domains. We then show that our algorithm DynaTD is more robust to environment noise than PrioritizedER. Last, we demonstrate the practical utility of our algorithm on an autonomous driving vehicle application. Note that our DynaTD keeps using the same hill climbing parameter setting across all benchmark domains. We refer readers to the Appendix A.6 for any missing details.
Baselines. We include the following baseline competitors. ER is the DQN with a regular ER buffer without prioritized sampling. PrioritizedER is using a priority queue to store visited experiences and each experience is sampled proportional to its TD error magnitude. Note that according to the original paper (Schaul et al., 2016), after each minibatch update, only the priorities of those samples in the minibatch are updated. DynaValue (Pan et al., 2019) is the Dyna variant which performs hill climbing on value function to acquire states to populate the searchcontrol queue. DynaFrequency (Pan et al., 2020) is the Dyna variant which performs hill climbing on the norm of the gradient of the value function to acquire states to populate the searchcontrol queue.
Overall Performances. Figure 4 shows the overall performances of different algorithms on Acrobot, CartPole, GridWorld (Figure 3(a)) and MazeGridWorld (Figure 4(g)). Our key observations are: 1) DynaValue or DynaFrequency may converge to a suboptimal policy when using a large number of planning steps; 2) DynaFrequency has clearly inconsistent performances across different domains; 3) our algorithm performs the best in most cases; and even with an online learned model, our algorithm outperforms others on most of the tasks; 4) in most cases, modelbased methods (Dyna) significantly outperform modelfree methods.
Our interpretations of those observations are as following. First, for DynaValue, think about the case where some states have high value but low TD error, the valuedbased hill climbing methods may still frequently acquire those states and this can waste of samples and meanwhile incurs sampling distribution bias which leads to a suboptimal policy. This suboptimality can be clearly observed on Acrobot, GridWorld and MazeGridWorld. Similar reasoning applies to DynaFrequency. Second, for DynaFrequency, as indicated by the original paper Pan et al. (2020), the gradient or hessian norm have very different numerical scales and highly depends on the choice of the function approximator or domain, this indicates that the algorithm requires finely tuned parameter setting as testing domain varies, which possibly explains its inconsistent performances across domains. Furthermore, the Hessiangradient product can be expensive and we observe DynaFrequency takes much longer time to run the experiment than other Dyna variants. Third, since we fetch the same number of states during the searchcontrol process for all Dyna variants, the superior performance of our DynaTD indicates the utility of the samples acquired by our approach. Fourth, notice that each algorithm runs the same number of planning steps, while modelbased algorithms perform significantly better, this indicates the benefits of leveraging the generalization power of the learned value function. In contrast, modelfree methods can only utilize visited states.








Robustness to Noise. As a correspondent experiment to the supervised learning setting in Section 3, we show that our algorithm is more robust to increased noise variance than the prioritized ER. Figure 5 show the evaluation learning curves on Mountain Car with planning steps and reward noise standard deviation . We would like to identify three key observations. First, our algorithm’s relative performance to PrioritizedER resembles the FullPrioritizedL2 to PrioritizedL2 from the supervised learning setting, as FullPrioritizedL2 is more robust to target noise than PrioritizedL2. Second, our algorithm achieves almost the same performance as DynaFrequency which is claimed to be robust to noise by Pan et al. (2020). Last, as observed on other environments, usually all algorithms can benefit from the increased number of planning steps; however, the PrioritizedER and ER get clearly hurt when using more planning steps with the noise presented, this illuminates the limitation of the modelfree methods.




Practical Utility in Autonomous Driving Application. We study the practical utility of our method in an autonomous driving application (Leurent, 2018) with an online learned model. As shown in Figure 6 (a), we test on the roundaboutv0 domain, where the agent (i.e. the green car) should learn to go through a roundabout without collisions while maintaining as high speed as possible. We would like to emphasize that the domains are not difficult to train to reach some near optimal policy; this can be seen from the previous work by Leurent et al. (2019), which shows that different algorithms achieve similar episodic return. However, we observe that there is a significantly lower number of car crashes with the policy learned by our algorithm on both domains as we show in Figure 6(b). This coincides with our intuition—the crash should incur high temporal difference error and our method of actively searching such states by gradient ascent should make the agent get sufficient training during planning stage, hence the agent can better handle these scenarios than modelfree methods.


6 Discussion
In this work, we provide theoretical reason for why prioritized ER can help improve sample efficiency. We identify crucial factors for it to be effective: sample space coverage and thorough priority updating. We then propose to sample states by Langevine dynamics and conduct experiments to show the efficacy of our method. Interesting future directions include: 1) studying the effect of model error in sample efficiency with our search control; 2) applying our method with a featuretofeature model, which can improve scalability of our method.
7 Broader Impact Discussion
This work is about methodology of how to efficiently sample hypothetical experiences in modelbased reinforcement learning. Potential impact of this work is likely to be further improvement of sample efficiency of reinforcement learning methods, which should be generally beneficial to the reinforcement learning research community. We have not considered specific applications or scenarios as the goal of this work.
References
 TensorFlow: largescale machine learning on heterogeneous systems. Software available from tensorflow.org. Cited by: §A.6.
 OpenAI Gym. Note: arXiv:1606.01540 Cited by: §A.6.2, §3.2.
 Diffusion for global optimization in . SIAM Journal on Control and Optimization, pp. 737–753. Cited by: §A.4.
 Efficient modelbased deep reinforcement learning with variational state tabulation. In International Conference on Machine Learning, pp. 1049–1058. Cited by: §A.1, §1.
 Experience selection in deep reinforcement learning for control. Journal of Machine Learning Research. Cited by: §1.
 Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. The Annals of Applied Probability, pp. 1551–1587. Cited by: §A.4.

Event labeling combining ensemble detectors and background knowledge.
Progress in Artificial Intelligence
, pp. 1–15. Cited by: §A.7.  Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, Cited by: §A.6.2.
 Recall traces: backtracking models for efficient reinforcement learning. In International Conference on Learning Representations, Cited by: §A.1, §1.
 Continuous Deep QLearning with Modelbased Acceleration.. In International Conference on Machine Learning, pp. 2829–2838. Cited by: §A.1, §1.
 Recurrent world models facilitate policy evolution. Advances in Neural Information Processing Systems, pp. 2450–2462. Cited by: §1.
 The effect of planning shape on dynastyle planning in highdimensional state spaces. CoRR abs/1806.01825. Cited by: §1, §1.
 When to trust your model: modelbased policy optimization. Advances in Neural Information Processing Systems, pp. 12519–12530. Cited by: §A.1, §1.
 Adam: a method for stochastic optimization. International Conference on Learning Representations. Cited by: §A.6.
 Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, pp. 341–363. Cited by: §A.4, §4.
 Approximate robust control of uncertain dynamical systems. CoRR abs/1903.00220. External Links: 1903.00220 Cited by: §A.6.2, §5.
 An environment for autonomous driving decisionmaking. GitHub. Note: https://github.com/eleurent/highwayenv Cited by: §5.
 SelfImproving Reactive Agents Based On Reinforcement Learning, Planning and Teaching.. Machine Learning. Cited by: §1.
 Humanlevel control through deep reinforcement learning.. Nature. Cited by: §2.
 Prioritized sweeping: reinforcement learning with less data and less time. Machine learning, pp. 103–130. Cited by: §A.1, §1.
 Frequencybased searchcontrol in dyna. In International Conference on Learning Representations, Cited by: §A.6.2, §A.6.2, §1, §2, §3.2, Figure 4, §5, §5, §5, footnote 5.
 Hill climbing on value estimates for searchcontrol in dyna. In International Joint Conference on Artificial Intelligence, Cited by: §A.1, §A.6.2, §1, §2, Figure 3, §4, §4, §4, §5, footnote 1.
 Organizing experience: a deeper look at replay mechanisms for samplebased planning in continuous state domains. In International Joint Conference on Artificial Intelligence, pp. 4794–4800. Cited by: §A.1, §1, §1.
 Prioritized Experience Replay. In International Conference on Learning Representations, Cited by: §A.6.1, §1, §3.2, §5.
 Importance resampling for offpolicy prediction. Advances in Neural Information Processing Systems 32, pp. 1799–1809. Cited by: §1.
 Reinforcement learning: an introduction. Second edition, The MIT Press. Cited by: §A.1, §3.2.
 Dynastyle planning with linear function approximation and prioritized sweeping. In UAI, pp. 528–536. Cited by: §A.1, §1.
 Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In ML, Cited by: §A.1.
 Integrated modeling and control based on reinforcement learning and dynamic programming. In Advances in Neural Information Processing Systems, Cited by: §1, §2.
 When to use parametric models in reinforcement learning?. In Advances in Neural Information Processing Systems, pp. 14322–14333. Cited by: §1.
 A deeper look at planning as learning from replay. In International Conference on Machine Learning, pp. 2314–2322. Cited by: §1.
 Qlearning. Machine Learning, pp. 279–292. Cited by: §2.
 Bayesian learning via stochastic gradient Langevin dynamics. In International Conference on Machine Learning, pp. 681–688. Cited by: §4.
 A hitting time analysis of stochastic gradient langevin dynamics. In Conference on Learning Theory, pp. 1980–2022. Cited by: §4.
Appendix A Appendix
In Section A.1, we introduce some background in Dyna architecture. We then provide the full proof of Theorem 2 in Section A.3. We briefly discuss Langevin dynamics in Section A.4, and present the proof for Theorem 4 in Section A.5. Details for reproducible research are in Section A.6. We provide supplementary experimental results in Section A.7.
a.1 Background in Dyna
Dyna integrates modelfree and modelbased policy updates in an online RL setting (Sutton, 1990). As shown in Algorithm 2, at each time step, a Dyna agent uses the real experience to learn a model and performs modelfree policy update. During the planning stage, simulated experiences are acquired from the model to further improve the policy. It should be noted that the concept of planning refers to any computational process which leverages a model to improve policy, according to Sutton and Barto (2018). The mechanism of generating states or stateaction pairs from which to query the model is called searchcontrol, which is of critical importance to the sample efficiency. There are abundant existing works (Moore and Atkeson, 1993; Sutton et al., 2008; Gu et al., 2016; Pan et al., 2018; Corneil et al., 2018; Goyal et al., 2019; Janner et al., 2019; Pan et al., 2019) report different level of sample efficiency improvements by using different way of generating hypothetical experiences during the planning stage.
a.2 Proof for Theorem 1
Theorem 1. For a constant determined by , we have
Proof.
The proof is very intuitive. The expected gradient of the uniform sampling method is
Setting completes the proof. ∎
a.3 Proof for Theorem 2
Theorem 2. Consider the following two objectives: , and . Denote , and . Define the functional gradient flow updates on these two objectives:
(4) 
Given error threshold , define the hitting time and . For any initial function value s.t. , such that .
Proof.
For the gradient flow update on the objective, we have,
(5)  
(6)  
(7)  
(8) 
which implies,
(9) 
Taking integral, we have,
(10) 
which is equivalent to (letting ),
(11) 
On the other hand, for the gradient flow update on the objective, we have,
(12)  
(13)  
(14) 
which implies,
(15) 
Taking integral, we have,
(16) 
which is equivalent to (letting ),
(17) 
Then we have,
(18)  
(19) 
Define the function is continuous and . We have , and is monotonically increasing for and monotonically decreasing for .
Given , we have . Using the intermediate value theorem for on , we have , such that . Since is monotonically increasing on and monotonically decreasing on , for any , we have .^{4}^{4}4Note that by the design of using gradient descent updating rule. If the two are equal, holds trivially. Hence we have,
Remark 1. Figure 7 shows the function . Fix arbitrary , there will be another root s.t. . However, there is no realvalued solution for . The solution in is , where is a Wright Omega function. Hence, finding the exact value of would require a definition of ordering on complex plane. Our current theorem statement is sufficient for the purpose of characterizing convergence rate. The theorem states that there always exists some desired low error level , minimizing the square loss converges slower than the cubic loss.
a.4 Discussion on the Langevin Dynamics
Define a SDE: , where is a dimensional Brownian motion and is a continuous differentiable function. It turns out that the Langevin diffusion converges to a unique invariant distribution (Chiang et al., 1987). By applying the EulerMaruyama discretization scheme to the SDE, we acquire the discretized version where is an i.i.d. sequence of standard dimensional Gaussian random vectors and is a sequence of step sizes. It has been proved that the limiting distribution of the sequence converges to the invariant distribution of the underlying SDE L. (1996); Durmus and Moulines (2017). As a result, considering as , as completes the proof for Theorem 3.
a.5 Proof for Theorem 4
We now provide the error bound for Theorem 4. We denote the transition probability distribution under policy with the true model as ; denote that with the learned model as . Let and be the convergent distributions described in Theorem 3 by using true model and learned model respectively. Let be the total variation distance between two probability distributions. Define . Then we have the following bound.
Theorem 4. Assume: 1) the reward magnitude is bounded and define ; 2) the largest model error for a single state is some small value: . Then .
Proof.
First, we bound the estimated temporal difference error. Fix an arbitrary state , it is sufficient the consider the case , then
Then we take into consideration of the normalizer of the Gibbs distribution. Consider the case first.
This corresponds to the second term in the maximum operation. The first term corresponds to the case . This completes the proof. ∎
a.6 Reproducible Research
Our implementations are based on tensorflow with version (Abadi et al., 2015). We use Adam optimizer (Kingma and Ba, 2014) for all experiments.
a.6.1 Reproduce experiments before Section 5
Supervised learning experiment.
For the supervised learning experiment shown in section 3, we use tanh units neural network, with learning rate swept from for all algorithms. We compute the constant as specified in the Theorem 1 at each time step for Cubic loss. We compute the testing error every iterations/minibatch updates and our evaluation learning curves are plotted by averaging random seeds. For each random seed, we randomly split the dataset to testing set and training set and the testing set has k data points. Note that the testing set is not noisecontaminated.
Reinforcement Learning experiments in Section 3.
We use a particularly small neural network to highlight the issue of incomplete priority updating. Intuitively, a large neural network may be able to memorize each state’s value and thus updating one state’s value is less likely to affect others. We choose a small neural network, in which case a complete priority updating for all states should be very important. We set the maximum ER buffer size as k and minibatch size as . The learning rate is and the target network is updated every k steps.
Distribution distance computation in Section 4.
We now introduce the implementation details for Figure 3. The distance is estimated by the following steps. First, in order to compute the desired sampling distribution, we discretize the domain into grids and calculate the absolute TD error of each grid (represented by the left bottom vertex coordinates) by using the true environment model and the current learned function. We then normalize these priorities to get probability distribution . Note that this distribution is considered as the desired one since we have access to all states across the state space with priorities computed by current Qfunction at each time step. Second, we estimate our sampling distribution by randomly sampling k states from searchcontrol queue and count the number of states falling into each discretized grid and normalize these counts to get . Third, for comparison, we estimate the sampling distribution of the conventional prioritized ER (Schaul et al., 2016) by sampling k states from the prioritized ER buffer and count the states falling into each grid and compute its corresponding distribution by normalizing the counts. Then we compute the distances of