1 Introduction
Experience replay has become nearly ubiquitous in modern largescale, deep reinforcement learning systems
[Schaul et al.2016]. The basic idea is to store an incomplete history of previous agentenvironment interactions in a transition buffer. During planning, the agent selects a transition from the buffer and updates the value function as if the samples were generated online—the agent replays the transition. There are many potential benefits of this approach, including stabilizing potentially divergent nonlinear Qlearning updates, and mimicking the effect of multistep updates as in eligibility traces.Experience replay (ER) is like a modelbased RL system, where the transition buffer acts as a model of the world [Lin1992]. Using the data as your model avoids model errors that can cause bias in the updates (c.f. [Bagnell and Schneider2001]). One of ER’s most distinctive attributes as a modelbased planning method, is that it does not perform multistep rollouts of hypothetical trajectories according to a model; rather previous agentenvironment transitions are replayed randomly or with priority from the transition buffer. Trajectory sampling approaches such as PILCO [Deisenroth and Rasmussen2011], HallucinatedDagger [Talvitie2017], and CPSRs [Hamilton et al.2014], unlike ER, can rollout unlikely trajectories ending up in hypothetical states that do not match any real state in the world when the model is wrong [Talvitie2017]. ER’s stochastic onestep planning approach was later adopted by Sutton’s Dyna architecture [Sutton1991].
Despite the similarities between Dyna and ER, there have been no comprehensive, direct empirical comparisons comparing the two and their underlying designdecisions. ER maintains a buffer of transitions for replay, and Dyna a searchcontrol queue composed of stored states and actions from which to sample. There are many possibilities for how to add, remove and select samples from either ER’s transition buffer or Dyna’s searchcontrol queue. It is not hard to imagine situations where a Dynastyle approach could be better than ER. For example, because Dyna models the environment, states leading into high priority states—predecessors—can be added to the queue, unlike ER. Additionally, Dyna can choose to simulate onpolicy samples, whereas ER can only replay (likely offpolicy) samples previously stored. In nonstationary problems, small changes can be quickly recognized and corrected in the model. On the other hand, these small changes might result in wholesale changes to the policy, potentially invalidating many transitions in ER’s buffer. It remains to be seen if these differences manifest empirically, or if the additional complexity of Dyna is worthwhile.
In this paper, we develop a novel semiparametric Dyna algorithm, called REMDyna, that provides some of the benefits of both Dynastyle planning and ER. We highlight criteria for learned models used within Dyna, and propose Reweighted Experience Models (REMs) that are dataefficient, efficient to sample and can be learned incrementally. We investigate the properties of both ER and REMDyna, and highlight cases where ER can fail, but REMDyna is robust. Specifically, this paper contributes both (1) a new method extending Dyna to continuousstate domains—significantly outperforming previous attempts [Sutton et al.2008], and (2) a comprehensive investigation of the design decisions critical to the performance of onestep, samplebased planning methods for reinforcement learning with function approximation. An Appendix is publicly available on arXiv, with theorem proof and additional algorithm and experimental details.
2 Background
We formalize an agent’s interaction with its environment as a discrete time Markov Decision Process (MDP). On each time step
, the agent observes the state of the MDP , and selects an action , causing a transition to a new state and producing a scalar reward on the transition . The agent’s objective is to find an optimal policy , which maximizes the expected return for all , where , , and , with future states and rewards are sampled according to the onestep dynamics of the MDP. The generalization to a discount function allows for a unified specification of episodic and continuing tasks [White2017], both of which are considered in this work.In this paper we are concerned with modelbased approaches to finding optimal policies. In all approaches we consider here the agent forms an estimate of the value function from data:
. The value function is parameterized by allowing both linear and nonlinear approximations. We consider sample models, that given an input state and action need only output one possible next state and reward, sampled according to the onestep dynamics of the MDP: .In this paper, we focus on stochastic onestep planning methods, where onestep transitions are sampled from a model to update an actionvalue function. The agent interacts with the environment on each time step, selecting actions according to its current policy (e.g., greedy with respect to ), observing next states and rewards, and updating . Additionally, the agent also updates a model with these observed sample transitions on each time step. After updating the value function and the model, the agent executes steps of planning. On each planning step, the agent samples a start state and action in some way (called search control), then uses the model to simulate the next state and reward. Using this hypothetical transition the agent updates in the usual way. In this generic framework, the agent can interleave learning, planning, and acting—all in realtime. Two wellknown implementations of this framework are ER [Lin1992], and the Dyna architecture [Sutton1991].
3 Onestep Samplebased Planning Choices
There are subtle design choices in the construction of stochastic, onestep, samplebased planning methods that can significantly impact performance. These include how to add states and actions to the searchcontrol queue for Dyna, how to select states and actions from the queue, and how to sample next states. These choices influence the design of our REM algorithm, and so we discuss them in this section.
One important choice for Dynastyle methods is whether to sample a next state, or compute an expected update over all possible transitions. A samplebased planner samples , given , and stochastically updates . An alternative is to approximate full dynamic programming updates, to give an expected update, as done by stochastic factorization approaches [Barreto et al.2011, Kveton and Theocharous2012, Barreto et al.2014, Yao et al.2014, Barreto et al.2016, Pires and Szepesvári2016], kernelbased RL (KBRL) [Ormoneit and Sen2002], and kernel mean embeddings (KME) for RL [Grunewalder et al.2012, Van Hoof et al.2015, Lever et al.2016]. Linear Dyna [Sutton et al.2008]
computes an expected next reward and expected next feature vector for the update, which corresponds to an expected update when
is a linear function of features. We advocate for a sampled update, because approximate dynamic programming updates, such as KME and KBRL, are typically too expensive, couple the model and value function parameterization and are designed for a batch setting. Computation can be more effectively used by sampling transitions.There are many possible refinements to the searchcontrol mechanism, including prioritization and backwardssearch. For tabular domains, it is feasible to simply store all possible states and actions, from which to simulate. In continuous domains, however, care must be taken to order and delete stored samples. A basic strategy is to simply store recent transitions for the transition buffer in ER, or state and actions for the searchcontrol queue in Dyna. This, however, provides little information about which samples would be most beneficial for learning. Prioritizing how samples are drawn, based on absolute TDerror , has been shown to be useful for both tabular Dyna [Sutton and Barto1998], and ER with function approximation [Schaul et al.2016]. When the buffer or searchcontrol queue gets too large, one then must also decide whether to delete transitions based on recency or priority. In the experiments, we explore this question about the efficacy of recency versus priorities for adding and deleting.
ER is limited in using alternative criteria for searchcontrol, such as backward search. A model allows more flexibility in obtaining useful states and action to add to the searchcontrol queue. For example, a model can be learned to simulate predecessor states—states leading into (a highpriority) for a given action . Predecessor states can be added to the searchcontrol queue during planning, facilitating a type of backward search. The idea of backward search and prioritization were introduced together for tabular Dyna [Peng and Williams1993, Moore and Atkeson1993]. Backward search can only be applied in ER in a limited way because its buffer is unlikely to contain transitions from multiple predecessor states to the current state in planning. [Schaul et al.2016]
proposed a simple heuristic to approximate prioritization with predecessors, by updating the priority of the most recent transition on the transition buffer to be at least as large as the transition that came directly after it. This heuristic, however, does not allow a systematic backwardsearch.
A final possibility we consider is using the current policy to select the actions during search control. Conventionally, Dyna draws the action from the searchcontrol queue using the same mechanism used to sample the state. Alternatively, we can sample the state via priority or recency, and then query the model using the action the learned policy would select in the current state: . This approach has the advantage that planning focuses on actions that the agent currently estimates to be the best. In the tabular setting, this onpolicy sampling can result in dramatic efficiency improvements for Dyna [Sutton and Barto1998], while [Gu et al.2016] report improvement from onpolicy sample of transitions, in a setting with multistep rollouts. ER cannot emulate onpolicy search control because it replays full transitions , and cannot query for an alternative transition if a different action than is taken.
4 Reweighted Experience Models for Dyna
In this section, we highlight criteria for selecting amongst the variety of available sampling models, and then propose a semiparametric model—called Reweighted Experience Models—as one suitable model that satisfies these criteria.
4.1 Generative Models for Dyna
Generative models are a fundamental tool in machine learning, providing a wealth of possible model choices. We begin by specifying our desiderata for online samplebased planning and acting. First, the model learning should be
incremental and adaptive, because the agent incrementally interleaves learning and planning. Second, the models should be dataefficient, in order to achieve the primary goal of improving dataefficiency of learning value functions. Third, due to policy nonstationarity, the models need to be robust to forgetting: if the agent stays in a part of the world for quite some time, the learning algorithm should not overwrite—or forget—the model in other parts of the world. Fourth, the models need to be able to be queried as conditional models. Fifth, sampling should be computationally efficient, since a slow sampler will reduce the feasible number of planning steps.Density models are typically learned as a mixture of simpler functions or distributions. In the most basic case, a simple distributional form can be used, such as a Gaussian distribution for continuous random variables, or a categorical distribution for discrete random variables. For conditional distributions,
, the parameters to these distributions, like the mean and variance of
, can be learned as a (complex) function of. More general distributions can be learned using mixtures, such as mixture models or belief networks. A Conditional Gaussian Mixture Model, for example, could represent
, where and are (learned) functions of . In belief networks—such as Boltzmann distributions—the distribution is similarly represented as a sum over hidden variables, but for more general functional forms over the random variables—such as energy functions. To condition on , those variables in the network are fixed both for learning and sampling.Kernel density estimators (KDE) are similar to mixture models, but are nonparametric: means in the mixture are the training data, with a uniform weighting: for samples. KDE and conditional KDE is consistent [Holmes et al.2007]—since the model is a weighting over observed data—providing low modelbias. Further, it is dataefficient, easily enables conditional distributions, and is wellunderstood theoretically and empirically. Unfortunately, it scales linearly in the data, which is not compatible with online reinforcement learning problems. Mixture models, on the other hand, learn a compact mixture and could scale, but are expensive to train incrementally and have issues with local minima.
Neural network models are another option, such as Generative Adversarial Networks [Goodfellow et al.2014] and Stochastic Neural Networks [Sohn et al.2015, Alain et al.2016]
. Many of the energybased models, however, such as Boltzmann distributions, require computationally expensive sampling strategies
[Alain et al.2016]. Other networks—such as Variational Autoencoders—sample inputs from a given distribution, to enable the network to sample outputs. These neural network models, however, have issues with forgetting [McCloskey and Cohen1989, French1999, Goodfellow et al.2013], and require more intensive training strategies—often requiring experience replay themselves.4.2 Reweighted Experience Models
We propose a semiparametric model to take advantage of the properties of KDE and still scale with increasing experience. The key properties of REM models are that 1) it is straightforward to specify and sample both forward and reverse models for predecessors— and —using essentially the same model (the same prototypes); 2) they are dataefficient, requiring few parameters to be learned; and 3) they can provide sufficient model complexity, by allowing for a variety of kernels or metrics defining similarity.
REM models consist of a subset of prototype transitions , chosen from all transitions experienced by the agent, and their corresponding weights . These prototypes are chosen to be representative of the transitions, based on a similarity given by a product kernel
(1) 
A product kernel is a product of separate kernels. It is still a valid kernel, but simplifies dependences and simplifies computing conditional densities, which are key for Dyna, both for forward and predecessor models. They are also key for obtaining a consistent estimate of the , described below.
We first consider Gaussian kernels for simplicity. For states,
with covariance . For discrete actions, the similarity is an indicator if and otherwise . For next state, reward and discount, a Gaussian kernel is used for with covariance . We set the covariance matrix , where is a sample covariance, and use a conditional covariance for .
First consider a KDE model, for comparison, where all experience is used to define the distribution
This estimator puts higher density around more frequently observed transitions. A conditional estimator is similarly intuitive, and also a consistent estimator [Holmes et al.2007],
The experience similar to has higher weight in the conditional estimator: distributions centered at contribute more to specifying . Similarly, it is straightforward to specify the conditional density .
When only prototype transitions are stored, joint and conditional densities can be similarly specified, but prototypes must be weighted to reflect the density in that area. We therefore need a method to select prototypes and to compute weightings. Selecting representative prototypes or centers is a very active area of research, and we simply use a recent incremental and efficient algorithm designed to select prototypes [Schlegel et al.2017]. For the reweighting, however, we can design a more effective weighting exploiting the fact that we will only query the model using conditional distributions.
Reweighting approach. We develop a reweighting scheme that takes advantage of the fact that Dyna only requires conditional models. Because , a simple KDE strategy is to estimate coefficients on the entire transition and on , to obtain accurate densities and . However, there are several disadvantages to this approach. The and need to constantly adjust, because the policy is changing. Further, when adding and removing prototypes incrementally, the other and need to be adjusted. Finally, and can get very small, depending on visitation frequency to a part of the environment, even if is not small. Rather, by directly estimating the conditional coefficients , we avoid these problems. The distribution is stationary even with a changing policy; each can converge even during policy improvement and can be estimated independently of the other .
We can directly estimate , because of the conditional independence assumption made by product kernels. To see why, for prototype in the product kernel in Equation (1),
Rewriting and because
, we can rewrite the probability as
Now we simply need to estimate . Again using the conditional independence property, we can prove the following.
Theorem 1.
Let be the similarity of for sample to for prototype . ThenThe resulting REM model is
To sample predecessor states, with , the same set of prototypes can be used, with a separate set of conditional weightings estimated as for .
Sampling from REMs. Conveniently, to sample from the REM conditional distribution, the similarity across next states and rewards need not be computed. Rather, only the coefficients need to be computed. A prototype is sampled with probability ; if prototype is sampled, then the density (Gaussian) centered around is sampled.
In the implementation, the terms in the Gaussian kernels are omitted, because as fixed constants they can be normalized out. All kernel values then are in , providing improved numerical stability and the straightforward initialization for new prototypes. REMs are linear in the number of prototypes, for learning and sampling, with complexity perstep independent of the number of samples.
Addressing issues with scaling with input dimension. In general, any nonnegative kernel that integrates to one is possible. There are realistic lowdimensional physical systems for which Gaussian kernels have been shown to be highly effective, such as in robotics [Deisenroth and Rasmussen2011]. Kernelbased approaches can, however, extend to highdimensional problems with specialized kernels. For example, convolutional kernels for images have been shown to be competitive with neural networks [Mairal et al.2014]. Further, learned similarity metrics or embeddings enable datadriven models—such as neural networks—to improve performance, by replacing the Euclidean distance. This combination of probabilistic structure from REMs and datadriven similarities for neural networks is a promising next step.
5 Experiments
We first empirically investigate the design choices for ER’s buffer and Dyna’s searchcontrol queue in the tabular setting. Subsequently, we examine the utility of REMDyna, our proposed modellearning technique, by comparing it with ER and other model learning techniques in the function approximation setting. Maintaining the buffer or queue involves determining how to add and remove samples, and how to prioritize samples. All methods delete the oldest samples. Our experiments (not shown here), showed that deleting samples of lowest priority—computed from TD error—is not effective in the problems we studied. We investigate three different settings:
1) Random: samples are drawn randomly.
2) Prioritized: samples are drawn probabilistically according to the absolute TD error of the transitions [Schaul et al.2016, Equation 1] (exponent = 1).
3) Predecessors: same as Prioritized, and predecessors of the current state are also added to the buffer or queue.
We also test using Onpolicy transitions for Dyna, where only is stored on the queue and actions simulated according to the current policy; the queue is maintained using priorities and predecessors. In Dyna, we use the learned model to sample predecessors of the current , for all actions , and add them to the queue. In ER, with no environment model, we use a simple heuristic which adds the priority of the current sample to the preceding sample in the buffer [Schaul et al.2016]. Note that [van Seijen and Sutton2015] relate Dyna and ER, but specifically for a theoretical equivalence in policy evaluation based on a nonstandard form of replay related to true online methods, and thus we do not include it.
Experimental settings: All experiments are averaged over many independent runs, with the randomness controlled based on the run number. All learning algorithms use greedy action selection () and Qlearning to update the value function in both learning and planning phases. The stepsizes are swept in . The size of the searchcontrol queue and buffer was fixed to 1024—large enough for the microworlds considered—and the number of planning steps was fixed to 5.
A natural question is if the conclusions from experiments in the below microworlds extend to larger environments. Microworlds are specifically designed to highlight phenomena in larger domains, such as creating difficulttoreach, highreward states in River Swim described below. The computation and model size are correspondingly scaled down, to reflect realistic limitations when moving to larger environments. The trends obtained when varying the size and stochasticity of these environments provides insights into making such changes in larger environments. Experiments, then, in microworlds enable a more systematic issueoriented investigation and suggest directions for further investigation for use in real domains. Results in the Tabular Setting: To gain insight into the differences between Dyna and ER, we first consider them in the deterministic and stochastic variants of a simple gridworld with increasing state space size. ER has largely been explored in deterministic problems, and most work on Dyna has only considered the tabular setting. The gridworld is discounted with , and episodic with obstacles and one goal, with a reward of 0 everywhere except the transition into goal, in which case the reward is +100. The agent can take four actions. In the stochastic variant each action takes the agent to the intended next state with probability 0.925, or one of the other three adjacent states with probability 0.025. In the deterministic setting, Dyna uses a table to store next state and reward for each state and action; in stochastic, it estimates the probabilities of each observed transition via transition counts.
Figure 1 shows the reward accumulated by each agent over timesteps. We observe that: 1) Dyna with priorities and predecessors outperformed all variants of ER, and the performance gap increases with gridworld size. 2) TDerror based prioritization on Dyna’s search control queue improved performance only when combined with the addition of predecessors; otherwise, unprioritized variants outperformed prioritized variants. We hypothesize that this could be due to outdated priorities, previously suggested to be problematic [Peng and Williams1993, Schaul et al.2016]. 3) ER with prioritization performs slightly worse than unprioritized ER variants for the deterministic setting, but its performance degrades considerably in the stochastic setting. 4) OnPolicy Dyna with priorities and predecessors outperformed the regular variant in the stochastic domain with a larger state space. 5) Dyna with similar searchcontrol strategies to ER, such as recency and priorities, does not outperform ER; only with the addition of improved searchcontrol strategies is there an advantage. 6) Deleting samples from the queue or transitions from the buffer according to recency was always better than deleting according to priority for both Dyna and ER.
Results for Continuous States. We recreate the above experiments for continuous states, and additionally explore the utility of REMs for Dyna. We compare to using a Neural Network model—with two layers, trained with the Adam optimizer on a sliding buffer of 1000 transitions—and to a Linear model predicting featurestoexpected next features rather than states, as in Linear Dyna. We improved upon the original Linear Dyna by learning a reverse model and sweeping different stepsizes for the models and updates to .
We conduct experiments in two tasks: a Continuous Gridworld and River Swim. Continuous Gridworld is a continuous variant of a domain introduced by [Peng and Williams1993], with , a sparse reward of 1 at the goal, and a long wall with a small opening. Agents can choose to move 0.05 units up, down, left, right, which is executed successfully with probability and otherwise the environment executes a random move. Each move has noise . River Swim is a difficult exploration domain, introduced as a tabular domain [Strehl and Littman2008], as a simple simulation of a fish swimming up a river. We modify it to have a continuous state space . On each step, the agent can go right or left, with the river pushing the agent towards the left. The right action succeeds with low probability depending on the position, and the left action always succeeds. There is a small reward at the leftmost state (close to ), and a relatively large reward at the rightmost state (close to ). The optimal policy is to constantly select right. Because exploration is difficult in this domain, instead of greedy, we induced a bit of extra exploration by initializing the weights to . For both domains, we use a coarse tilecoding, similar to stateaggregation.
REMDyna obtains the best performance on both domains, in comparison to the ER variants and other modelbased approaches. For searchcontrol in the continuous state domains, the results in Figures 2 parallels the conclusions from the tabular case. For the alternative models, REMs outperform both Linear models and NN models. For Linear models, the modelaccuracy was quite low and the stepsize selection sensitive. We hypothesize that this additional tuning inadvertently improved the Qlearning update, rather than gaining from Dynastyle planning; in River Swim, Linear Dyna did poorly. Dyna with NNs performs poorly because the NN model is not dataefficient; after several 1000s of more learning steps, however, the model does finally become accurate. This highlights the necessity for dataefficient models, for Dyna to be effective. In Riverswim, no variant of ER was within 85% of optimal, in 20,000 steps, whereas all variants of REMDyna were, once again particularly for REMDyna with Predecessors.
6 Conclusion
In this work, we developed a semiparametric model learning approach, called Reweighted Experience Models (REMs), for use with Dyna for control in continuous state settings. We revisited a few key dimensions for maintaining the searchcontrol queue for Dyna, to decide how to select states and actions from which to sample. These included understanding the importance of using recent samples, prioritizing samples (with absolute TDerror), generating predecessor states that lead into highpriority states, and generating onpolicy transitions. We compared Dyna to the simpler alternative, Experience Replay (ER), and considered similar design decisions for its transition buffer. We highlighted several criteria for the model to be useful in Dyna, for onestep sampled transitions, namely being dataefficient, robust to forgetting, enabling conditional models and being efficient to sample. We developed a new semiparametric model, REM, that uses similarities to a representative set of prototypes, and requires only a small set of coefficients to be learned. We provided a simple learning rule for these coefficients, taking advantage of a conditional independence assumption and that we only require conditional models. We thoroughly investigate the differences between Dyna and ER, in several microworlds for both tabular and continuous states, showing that Dyna can provide significant gains through the use of predecessors and onpolicy transitions. We further highlight that REMs are an effective model for Dyna, compared to using a Linear model or a Neural Network model.
References
 [Alain et al.2016] Guillaume Alain, Yoshua Bengio, Li Yao, Jason Yosinski, Éric ThibodeauLaufer, Saizheng Zhang, and Pascal Vincent. GSNs: generative stochastic networks. Information and Inference: A Journal of the IMA, 2016.
 [Bagnell and Schneider2001] J A Bagnell and J G Schneider. Autonomous helicopter control using reinforcement learning policy search methods. In IEEE International Conference on Robotics and Automation, 2001.
 [Barreto et al.2011] A Barreto, D Precup, and J Pineau. Reinforcement Learning using KernelBased Stochastic Factorization. In Advances in Neural Information Processing Systems, 2011.

[Barreto et al.2014]
A Barreto, J Pineau, and D Precup.
Policy Iteration Based on Stochastic Factorization.
Journal of Artificial Intelligence Research
, 2014.  [Barreto et al.2016] A Barreto, R Beirigo, J Pineau, and D Precup. Incremental Stochastic Factorization for Online Reinforcement Learning. In AAAI Conference on Artificial Intelligence, 2016.
 [Deisenroth and Rasmussen2011] M Deisenroth and C E Rasmussen. PILCO: A modelbased and dataefficient approach to policy search. In International Conference on Machine Learning, 2011.
 [French1999] R M French. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4):128–135, 1999.
 [Goodfellow et al.2013] I J Goodfellow, M Mirza, D Xiao, A Courville, and Y Bengio. An empirical investigation of catastrophic forgetting in gradientbased neural networks. arXiv preprint arXiv:1312.6211, 2013.
 [Goodfellow et al.2014] I J Goodfellow, J PougetAbadie, MehMdi Mirza, B Xu, D WardeFarley, S Ozair, A C Courville, and Y Bengio. Generative Adversarial Nets. In Advances in Neural Information Processing Systems, 2014.
 [Grunewalder et al.2012] Steffen Grunewalder, Guy Lever, Luca Baldassarre, Massi Pontil, and Arthur Gretton. Modelling transition dynamics in MDPs with RKHS embeddings. In International Conference on Machine Learning, 2012.
 [Gu et al.2016] Shixiang Gu, Timothy P Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous Deep QLearning with Modelbased Acceleration. In International Conference on Machine Learning, 2016.
 [Hamilton et al.2014] W L Hamilton, M M Fard, and J Pineau. Efficient learning and planning with compressed predictive states. Journal of Machine Learning Research, 2014.
 [Holmes et al.2007] Michael P Holmes, Alexander G Gray, and Charles L Isbell. Fast Nonparametric Conditional Density Estimation. Uncertainty in AI, 2007.
 [Kveton and Theocharous2012] B Kveton and G Theocharous. KernelBased Reinforcement Learning on Representative States. AAAI Conference on Artificial Intelligence, 2012.
 [Lever et al.2016] Guy Lever, John ShaweTaylor, Ronnie Stafford, and Csaba Szepesvári. Compressed Conditional Mean Embeddings for ModelBased Reinforcement Learning. In AAAI Conference on Artificial Intelligence, 2016.
 [Lin1992] LongJi Lin. SelfImproving Reactive Agents Based On Reinforcement Learning, Planning and Teaching. Machine Learning, 1992.
 [Mairal et al.2014] Julien Mairal, Piotr Koniusz, Zaid Harchaoui, and Cordelia Schmid. Convolutional Kernel Networks. Advances in Neural Information Processing Systems, 2014.
 [McCloskey and Cohen1989] Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of learning and motivation, 24:109–165, 1989.
 [Moore and Atkeson1993] Andrew W Moore and Christopher G Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. Machine learning, 13(1):103–130, 1993.
 [Ormoneit and Sen2002] Dirk Ormoneit and Śaunak Sen. KernelBased Reinforcement Learning. Machine Learning, 2002.
 [Peng and Williams1993] Jing Peng and Ronald J Williams. Efficient Learning and Planning Within the Dyna Framework. Adaptive behavior, 1993.
 [Pires and Szepesvári2016] Bernardo Avila Pires and Csaba Szepesvári. Policy Error Bounds for ModelBased Reinforcement Learning with Factored Linear Models. In Annual Conference on Learning Theory, 2016.
 [Schaul et al.2016] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized Experience Replay. In International Conference on Learning Representations, 2016.
 [Schlegel et al.2017] Matthew Schlegel, Yangchen Pan, Jiecao Chen, and Martha White. Adapting Kernel Representations Online Using Submodular Maximization. In International Conference on Machine Learning, 2017.
 [Sohn et al.2015] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning Structured Output Representation using Deep Conditional Generative Models. In Advances in Neural Information Processing Systems, 2015.
 [Strehl and Littman2008] A. Strehl and M Littman. An analysis of modelbased Interval Estimation for Markov Decision Processes. Journal of Computer and System Sciences, 2008.
 [Sutton and Barto1998] R.S. Sutton and A G Barto. Reinforcement Learning: An Introduction. MIT press, 1998.
 [Sutton et al.2008] R Sutton, C Szepesvári, A Geramifard, and M Bowling. Dynastyle planning with linear function approximation and prioritized sweeping. In Conference on Uncertainty in Artificial Intelligence, 2008.
 [Sutton1991] R.S. Sutton. Integrated modeling and control based on reinforcement learning and dynamic programming. In Advances in Neural Information Processing Systems, 1991.
 [Talvitie2017] Erik Talvitie. SelfCorrecting Models for ModelBased Reinforcement Learning. In AAAI Conference on Artificial Intelligence, 2017.
 [Van Hoof et al.2015] H Van Hoof, J. Peters, and G Neumann. Learning of NonParametric Control Policies with HighDimensional State Features. AI and Statistics, 2015.
 [van Seijen and Sutton2015] H van Seijen and R.S. Sutton. A deeper look at planning as learning from replay. In International Conference on Machine Learning, 2015.
 [White2017] Martha White. Unifying task specification in reinforcement learning. In International Conference on Machine Learning, 2017.
 [Yao et al.2014] Hengshuai Yao, Csaba Szepesvári, Bernardo Avila Pires, and Xinhua Zhang. PseudoMDPs and factored linear action models. In ADPRL, 2014.
Appendix A Consistency of conditional probability estimators
Theorem 1 Let be the similarity of for sample to for prototype . Then
is a consistent estimator of .
Proof.
The closedform solution for this objective is
As ,
with expectation according to . The second equality holds because (a) all three limits exist and (b) the limit of the denominator is not zero: .
Expanding out these expectations, where by the symmetry of the kernel, we get
Similarly,
Therefore,
and so converges to as . ∎
Appendix B REM Algorithmic Details
Algorithm 1 summarizes REMDyna, our online algorithm for learning, acting, and samplebased planning. Supporting pseudocode, for sampling and updating REMs, is given in Section B.3 below. We include Experience Replay with Priorities in Algorithm 2, for comparison. We additionally include a diagram highlighting the difference between KDE and REM, to approximate densities, in Figure 3.
For the queue and buffer, we maintain a circular array. When a sample is added to the array, with priority , it is placed in the spot with the oldest transition. When a stateaction or transition is sampled with priority from the array, it is used to update the weights and its priority in the array is updated with its new priority. Therefore, it is not removed from the array, simply updated. Array elements are only removed once they are the oldest, implemented by incrementing the index each time a new point is added to the array.
b.1 Computing Conditional Covariances
To sample from REMs, as in Algorithm 5, we need to be able to compute the conditional covariance. Recall that
This is not necessarily the true conditional distribution over , but it is the conditional distribution under our model.
It is straightforward to sample from this model, using as the coefficients, shown in Algorithm 5. The key detail is computing a conditional covariance, described below.
Given sample , the conditional mean is
Similarly, we can compute the conditional covariance
(2)  
This conditional covariance matrix more accurately reflects the distribution over , given . A covariance over would be significantly larger than this conditional covariance, since it would reflect the variability across the whole state space, rather than for a given . For example, in a deterministic domain, the conditional covariance is zero, whereas the covariance of across the space is not. If one consistent covariance is desired for , a reasonable choice is to compute a running average of conditional covariances across observed .
b.2 Details on Prototype Selection
We use a prototype selection strategy that maximizes [Schlegel et al.2017]. There are a number of parameters, but they are intuitive to set and did not require sweeps. The algorithm begins by adding the first prototypes, to fill up the budget of prototypes. Then, it starts to swap out the least useful prototypes as new transitions are observed. The algorithm adds in new prototypes, if the are sufficiently different from previous prototypes and increase the diversity of the set. The utility increase threshold is set to ; this threshold simply avoids swapping too frequently, which is computationally expensive, rather than having much impact on quality of the solution.
A component of this algorithm is a kmeans clustering algorithm, to make the update more efficient. We perform kmeans clustering using the distance metric
, where is the empirical covariance matrix for transitions . The points are clustered into blocks, to speed up the computation of the logdeterminant. The clustering is rerun every swaps, but is efficient to do, since it is started from the previous clustering and only a few iterations needs to be executed.b.3 Additional Pseudocode for REMs
The pseudocode for the remaining algorithms is included below, in Algorithm 36. Note that the implementation for REMs can be made much faster by using KDtrees to find nearest points, but we do not include those details here.
Comments
There are no comments yet.