Off-Policy Shaping Ensembles in Reinforcement Learning

05/21/2014 ∙ by Anna Harutyunyan, et al. ∙ Vrije Universiteit Brussel 0

Recent advances of gradient temporal-difference methods allow to learn off-policy multiple value functions in parallel with- out sacrificing convergence guarantees or computational efficiency. This opens up new possibilities for sound ensemble techniques in reinforcement learning. In this work we propose learning an ensemble of policies related through potential-based shaping rewards. The ensemble induces a combination policy by using a voting mechanism on its components. Learning happens in real time, and we empirically show the combination policy to outperform the individual policies of the ensemble.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) is a framework [24], where an agent learns from interacting with its (typically Markovian) environment. The bulk of RL algorithms focus on the on-policy setup, in which the agent learns only about the policy it is executing. While the off-policy setup, in which the agent’s behavior and target policies are allowed to differ is arguably more versatile, its use in practice has been hindered by the convergence issues arising when combined with function approximation (a likely scenario, given any reasonable problem); e.g. the popular Q-learning potentially diverges [1]. This issue was recently resolved by the advancement of the family of gradient temporal-difference methods, such as Greedy-GQ [18]. An interesting implication of this is the possibility to learn multiple tasks in parallel from a shared experience stream in a sound framework, an architecture dubbed Horde by Sutton et al [26]. In the spirit of ensemble methods [31], we use this idea in the context of learning a single task faster. Our larger aim is to devise ensembles of policies that improve the (off-policy) learning speed of a task online in real time, without incurring extra sample or computational costs.

We choose the policies in our ensemble to be related through potential-based reward shaping. Reward shaping is a well-known technique to speed up the learning process by injecting domain knowledge into the reward function. The idea of considering multiple shaping signals instead of a single one, is relatively recent: Devlin et al. observe that it improves performance in the multi-agent context [6], and Brys et al. using a multi-objectivization formalism demonstrate its usefullness while treating different shapings as correlated objectives [4].

The scenario we consider in this paper is that of off-policy learning under fixed behavior, a scenario Maei et al. [18] refer to as latent learning. This is often the setup in applications where the environment samples are costly and a failure is highly penalized, making the usual trial and error tactic implausible, e.g. robotic applications. One can imagine an agent executing a safe exploratory policy, while learning control policies for a variety of tasks.

We note that even though the effects of reward shaping in this latent learning context are bound to be limited, since a large part of its benefits lie in guiding exploration during learning, we witness a significant rise in performance, making this a validation of the effectiveness of reward shaping purely as a means of faster knowledge propagation. See Section 3 for a discussion.

Unlike the existing ensembles in RL, this is the first policy ensemble architecture capable of learning online in real-time and sound w.r.t. convergence in realistic setups – guarantees provided by Horde [26]. The limitation (as with Horde in general) is that it can only be applied in the latent learning setup, to ensure convergence.


In the following section, we give a brief overview of definitions and notation. Section 3 further motivates the use of Horde and multiple shaping signals to form our ensemble. Section 4 summarizes our architecture, and describes the rank voting mechanism used for combining policies. Section 5 gives experimental results in the mountain car domain, and Section 6 concludes and discusses future work directions.

2 Background

The environment of a RL agent is usually modeled as a Markov Decision Process (MDP) 

[23] given by a 4-tuple , where is the set of states, is the set of actions available to the agent, is the transition function with

denoting the probability of ending up in state

upon taking action in state , and is the reward function with denoting the expected reward on the transition from to upon taking action . The Markovian assumption is that and the reward only depend on and , where denotes the discrete time step. A stochastic policy

defines a probability distribution for actions in each state:


Value functionsestimate the utility of policies via their expected cumulative reward. In the discounted setting, the state-action value function is given by:


where is the discounting factor, and is stored as a table with an entry for each state-action pair.

A policy is optimal if its value is maximized for all state-action pairs. Solving an MDP implies finding the optimal policy. When the environment dynamics (given by and ) are unknown, one can solve the MDP by applying the family of temporal difference (TD) algorithms [24] to iteratively estimate the value functions. The following is the update rule of the popular Q-learning method in its simplest form [30]:


where is the reward received at the transition ), is the learning rate or step size, is the TD error and is drawn according to given . Eligibility traces controlled by a trace decay parameter can be used as a way to speed up knowledge propagation [24].

Jaakkola et al. [13] show that in the tabular case this process converges to the optimal solution, under standard stochastic approximation assumptions.

When the state or action spaces are too large, or continuous, tabular representations do not suffice and one needs to use function approximation (FA). The state (or state-action) space is then represented through a set of features

, and the algorithms learn the value of a parameter vector

. In the (common) linear case:


and (3) becomes:


where we slightly abuse notation by letting denote the state-action features , and is still computed according to (4).

2.1 Horde

Unfortunately, FA can cause off-policy bootstrapping methods, such as Q-learning, to diverge even on simple problems [1, 27]. The family of gradient temporal-difference (GTD) algorithms resolve this issue for the first time, while keeping the constant per-step complexity, provided a fixed (or slowly changing) behavior [25, 17]. They accomplish this111Please refer to Maei’s dissertation for the full details [16]. by performing gradient descent on a reformulated objective function, which ensures convergence to the TD fixpoint by introducing a gradient bias into the TD update. Mechanistically, it requires maintaining and learning a second set of weights , along with , with the following update rules:222This is the simplest form of the update rules for gradient temporal-difference algorithms, namely that of TDC [25]. GQ() augments this update with eligibility traces.


Off-policy learning allows one to learn about any policy, regardless of the behavior policy being followed. One then does not need to limit themselves to a single policy, and may learn about an arbitrary number of policies from a single stream of environment interactions (or experience), with computational considerations being the bottleneck. GTD methods not only reliably converge in realistic setups (with FA), but unlike second order algorithms with similar guarantees (e.g. LSTD [2]), run in constant time and memory per-step, and are hence scalable. Sutton et al. [26] formalize a framework of parallel real-time off-policy learning, naming it Horde. They demonstrate Horde being able to learn a set of predictive and goal-oriented value functions333Sutton et al. [26] give Horde in terms of general value functions, each with 4 auxilary inputs: . In this paper we always assume to be the greedy policy w.r.t. to , and shared between all demons, and to be related to the base reward via a shaping reward. in real-time from a single unsupervised stream of sensorimotor experience. There have been further successful applications of Horde in realistic robotic setups [22]. We take a different angle to the existing literature in an attempt to use the power of Horde for learning about a single task from multiple viewpoints.

2.2 Reward shaping

Reward shaping

augments the true reward signal with an additional heuristic reward, provided by the designer. It was originally thought of as a way of scaling up RL methods to handle difficult problems 

[7], as RL generally suffers from infeasibly long learning times. If applied carelessly, however, shaping can slow down or even prevent finding the optimal policy [28]. Ng et al. [21] show that grounding the shaping rewards in state potentials is both necessary and sufficient for ensuring preservation of the (optimal) policies of the original MDP. Potential-based reward shaping maintains a potential function , and defines the auxiliary reward function as:


where is the main discounting factor. Intuitively, potentials are a way to encode the desirability of a state, and the shaping reward on a transition signals positive or negative progress towards desirable states. Potential-based shaping has been repeatedly validated as a way to speed up learning in problems with uninformative rewards [11].

We refer to the rewards augmented with shaping signals as shaped rewards, the value functions w.r.t. them as shaped value functions, and the greedy policies induced by the shaped value functions as shaped policies. Shaped policies converge to the same (optimal) policy as the base policy, but differ during the learning process.

3 Ensembles of Shapings

In this section we further motivate why we find Horde to be a well-suited framework for ensemble learning by surveying ensemble methods in reinforcement learning, and argue why policies obtained by potential-based reward shaping are good candidates for such an ensemble.

Ensemble techniques such as boosting [9] and bagging [3]

are widely used in supervised learning as effective methods to reduce bias and variance of solutions. The use of ensembles in RL has been extremely sparse thus far. Most previous uses of ensembles of policies involved independent runs for each policy, with the combination happening post-factum 

[8]. This is limited in practical usage, since it requires a large computational and sample overhead, assumes a repeatable setup, and does not improve learning speed. Others, in general, lack convergence guarantees,444See the discussion on convergence in Section 6.1.2 of van Hasselt’s dissertation [29]. either using mixed on- and off-policy learners [31], or Q-learners under function approximation [4]. In general, an off-policy setup seems inevitable when considering ensembles of policies; it is surely only interesting if the policies reflect information different from the behavior, since the strength of ensemble learning lies in the diversity of information its components contribute [14]. Q-learning in this setup is not reliable in the presence of FA. While the unofficial mantra is that in practice under a sufficiently similar (e.g. -greedy) policy, Q-learning used with FA does not diverge, even despite the famous counterexamples [1, 27], ensembles of diverse Q-learners are bound to have larger disagreement amongst themselves and with the behavior policy, and have a much larger potential of becoming unstable.555See the discussion in Section 8.5 of Sutton and Barto [24] relating potential to diverge to the proximity of behavior and target policies. To the best of our knowledge, there have been no formal results on this topic.

The ability to learn multiple policies reliably in parallel in a realistic setup is provided by the Horde architecture. For this reason, we believe Horde to be an ideally suited framework for ensemble learning in RL.

Now we turn to the question of the choice of components of our ensemble. Recall that our larger aim is to use ensembles to speed up learning of a single task in real time. Krogh and Vedelsby [14]

show in the context of neural networks that effective ensembles have accurate and diverse components, namely that they make their errors at different parts of the space. In the RL context this diversity can be expressed through several aspects, related to dimensions of the learning process: (1) diversity of

experience, (2) diversity of algorithms and (3) diversity of reward signals. Diversity of experience naturally implies high sample complexity, and assumes either a multi-agent setup, or learning in stages. Diversity of algorithms may run into convergence issues, unless all algorithms are sound off-policy, by the argument above. Marivate and Littman [19] consider diversity of MDPs, by improving performance in a generalized MDP through an ensemble trained on sample MDPs, which also requires a two-stage learning process. In the context of our aim of improving learning speed, we focus on the latter aspect of diversity: diversity of reward signals.

As discussed in Section 2.2, potential-based reward shaping provides a framework for enriching the base reward by incorporating heuristics that express the desirability of states. One can usually think of multiple such heuristics for a single problem, each effective in different situations. Combining them naïvely, e.g. with linear scalarization on the potentials, may be uninformative since the heuristics may counterweigh each other at some parts of the space, and “cancel out”. On the other hand, it is typically infeasible for the designer to handcode all tradeoffs without executing each shaping separately. Horde provides a sound framework to learn and maintain all of the shapings in parallel, enabling the possibility of using any (scale free) ensemble methods for combination.

Shaping off-policy

We note that we are straying from convention in using reward shaping in an off-policy latent learning setup. The effects of reward shaping on the learning process are usually considered to lie in the guidance of exploration during learning [10, 20, 21]. Laud and DeJong [15] formalize this by showing that the difficulty of learning is most dependent on the reward horizon, a measure of the number of decisions a learning agent must make before experiencing accurate feedback, and that reward shaping artificially reduces this horizon. In our setting we assume no control over the agent’s behavior, and the performance benefits in Section 5 must be explained by a different effect. Namely, shaping rewards in the TD updates aid faster knowledge propagation, which we now observe decoupled from guidance of exploration due to the off-policy latent learning setup.

In the next section we describe the exact architecture used for this paper, and the combination method we chose.

4 Architecture

We maintain our Horde of shapings as a set of Greedy-GQ()-learners. The reward function is a vector: , where ( always learns on the base reward alone), and , are potential-based rewards given by (9) on potentials provided by the designer. We adopt the terminology of Sutton et al. [26], and refer to individual agents within Horde as demons. Each demon learns a greedy policy w.r.t. its reward . We refer to the demons learning on shaped rewards as shaped demons.

At any point of learning, we can devise a combination policy by collecting votes on action preferences from all shaped demons (). Wiering et al. [31] discuss several intuitive ways to do so, e.g. majority voting, rank voting, Boltzman multiplication, etc. We describe rank voting used in this paper, but in general the choice of ensemble combination is up to the designer, and may depend on the specifics of the problem and architecture. Even though the base demon does not contribute a vote, we maintain it as a part of the ensemble.

Figure 1: A rough overview of the Horde architecture used to learn an ensemble of shapings. The blue output of the linear function approximation block are the features of the transition (two state-action pairs), with their intersections with representing weights. is a vector of greedy actions at w.r.t. to each policy . Note that all interactions with the environment happen only in the upper left corner.

Rank voting

Each demon (except for ) ranks its actions according to its greedy policy, casting a vote of for its most, and a vote of for its least preferred actions. The voting schema then is defined for policies, rather than value functions, which mitigates the magnitude bias.666Note that even though the shaped policies are the same upon convergence – the value functions are not. We slightly modify the formulation from [31], by ranking Q-values, instead of policy probabilities, i.e. let be the ranking map of a demon. Then , if and only if . The combination or ensemble policy acts greedily w.r.t. the cumulative preference values :


In the next section we validate our approach on the typical mountain car benchmark and interpret the results.

5 Experiments

In this section we give comparison results between the individuals in our ensemble, and the combination policy. We remind the reader that while all policies eventually arrive at the same (optimal) solution, our focus is the time it takes them to get there.

We focus our attention to a classical benchmark domain of mountain car [24]. The task is to drive an underpowered car up a hill (Fig. 2). The (continuous) state of the system is composed of the current position (in ) and the current velocity (in ) of the car. Actions are discrete, a throttle of . The agent starts at the position and a velocity of , and the goal is at the position . The rewards are for every time step. An episode ends when the goal is reached, or when 2000 steps777Note the significantly shorter lifetime of an episode here, as compared to results in Degris et al. [5]; since the shaped rewards are more informative, they can get by with very rarely reaching the goal. have elapsed. The state space is approximated with the standard tile-coding technique [24], using ten tilings of

, with a parameter vector learnt for each action. The behavior policy is a uniform distribution over all actions at each time step.

Figure 2: The mountain car problem. The mountain height is given by .

In this domain we define three intuitive shaping potentials. Each is normalized into the range .

Right shaping.

Encourage progress to the right (in the direction of the goal). This potential is flawed by design, since in order to get to the goal, one needs to first move away from it.

Height shaping.

Encourage higher positions (potential energy), where height is computed according to the formula in Fig. 2.

Speed shaping.

Encourage higher speeds (kinetic energy).


Here is the state (position and velocity), and is a vector of tuned scaling constants.888The scaling of potentials is in general a challenging problem in reward shaping research. Finding the right scaling factor requires a lot of a priori tuning, and the factor is generally assumed constant over the state space. The scalable nature of Horde could be used to lift this problem, by learning multiple preset scales for each potential, and combining them via either a voting method like the one described here, or a meta-learner. See Section 6.

Thus our architecture has 4 demons: , where learns on the base reward, and the others on their respective shaping rewards. The combination policy is formed via rank voting, which we found to outperform majority voting, and a variant of Q-value voting on this problem.

The third (speed) shaping turns out to be the most helpful universally. If this is the case one would likely prefer to just use that single shaping on its own, but we assume such information is not available a priori, which is a more realistic (and challenging) situation. To make our experiment more interesting we consider two scenarios: with and without this best shaping. Ideally we would like our combination method to be able to outperform the two comparable shapings in the first scenario, and pick out the best shaping in the second scenario.

We used . The learning parameters were tuned and selected to be , where is the trace decay parameter, the step size for the second set of weights in Greedy-GQ, and the vector of step sizes for the value functions of our demons.999These were tuned individually, as the value functions differ in magnitude. We ran 1000 independent runs of 100 episodes each. The evaluation was done by interrupting the off-policy learner every 5 episodes, and executing each demon’s greedy policy once. No learning was allowed during evaluation. The graphs reflect the average base reward. The initial and final performance refer to the first and last 20% of a run.

(a) Scenario 1, with two comparable shapings. The combination is able to follow the right shaping in the beginning (where it is best), then switch to the height shaping.
(b) Scenario 2, with one clearly superior shaping. The combination is able to pick out the best shaping and follow it.
Figure 3: Learning curves of the policies in the ensemble in mountain car
Variant Cumulative Initial Final
No shaping -336.3 279.5 -784.7 385.9 -185.1 9.9
Right shaping -310.4 96.9 -378.5 217.4 -290.3 19.3
Height shaping -283.2 205.2 -594.2 317.0 -182.3 7.5
Combination -211.2 94.2 -330.6 179.5 -180.2 1.5
Table 1:

Results for the scenario with two comparable shapings. The combination has the best cumulative performance. In the initial stage it is comparable to the right shaping, in the final – to the height shaping (each being the best in the corresponding stages), overall outperforming both. The results that are not significantly different from the best (Student’s t-test with

) are in bold.
Variant Cumulative Initial Final
No shaping -349.7 285.2 -818.6 373.7 -193.2 10.9
Right shaping -303.4 81.4 -346.7 181.2 -295.1 16.7
Height shaping -292.4 213.8 -619.8 328.3 -190.1 5.3
Speed shaping -158.6 23.7 -182.1 50.6 -150.2 2.9
Combination -168.7 44.7 -214.8 94.8 -161.7 4.0
Table 2: Results for the scenario with one clearly superior shaping. The combination has comparable performance to that shaping, indicating that even in such a setup, our technique is viable. The results that are not significantly different from the best (Student’s t-test with ) are in bold.

The results in Fig. 3, and Tables 1 and 2 show that individual shapings alone aid learning speed significantly. The combination method meets our desiderata: it either statistically matches or is better than the best shaping at any stage, overall outperforming all single shapings. The exception is the final performance of the run in Scenario 2, where the performance of the best shaping is significantly different from the combination. The difference in actual averaged performance however is relatively small, and arguably negligible.

We note that even the best performances in these tables do not reach the maximum attainable, if behaving online.101010An artefact of value-function methods learnt off-policy under a behavior policy rarely reaching the goal. Given longer learning periods, they will get closer and closer to the attainable optimum, but we choose not to concern ourselves with this in the context of this paper, as our main focus lies in improving on the learning time within the off-policy framework.

6 Conclusions and future work

We gave the first policy ensemble that is both sound and capable of learning in real time, by exploiting the power of Horde architecture to learn a single policy well. The value functions in our ensemble learned on shaped rewards, and we used a voting method to combine them. We validated the approach on the classical mountain car domain, considering two scenarios: with and without a clearly best shaping signal. In the former scenario, the combination outperformed single shapings, and in the latter was able to match the performance of that best shaping. In general, we expect to see larger benefits on larger problems; a more extensive suite of experiments is subject to future work.

The primary limitation of Horde is the requirement to keep the behavior policy fixed (or change it slowly). While this is an important case, relaxing this constraint would further expand the effectiveness of the architecture. This is a topic of ongoing research in the GTD community.

Future work

In this work, we considered an ad-hoc voting approach to combining shapings. One of the possible future directions would be to learn optimal combination ways via predicting some shared fitness value w.r.t. the policies induced by the learnt value functions. The challenge with this is that the meta-learning has to happen at a much faster pace for it to be useful in speeding up the main learning process. In the case of shapings, this is doubly the case, since they all eventually converge to the same (optimal) policy. The size of this window of opportunity is related to the size of the problem.

The scalability of Horde allows for learning potentially thousands of value functions efficiently in parallel. While in the context of shaping it will rarely be sensible to actually define thousands of distinct shapings, one could imagine defining shaping potentials with many different scaling factors each, and having a demon combining the shapings from each group. This would not only mitigate the scaling problem, but potentially make the representation more flexible by having non-static scaling factors throughout the state space. This has a roughly similar flavor to the approach of Marivate and Littman [19], who learn to solve many variants of a problem for the best parameter settings in a generalized MDP.

One could go further and attempt to learn the best potential functions [20, 12]. As before, one needs to be realistic about attainability of learning this in time, since as argued by Ng et al. [21], the best potential function correlates with the optimal value function , learning which would solve the base problem itself and render the potentials pointless.

Anna Harutyunyan is supported by the IWT-SBO project MIRAD (grant nr. 120057). Tim Brys is funded by a Ph.D grant of the Research Foundation-Flanders (FWO).


  • [1] L. Baird, ‘Residual algorithms: Reinforcement learning with function approximation’, in

    In Proceedings of the Twelfth International Conference on Machine Learning

    , pp. 30–37. Morgan Kaufmann, (1995).
  • [2] S. J. Bradtke, A. G. Barto, and P. Kaelbling, ‘Linear least-squares algorithms for temporal difference learning’, in Machine Learning, pp. 22–33, (1996).
  • [3] L. Breiman, ‘Bagging predictors’, Mach. Learn., 24(2), 123–140, (August 1996).
  • [4] T. Brys, A. Harutyunyan, P. Vrancx, M.E. Taylor, D. Kudenko, and A. Nowé, ‘Multi-objectivization in reinforcement learning’, Technical Report AI-TR-13-354, AI Lab, Vrije Universiteit Brussel, (2013).
  • [5] T. Degris, M. White, and R.S. Sutton, ‘Off-policy actor-critic’, in Proceedings of the Twenty-Ninth International Conference on Machine Learning (ICML), (2012).
  • [6] S. Devlin, D. Kudenko, and M. Grzes, ‘An empirical study of potential-based reward shaping and advice in complex, multi-agent systems’, Advances in Complex Systems (ACS), 14(02), 251–278, (2011).
  • [7] M. Dorigo and M. Colombetti. Robot shaping: Experiment in behavior engineering, 1997.
  • [8] S. Faußer and F. Schwenker, ‘Ensemble methods for reinforcement learning with function approximation.’, in MCS, eds., Carlo Sansone, Josef Kittler, and Fabio Roli, volume 6713 of Lecture Notes in Computer Science, pp. 56–65. Springer, (2011).
  • [9] Y. Freund and R.E. Schapire, ‘Experiments with a New Boosting Algorithm’, in International Conference on Machine Learning, pp. 148–156, (1996).
  • [10] M. Grzes, Improving Exploration in Reinforcement Learning through Domain Knowledge and Parameter Analysis, Ph.D. dissertation, University of York, 2010.
  • [11] M. Grzes and D. Kudenko, ‘Theoretical and empirical analysis of reward shaping in reinforcement learning’, Machine Learning and Applications, Fourth International Conference on, 0, 337–344, (2009).
  • [12] M. Grzes and D. Kudenko, ‘Online learning of shaping rewards in reinforcement learning’, Neural Networks, 23(4), 541 – 550, (2010). The 18th International Conference on Artificial Neural Networks, {ICANN} 2008.
  • [13] T. Jaakkola, M. I. Jordan, and S. P. Singh, ‘Convergence of stochastic iterative dynamic programming algorithms’, Neural Computation, 6, 1185–1201, (1994).
  • [14]

    A. Krogh and J. Vedelsby, ‘Neural network ensembles, cross validation, and active learning’, in

    Advances in Neural Information Processing Systems, pp. 231–238. MIT Press, (1995).
  • [15] A. Laud and G. DeJong, ‘The influence of reward on the speed of reinforcement learning: An analysis of shaping’, in In Proc. 20th International Conference on Machine Learning. AAAI Press, (2003).
  • [16] H.R. Maei, Gradient Temporal-Difference Learning Algorithms, Ph.D. dissertation, University of Alberta, 2011.
  • [17] H.R. Maei and R.S. Sutton, ‘: A general gradient algorithm for temporal-difference prediction learning with eligibility traces’, in Proceedings of the Third Conf. on Artificial General Intelligence., (2010).
  • [18] H.R. Maei, C. Szepesvári, S. Bhatnagar, and R.S. Sutton, ‘Toward off-policy learning control with function approximation’, in Proceedings of the Twenty-seventh International Conference on Machine Learning (ICML 2010), eds., Johannes Fürnkranz and Thorsten Joachims, pp. 719–726. Omnipress, (2010).
  • [19] V. Marivate and M. Littman. An ensemble of linearly combined reinforcement-learning agents. AAAI Workshops, 2013.
  • [20] B. Marthi, ‘Automatic shaping and decomposition of reward functions’, in Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pp. 601–608, New York, NY, USA, (2007). ACM.
  • [21] A. Y. Ng, D. Harada, and S. Russell, ‘Policy invariance under reward transformations: Theory and application to reward shaping’, in In Proceedings of the Sixteenth International Conference on Machine Learning, pp. 278–287. Morgan Kaufmann, (1999).
  • [22] P.M. Pilarski, M.R. Dawson, T. Degris, J.P. Carey, K.M. Chan, J.S. Hebert, and R.S. Sutton, ‘Adaptive artificial limbs: a real-time approach to prediction and anticipation’, Robotics Automation Magazine, IEEE, 20(1), 53–64, (March 2013).
  • [23] M.L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, John Wiley & Sons, Inc., New York, NY, USA, 1st edn., 1994.
  • [24] R.S. Sutton and A.G. Barto, Reinforcement learning: An introduction, volume 116, Cambridge Univ Press, 1998.
  • [25] R.S. Sutton, H.R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvári, and E. Wiewiora, ‘Fast gradient-descent methods for temporal-difference learning with linear function approximation’, in In Proceedings of the 26th International Conference on Machine Learning, (2009).
  • [26] R.S. Sutton, J. Modayil, M. Delp, T. Degris, P.M. Pilarski, A. White, and D. Precup, ‘Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction’, in The 10th International Conference on Autonomous Agents and Multiagent Systems - Volume 2, AAMAS ’11, pp. 761–768, Richland, SC, (2011). International Foundation for Autonomous Agents and Multiagent Systems.
  • [27] J. N. Tsitsiklis and B. Van Roy, ‘An analysis of temporal-difference learning with function approximation’, Technical report, IEEE Transactions on Automatic Control, (1997).
  • [28] J. Randløv and P. Alstrøm. Learning to drive a bicycle using reinforcement learning and shaping, 1998.
  • [29] H. van Hasselt, Insights in reinforcement learning : formal analysis and empirical evaluation of temporal-difference learning algorithms, Ph.D. dissertation, Utrecht University, 2011.
  • [30] C. J. C. H. Watkins and P. Dayan, ‘Q-learning’, Machine Learning, 8(3), 272–292, (1992).
  • [31] M.A. Wiering and H. van Hasselt, ‘Ensemble algorithms in reinforcement learning’, Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 38(4), 930–936, (Aug 2008).