1 Introduction
Humandecision making involves decomposing a task into a course of action. The course of action is typically composed of abstract, highlevel actions that may execute over different timescales (e.g., walk to the door or make a cup of coffee). The decisionmaker then chooses actions to execute to solve the task. These actions may need to be reused at different points in the task. In addition, the actions may need to be used across multiple, related tasks.
Consider, for example, the task of building a city. The course of action to building a city may involve building the foundations, laying down sewage pipes as well as building houses and shopping malls. Each action operates over multiple timescales and certain actions (such as building a house) may need to be reused if additional units are required. In addition, these actions can be reused if a neighboring city needs to be developed (multitask scenario).
Reinforcement Learning (RL) represents actions that last for multiple timescales as Temporally Extended Actions (TEAs) (Sutton et al., 1999), also referred to as options, skills (Konidaris & Barto, 2009) or macroactions (Hauskrecht, 1998). It has been shown both experimentally Precup & Sutton (1997); Sutton et al. (1999); Silver & Ciosek (2012) and theoretically (Mann & Mannor, 2014) that TEAs speed up the convergence rates of RL planning algorithms. TEAs are seen as a potentially viable solution to making RL truly scalable. TEAs in RL have become popular in many domains including RoboCup soccer (Bai et al., 2012), video games (Mann et al., 2015) and Robotics (Fu et al., 2015). Here, decomposing the domains into temporally extended courses of action (strategies in RoboCup, strategic move combinations in video games and skill controllers in Robotics for example) has generated impressive solutions. From here on in, we will refer to TEAs as skills.
A course of action is defined by a policy. A policy is a solution to a Markov Decision Process (MDP) and is defined as a mapping from states to a probability distribution over actions. That is, it tells the RL agent which action to perform given the agent’s current state. We will refer to an
interskill policy as being a policy that tells the agent which skill to execute, given the current state.A truly general skill learning framework must (1) learn skills as well as (2) automatically compose them together (as stated by Bacon & Precup (2015)) and determine where each skill should be executed (the interskill policy). This framework should also determine (3) where skills can be reused in different parts of the state space and (4) adapt to changes in the task itself. Finally it should also be able to (5) correct model misspecification (Mankowitz et al., 2014). Model misspecification is defined as having an unsatisfactory set of skills and interskill policy that provide a suboptimal solution to a given task. This skill learning framework should be able to correct the set of misspecified skills and interskill policy to obtain a nearoptimal solution. A number of works have addressed some of these issues separately as shown in Table 1. However, no work, to the best of our knowledge, has combined all of these elements into a truly general skilllearning framework.
Our framework entitled ‘Adaptive Skills, Adaptive Partitions (ASAP)’ is the first of its kind to incorporate all of the abovementioned elements into a single framework, as shown in Table 1, and solve continuous state MDPs. It receives as input a misspecified model (a suboptimal set of skills and interskill policy). The ASAP framework corrects the misspecification by simultaneously learning a nearoptimal skillset and interskill policy which are both stored, in a Bayesianlike manner, within the ASAP policy. In addition, ASAP automatically composes skills together, learns where to reuse them and learns skills across multiple tasks.
Automated Skill  Automatic  Continuous  Learning  Correcting  
Learning  Skill  State  Reusable  Model  
with Policy  Composition  Multitask  Skills  Misspecification  
Gradient  Learning  
ASAP (this paper)  ✓  ✓  ✓  ✓  ✓ 
da Silva et al. 2012  ✓  ✓  
Konidaris & Barto 2009  ✓  
Bacon & Precup 2015  ✓  
Eaton & Ruvolo 2013  ✓  
Masson & Konidaris 2015 
Main Contributions: (1) The Adaptive Skills, Adaptive Partitions (ASAP) algorithm that automatically corrects a misspecified model. It learns a set of nearoptimal skills, automatically composes skills together and learns an interskill policy to solve a given task. (2) Learning skills over multiple different tasks by automatically adapting both the interskill policy and the skill set. (3) ASAP can determine where skills should be reused in the state space. (4) Theoretical convergence guarantees.
2 Background
Reinforcement Learning Problem: A Markov Decision Process is defined by a tuple where is the state space, is the action space, is a bounded reward function, is the discount factor and is the transition probability function for the MDP. The solution to a MDP is a policy which is a function mapping states to a probability distribution over actions. An optimal policy determines the best actions to take so as to maximize the expected reward. The value function defines the expected reward for following a policy from state . The optimal expected reward is the expected value obtained for following the optimal policy from state .
Policy Gradient: Policy Gradient (PG) methods have enjoyed success in recent years especially in the fields of robotics (Peters & Schaal, 2006, 2008). The goal in PG is to learn a policy that maximizes the expected return , where is a set of trajectories, is the probability of a trajectory and is the reward obtained for a particular trajectory. is defined as . Here, is the state at the timestep of the trajectory; is the action at the timestep; is the trajectory length. Only the policy, in the general formulation of policy gradient, is parameterized with parameters
. The idea is then to update the policy parameters using stochastic gradient descent leading to the update rule
, where are the policy parameters at timestep , is the gradient of the objective function with respect to the parameters and is the step size.3 Skills, Skill Partitions and IntraSkill Policy
Skills: A skill is a parameterized Temporally Extended Action (TEA) (Sutton et al., 1999). The power of a skill is that it incorporates both generalization (due to the parameterization) and temporal abstraction. Skills are a special case of options and therefore inherit many of their useful theoretical properties (Sutton et al., 1999; Precup et al., 1998).
Definition 1.
A Skill is a TEA that consists of the twotuple where is a parameterized, intraskill policy with parameters and is the termination probability distribution of the skill.
Skill Partitions: A skill, by definition, performs a specialized task on a subregion of a state space. We refer to these subregions as Skill Partitions (SPs) which are necessary for skills to specialize during the learning process. A given set of SPs covering a state space effectively define the interskill policy as they determine where each skill should be executed. These partitions are unknown apriori
and are generated using intersections of hyperplane halfspaces. Hyperplanes provide a natural way to automatically compose skills together. In addition, once a skill is being executed, the agent needs to select actions from the skill’s intraskill policy
. We next utilize SPs and the intraskill policy for each skill to construct the ASAP policy, defined in Section 4. We now define a skill hyperplane.Definition 2.
Skill Hyperplane (SH): Let
be a vector of features that depend on a state
and an MDP environment . Let be a vector of hyperplane parameters. A skill hyperplane is defined as , where is a constant.In this work, we interpret hyperplanes to mean that the intersection of skill hyperplane half spaces form subregions in the state space called Skill Partitions (SPs), defining where each skill is executed. Figure 1 contains two example skill hyperplanes . Skill is executed in the SP defined by the intersection of the positive halfspace of and the negative halfspace of . The same argument applies for . From here on in, we will refer to skill interchangeably with its index .
Skill hyperplanes have two functions: (1) They automatically compose skills together, creating chainable skills as desired by Bacon & Precup (2015). (2) They define SPs which enable us to derive the probability of executing a skill, given a state and MDP . First, we need to be able to uniquely identify a skill. We define a binary vector where
is a Bernoulli random variable and
is the number of skill hyperplanes. We define the skill index as a sum of Bernoulli random variables . Note that this is but one way to generate skill partitions. In principle this setup defines skills, but in practice, far fewer skills are typically used (see experiments). Furthermore, the complexity of the SP is governed by the VCdimension. We can now define the probability of executing skill as a Bernoulli likelihood in Equation 1.(1) 
Here, is the value of the bit of , is the current state and is a description of the MDP. The probability and are defined in Equation 2.
(2) 
We have made use of the logistic sigmoid function to ensure valid probabilities where
is a skill hyperplane and is a temperature parameter. The intuition here is that the bit of a skill, , if the skill hyperplane meaning that the skill’s partition is in the positive halfspace of the hyperplane. Similarly, if corresponding to the negative halfspace. Using skill as an example with hyperplanes in Figure 1, we would define the Bernoulli likelihood of executing as .IntraSkill Policy: Now that we can define the probability of executing a skill based on its SP, we define the intraskill policy for each skill. The Gibb’s distribution is a commonly used function to define policies in RL (Sutton et al., 1999). Therefore we define the intraskill policy for skill , parameterized by as
(3) 
Here, is the temperature, is a feature vector that depends on the current state and action . Now that we have a definition of both the probability of executing a skill and an intraskill policy, we need to incorporate these distributions into the policy gradient setting using a generalized trajectory.
Generalized Trajectory: A generalized trajectory is necessary to derive policy gradient update rules with respect to the parameters as will be shown in Section 4. A typical trajectory is usually defined as where is the length of the trajectory. For a generalized trajectory, our algorithm emits a class at each timestep , which denotes the skill that was executed. The generalized trajectory is defined as . The probability of a generalized trajectory, as an extension to the PG trajectory in Section 2, is now, , where is the probability of a skill being executed, given the state and environment at time ; is the probability of executing action at time given that we are executing skill . The generalized trajectory is now a function of two parameter vectors and .
4 Adaptive Skills, Adaptive Partitions (ASAP) Framework
The Adaptive Skills, Adaptive Partitions (ASAP) framework simultaneously learns a nearoptimal set of skills and SPs (interskill policy), given an initially misspecified model. ASAP automatically composes skills together and allows for a multitask setting as it incorporates the environment into its hyperplane feature set. We have previously defined two important distributions and respectively. These distributions are used to collectively define the ASAP policy which is presented below. Using the notion of a generalized trajectory, the ASAP policy can be learned in a policy gradient setting.
ASAP Policy: Assume that we are given a probability distribution over MDPs with a ddimensional stateaction space and a dimensional vector describing each MDP. We define as a matrix where each column represents a skill hyperplane, and is a matrix where each column parameterizes an intraskill policy. Using the previously defined distributions, we now define the ASAP policy.
Definition 3.
(ASAP Policy). Given skill hyperplanes, a set of skills , a state space , a set of actions and an MDP from a hypothesis space of MDPs, the ASAP policy is defined as,
(4) 
This is a powerful description for a policy, which resembles a Bayesian approach, as the policy takes into account the uncertainty of the skills that are executing as well as the actions that each skill’s intraskill policy chooses. We now define the ASAP objective with respect to the ASAP policy.
ASAP Objective: We defined the policy with respect to a hypothesis space of MDPs. We now need to define an objective function which takes this hypothesis space into account. Since we assume that we are provided with a distribution over possible MDP models , with a dimensional stateaction space, we can incorporate this into the ASAP objective function:
(5) 
where is the ASAP policy and is the expected return for MDP with respect to the ASAP policy. To simplify the notation, we group all of the parameters into a single parameter vector . We define the expected reward for generalized trajectories as , where is reward obtained for a particular trajectory . This is a slight variation of the original policy gradient objective defined in Section 2. We then insert into Equation 5 and we get the ASAP objective function
(6) 
where is the expected return for policy in MDP . Next, we need to derive gradient update rules to learn the parameters of the optimal policy that maximizes this objective.
ASAP Gradients: To learn both intraskill policy parameters matrix as well as the hyperplane parameters matrix (and therefore implicitly the SPs), we derive an update rule for the policy gradient framework with generalized trajectories. The derivation is in the supplementary material. The first step involves calculating the gradient of the ASAP objective function yielding the ASAP gradient (Theorem 1).
Theorem 1.
(ASAP Gradient Theorem). Suppose that the ASAP objective function is where is a distribution over MDPs and is the expected return for MDP whilst following policy , then the gradient of this objective is:
where , is the length of a trajectory for MDP ; is the discounted cumulative reward for trajectory ^{1}^{1}1These expectations can easily be sampled (see supplementary material)..
If we are able to derive
, then we can estimate the gradient
. We will refer to where it is clear from context. It turns out that it is possible to derive this term as a result of the generalized trajectory. This yields the gradients and in Theorems 2 and 3 respectively. The derivations can be found the supplementary material.Theorem 2.
( Gradient Theorem). Suppose that is a matrix where each column parameterizes an intraskill policy. Then the gradient corresponding to the intraskill parameters of the skill at time is:
where is the temperature parameter and is a feature vector of the current state and the current action .
Theorem 3.
( Gradient Theorem). Suppose that is a matrix where each column represents a skill hyperplane. Then the gradient corresponding to parameters of the hyperplane is:
(7) 
where is the hyperplane temperature parameter, is the skill hyperplane for MDP , corresponds to locations in the binary vector equal to () and corresponds to locations in the binary vector equal to ().
Using these gradient updates, we can then order all of the gradients into a vector and update both the intraskill policy parameters and hyperplane parameters for the given task (learning a skill set and SPs). Note that the updates occur on a single time scale. This is formally stated in the ASAP Algorithm.
5 ASAP Algorithm
We present the ASAP algorithm (Algorithm 1) that dynamically and simultaneously learns skills, the interskill policy and automatically composes skills together by learning SPs. The skills ( matrix) and SPs ( matrix) are initially arbitrary and therefore form a misspecified model. Line combines the skill and hyperplane parameters into a single parameter vector . Lines learns the skill and hyperplane parameters (and therefore implicitly the skill partitions). In line a generalized trajectory is generated using the current ASAP policy. The gradient is then estimated in line from this trajectory and updates the parameters in line . This is repeated until the skill and hyperplane parameters have converged, thus correcting the misspecified model. Theorem 4 provides a convergence guarantee of ASAP to a local optimum (see supplementary material for the proof).
Theorem 4.
Convergence of ASAP: Given an ASAP policy , an ASAP objective over MDP models as well as the ASAP gradient update rules. If (1) the stepsize satisfies and ; (2) The second derivative of the policy is bounded and we have bounded rewards. Then, the sequence converges such that almost surely.
6 Experiments
The experiments have been performed on four different continuous domains: the Two Rooms (2R) domain (Figure 1), the Flipped 2R domain (Figure 1), the Three rooms (3R) domain (Figure 1) and RoboCup domains (Figure 1) that include a oneonone scenario between a striker and a goalkeeper (R1), a twoonone scenario of a striker against a goalkeeper and a defender (R2), and a striker against two defenders and a goalkeeper (R3) (see supplementary material). In each experiment, ASAP is provided with a misspecified model; that is, a set of skills and SPs (the interskill policy) that achieve degenerate, suboptimal performance. ASAP corrects this misspecified model in each case to learn a set of nearoptimal skills and SPs. For each experiment we implement ASAP using ActorCritic Policy Gradient (ACPG) as the learning algorithm ^{2}^{2}2ACPG works well in practice and can be trivially incorporated into ASAP with convergence guarantees.
The TwoRoom and Flipped Room Domains (2R): In both domains, the agent (red ball) needs to reach the goal location (blue square) in the shortest amount of time. The agent receives constant negatives rewards and upon reaching the goal, receives a large positive reward. There is a wall dividing the environment which creates two rooms. The state space is a tuple consisting of the continuous location of the agent and the location of the center of the goal. The agent can move in each of the four cardinal directions. For each experiment involving the two room domains, a single hyperplane is learned (resulting in two SPs) with a linear feature vector representation . In addition, a skill is learned in each of the two SPs. The intraskill policies are represented as a probability distribution over actions.
Automated Hyperplane and Skill Learning: Using ASAP, the agent learned intuitive SPs and skills as seen in Figure 1 and . Each colored region corresponds to a SP. The white arrows have been superimposed onto the figures to indicate the skills learned for each SP. Since each intraskill policy is a probability distribution over actions, each skill is unable to solve the entire task on its own. ASAP has taken this into account and has positioned the hyperplane accordingly such that the given skill representation can solve the task. Figure 2 shows that ASAP improves upon the initial misspecified partitioning to attain nearoptimal performance compared to executing ASAP on the fixed initial misspecified partitioning and on a fixed approximately optimal partitioning.
Multiple Hyperplanes: We analyzed the ASAP framework when learning multiple hyperplanes in the two room domain. As seen in Figure 2, increasing the number of hyperplanes , does not have an impact on the final solution in terms of average reward. However, it does increase the computational complexity of the algorithm since skills need to be learned. The approximate points of convergence are marked in the figure as and , respectively. In addition, two skills dominate in each case producing similar partitions to those seen in Figure 1 (see supplementary material) indicating that ASAP learns that not all skills are necessary to solve the task.
Multitask Learning: We first applied ASAP to the 2R domain (Task 1) and attained a near optimal average reward (Figure 2). It took approximately episodes to get nearoptimal performance and resulted in the SPs and skill set shown in Figure 1 (top). Using the learned SPs and skills, ASAP was then able to adapt and learn a new set of SPs and skills to solve a different task (Flipped 2R  Task 2) in only episodes (Figure 2) indicating that the parameters learned from the old task provided a good initialization for the new task. The knowledge transfer is seen in Figure 1 (bottom) as the SPs do not significantly change between tasks, yet the skills are completely relearned.
We also wanted to see whether we could flip the SPs; that is, switch the sign of the hyperplane parameters learned in the 2R domain and see whether ASAP can solve the Flipped 2R domain (Task 2) without any additional learning. Due to the symmetry of the domains, ASAP was indeed able to solve the new domain and attained nearoptimal performance (Figure 2). This is an exciting result as many problems, especially navigation tasks, possess symmetrical characteristics. This insight could dramatically reduce the sample complexity of these problems.
The ThreeRoom Domain (3R): The 3R domain (Figure 1), is similar to the 2R domain regarding the goal, statespace, available actions and rewards. However, in this case, there are two walls, dividing the state space into three rooms. The hyperplane feature vector consists of a single fourier feature. The intraskill policy is a probability distribution over actions. The resulting learned hyperplane partitioning and skill set are shown in Figure 1. Using this partitioning ASAP achieved near optimal performance (Figure 2). This experiment shows an insightful and unexpected result. Reusable Skills: Using this hyperplane representation, ASAP was able to not only learn the intraskill policies and SPs, but also that skill ‘A’ needed to be reused in two different parts of the state space (Figure 1). ASAP therefore shows the potential to automatically create reusable skills.
RoboCup Domain: The RoboCup 2D soccer simulation domain (Akiyama & Nakashima, 2014) is a 2D soccer field (Figure 1) with two opposing teams. We utilized three RoboCup subdomains ^{3}^{3}3https://github.com/mhauskn/HFO.git R1, R2 and R3 as mentioned previously. In these subdomains, a striker (the agent) needs to learn to dribble the ball and try and score goals past the goalkeeper. State space: R1 domain  the continuous locations of the striker , the ball , the goalkeeper and the constant goal location . R2 domain  we have the addition of the defender’s location to the state space. R3 domain  we add the locations of two defenders. Features: For the R1 domain, we tested both a linear and degree two polynomial feature representation for the hyperplanes. For the R2 and R3 domains, we also utilized a degree two polynomial hyperplane feature representation. Actions: The striker has three actions which are () move to the ball (M), () move to the ball and dribble towards the goal (D) () move to the ball and shoot towards the goal (S). Rewards: The reward setup is consistent with logical football strategies (Hausknecht & Stone, 2015; Bai et al., 2012). Small negative (positive) rewards for shooting from outside (inside) the box and dribbling when inside (outside) the box. Large negative rewards for losing possession and kicking the ball out of bounds. Large positive reward for scoring.
Different SP Optimas: Since ASAP attains a locally optimal solution, it may sometimes learn different SPs. For the polynomial hyperplane feature representation, ASAP attained two different solutions as shown in Figure 2 as well as Figures 2, 2, respectively. Both achieve near optimal performance compared to the approximately optimal scoring controller (see supplementary material). For the linear feature representation, the SPs and skill set in Figure 2 is obtained and achieved nearoptimal performance (Figure 2), outperforming the polynomial representation.
SP Sensitivity: In the R2 domain, an additional player (the defender) is added to the game. It is expected that the presence of the defender will affect the shape of the learned SPs. ASAP again learns intuitive SPs. However, the shape of the learned SPs change based on the predefined hyperplane feature vector . Figure 2 shows the learned SPs when the location of the defender is not used as a hyperplane feature. When the location of the defender is utilized, the ‘flatter’ SPs are learned in Figure 2(). Using the location of the defender as a hyperplane feature causes the hyperplane offset shown in Figure 2(). This is due to the striker learning to dribble around the defender in order to score a goal as seen in Figure 2. Finally, taking the location of the defender into account results in the ‘squashed’ SPs shown in Figure 2() clearly showing the sensitivity and adaptability of ASAP to dynamic factors in the environment.
7 Discussion
We have presented the Adaptive Skills, Adaptive Partitions (ASAP) framework that is able to automatically compose skills together and learns a nearoptimal skill set and skill partitions (the interskill policy) simultaneously to correct an initially misspecified model. We derived the gradient update rules for both skill and skill hyperplane parameters and incorporated them into a policy gradient framework. This is possible due to our definition of a generalized trajectory. In addition, ASAP has shown the potential to learn across multiple tasks as well as automatically reuse skills. These are the necessary requirements for a truly general skill learning framework and can be applied to lifelong learning problems (Ammar et al., 2015; Thrun & Mitchell, 1995). An exciting extension of this work is to incorporate it into a Deep Reinforcement Learning framework, where both the skills and ASAP policy can be represented as deep networks.
References
 Akiyama & Nakashima (2014) Akiyama, Hidehisa and Nakashima, Tomoharu. Helios base: An open source package for the robocup soccer 2d simulation. In RoboCup 2013: Robot World Cup XVII, pp. 528–535. Springer, 2014.
 Ammar et al. (2015) Ammar, Haitham Bou, Tutunov, Rasul, and Eaton, Eric. Safe policy search for lifelong reinforcement learning with sublinear regret. arXiv preprint arXiv:1505.05798, 2015.
 Bacon & Precup (2015) Bacon, PierreLuc and Precup, Doina. The optioncritic architecture. In NIPS Deep Reinforcement Learning Workshop, 2015.
 Bai et al. (2012) Bai, Aijun, Wu, Feng, and Chen, Xiaoping. Online planning for large mdps with maxq decomposition. In AAMAS, 2012.
 da Silva et al. (2012) da Silva, B.C., Konidaris, G.D., and Barto, A.G. Learning parameterized skills. In ICML, 2012.

Eaton & Ruvolo (2013)
Eaton, Eric and Ruvolo, Paul L.
Ella: An efficient lifelong learning algorithm.
In
Proceedings of the 30th international conference on machine learning (ICML13)
, pp. 507–515, 2013.  Fu et al. (2015) Fu, Justin, Levine, Sergey, and Abbeel, Pieter. Oneshot learning of manipulation skills with online dynamics adaptation and neural network priors. arXiv preprint arXiv:1509.06841, 2015.
 Hausknecht & Stone (2015) Hausknecht, Matthew and Stone, Peter. Deep reinforcement learning in parameterized action space. arXiv preprint arXiv:1511.04143, 2015.
 Hauskrecht (1998) Hauskrecht, Milos, Meuleau Nicolas et. al. Hierarchical solution of markov decision processes using macroactions. In UAI, pp. 220–229, 1998.
 Konidaris & Barto (2009) Konidaris, George and Barto, Andrew G. Skill discovery in continuous reinforcement learning domains using skill chaining. In NIPS, 2009.
 Mankowitz et al. (2014) Mankowitz, Daniel J, Mann, Timothy A, and Mannor, Shie. Time regularized interrupting options. Internation Conference on Machine Learning, 2014.
 Mann & Mannor (2014) Mann, Timothy A and Mannor, Shie. Scaling up approximate value iteration with options: Better policies with fewer iterations. In Proceedings of the International Conference on Machine Learning, 2014.
 Mann et al. (2015) Mann, Timothy Arthur, Mankowitz, Daniel J, and Mannor, Shie. Learning when to switch between skills in a high dimensional domain. In AAAI Workshop, 2015.
 Masson & Konidaris (2015) Masson, Warwick and Konidaris, George. Reinforcement learning with parameterized actions. arXiv preprint arXiv:1509.01644, 2015.
 Peters & Schaal (2006) Peters, Jan and Schaal, Stefan. Policy gradient methods for robotics. In Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on, pp. 2219–2225. IEEE, 2006.
 Peters & Schaal (2008) Peters, Jan and Schaal, Stefan. Reinforcement learning of motor skills with policy gradients. Neural Networks, 21:682–691, 2008.
 Precup & Sutton (1997) Precup, Doina and Sutton, Richard S. Multitime models for temporally abstract planning. In Advances in Neural Information Processing Systems 10 (Proceedings of NIPS’97), 1997.
 Precup et al. (1998) Precup, Doina, Sutton, Richard S, and Singh, Satinder. Theoretical results on reinforcement learning with temporally abstract options. In Machine Learning: ECML98, pp. 382–393. Springer, 1998.
 Silver & Ciosek (2012) Silver, David and Ciosek, Kamil. Compositional Planning Using Optimal Option Models. In Proceedings of the 29th International Conference on Machine Learning, Edinburgh, 2012.
 Sutton et al. (1999) Sutton, Richard S, Precup, Doina, and Singh, Satinder. Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 1999.
 Sutton et al. (2000) Sutton, Richard S, McAllester, David, Singh, Satindar, and Mansour, Yishay. Policy gradient methods for reinforcement learning with function approximation. In NIPS, pp. 1057–1063, 2000.
 Thrun & Mitchell (1995) Thrun, Sebastian and Mitchell, Tom M. Lifelong robot learning. Springer, 1995.
Comments
There are no comments yet.