Reinforcement Learning (RL) algorithms can learn near-optimal solutions to well-defined problems. However, real-world problems rarely come in the form of a concrete problem description. A human has to translate the poorly defined target problem into a concrete problem description. A Misspecified Problem (MP) occurs when an optimal solution to the problem description is inadequate in the target problem. Unfortunately, creating a well-defined problem description is a challenging art. Furthermore, MPs can have serious consequences in many domains ranging from smart-grids (Abiri-Jahromi et al., 2013; Wu et al., 2010) and robotics (Smart and Kaelbling, 2002) to inventory management systems (Mann and Mannor, 2014). In this paper, we introduce a hierarchical approach that mitigates the consequences of problem misspecification.
RL problems are often described as Markov Decision Processes (Sutton and Barto, 1998, MDPs). A solution to a MDP is a function that generates an action when presented with the current state, called a policy. The solution to a MDP is any policy that maximizes the long term sum of rewards. For problems with continuous, high dimensional state-spaces, explicitly representing the policy is infeasible, thus for the remainder of this paper we restrict our discussion to linearly parametrized policy representations (Sutton, 1996; Roy and How, 2013).111Our results are generalizable and complementary to non-linear parametric policy representations.
Why are problems misspecified?
While a problem description can be misspecified for many reasons, one important case is due to the state representation. It is well established in the machine learning(Levi and Weiss, 2004; Zhou et al., 2009) and RL (Konidaris et al., 2011) literature that “good” features can have a dramatic impact on performance. Finding “good” features to represent the state is a challenging domain specific problem that is generally considered outside of the scope of RL. Unfortunately, domain experts may not supply useful features either because they do not fully understand the target problem or the technicalities of reinforcement learning.
In addition, we may prefer a MP with a limited state representation for several reasons: (1) Regularization: We wish to have a limited feature representation to improve the generalization and avoid overfitting (Singh et al., 1995; Geramifard et al., 2012). (2) Memory and System constraints: Only a finite number of the features can be used due to computational constraints (Roy and How, 2013; Singh et al., 1995). In real-time systems, querying a feature may take too long. In physical systems, the sensor required to measure a desired feature may be prohibitively expensive. (3) Learning on Large Data: After learning on large amounts of data, augmenting a feature set with new features to get improved performance is non-trivial and often inefficient (Geramifard et al., 2012).
How can we mitigate misspecification? Learning a hierarchical policy can mitigate the problems associated with a MP and contrast this against a flat policy approach where a single, parameterized policy is used to solve the entire MDP.
To illustrate how learning a hierarchical policy can repair MPs, consider the S-shaped domain shown in Figure 1. To solve the task the agent must move from the bottom left corner to the goal region denoted by the letter ‘G’ in the top right. The state representation only permits policies that move in a straight line. So the problem is misspecified, and it is not solvable with a flat policy approach (Figure 1). However, if we break up the state-space, as shown in Figure 1, and learn one policy for each cell, the problem is solvable.
The partial policies shown in Figure 1 are an example of abstract actions, called options (Sutton et al., 1999), macro-actions (Hauskrecht et al., 1998; He et al., 2011), or skills (Konidaris and Barto, 2009). Learning useful options has been a topic of intense research (McGovern and Barto, 2001; Moerman, 2009; Konidaris and Barto, 2009; Brunskill and Li, 2014; Hauskrecht et al., 1998). However, previous approaches have proposed algorithms for learning options to learn or plan faster. In contrast, our objective is to learn options to repair a MP.
Proposed Algorithm: We introduce a meta-algorithm, Iterative Hierarchical Optimization for Misspecified Problems (IHOMP), that uses an RL algorithm as a “black box” to iteratively learn options that repair MPs. To force the options to specialize, IHOMP uses a partition of the state-space and trains one option for each class in the partition (Figure 1). Any arbitrary partitioning scheme can be used, however the partition impacts performance. During an iteration of IHOMP, an RL algorithm updates each option. The options may be initialized arbitrarily, but after the first iteration options with access to a goal region or non-zero rewards will learn how to exploit those rewards (e.g., Figure 1, Iteration 1). On further iterations, the newly acquired options propagate reward back to other regions of the state-space. Thus, options that previously had no reward signal exploit the rewards of other options that have received meaningful reward signals (e.g., Figure 1, Iterations 2 and 5). Although each option is only learned over a single partition class, it can be initialized in any state.
Why partitions? If all options are trained on all data, then the options would not specialize defeating the purpose of learning multiple policies. Partitions are necessary to foster specialization. Natural partitionings arise in many different applications and are often easy to design. Consider navigation tasks (which we use in this paper for ease of visualization), which are ever-present in robotics (Smart and Kaelbling, 2002), where partitions naturally lead an agent from one location to another in the state space. In addition, partitions are well suited to cyclical tasks; that is, tasks that have repeatable cycles (For example, a yearly cycle of 12 months). Here the state space can be easily partitioned based on time. Examples include inventory management systems (Mann and Mannor, 2014) as well as maintenance scheduling of generation units and transmission lines in smart grids (Abiri-Jahromi et al., 2013; Wu et al., 2010).
Automatically Learning partitions: The availability of a pre-defined partitioning of the state space is a strong assumption in some domains. We have developed a relaxation to this assumption that can enable partitions to be learned automatically using Regularized Option Interruption (ROI) (Mankowitz et al., 2014; Sutton et al., 1999).
Contributions: Our main contributions are: (1) Introducing Iterative Hierarchical Optimization for Misspecified Problems (IHOMP), which learns options to repair and solve MPs. (2) Theorem 1 shows that IHOMP converges to a near-optimal solution relating the quality of the learned policy to the quality of the options learned by the “black box” RL algorithm. (3) Theorem 2 proves that Regularized Option Interruption (ROI) can be safely incorporated into IHOMP. (4) Experiments demonstrating that, given a misspecified problem, IHOMP can learn options to repair and solve the problem. Experiments showing IHOMP-ROI learning partitions and discovering reusable options. This divide-and-conquer approach may also enable us to scale and solve larger MDPs.
Let be an MDP, where is a (possibly infinite) set of states, is a finite set of actions,
is a mapping from state-action pairs to probability distributions over next states,maps each state-action pair to a reward in , and is the discount factor. A policy gives the probability of executing action from state .
Let be an MDP. The value function of a policy with respect to a state is where the expectation is taken with respect to the trajectory produced by following policy . The value function of a policy can also be written recursively as
which is known as the Bellman equation. The optimal Bellman equation can be written as Let . We say that a policy is -optimal if for all . The action-value function of a policy is defined by for a state and an action , and the optimal action-value function is denoted by . Throughout this paper, we will drop the dependence on when it is clear from the context.
3 Learning Options
An option is typically defined by a triple . However, we want to learn options that are both specialized to specific regions of the state-space but potentially reusable if they are useful in more general contexts. We focus on a special case of options, where an option is defined by a tuple , where
is a parametric policy with parameter vectorand indicates whether the option has finished () or not () given the current state .
Given a set of options with size , the inter-option policy is defined by where is the state-space and is the index set over the options in . An inter-option policy selects which options to execute from the current state by returning the index of one of the options. By defining inter-option policies to select an index (rather than the options), we can use the same policy even as the set of options is adapting.
Figure 1 shows an arbitrary partitioning , consisting of sub-partitions , defined over the original MDP’s state space. Each is initialized with an arbitrary option and its corresponding Local-MDP . Local-MDP (see supplementary material for a full definition) is an episodic MDP that terminates once the agent escapes from and upon terminating receives a reward equal to the value of the state the agent would have transitioned to in the original MDP. Therefore, we construct a modified MDP called a Local-MDP and apply a planning or RL algorithm to solve it. The resulting solution (policy) is a specialized option.
Given a “good” set of options, planning can be significantly faster (Sutton et al., 1999; Mann and Mannor, 2014). However, in many domains we may not be given a good set of options. Therefore it is necessary to learn and improve this set of options. In the next section, we introduce an algorithm for dynamically learning and improving options using iterative hierarchical optimization.
4 Iterative Hierarchical Optimization for Misspecified Problems (IHOMP)
Iterative Hierarchical Optimization for Misspecified Problems (IHOMP, Algorithm 1) takes the original MDP , a partition over the state-space and a number of iterations and returns a pair containing an inter-option policy and a set of options . The number of options is equal to the number of classes (sub-partitions) in the partition (line 1). The inter-option policy returned by IHOMP is defined (line 2) by where is the indicator function returning if its argument is true and otherwise and denotes the class in the partition . Thus simply returns the index of the option associated with the partition class containing the current state. On line 3, IHOMP initializes with arbitrary options (IHOMP can also be initialized with options that we believe might be useful to speed up learning).
Next (lines 4–14), IHOMP performs iterations. In each iteration, IHOMP updates the options in (lines 5–13). Note that the value of a option depends on how it is combined with other options. If we allowed all options to change simultaneously, the options could not reliably propagate value off of each other. Therefore, IHOMP updates each option individually. Multiple iterations are needed so that the option set can converge (Figure 1).
The process of updating an option (lines 7–12) starts by evaluating with the current option-set (line 7). Any number of policy evaluation algorithms could be used here, such as TD with function approximation (Sutton and Barto, 1998) or LSTD (Boyan, 2002), modified to be used with options. In our experiments, we used a straightforward variant of LSTD (Sorg and Singh, 2010). Then we use the original MDP to construct a Local-MDP (line 9). Next, IHOMP uses a planning or RL algorithm to approximately solve the Local-MDP returning a parametrized policy (line 10). Any planning or RL algorithm for regular MDPs could fill this role provided that it produces a parametrized policy. However, in our experiments, we used a simple actor-critic PG algorithm, unless otherwise stated. Then a new option is created (line 11) where is the policy derived on line 10 and . The definition of means that the option will terminate only if it leaves the partition. Finally, we update the option set by replacing the option with (line 12). It is important to note that in IHOMP, updating an option is equivalent to solving a Local-MDP.
5 Analysis of IHOMP
We provide the first convergence guarantee for combining hierarchically and iteratively learning options in a continuous state MDP using IHOMP (Lemma 1 and Lemma 2, proven in the supplementary material). We use this guarantee as well as Lemma 2 to prove Theorem 1. This theorem enables us to analyze the quality of the inter-option policy returned by IHOMP. It turns out that the quality of the policy depends critically on the quality of the option learning algorithm. An important parameter for determining the quality of a policy returned by IHOMP is the misspecification error defined below.
Let be a partition over the target MDP’s state-space. The misspecification error is
where is the smallest , such that for all and is the policy returned by the option learning algorithm executed on .
The misspecification error quantifies the quality of the Local-MDP solutions returned by our option learning algorithm. If we used an exact solver to learn options, then . However, if we use an approximate solver, then will be non-zero and the quality will depend on the partition . Generally, using finer grain partitions will decrease . However, Theorem 1 reveals that adding too many options can also negatively impact the returned policy’s quality.
Let . If we run IHOMP with partition for iterations, then the algorithm returns stitching policy such that
where is the number of partition classes in .
The proof of Theorem 1 is divided into three parts (a complete proof is given in the supplementary material). The main challenge is that updating one option can impact the value of other options. Our analysis starts by bounding the impact of updating one option. Note that represents a option set and represents a option set where we have updated the option (corresponding to the partition class ) in the set. In the first part, we show that the error between , the globally optimal value function, and , is a contraction when and is bound by otherwise (Lemma 1). In the second part, we apply an inductive argument to show that updating all options results in a contraction over the entire state space (Lemma 2). In the third part, we apply this contraction recursively, which proves Theorem 1.
This provides the first theoretical guarantees of convergence to a near optimal solution when combining hierarchically, and iteratively learning, a set of options in a continuous state MDP. Theorem 1 tells us that when the misspecification error is small, IHOMP returns a near-optimal inter-option policy. The first term on the right hand side of (3) is the approximation error. This is the loss we pay for the parametrized class of policies that we learn options over. Since represents the number of classes defined by the partition, we now have a formal way of analyzing the effect of the partitioning structure. In addition, complex options do not need to be designed by a domain expert; only the partitioning needs to be provided a-priori. The second term is the convergence error. It goes to as the number of iterations increases.
The guarantee provided by Theorem 1 may appear similar to (Hauskrecht et al., 1998, Theorem 1). However, Hauskrecht et al. (1998) derive options only at the beginning of the learning process and do not update them. On the other hand, IHOMP updates its option-set dynamically by propagating value throughout the state space during each iteration. Thus, IHOMP does not require prior knowledge of the optimal value function.
Theorem 1 does not explicitly present the effect of policy evaluation error, which occurs with any approximate policy evaluation technique. However, if the policy evaluation error is bounded by , then we can simply replace in (3) with . Again, smaller policy evaluation error leads to smaller approximation error.
6 Learning Partitions via Regularized Option Interruption
So far IHOMP has assumed a partition is given a-priori. However, it may be non-trivial to design a partition and, in many cases, the partition may be sub-optimal. To relax this assumption, we incorporate Regularized Option Interruption (ROI) (Mankowitz et al., 2014) into this work to enable IHOMP to automatically learn a near-optimal partition from an initially misspecified problem.
IHOMP keeps track of the action value function which represents the expected value of being in state and executing option , given the inter-option policy and option set
. ROI uses this estimate of the action-value function to enable the agent to choose when to switch options according to the following termination rule:
Here corresponds to the termination probability of the option partition and . This rule is illustrated in Figure 2. A user has designed a partition resulting in a MP (Figure 2) compared to the optimal partition for this domain (Figure 2). IHOMP applies ROI to ‘modify’ the initial partition into the optimal one. By learning the optimal action-value function , IHOMP builds a near-optimal partition (Figure 2) that is implicitly stored within this action-value function. That is, if the agent is executing an option in partition class , and the value of continuing with option , , is less than for some regularization function (see the location in Figure 2), then switch to the new option partition (). Otherwise, continue executing the current option (see the location in Figure 2).
This leads to a new algorithm IHOMP-ROI (IHOMP with Regularized Option Interruption). The algorithm can be found in the supplementary material. The key difference between IHOMP and IHOMP-ROI is applying ROI during the policy evaluation step after each of the options have been updated. IHOMP-ROI automatically learns an improved partition between iterations. We show that ROI can be safely incorporated into IHOMP in Theorem 2. The theorem shows that incorporating ROI can only improve the policy produced by IHOMP. The full proof is given in the supplementary material.
(IHOMP-ROI Approximate Convergence) Eq. (3) also holds for IHOMP-ROI.
7 Experiments and Results
We performed experiments on three well-known RL benchmarks: Mountain Car (MC), Puddle World (PW) (Sutton, 1996) and the Pinball domain (Konidaris and Barto, 2009). We also perform experiments in a sub-domain of Minecraft 222https://minecraft.net/en/. The MC and Minecraft domains have similar results to PW and therefore have been moved to the supplementary material. We use two variations for the Pinball domain, namely maze-world (moved to supplementary material), which we created, and pinball-world which is one of the standard pinball benchmark domains. Finally, we created a domain which we call the Two Rooms domain to demonstrate how IHOMP-ROI can improve partitions.
In each experiment, we defined a MP, where no flat policy is adequate, and in some of the tasks, cannot solve the task at all. These experiments simulate situations where the policy representation is constrained to avoid overfitting, manage system constraints, or coping with poorly designed features. In each case, IHOMP learns a significantly better policy compared to the non-hierarchical approach. In the Two Rooms domain, IHOMP-ROI improves the initial partition. Our experiments demonstrate potential to scale up to higher dimensional domains by hierarchically combining options over simple representations (A Video of IHOMP solving the Pinball tasks and the Minecraft sub-domain can be found in the supplementary material).
IHOMP is a meta-algorithm. We provide an algorithm for Policy Evaluation (PE) and Policy Learning (PL). For the MC and PW domains, we used SMDP-LSTD (Sorg and Singh, 2010) for PE and a modified version of Regular-Gradient Actor-Critic (RG-AC) (Bhatnagar et al., 2009) for PL (see supplementary material for details). In the Pinball domains, we used Nearest-Neighbor Function Approximation (NN-FA) for PE and UCB Random Policy Search (UCB-RPS) for PL. In the two rooms domain, we use a variation of LSTDQ with Option Interruption for PE and RG-AC for PL.
For the MC, PW, and Two Rooms domains, each intra-option policy is represented as a probability distribution over actions (independent of the state). We compare their performance to the original misspecified problem using a flat policy with the same representation. Grid-like partitions are generated for each task. Binary-grid features are used to estimate the value function. In the Pinball domains, each option is represented by polynomial features corresponding to each state dimension and a bias term. The value function is represented by a KD-Tree containing state-value pairs uniformly sampled in the domain. A value for a particular state is obtained by assigning the value of the nearest neighbor to that state that is contained within the KD-tree. These are example representations. In principal, any value function and policy representation that is representative of the domain can be utilized.
Puddle World: Puddle World is a continuous 2-dimensional world containing two puddles as shown in Figure 3. A successful agent (red ball) should navigate to the goal location (blue square), avoiding the puddles. The state space is the location of the agent. Initially, the agent is provided with a misspecified problem. That is, a flat policy that can only move in a single direction (thus it cannot avoid the puddles). Figure 3 compares this flat policy with IHOMP (for a grid partition (Four options)). The flat policy achieves low average reward. However, IHOMP turns the flat policy into options and hierarchically composes these options together, resulting in a richer solution space and a higher average reward as seen in Figure 3. This is comparable to the approximately optimal average reward attained by executing Approximate Value Iteration (AVI) for a huge number of iterations. In this experiment IHOMP is not initiated in the partition class containing the goal state but still achieves near-optimal convergence after only iterations.
Figure 3 compares the performance of different partitions where a grid represents the flat policy of the initially misspecified problem. The option learning error is significantly smaller for all the partitions greater than , resulting in lower cost. On the other hand, according to Theorem 1, adding more options increases the cost. A trade off therefore exists between and . In practice, tends to dominate . In addition to the trade off, the importance of the partition design is evident when analyzing the cost of the and grids. In this scenario, the partition design is better suited to Puddle World than the partition, resulting in lower cost.
Pinball: We tested IHOMP on the challenging pinball-world task (Figure 4) Konidaris and Barto (2009). The agent is initially provided with a -feature flat policy . This results in a misspecified problem as the agent is unable to solve the task using this limited representation as shown by the average reward in Figure 4. Using IHOMP with a grid, options were learned. IHOMP clearly outperforms the flat policy as shown in Figure 4. It is less than optimal but still manages to sufficiently perform the task (see value function, Figure 4). The drop in performance is due to a complicated obstacle setup, non-linear dynamics and partition design. Nevertheless, this shows that IHOMP can produce a reasonable solution with a limited representation.
Improving Partitions: Providing a ‘good’ option partitioning a-priori is a strong assumption. It may be non-trivial to design the partitioning especially in continuous, high-dimensional domains. A sub-optimal partitioning may still mitigate misspecification, but it will not result in a near-optimal solution. To relax this assumption, we have incorporated Regularized Option Interruption into IHOMP to produce the IHOMP-ROI Algorithm. This algorithm learns the options and improves the partition, effectively determining where the options should be executed in the state space.
We tested IHOMP-ROI on the two rooms domain shown in Figure 5. The agent (red ball) needs to navigate to the goal region (blue square). The policy parameterization is limited to a distribution over actions (moving in a single direction). This limited representation results in a MP as the agent is unable to traverse between the two rooms. If we use IHOMP with a sub-optimal partitioning containing two options as shown by the red and green cells in Figure 5, the problem is still misspecified. Here, the agent leaves the red cell and immediately gets trapped behind the wall whilst in the green cell. Using IHOMP-ROI, as shown in Figure 5, the agent learns both the options and a partition such that the agent can navigate to the goal. The green region in the bottom left corner comes about as a function approximation error but does not prevent the agent from reaching the goal. If reader looks carefully in Figure 5, they will notice something unexpected. The optimal partitioning learned for the Two Rooms domain includes executing the red option in region B. This is intuitive given the parameterizations learned for each of the options. The red option has a dominant right action whereas the green option has a dominant upward action. When the agent in region B, it makes more sense to execute the red option to reach the goal. Thus, IHOMP-ROI provides an effective way to, not only learn an optimal partition, but to also discover where options should be reused.
We introduced IHOMP a RL planning algorithm for iteratively learning options and an inter-option policy (Sutton et al., 1999) to repair a MP. We provide theoretical results for IHOMP that directly relate the quality of the final inter-option policy to the misspecification error. IHOMP is the first algorithm that provides theoretical convergence guarantees while iteratively learning a set of options in a continuous state space. In addition, we have developed IHOMP-ROI which makes use of regularized option interruption (Sutton et al., 1999; Mankowitz et al., 2014) to learn an improved partition to solve an initially misspecified problem. IHOMP-ROI is also able to discover regions in the state space where the options should be reused. In high-dimensional domains, partitions can be learned from expert demonstrations (Abbeel and Ng, 2005) and intra-option policies can be represented as Deep Q-Networks (Mnih, 2015). Option reuse can be especially useful for transfer learning (Tessler et al., 2016) and multi-agent settings (Garant et al., 2015).
- Abbeel and Ng (2005) Pieter Abbeel and Andrew Y Ng. Exploration and apprenticeship learning in reinforcement learning. In Proceedings of the 22nd International Conference on Machine Learning, 2005.
- Abiri-Jahromi et al. (2013) Amir Abiri-Jahromi, Masood Parvania, Francois Bouffard, and Mahmud Fotuhi-Firuzabad. A two-stage framework for power transformer asset maintenance management—part i: Models and formulations. Power Systems, IEEE Transactions on, 28(2):1395–1403, 2013.
- Bhatnagar et al. (2009) Shalabh Bhatnagar, Richard S Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actor–critic algorithms. Automatica, 45(11):2471–2482, 2009.
- Boyan (2002) Justin A Boyan. Technical update: Least-squares temporal difference learning. Machine Learning, 2002.
- Brunskill and Li (2014) Emma Brunskill and Lihong Li. PAC-inspired option discovery in lifelong reinforcement learning. JMLR, 2014.
- Garant et al. (2015) Daniel Garant, Bruno C. da Silva, Victor Lesser, and Chongjie Zhang. Accelerating Multi-agent Reinforcement Learning with Dynamic Co-learning. Technical report, 2015.
- Geramifard et al. (2012) A Geramifard, S Tellex, D Wingate, N Roy, and JP How. A bayesian approach to finding compact representations for reinforcement learning. In European Workshops on Reinforcement Learning (EWRL), 2012.
- Hauskrecht et al. (1998) Milos Hauskrecht, Nicolas Meuleau, Leslie Pack Kaelbling, Thomas Dean, and Craig Boutilier. Hierarchical solution of markov decision processes using macro-actions. In Proceedings of the 14th Conference on Uncertainty in AI, pages 220–229, 1998.
He et al. (2011)
Ruijie He, Emma Brunskill, and Nicholas Roy.
Efficient planning under uncertainty with macro-actions.
Journal of Artificial Intelligence Research, 40:523–570, 2011.
- Konidaris et al. (2011) G.D. Konidaris, S. Osentoski, and P.S. Thomas. Value function approximation in reinforcement learning using the fourier basis. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011.
- Konidaris and Barto (2009) George Konidaris and Andrew G Barto. Skill discovery in continuous reinforcement learning domains using skill chaining. In NIPS 22, pages 1015–1023, 2009.
- Levi and Weiss (2004) Kobi Levi and Yair Weiss. Learning object detection from a small number of examples: the importance of good features. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–53. IEEE, 2004.
- Mankowitz et al. (2014) Daniel J Mankowitz, Timothy A Mann, and Shie Mannor. Time regularized interrupting options. ICML, 2014.
- Mann and Mannor (2014) Timothy A Mann and Shie Mannor. Scaling up approximate value iteration with options: Better policies with fewer iterations. In Proceedings of the ICML, 2014.
- McGovern and Barto (2001) Amy McGovern and Andrew G Barto. Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density. In Proceedings of the 18th ICML, pages 361 – 368, 2001.
- Mnih (2015) Volodymyr et. al. Mnih. Human-level control through deep reinforcement learning. Nature, 2015.
- Moerman (2009) Wilco Moerman. Hierarchical reinforcement learning: Assignment of behaviours to subpolicies by self-organization. PhD thesis, Cognitive Artificial Intelligence, Utrecht University, 2009.
- Roy and How (2013) N Roy and JP How. A tutorial on linear function approximators for dynamic programming and reinforcement learning. 2013.
- Singh et al. (1995) Satinder P Singh, Tommi Jaakkola, and Michael I Jordan. Reinforcement learning with soft state aggregation. Advances in neural information processing systems, pages 361–368, 1995.
- Smart and Kaelbling (2002) William D Smart and Leslie Pack Kaelbling. Effective reinforcement learning for mobile robots. In Robotics and Automation, 2002. Proceedings. ICRA’02. IEEE International Conference on, 2002.
- Sorg and Singh (2010) Jonathan Sorg and Satinder Singh. Linear options. In Proceedings AAMAS, pages 31–38, 2010.
- Sutton (1996) Richard Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in neural information processing systems, pages 1038–1044, 1996.
- Sutton and Barto (1998) Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
- Sutton et al. (1999) Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. AI, 112(1):181–211, August 1999.
- Tessler et al. (2016) Chen Tessler, Shahar Givony, Tom Zahavy, Daniel J Mankowitz, and Shie Mannor. A deep hierarchical approach to lifelong learning in minecraft. arXiv preprint arXiv:1604.07255, 2016.
- Wu et al. (2010) Lei Wu, Mohammad Shahidehpour, and Yong Fu. Security-constrained generation and transmission outage scheduling with uncertainties. Power Systems, IEEE Transactions on, 25(3):1674–1685, 2010.
- Zhou et al. (2009) Huiyu Zhou, Yuan Yuan, and Chunmei Shi. Object tracking using sift features and mean shift. Computer vision and image understanding, 113(3):345–352, 2009.