Recently, Konidaris14 Konidaris14 considered the question of how to construct a symbolic representation suitable for planning in high-dimensional continuous domains, given a set of high-level skills. The key result of that work was that the appropriate abstract representation of the problem was directly determined by characteristics of the skills available to the agent—the skills determine the representation, and adding new high-level skills must result in a new representation.
We show that these two processes can be combined into a skill-symbol loop: the agent acquires a set of high-level skills, then constructs the appropriate representation for planning using them, resulting in a new problem in which the agent can again perform skill acquisition. Repeating this process leads to a true abstraction hierarchy where both the available skills and the state space become more abstract at each level of the hierarchy. We describe the properties of the resulting abstraction hierarchies and demonstrate the construction and use of one such hierarchy in the Taxi domain.
Reinforcement learning problems are typically formalized as Markov decision processes or MDPs, represented by a tuple , where is a set of states, is a set of actions, is the reward the agent receives when executing action in state and transitioning to state ,
is the probability of the agent finding itself in statehaving executed action in state , and is a discount factor.
We are interested in the multi-task reinforcement learning setting where, rather than solving a single MDP, the agent is tasked with solving several problems drawn from some task distribution. Each individual problem is obtained by adding a set of start and goal states to a base MDP that specifies the state and action spaces and background reward function. The agent’s task is to minimize the average time required to solve new problems drawn from this distribution.
2.1 Hierarchical Reinforcement Learning
Hierarchical reinforcement learning [Barto and Mahadevan2003] is a framework for learning and planning using higher-level actions built out of the primitive actions available to the agent. Although other formalizations exist—mostly notably the MAX-Q [Dietterich2000] and Hierarchy of Abstract Machines [Parr and Russell1997] approaches—we adopt the options framework [Sutton, Precup, and Singh1999], which models temporally abstract macro-actions as options.
An option consists of three components: an option policy, , which is executed when the option is invoked; an initiation set, , which describes the states in which the option may be executed; and a termination condition, , which describes the probability that an option will terminate upon reaching state .
An MDP where primitive actions are replaced by a set of possibly temporally-extended options (some of which could simply execute a single primitive action) is known as a semi Markov decision process (or SMDP), which generalizes MDPs to handle action executions that may take more than one time step. An SMDP is described by a tuple , where is a set of states; is a set of options; is the reward received when executing option at state , and arriving in state after time steps; is a PDF describing the probability of arriving in state , time steps after executing option in state ; and is a discount factor, as before.
The problem of deciding which options an agent should acquire is known as the skill discovery problem. A skill discovery algorithm must, through experience (and perhaps additional advice or domain knowledge), acquire new options by specifying their initiation set, , and termination condition, . The option policy is usually specified indirectly via an option reward function, , which is used to learn . Each new skill is added to the set of options available to the agent with the aim of either solving the original or subsequent tasks more efficiently. Our framework is agnostic to the specific skill discovery method used (many exist).
2.2 Representation Acquisition
While skill acquisition allows an agent to construct higher-level actions, it alone is insufficient for constructing a true abstraction hierarchy because the agent must still plan in the original state space, no matter how abstract its actions become. A complementary approach is taken by recent work on representation acquisition [Konidaris, Kaelbling, and Lozano-Perez2014], which considers the question of constructing a symbolic description of an SMDP suitable for high-level planning. Key to this is the definition of a symbol as a name referring to a set of states:
A propositional symbol is the name of a test , and corresponding set of states .
The test, or grounding classifier, is a compact representation of a (potentially uncountably infinite) set of states (the grounding set). Logical operations (e.g., and) using the resulting symbolic names have the semantic meaning of set operations (e.g.,
) over the grounding sets, which allows us to reason about which symbols (and corresponding grounding classifiers) an agent should construct in order to be able to determine the feasibility of high-level plans composed of sequences of options. We use thegrounding operator to obtain the grounding set of a symbol or symbolic expression; for example, , . For convenience we also define over collections of symbols; for a set of symbols , we define .
Konidaris14 Konidaris14 showed that defining a symbol for each option’s initiation set and the symbols necessary to compute its image (the set of states the agent might be in after executing the option from some set of starting states) are necessary and sufficient for planning using that set of options. The feasibility of a plan is evaluated by computing each successive option’s image, and then testing whether it is a subset of the next option’s initiation set. Unfortunately, computing the image of an option is intractable in the general case. However, the definition of the image for two common classes of options is both natural and computationally very simple.
The first is the subgoal option: the option reaches some set of states and terminates, and the state it terminates in can be considered independent of the state execution began in. In this case we can create a symbol for that set (called the effect set—the set of all possible states the option may terminate in), and use it directly as the option’s image. We thus obtain symbols for options (a symbol for each option’s initiation and effect sets), from which we can build a plan graph representation: a graph with nodes, and an edge from node to node if option ’s initiation set is a superset of option ’s effect set. Planning amounts to finding a path in the plan graph; once this graph has been computed, the grounding classifiers can be discarded.
The second class of options are abstract subgoal options: the low-level state is factored, and some variables are set to a subgoal (again, independently of the starting state) while others remain unchanged. The image operator can then be computed using the intersection of the effect set (as in the subgoal option case) and the starting state classifier with the modified factors projected out. This results in a STRIPS-like factored representation which can be automatically converted to PDDL [McDermott et al.1998] and used as input to an off-the-shelf task planner. After this conversion the grounding classifiers can again be discarded.
3 Constructing Abstraction Hierarchies
These results show that the two fundamental aspects of hierarchy—skills and representations—are tightly coupled: skill acquisition drives representational abstraction. An agent that has performed skill acquisition in an MDP to obtain higher-level skills can automatically determine a new abstract state representation suitable for planning in the resulting SMDP. We now show that these two processes can be alternated to construct an abstraction hierarchy.
We assume the following setting: an agent is faced with some base MDP , and aims to construct an abstraction hierarchy that enables efficient planning for new problems posed in , each of which is specified by a start and goal state set. may be continuous-state and even continuous-action, but all subsequent levels of the hierarchy will be constructed to be discrete-state and discrete-action. We adopt the following definition of an abstraction hierarchy:
An -level hierarchy on base MDP is a collection of MDPs , , such that each action set , , is a set of options defined over (i.e., is an SMDP).
This captures the core assumption behind hierarchical reinforcement learning: hierarchies are built through macro-actions. Note that this formulation retains the downward refinement property from classical hierarchical planning [Bacchus and Yang1991]—meaning that a plan at level can be refined to a plan at level without backtracking to level or higher—because a policy at any level is also a (not necessarily Markovian [Sutton, Precup, and Singh1999]) policy at any level lower, including . However, while Definition 2 links the action set of each MDP to the action set of its predecessor in the hierarchy, it says nothing about how to link their state spaces. To do so, we must in addition determine how to construct a new state space , transition probability function , and reward function .
Fortunately, this is exactly what representation acquisition provides: a method for constructing a new symbolic representation suitable for planning in using the options in . This provides a new state space , which, combined with , specifies . The only remaining component is the reward function. A representation construction algorithm based on sets [Konidaris, Kaelbling, and Lozano-Perez2014]—such as we adopt here—is insufficient for reasoning about expected rewards, which requires a formulation based on distributions [Konidaris, Kaelbling, and Lozano-Perez2015]. For simplicity, we can remain consistent and simply set the reward to a uniform transition penalty of ; alternatively, we can adopt just one aspect of the distribution-based representation and set to the empirical mean of the rewards obtained when executing each option.
Thus, we have all the components required to build level of the hierarchy from level . This procedure can be repeated in a skill-symbol loop
—alternating skill acquisition and representation acquisition phases—to construct an abstraction hierarchy. It is important to note that there are no degrees of freedom or design choices in the representation acquisition phase of the skill-symbol loop; the algorithmic questions reside solely in determining which skills to acquire at each level.
This construction results in a specific relationship between MDPs in a hierarchy: every state at level refers to a set of states at level .111Note that is not necessarily a partition of —the grounding sets of two states in may overlap. A grounding in can therefore be computed for any state at level in the hierarchy by applying the grounding operator times. If we denote this “final grounding” operator as , then such that .
We now illustrate the construction of an abstraction hierarchy via an example—a very simple task that must be solved by a complex agent. Consider a robot in a room with two boxes, one containing an apple (Figure 1a). The robot must occasionally move the apple from one box to the other. Directly accomplishing this involves solving a high-dimensional motion planning problem, so instead the robot is given five motor skills: move-gripper-above1 and move-gripper-above2 use motion planning to move the robot’s gripper above each box; pregrasp controls the gripper so that it cages the apple, and is only executable from above it; grasp can be executed following pregrasp, and runs a gradient-descent based controller to achieve wrench-closure on the apple; and release drops the apple. These form , the actions in the first level of the hierarchy, and since they are abstract subgoal options the robot automatically constructs a factored state space (see Figure 1b) that specifies . This enables abstract planning—the state space is independent of the complexity of the robot, although contains some low-level details (e.g., pregrasped).
Applying a skill discovery algorithm in , the robot detects that pregrasp is always followed by grasp, and therefore replaces these actions with grab-apple, which together with the remaining skills in forms . This results in a smaller MDP, (Figure 1c), which is a good abstract model of the task. Applying a skill discovery algorithm to creates a skill that picks up the apple in whichever box it is in, and moves it over the other box. now consists of just a single action, swap-apple, requiring just two propositions to define : apple-in-box-1, and apple-in-box-2 (Figure 1d). The abstraction hierarchy has abstracted away the details of the robot (in all its complexity) and exposed the (almost trivial) underlying task structure.
4 Planning Using an Abstraction Hierarchy
Once an agent has constructed an abstraction hierarchy, it must be able to use it to rapidly find plans for new problems. We formalize this process as the agent posing a plan query to the hierarchy, which should then be used to generate a plan for solving the problem described by the query. We adopt the following definition of a plan query:
A plan query is a tuple , where is the set of base MDP states from which execution may begin, and (the goal) is the set of base MDP states in which the agent wishes to find itself following execution.
The critical question is at which level of the hierarchy planning should take place. We first define a useful predicate, planmatch, which determines whether an agent should attempt to plan at level (see Figure 2):
A pair of abstract state sets and match a plan query (denoted planmatch) when and .
A plan can be found to solve plan query at level iff such that planmatch, and there is a feasible plan in from every state in to some state in .
The MDP at level is constructed such that a plan starting from any state in (and hence also ) is guaranteed to leave the agent in a state in (and hence also ) iff is a plan in MDP from to [Konidaris, Kaelbling, and Lozano-Perez2014].
Plan is additionally valid from to iff (the start state at level refers to a set that includes all query start states) and (the query goal includes all states referred to by the goal at level ). ∎
Note that and may not be unique, even within a single level: because is not necessarily a partition of , there may be multiple states, or sets of states, at each level whose final groundings are included by or include ; a solution from any such to any such is sufficient. For efficient planning it is better for to be a small set to reduce the number of start states while remaining large enough to subsume ; if then answering the plan query requires a complete policy for , rather than a plan. However, finding a minimal subset is computationally difficult. One approach is to build the maximal candidate set . This is a superset of any start match, and a suitable one exists at this level if and only if . Similarly, should be maximally large (and so easy to reach) while remaining small enough so that its grounding set lies within . At each level , we can therefore collect all states that ground out to subsets of : . These approximations result in a unique pair of sets of states at each level—at the cost of potentially including unnecessary states in each set— and can be computed in time linear in .
It follows from the state abstraction properties of the hierarchy that a planmatch at level implies the existence of a planmatch at all levels below .
Given a hierarchy of state spaces constructed as above and plan query , if such that planmatch, for some , then such that planmatch, .
We first consider . Let , and . Both are, by definition, sets of states in . By definition of the final grounding operator, and , and hence and . This process can be repeated to reach any . ∎
Any plan query therefore has a unique highest level containing a planmatch. This leads directly to Algorithm 1, which starts looking for a planmatch at the highest level of the hierarchy and proceeds downwards; it is sound and complete by Theorem 1.
The complexity of Algorithm 1 depends on its two component algorithms: one used to find a planmatch, and another to attempt to find a plan (possibly with multiple start states and goals). We denote the complexity of these algorithms as (linear using the approach described above) and , for a problem with states, respectively. The complexity of finding a plan at level , where the first match is found at level , is given by for a hierarchy with levels. The first term corresponds to the search for the level with the first planmatch; the second term for the repeated planning at levels that contain a match but not a plan (a planmatch does not necessarily mean a plan exists at that level—merely that one could).
The formula for highlights the fact that hierarchies make some problems easier to solve and others harder: in the worst case, a problem that should take time—one only solvable via the base MDP—could instead take time. A key question is therefore how to balance the depth of the hierarchy, the rate at which the state space size diminishes as the level increases, which specific skills to discover at each level, and how to control false positive plan matches, to reduce planning time.
Recent work has highlighted the idea that skill discovery algorithms should aim to reduce average planning or learning time across a target distribution of tasks [Şimşek and Barto2008, Solway et al.2014]. Following this logic, a hierarchy for some distribution of over task set should be constructed so as to minimize where and now both depend on each task . Minimizing this quantity over the entire distribution seems infeasible; an acceptable substitute may be to assume that the tasks the agent has already experienced are drawn from the same distribution as those it will experience in the future, and to construct the hierarchy that minimizes averaged over past tasks.
The form of suggests two important principles which may aid the more direct design of skill acquisition algorithms. One is that deeper hierarchies are not necessarily better; each level adds potential planning and matching costs, and must be justified by a rapidly diminishing state space size and a high likelihood of solving tasks at that level. Second, false positive plan matches—when a pair of states that match the query is found at some level at which a plan cannot be found—incur a significant time penalty. The hierarchy should therefore ideally be constructed so that every likely goal state at each level is reachable from every likely start state at that level.
An agent that generates its own goals—as a completely autonomous agent would—could do so by selecting an existing state from an MDP at some level (say ) in the hierarchy. In that case it need not search for a matching level, and could instead immediately plan at level , though it may still need to drop to lower levels if no plan is found in .
6 An Example Domain: Taxi
We now explain the construction and use of an abstraction hierarchy for a common hierarchical reinforcement learning benchmark: the Taxi domain [Dietterich2000], depicted in Figure 3a. A taxi must navigate a grid, which contains a few walls, four depots (labeled red, green, blue, and yellow), and a passenger. The taxi may move one square in each direction (unless impeded by a wall), pick up a passenger (when occupying the same square), or drop off a passenger (when it has previously picked the passenger up). A state at base MDP is described by state variables: the and location of the taxi and the passenger, and whether or not the passenger is in the taxi. This results in a total of states ( states for when the passenger is not in the taxi, plus another for when the passenger is in the taxi and they are constrained to have the same location).
We now describe the construction of a hierarchy for the taxi domain using hand-designed options at each level, and present some results for planning using Algorithm 1 for three example plan queries.
|Query||Level||Matching||Planning||Total||Base + Options||Base MDP|
In this version of taxi, the agent is able to move the taxi to, and drop the passenger at, any square, but it expects to face a distribution of problems generated by placing the taxi and the passenger at a depot at random, and selecting a random target depot at which the passenger must be deposited. Consequently, we create navigation options for driving the taxi to each depot, and retain the existing put-down and pick-up options.222These roughly correspond to the hand-designed hierarchical actions used in Dietterich00 Dietterich00. These options over form the action set for level of the hierarchy: drive-to-red, drive-to-green, drive-to-blue, drive-to-yellow, pick-up, put-down.
Consider the drive-to-blue-depot option. It is executable in all states (i.e., its initiation set is ), and terminates with the taxi’s and position set to the position of the blue depot; if the passenger is in the taxi, their location is also set to that of the blue depot; otherwise, their location (and the fact that they are not in the taxi) remains unchanged. It can therefore be partitioned into two abstract subgoal options: one, when the passenger is in the taxi, sets the and positions of the taxi and passenger to those of the blue depot; another, when the passenger is not in the taxi, sets the taxi and coordinates and leaves those of the passenger unchanged. Both leave the in-taxi state variable unmodified. Similarly, the put-down and pick-up options are executable everywhere and when the taxi and passenger are in the same square, respectively, and modify the in-taxi variable while leaving the remaining variables the same. Partitioning all options in into abstract subgoal options results in a factored state space consisting of reachable states where the taxi or passenger are at the depot locations ( states for when the passenger is not in the taxi, plus for when they are).
Given , we now build the second level of the hierarchy by constructing options that pick up the passenger (wherever they are), move them to each of the four depots, and drop them off. These options become passenger-to-blue, passenger-to-red, passenger-to-green, passenger-to-yellow. Each option is executable whenever the passenger is not already at the relevant depot, and it leaves the passenger and taxi at the depot, with the passenger outside the taxi. Since these are subgoal (as opposed to abstract subgoal) options, the resulting MDP, , consists of only states (one for each location of the passenger) and is a simple (and coincidentally fully connected) graph. The resulting hierarchy is depicted in Figure 3b.
We used the above hierarchy to compute plans for three example queries, using dynamic programming and decision trees for planning and grounding classifiers, respectively. The results are given in Table4; we next present each query, and step through the matching process in detail.
Example Query 1. Query has the passenger start at the blue depot (with the taxi at an unknown depot) and request to be moved to the red depot. In this case refers to all states where the passenger is at the blue depot and the taxi is located at one of four depots, and similarly refers to the red depot. The agent must first determine the appropriate level to plan at, starting from , the highest level of the hierarchy. It finds state where (and therefore holds), and where (and therefore ), where and are the states in referring to the passenger being located at the blue and red depots, respectively. Planning therefore consists of finding a plan from to at level ; this is virtually trivial (there are only four states in and the state space is fully connected).
Example Query 2. Query has the start state set as before, but now specifies a goal depot (the yellow depot) for the taxi. refers to all states where the passenger is at the blue depot and the taxi is at an unknown depot, but refers to a single state. contains a state that has the same grounding set as , but no state in is a subset of because no state in specifies the location of the taxi. The agent therefore cannot find a planmatch for at level .
At no single state is a superset of , but the agent finds a collection of states , such that . It also finds a single state with the same grounding as . Therefore, it builds a plan at level for each state in .
Example Query 3. In query , the taxi begins at the red depot and the passenger at the blue depot, and its goal is to leave the passenger at grid location , with the taxi goal location left unspecified. The start set, , refers to a single state, and the goal set, , refers to the set of states where the passenger is located at .
Again the agent starts at . is a subset of the grounding of the single state in where the passenger is at the blue depot but the taxi is at an unknown depot. However, is not a superset of any of the states in , since none contain any state where the passenger is not at a depot. Therefore the agent cannot plan for at level .
At level , it again find a state that is a superset of , but no state that is a subset of —all states in now additionally specify the position of the taxi and passenger, but like the states in they all fix the location of the passenger at a depot. All state groundings are in fact disjoint from the grounding of . The agent must therefore resort to planning in , and the hierarchy does not help (indeed, it results in a performance penalty due to the compute time to rule out and ).
We have introduced a framework for building abstraction hierarchies by alternating skill- and representation-acquisition phases. The framework is completely automatic except for the choice of skill acquisition algorithm, to which our formulation is agnostic. The resulting hierarchies combine temporal and state abstraction to realize efficient planning and learning in the multi-task setting.
- [Bacchus and Yang1991] Bacchus, F., and Yang, Q. 1991. The downward refinement property. In Proceedings of the 12th International Joint Conference on Artificial Intelligence, 286–292. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
- [Barto and Mahadevan2003] Barto, A., and Mahadevan, S. 2003. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems 13:41–77.
- [Dietterich2000] Dietterich, T. 2000. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research 13:227–303.
- [Hengst2012] Hengst, B. 2012. Hierarchical approaches. In Wiering, M., and van Otterlo, M., eds., Reinforcement Learning, volume 12 of Adaptation, Learning, and Optimization. Springer Berlin Heidelberg. 293–323.
- [Konidaris, Kaelbling, and Lozano-Perez2014] Konidaris, G.; Kaelbling, L.; and Lozano-Perez, T. 2014. Constructing symbolic representations for high-level planning. In Proceedings of the Twenty-Eighth Conference on Artificial Intelligence, 1932–1940.
- [Konidaris, Kaelbling, and Lozano-Perez2015] Konidaris, G.; Kaelbling, L.; and Lozano-Perez, T. 2015. Symbol acquisition for probabilistic high-level planning. In Proceedings of the Twenty Fourth International Joint Conference on Artificial Intelligence.
[McDermott et al.1998]
McDermott, D.; Ghallab, M.; Howe, A.; Knoblock, C.; Ram, A.; Veloso, M.; Weld,
D.; and Wilkins, D.
PDDL—the planning domain definition language.
Technical Report CVC TR98003/DCS TR1165, Yale Center for Computational Vision and Control.
- [Parr and Russell1997] Parr, R., and Russell, S. 1997. Reinforcement learning with hierarchies of machines. In Advances in Neural Information Processing Systems 10, 1043–1049.
- [Şimşek and Barto2008] Şimşek, Ö., and Barto, A. 2008. Skill characterization based on betweenness. In Advances in Neural Information Processing Systems 22.
- [Solway et al.2014] Solway, A.; Diuk, C.; Cordova, N.; Yee, D.; Barto, A.; Niv, Y.; and Botvinick, M. 2014. Optimal behavioral hierarchy. PLOS Computational Biology 10(8):e1003779.
- [Sutton, Precup, and Singh1999] Sutton, R.; Precup, D.; and Singh, S. 1999. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112(1-2):181–211.