In classical planning, an agent must select a sequence of deterministic, durationless actions in order to transition from a known initial state to a desired goal state. Planning assumes the agent has access to a model of the effects of its actions, which it uses to reason about potential plans. Usually this model takes the form of either a PDDL-like description , which specifies the preconditions and effects of each action, or a simulator [10, 9], which the agent can use to simulate state transitions.
In general, planning is hard: determining whether a plan exists to reach the goal state is PSPACE-complete . Heuristic search eases this computational burden by guiding the search towards promising solutions. Of course, heuristic search is only useful with a good heuristic. Prior work on domain-independent planning has produced several methods for automatically generating heuristics by exploiting structure in PDDL descriptions [1, 8, 7]. However, simulator-based planners have no formal domain description to exploit, and are therefore limited to less-informed heuristics. This poses a problem because an informative heuristic is especially important for simulator-based planning, where querying the simulator can be slow.
One of the simplest domain-independent heuristics compatible with both PDDL planners and simulator-based planners is the goal-count heuristic . The goal count heuristic counts the number of state variables that differ between a given state and the goal. Two basic assumptions of the goal-count heuristic are: a factored state space (i.e. there are state variables to count), and a known goal state (i.e. there is a reason to fix variables). A third, more subtle assumption is that each state variable can be treated as an approximately independent subgoal. Unfortunately, the subgoal independence assumption is invalid for most planning problems of practical interest, and thus the goal-count heuristic is often misleading.
We examine why the goal-count heuristic becomes uninformative for certain sets of actions, and we show that planning efficiency is linked to whether actions modify many variables at once. Our investigation suggests a compelling strategy for improving the usefulness of the goal-count heuristic: learning macro-actions that modify as few variables as possible. We describe a method for learning such macro-actions and test it on two classical planning benchmarks: 15-puzzle and Rubik’s cube. We focus our attention on quickly finding feasible plans, rather than optimal ones, with the goal of minimizing the number of simulation steps. Our learned macro-actions enable reliable and efficient planning, making dramatically fewer calls to the simulator while improving our solve rate for Rubik’s cube from zero to 100%.
2 Measuring the Effect of Entanglement on Planning Efficiency
The goal-count heuristic implicitly treats each state variable as an independent subgoal. For this assumption to be exactly correct, actions must change only one variable at a time. However, in general, an action can change many state variables, and this can cause the goal-count heuristic to be uninformative. In the latter case, we will say that an action’s effects are entangled. We formalize the “entanglement” of an action’s effects as follows:
where is an action,
is a vector representing the differential effects of the action, anddenotes the norm. In other words, the entanglement simply counts the number of state variables modified by the action. We will sometimes abbreviate this as and say that the action itself is entangled. We can extend the above definition to macro-actions where entanglement only counts the variables modified by the net effects of the macro-action, even if additional variables are modified during its execution.
The goal-count heuristic fundamentally links entanglement with planning efficiency. In the following experiment, we demonstrate that increasing the entanglement of a set of actions can increase planning time exponentially.
2.1 The Suitcase Lock Domain
We introduce the planning problem of entering a combination on a suitcase lock with dials, each with digits, and actions, half which increment a deterministic subset of the dials (modulo ), and half which decrement the same dials (see Figure 0(a)). Let denote the entanglement of action , and let represent the mean entanglement across all the actions. We vary and measure its effect on planning time. The start and goal state are randomly generated, and the actions are constructed so that, regardless of , every state can always be reached from every other state222Note that if , or if all actions modify (for example) an even number of state variables, it is not possible to reach every state from every other state. To circumvent this issue, we check that for a given problem instance, the increment and decrement action sets can each be reduced to an binary matrix with full rank. We repeatedly generate action sets with the desired mean entanglement until we find one that satisfies this condition. The resulting action sets are therefore different for each random seed, except when where we always use the identity matrix
where we always use the identity matrix, and when where we use with an extra 1 added to the first diagonal element to break symmetry. The decrement actions are always the negation of the increment actions, and we ignore them for ..
We solve each instance of the planning problem using the goal-count heuristic and greedy best-first search, since we care about feasible plans, rather than optimal ones. We run two experiments: first with , and varying in the range ; and second with , , and . We compare planning time for each entanglement value by measuring the number of simulator steps to solve the problem across different random seeds.
Figure 2 shows an approximately exponential relationship between entanglement and planning time. Note that the goal-count heuristic is exactly equal to the cost when and , and greedy best-first search (GBFS) need consider at most state transitions. By contrast, when the heuristic is maximally uninformative, and GBFS may need to consider state transitions in the worst case. In Figure 1(b), this exponential trend appears to hold even when the state variables are not binary.
These results suggest that reducing entanglement is a viable strategy for improving planning efficiency. We therefore next propose a method for learning disentangled macro-actions.
3 Learning Macro-Actions with Disentangled Effects
We search for macro-actions using A*  with a simulation budget of state transitions. We start the search at a randomly generated state, and the search heuristic is macro-action entanglement (the number of variables modified by the macro-action) or infinity if the macro-action modifies zero variables. We ignore duplicate macro-actions that have the same net effect, and save the macro-actions with the lowest entanglement. To encourage diversity in the macros, we repeat this process times, each time generating a new random starting state in which none of the existing saved macro-actions are valid, or until we fail to find such a starting state. This ensures that even if macros have constraining preconditions, we will still find macros that apply in most situations. (See Algorithm 1 for pseudocode.)
We evaluate our method by learning macro-actions in two domains and subsequently using them for planning.333Code at https://github.com/camall3n/skills-for-planning
The 15-puzzle (Fig. 0(b)) is a grid of 15 numbered, sliding tiles and one blank space. The puzzle begins in a scrambled configuration, and the objective is to slide the tiles until the numbers on the tiles are arranged in increasing order. There are approximately states and the worst-case shortest solution requires 80 actions .
Our simulator uses a state representation with 16 variables (for the positions of each tile and of the blank space), and 48 primitive actions (that swap the blank space with one of the adjacent tiles), of which only 2-4 can be applied in each state. Similarly, macro-actions can only run if they begin with the correct blank space location.
We select this domain since the primitive actions are almost completely disentangled (each modifies only the blank space and one numbered tile), yet planning is inefficient with primitive actions alone. Moreover, naively chosen macro-actions can rapidly become entangled, which makes their application challenging.
|Macro-actions||Simulator steps||Solve rate|
4.1.1 Analysis of Learned Macro-Actions
When learning macro-actions for 15-puzzle, we set , , and . This resulted in a combined simulation budget of state transitions, and a total of generated macro-actions. We compared these macro-actions against 1600 “random” macro-actions of the same length, which were generated (for each random seed) by selecting actions uniformly at random from the valid actions at each state.
We plot the learned macro-actions in Figure 3 (with some overlap), and see that they are much less entangled than random macros, and only slightly more entangled than primitive actions. The learned macros are also easy to interpret: one type swaps the blank space with a central tile; another type exchanges three tiles without moving the blank space.
4.1.2 Planning with Learned Macro-Actions
After learning disentangled macro-actions, we solve the 15-puzzle using greedy best-first search with the goal-count heuristic and a simulation budget of state transitions. We generate starting states by scrambling the puzzle with random actions for either or
steps, with equal probability (to ensure that we see all possible blank space locations).
We consider planning with the primitive actions alone, as well as augmenting the primitive actions with either random, or learned, macro-actions.444Note that by keeping the primitive actions as well as the macro-actions we ensure that every position is solvable. To save on computational resources, when using macro-actions, we pre-compute and save a model of each, represented as permutation operation on the indices of the variables, which we can then apply during planning as a single operation. We measure the best-seen heuristic value vs. the number of nodes considered, across different random seeds for each group of actions/macro-actions. We stop searching as soon as we find a feasible plan.
Figure 4 shows planning time by action/macro-action type, and we summarize the results in Table 1. We see that disentangled macro-actions enable an order of magnitude faster planning than with primitive actions alone. We also see that random macro-actions actually make planning slower, even though we filter out duplicate macro-actions with the same effects. We attribute this to their increased entanglement as seen in Figure 3.
4.2 Rubik’s Cube
The Rubik’s cube (Fig. 0(c)) is a cube with colored stickers on each outward-facing square. The puzzle begins in a scrambled configuration, and the objective is to rotate the faces of the cube until all stickers on each face are the same color. There are approximately states, and the worst-case shortest solution requires 26 actions . Our simulator fixes a canonical orientation of the cube, and uses a 48-state-variable representation (for the positions of each colored square, excluding the stationary center squares). The problem has 12 primitive actions (i.e. rotating each of the 6 faces by degrees), and these actions are highly entangled: each modifies 20 of the 48 state variables.
4.2.1 Human Expert Macro-Actions
Human “speedcubers” use macro-actions to help them manage the Rubik’s cube’s highly entangled actions. In speedcubing, the goal is to solve the cube as quickly as possible, without necessarily finding an optimal plan. Most speedcubers learn a collection of macro-actions (called “algorithms” in Rubik’s cube parlance) and then employ a strategy for sequencing those macro-actions to solve the cube. Expert macro-actions tend to affect only a small number of state variables, and proper sequencing enables speedcubers to preserve previously-solved parts of the cube while solving the remainder. Common solution methods typically involve multiple levels of hierarchical subgoals and produce plans approximately twice as long as optimal.
As a benchmark, we consider a simplification of expert strategy that involves only a two-level hierarchy. We select a set of six expert macro-actions to perform various complementary types of permutations.555We use the following expert macro-actions (expressed in standard cube notation ):
- Swap three corners: (see Fig. 4(a))
- Swap three middle edges:
- Swap three face edges:
- Rotate two corners:
- Flip two edges: We visualize one of these macro-actions, which swaps three corner pieces, in Figure 4(a). Since our simulator uses a fixed cube orientation, we consider all 96 possible variations of each skill (to account for orientation, mirror-flips, and inverses), resulting in 576 total macro-actions.
4.2.2 Analysis of Learned and Expert Macro-Actions
For Rubik’s cube, we followed the same macro-action learning procedure as for 15-puzzle. We set the number of learned macro-actions so that we could fairly compare the generated macro-actions against our set of expert macro-actions. We learned macro-actions from a single starting state , and set a simulation budget of . We also compared against “random” macro-actions of the same length as the expert skills (six distinct macro-actions plus their corresponding variations), which were regenerated for each random seed.
We plot the entanglement and length of our learned macro-actions in Figure 6. We can see that the learned macros are significantly less entangled than primitive actions or random macro-actions, and almost as disentangled as the expert macro-actions. We note that our learned macros are somewhat shorter on average than the expert macros, and we suspect that increasing the simulation budget would result in learning macros that are even less entangled.
Again we find that the learned macro-actions are relatively easy to interpret. For example, one macro (Figure 4(b)) swapped three edge-corner pairs while keeping them connected.
4.2.3 Planning with Learned and Expert Macro-Actions
We follow the same planning procedure as for 15-puzzle, but now we additionally consider augmenting the primitive actions with the expert macro-actions. We generate starting states by scrambling the cube with uniform random actions for steps. We set , and use different random seeds per group of actions/macro-actions.
We plot heuristic progress vs. planning time in Figure 7 and summarize the results in Table 2. We see that neither random macro-actions nor primitive actions alone are sufficient to solve the cube within the simulation budget. By contrast, planning with learned macro-actions reliably solves the cube, typically after only about 10% of the total simulation budget, and is almost as efficient as with expert macro-actions. Learned macro-actions still dominate the results for random macros and/or primitive actions, even if we account for the additional simulation budget to learn them.
We observed that the average solution length for learned and expert macro-actions is about an order of magnitude longer than typical human speedsolve solutions (377 and 339, respectively, vs. ~ ), which suggests that there are additional insights to be mined from human strategy beyond just learning disentangled macro-actions.
4.3 Generalizing to Novel Goal States
Since our method represents macro-action models as permutation operations on the indices of the variables (see section 4.1.2), the same macro-actions that we learned previously can be used to solve problems with novel goal states. To demonstrate this, we generate random goal states for 15-puzzle and Rubik’s cube using the same process we use for generating start states (albeit with a different random seed), and solve the puzzles again. In both domains, we find that planning time and solve rate remain effectively unchanged for novel goal states (see Table 3).
|Domain||Goal type||Simulator steps||Solve rate|
5 Related Work
The concept of building macro-actions to improve planning efficiency is not new. dawson1977role dawson1977role considered two-action macros and analyzed domain structure to remove macros that were invalid or that had no effect. botea2005macro botea2005macro introduced another method for automatically learning macro-actions, which subsequently ranked and filtered them by usefulness. In both cases, the learned macros were found to help with subsequent planning tasks.
jinnai2017learning jinnai2017learning described a method for macro-action pruning, in simulator-based planning, that specifically looked for dominated action sequences. lipovetzky2015classical lipovetzky2015classical also looked at simulator-based planning, in the context of planning problems without known goal states. newton2007learning newton2007learning investigated learning macro-actions without relying on assumptions about the planner or domain. Their approach used a genetic algorithm to maximize the “fitness” of macro-actions, whereas we use entanglement to guide our search for macros that are compatible with the goal-count heuristic.
There has also been work on domain-independent approaches to learning planning heuristics. virseda2013learning virseda2013learning showed how to automatically learn a weighted combination of existing heuristics to improve planner robustness. gomoluch2017towards gomoluch2017towards trained heuristic functions on known solutions for representative problems, and showed that their heuristics generalized to held-out test problems for a small set of domains. shen2019learning shen2019learning recently learned domain-independent heuristics from scratch that outperformed existing baseline heuristics.
agostinelli2019solving agostinelli2019solving use a domain-independent approach to train a domain-specific heuristic. They make use of the fact that they can reset their simulator to states that are close to the goal state, which enables them to train their neural network heuristic via dynamic programming. Their approach learns heuristics for 15-puzzle and Rubik’s cube that support fast, near-optimal planning. However, training their neural network uses approximatelytimes the simulation budget of our approach, and results in a heuristic that is only informative for a single goal state, whereas ours works for arbitrary goal states.
Very recently, openai2019rubiks openai2019rubiks demonstrated that the motions to manipulate the Rubik’s cube with a robotic hand can be effectively learned, but they employ an off-the-shelf, domain-specific planner to generate solutions.
6 Discussion and Conclusion
We have described a method of learning disentangled macro-actions that enables efficient planning with the goal-count heuristic. Our approach is domain-independent and compatible with both PDDL-based planners and simulation-based planners. By planning with our learned macro-actions, we are able to quickly and reliably solve difficult planning problems like 15-puzzle and Rubik’s cube.
Because our entanglement metric measures the net effects of macro-actions, it allows flexibility in planning: a robot can move a box to do its job, as long as it moves the box back when it is done. Such behavior may also have applications in AI safety, where unintended side-effects can sometimes have dangerous and permanent consequences. Minimizing macro-action entanglement could help avoid unwanted side-effects like breaking a vase, since such one-way state transitions cannot be reversed later. We are also encouraged to see that many of the learned macro-actions had intuitive, interpretable meaning in the task domain. This suggests that our method may be useful for improving explainability in addition to planning efficiency.
This work employed a two-level hierarchy: macro-actions composed of primitive actions. One extension that would bring this method more in line with human-expert techniques would be to incorporate several levels of action hierarchy (i.e. macros composed of other macros), or macros that permit side-effects to certain unsolved variables, combined with macros to subsequently solve those remaining variables. We leave an exploration of these ideas for future work.
We thank Yuu Jinnai for several helpful discussions, Michael Katz for comments on an earlier version of this manuscript, and our colleagues at IBM Research and Brown University for their thoughtful conversations and support.
-  (2001) Planning as heuristic search. Artificial Intelligence 129 (1-2), pp. 5–33. Cited by: §1.
-  (1999) The parallel search bench zram and its applications. Annals of Operations Research 90, pp. 45–63. Cited by: §4.1.
-  (1994) The computational complexity of propositional STRIPS planning. Artificial Intelligence 69 (1-2), pp. 165–204. Cited by: §1.
-  (1971) STRIPS: A new approach to the application of theorem proving to problem solving. Artificial intelligence 2 (3-4), pp. 189–208. Cited by: §1.
-  (2003) PDDL2.1: An extension to PDDL for expressing temporal planning domains. Journal of artificial intelligence research 20, pp. 61–124. Cited by: §1.
-  (1968) A formal basis for the heuristic determination of minimum cost paths. IEEE transactions on Systems Science and Cybernetics 4 (2), pp. 100–107. Cited by: §3.
-  (2006) The fast downward planning system. Journal of Artificial Intelligence Research 26, pp. 191–246. Cited by: §1.
-  (2001) The FF planning system: Fast plan generation through heuristic search. Journal of Artificial Intelligence Research 14, pp. 253–302. Cited by: §1.
-  (2017) Learning to prune dominated action sequences in online black-box planning. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §1.
-  (2015) Classical planning with simulators: results on the atari video games. In Twenty-Fourth International Joint Conference on Artificial Intelligence, Cited by: §1.
-  (2014) God’s number is 26 in the quarter-turn metric. Note: http://www.cube20.org/qtm/[Online; accessed 20-January-2020] Cited by: §4.2.
-  (1981) Notes on Rubik’s magic cube. Enslow Publishers Hillside, NJ. Cited by: footnote 5.
-  (2019) CFOP method. Note: https://www.speedsolving.com/wiki/index.php/CFOP_method[Online; accessed 22-January-2020] Cited by: §4.2.3.