Reinforcement Learning for Classical Planning: Viewing Heuristics as Dense Reward Generators

09/30/2021 ∙ by Clement Gehring, et al. ∙ MIT ibm 3

Recent advances in reinforcement learning (RL) have led to a growing interest in applying RL to classical planning domains or applying classical planning methods to some complex RL domains. However, the long-horizon goal-based problems found in classical planning lead to sparse rewards for RL, making direct application inefficient. In this paper, we propose to leverage domain-independent heuristic functions commonly used in the classical planning literature to improve the sample efficiency of RL. These classical heuristics act as dense reward generators to alleviate the sparse-rewards issue and enable our RL agent to learn domain-specific value functions as residuals on these heuristics, making learning easier. Correct application of this technique requires consolidating the discounted metric used in RL and the non-discounted metric used in heuristics. We implement the value functions using Neural Logic Machines, a neural network architecture designed for grounded first-order logic inputs. We demonstrate on several classical planning domains that using classical heuristics for RL allows for good sample efficiency compared to sparse-reward RL. We further show that our learned value functions generalize to novel problem instances in the same domain.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep reinforcement learning (RL) approaches have several strengths over conventional approaches to decision making problems, including compatibility with complex and unstructured observations, little dependency on hand-crafted models, and some robustness to stochastic environments. However, they are notorious for their poor sample complexity; e.g., it may require environment interactions to successfully learn a policy for Montezuma’s Revenge (Badia et al., 2020). This sample inefficiency prevents their applications in environments where such an exhaustive set of interactions is physically or financially infeasible. The issue is amplified in domains with sparse rewards and long horizons, where the reward signals for success are difficult to obtain through random interactions with the environment.

In contrast, research in AI Planning and classical planning has been primarily driven by the identification of tractable fragments of originally PSPACE-complete planning problems (Bäckström and Klein, 1991; Bylander, 1994; Erol, Nau, and Subrahmanian, 1995; Jonsson and Bäckström, 1998a, b; Brafman and Domshlak, 2003; Katz and Domshlak, 2008a, b; Katz and Keyder, 2012), and the use of the cost of the tractable relaxed problem as domain-independent heuristic guidance for searching through the state space of the original problem (Hoffmann and Nebel, 2001; Domshlak, Hoffmann, and Katz, 2015; Keyder, Hoffmann, and Haslum, 2012). Contrary to RL approaches, classical planning has focused on long-horizon problems with solutions well over 1000 steps long (Jonsson, 2007; Asai and Fukunaga, 2015). Moreover, classical planning problems inherently have sparse rewards — the objective of classical planning is to produce a sequence of actions that achieves a goal. However, although domain-independence is a welcome advantage, domain-independent methods can be vastly outperformed by carefully engineered domain-specific methods such as a specialized solver for Sokoban (Junghanns and Schaeffer, 2000) due to the no-free-lunch theorem for search problems (Wolpert, Macready et al., 1995). Developing such domain-specific heuristics can require intensive engineering effort, with payoff only in that single domain. We are thus interested in developing domain-independent methods for learning domain-specific heuristics.

In this paper, we draw on the strengths of reinforcement learning and classical planning to propose an RL framework for learning to solve STRIPS planning problems. We propose to leverage classical heuristics, derivable automatically from the STRIPS model, to accelerate RL agents to learn a domain-specific neural network value function. The value function, in turn, improves over existing heuristics and accelerates search algorithms at evaluation time.

To operationalize this idea, we use potential-based reward shaping (Ng, Harada, and Russell, 1999), a well-known RL technique with guaranteed theoretical properties. A key insight in our approach is to see classical heuristic functions as providing dense rewards that greatly accelerate the learning process in three ways. First, they allow for efficient, informative exploration by initializing a good baseline reactive agent that quickly reaches a goal in each episode during training. Second, instead of learning the value function directly, we learn a residual

on the heuristic value, making learning easier. Third, the learning agent receives a reward by reducing the estimated cost-to-go (heuristic value). This effectively mitigates the issue of sparse rewards by allowing the agent to receive positive rewards more frequently.

We implement our neural network value functions as Neural Logic Machines (Dong et al., 2019, NLM), a recently proposed neural network architecture that can directly process first-order logic (FOL) inputs, as are used in classical planning problems. NLM takes a dataset expressed in grounded FOL representations and learns a set of (continuous relaxations of) lifted Horn rules. The main advantage of NLMs is that they structurally generalize across different numbers of terms, corresponding to objects in a STRIPS encoding. Therefore, we find that our learned value functions are able to generalize effectively to problem instances of arbitrary sizes in the same domain.

We provide experimental results that validate the effectiveness of the proposed approach in 8 domains from past IPC (International Planning Competition) benchmarks, providing detailed considerations on the reproducibility of the experiments. We find that our reward shaping approach achieves good sample efficiency compared to sparse-reward RL, and that the use of NLMs allows for generalization to novel problem instances. For example, our system learns from blocksworld instances with 2-6 objects, and the result enhances the performance of solving instances with up to 50 objects.

2 Background

We denote a multi-dimensional array in bold.

denotes a concatenation of tensors

and in the last axis where the rest of the dimensions are same between and . Functions (e.g., ) are applied to arrays element-wise. Finally, we let denote .

2.1 Classical Planning

We consider planning problems in the STRIPS subset of PDDL Fikes and Nilsson (1972), which for simplicity we refer to as lifted STRIPS. We denote such a planning problem as a 5-tuple . is a set of objects, is a set of predicates, and is a set of actions. We denote the arity of predicates and action as and , and their parameters as, e.g., . We denote the set of predicates and actions instantiated on as and , respectively, which is a union of Cartesian products of predicates/actions and their arguments, i.e., they represent the set of all ground propositions and actions. A state is a set of propositions that are true in that state. An action is a 4-tuple , where are preconditions, add-effects, and delete-effects, and is a cost of taking the action . In this paper, we primarily assume a unit-cost domain where for all . Given a current state , a ground action is applicable when , and applying an action to yields a successor state . Finally, are the initial state and a goal condition, respectively. The task of classical planning is to find a plan which satisfies and every action satisfies its preconditions at the time of using it. The machine representation of a state and the goal condition is a bitvector of size , i.e., the

-th value of the vector is 1 when the corresponding

-th proposition is in , or .

2.2 Markov Decision Processes

In general, RL methods address domains modeled as a discounted Markov decision processes (MDP),

where is a set of states, is a set of actions,

encodes the probability

of transitioning from a state to a successor state by an action , is a reward function,

is a probability distribution over initial states, and

is a discount factor. In this paper, we restrict our attention to deterministic models because PDDL domains are deterministic, and we have a deterministic mapping . Given a policy representing a probability of performing an action in a state

, we define a sequence of random variables

, and , representing states, actions and rewards over time .

Our goal is to find a policy maximizing its long term discounted cumulative rewards, formally defined as a value function We also define an action-value function to be the value of executing a given action and subsequently following some policy , i.e., An optimal policy is a policy that achieves the optimal value function that satisfies for all states and policies. satisfies Bellman’s equation:


where is referred to as the optimal action-value function. We may omit in , for clarity.

Finally, we can define a policy by mapping action-values in each state to a probability distribution over actions. For example, given an action-value function, , we can define a policy , where is a temperature that controls the greediness of the policy. It returns a greedy policy when ; and approaches a uniform policy when .

2.3 Formulating Classical Planning as an MDP

There are two typical ways to formulate a classical planning problem as an MDP. In one strategy, given a transition , one may assign a reward of 1 when , and 0 otherwise (Rivlin, Hazan, and Karpas, 2019). In another strategy, one may assign a reward of 0 when , and otherwise (or, more generally in a non-unit-cost domain). In this paper we use the second, negative-reward model because it tends to induce more effective exploration in RL due to optimistic initial values (Sutton and Barto, 2018). Both cases are considered sparse reward problems because there is no information about whether one action sequence is better than another until a goal state is reached.

3 Bridging Deep RL and AI Planning

We consider a multitask learning setting with a training time and a test time (Fern, Khardon, and Tadepalli, 2011). During training, classical planning problems from a single domain are available. At test time, methods are evaluated on held-out problems from the same domain. The transition model (in PDDL form) is known at both training and test time.

Learning to improve planning has been considered in RL. For example, in AlphaGo (Silver et al., 2016), a value function was learned to provide heuristic guidance to Monte Carlo Tree Search (Kocsis and Szepesvári, 2006). Applying RL techniques in our classical planning setting, however, presents unique challenges.

(P1): Preconditions and dead-ends. In MDPs, a failure to perform an action is typically handled as a self-cycle to the current state in order to guarantee that the state transition probability is well-defined for all states. Another formulation augments the state space with an absorbing state with a highly negative reward. In contrast, classical planning does not handle non-deterministic outcomes (success and failure). Instead, actions are forbidden at a state when its preconditions are not satisfied, and a state is called a dead-end when no actions are applicable. In a self-cycle formulation, random interaction with the environment could be inefficient due to repeated attempts to perform inapplicable actions. Also, the second formulation requires assigning an ad-hoc amount of negative reward to an absorbing state, which is not appealing.

(P2): Objective functions. While the MDP framework itself does not necessarily assume discounting, the majority of RL applications (Schulman et al., 2015; Mnih et al., 2015, 2016; Lillicrap et al., 2016) aim to maximize the expected cumulative discounted rewards of trajectories. In contrast, classical planning tries to minimize the sum of costs (negative rewards) along trajectories, i.e., cumulative undiscounted costs, thus carrying the concepts in classical planning over to RL requires caution.

(P3): Input representations. While much of the deep RL literature assumes an unstructured (e.g., images in Atari) or a factored input representation (e.g., location and velocity in cartpole), classical planning deals with structured inputs based on FOL to perform domain- and problem-independent planning. This is problematic for typical neural networks, which assume a fixed-sized input. Recently, several network architectures were proposed to achieve invariance to size and ordering, i.e., neural networks for set-like inputs (Ravanbakhsh, Schneider, and Poczos, 2016; Zaheer et al., 2017). Graph Neural Networks (Battaglia et al., 2018) have also been recently used to encode FOL inputs (Rivlin, Hazan, and Karpas, 2019; Shen, Trevizan, and Thiébaux, 2020; Ma et al., 2020). While the choice of the architecture is arbitrary, our network should be able to handle FOL inputs.

3.1 Value Iteration for Classical Planning

Our main approach will be to learn a value function that can be used as a heuristic to guide planning. To learn estimated value functions, we build on the value iteration (VI) algorithm (line 1, Algorithm 1), where a known model of the dynamics is used to incrementally update the estimates of the optimal value function . The current estimates is updated by the r.h.s. of Eq. 1 until a fixpoint is reached.

1:  Value Iteration (VI):
2:  while not converged do
3:     for  do
5:  Approximate RTDP with Replay Buffer:
6:  Buffer
7:  while not converged do
8:     ,
9:     while  and is non-terminal do
13:        SGD()
15:  Approximate RTDP for Classical Planning:
16:  Buffer
17:  while not converged do
18:     , , ,
19:     while , , is not a deadlock do
23:        SGD()
Algorithm 1 VI, RTDP, RTDP for Classical Planning

In classical planning, however, state spaces are too large to enumerate its states (line 3), or to represent the estimates in a tabular form (line 4).

To avoid the exhaustive enumeration of states in VI, Real Time Dynamic Programming (Sutton and Barto, 2018, RTDP, line 5) samples a subset of the state space based on the current policy. In this work, we use on-policy RTDP, which replaces the second with (line 13) for the current policy defined by the of the current action-value estimates. On-policy methods are known to be more stable but can sometimes lead to slower convergence.

Next, to avoid representing the value estimates in an exhaustive table, we encode using a neural network parameterized by weights

and applying the Bellman updates approximately with Stochastic Gradient Descent (line 


We use experience replay (Lin, 1993; Mnih et al., 2015) to smooth out changes in the policy and reduce the correlation between updated states (lines 6-12). We store the history of states into a FIFO buffer , and update with mini-batches sampled from to leverage GPU-based parallelism.

We modify RTDP to address the assumptions (P1) in classical planning, resulting in line 15. First, in our multitask setting, where goals vary between problem instances, we wish to learn a single goal-parameterized value function that generalizes across problems (Schaul et al., 2015). We omitted the goal for notational concision, but all of our value functions are implicitly goal-parameterized, i.e., .

Next, since larger problem instances typically require more steps to solve, states from these problems are likely to dominate the replay buffer. This can make updates to states from smaller problems rare, which can lead to catastrophic forgetting. To address this, we separate the buffer into buckets (line 22), where states in one bucket are from problem instances with the same number of objects. When we sample a mini-batch, we randomly select a bucket and randomly select states from this bucket.

Next, instead of terminating the inner loop and sampling the initial state in the same state space, we redefine to be a distribution of problem instances, and select a new training instance and start from its initial state (line 18).

Finally, since in RTDP is not possible at a state with no applicable actions (a.k.a. deadlock), we reset the environment upon entering such a state (line 19). We also select actions only from applicable actions and do not treat an inapplicable action as a self-cycle (line 20). Indeed, training a value function along a trajectory that includes self-cycles has no benefit because the test-time agents never execute them due to duplicate detection.

3.2 Planning Heuristics as Dense Rewards

The fundamental difficulty of applying RL-based approaches to classical planning is the lack of dense reward to guide exploration. We address this by combining heuristic functions (e.g., ) with a technique called potential-based reward shaping. To correctly perform this technique, we should take care of the difference between the discounted and non-discounted objectives (P2).

Potential-based reward shaping (Ng, Harada, and Russell, 1999) is a technique that helps RL algorithms by modifying the reward function . Formally, with a potential function , a function of states, we define a shaped reward function on transitions, , as follows:


Let be a MDP with a shaped reward , and be the original MDP. When the discount factor , or when the MDP is proper, i.e., every policy eventually () reaches a terminal state with probability 1 under , any optimal policy of is an optimal policy of regardless of , thus RL converges to an policy optimal in the original MDP . Also, the optimal value function under satisfies


In other words, an agent trained in is learning an offset of the original optimal value function from the potential function. The potential function thus acts as prior knowledge about the environment, which initializes the value function to non-zero values (Wiewiora, 2003).

Building on this theoretical background, we propose to leverage existing domain-independent heuristics to define a potential function that guides the agent while it learns to solve a given domain. A naive approach that implements this idea is to define . The value is negated because the MDP formulation seeks to maximize reward and is an estimate of cost-to-go, which should be minimized. Note that the agent receives an additional reward when is positive (Eq. 2). When , this means that approaching toward the goal and reducing is treated as a reward signal. Effectively, this allows us to use a domain-independent planning heuristic to generate dense rewards that aid in the RL algorithm’s exploration.

However, this straightforward implementation has two issues: (1) First, when the problem contains a dead-end, the function may return , i.e., . This causes a numerical error in gradient-based optimization. (2) Second, the value function still requires a correction even if is the “perfect” oracle heuristic . Recall that is the optimal discounted value function with rewards per step. Given an optimal unit-cost cost-to-go of a state , the discounted value function and the non-discounted cost-to-go can be associated as follows:


Therefore, the amount of correction needed (i.e., ) is not zero even in the presence of an oracle . This is a direct consequence of discounting difference.

To address these issues, we propose to use the discounted value of the heuristic function as a potential function. Recall that a heuristic function is an estimate of the cost-to-go from the current state to a goal. Since does not provide a concrete idea of how to reach a goal, we tend to treat it as a black box. An important realization, however, is that it nevertheless represents a sequence of actions; thus its value can be decomposed into a sum of action costs (below, left), and we define a corresponding discounted heuristic function (below, right):


Notice that results in . Also, is bounded within , avoiding numerical issues.

3.3 Value-Function Generalized over Size

To achieve the goal of learning domain-dependent, instance-independent heuristics, the neural value function used in the reward-shaping framework discussed above must be invariant to the number, the order, and the textual representation of propositions and objects in a PDDL definition (P3). We propose the use of Neural Logic Machine (Dong et al., 2019, NLM)

layers, which are originally designed for a supervised learning task over FOL inputs. We describe here how states and goals are encoded but provide a summary of NLM layers in Appendix


NLMs act on binary arrays representing the presence of each proposition in a state. Propositions are grouped by the arity of the predicates they were grounded from. This forms a set of -d arrays denoted as , where the leading dimensions are indexed by objects and the last dimension is indexed by predicates of arity . For example, when we have objects a, b, c and four binary predicates on, connected, above and larger, we enumerate all combinations on(a,a), on(a,b)larger(c,c), resulting in an array . Similarly, we may have for 2 unary predicates, and for 5 ternary predicates. The total number of elements in all arrays combined matches the number of propositions .

To form the input to the NLMs, we concatenate these binary arrays representing the state and another set of binary arrays encoding the goal conditions, thus doubling the size of the last dimension. Once computed, these arrays can be used by NLMs without any additional processing.

4 Experimental Evaluation

Our objective is to see whether our RL agent can improve the efficiency of a Greedy Best-First Search (GBFS), a standard algorithm for solving satisficing planning problems, over a standard domain-independent heuristic. The efficiency is measured in terms of the number of node-evaluations performed during search. In addition, we place an emphasis on generalization: We hope that NLMs are able to generalize from smaller training instances with fewer objects to instances with more objects.

We train our RL agent with rewards shaped by and heuristics obtained by pyperplan (Alkhazraji et al., 2020) library. We denote blind heuristic to denote a baseline (no shaping). While our program is compatible with a wide range of unit-cost IPC domains (see the list of 25 domains in Appendix A.7

), we focus on extensively testing a selected subset of domains with a large enough number of independently trained models with different random seeds (20), to produce high-confidence results. This is due to the fact that RL algorithms tend to have a large amount of variance in their outcomes

(Henderson et al., 2018), induced by sensitivity to initialization, randomization in exploration, and randomization in experience replay.

We trained our system on five classical planning domains: 4-ops blocksworld, ferry, gripper, logistics, satellite, as well as three additional IPC domains: miconic, parking, and visitall. In all domains, we generated problem instances using existing parameterized generators (Fawcett et al., 2011). 111Please see Appendix Table 3 for the list of parameters.

For each domain, we provided between 195 and 500 instances for training, and between 250 and 700 instances for testing. Each agent is trained for 50000 steps, which takes about 4 to 6 hours on Xeon E5-2600 v4 and Tesla K80. All hyperparameters can be found in Appendix


width= Baselines Ours (meanstd (max) of 20 runs) GBFS GBFS- (GBFLS-) domain (total) -HGN -H -V (-H -V) blocks (250) 0 126 87 99.78.9(116) 181.38.6(198) 117.16.1(129) 3 223 0 250 250 ferry (250) 0 138 250 4512.2(74) 241.16.4(248) 2500(250) 27 51 0 250 250 gripper (250) 0 250 250 31.614.4(57) 248.31.8(250) 2500(250) 63 184 0 250 250 logistics (250) 0 103 237 00(0) 18.414.3(55) 43.526(105) - 0 0 24 20 miconic (442) 171 442 442 61.124.5(135) 410.834(442) 429.939.1(442) - 1 0 442 442 parking (700) 0 414 416 55.545.4(129) 387.827.5(404) 356.154.6(407) - 105 0 133 107 satellite (250) 0 249 225 59.218.8(95) 243.97.8(250) 184.938.2(225) - 12 0 163 170 visitall (252) 252 252 252 159.640.2(250) 251.90.3(252) 2520(252) - 104 0 252 252

Table 1: Coverage of GBFS with 100,000 node evaluations bound. Our scores are highlighted in bold and underline when our average score is significantly better/worse than the baseline (outside of ). Reinforcement learning with potential-based reward shaping improves the performance of heuristics in many domains and heuristics, but struggles in logistics. Note that even if the improvements are not reflected in coverage on some domains, the node evaluation plot (Figure 1) shows reduction in the search effort.
Figure 1: Best viewed on computer screens. (Left) Scatter plot showing the number of node evaluations on 8 domains, where -axis is for GBFS with and -axis is for . Each point corresponds to a single test problem instance. Results of 20 random seeds are plotted against a single deterministic baseline. Failed instances are plotted on the border. Points below the diagonal are the instances which were improved by RL, and red circles highlight the best seed from whose sum of evaluations across instances is the smallest. They show that RL tends to improve the performance with () and without () reward shaping in the best case. (Right) The rate of finding a solution (-axis) for blocks instances with a given number of objects (-axis). The agents are trained on 2-6 objects while the test instances contain 10-50 objects (Table 3). Results on other domains are in the appendix (Figure 3).

Once the training was done, we evaluated the learned heuristics within GBFS on the test instances. Instead of setting time or memory limits, we limited the maximum node evaluations in GBFS to 100,000. If a problem was solved within the evaluations bound, the configuration gets the score 1 for that instance, otherwise it gets 0. The sum of the scores for the test instances of each domain is called the coverage in that domain. Table 1 shows the coverage in each of the tested domains, comparing our configurations to the baseline ones, as well as to the prior work ones (referred to in Section 4.1). The baseline configurations are denoted by their heuristic (e.g., is the GBFS with ), while our heuristic functions, obtained by a training with reward shaping , are denoted with the capital (e.g., ). Additionally, Figure 1 goes beyond the pure coverage and compares the node evaluations to the baseline. These results answer the following questions:

(Q1) Do our agents learn heuristic functions at all, i.e., (green dots in Figure 1), where is equivalent to breadth-first search with duplicate detection, and is baseline RL without reward shaping? With the exception of visitall and miconic, could not solve any instances in the test set, while using the heuristics learned without shaping () significantly improved coverage in 5 of the 6 domains.

(Q2) Do they improve over the baseline heuristics it was initialized with, i.e., ? Table 1 suggests that the reward-shaping-based training has successfully improved the coverage upon the baseline heuristics in some domains (blocks, ferry). Moreover, even if no significant improvements are observed in the coverage, Figure 1 shows that the search effort is significantly reduced (ferry, gripper, miconic, visitall). However, the effect tends to be negative on logistics, or when the baseline is already quite effective (e.g., particularly , but note that improves upon in blocks). In such cases, there is little room to improve upon the baseline, and thus the high randomness of reinforcement learning may harm the performance.

(Q3) Do our agents with reward shaping outperform our agents without shaping? According to Table 1, and outperforms . Notice that and also outperform . This suggest that the informativeness of the base heuristic used for reward shaping affects the quality of the learned heuristic. This matches the theoretical expectation: the potential function plays the role of domain knowledge that initializes the policy.

(Q4) Can the improvement be explained by accelerated exploration during training? Table 2 shows the total number of goals reached by the agent during training, indicating that reward shaping indeed helps the agent reach goals more often (compared to no reward shaping in ). See Appendix Figure 4-5 for cumulative plots.

(Q5) Does the heuristics obtained by our value function implemented with NLM layers maintain its improvement in larger problem instances, i.e., does it generalize to larger number of objects? Figure 1 (Right) plots the number of objects (-axis) and the ratio of success (-axis) over blocks instances. The agents are trained on 2-6 objects while evaluated on 10-50 objects. It shows that the heuristic accuracy is improved in instances whose size far exceeds the training instances for . Due to space limitations, plots for the remaining domains are in Appendix, Figure 3.

blocks 4691 102 4808 93 5089 71
ferry 4981 183 5598 46 5530 70
gripper 2456 190 3856 40 3482 125
logistics 3475 219 5059 155 5046 134
miconic 3568 26 3794 22 3808 25
parking 3469 509 4763 80 4716 56
satellite 3292 200 4388 80 4387 54
visitall 1512 91 1360 75 2063 53
Table 2:

The cumulative number of goal states the agent has reached during training. The numbers are average and standard deviation over 20 seeds. Best numbers among heuristics are highlighted in bold, with ties equally highlighted when there are no statistically significant differences between them under Wilcoxon’s rank-sum test (

). The results indicate that reward shaping significantly accelerates the exploration compared to no shaping ().

4.1 Comparison with Previous Work

Next, we compared our learned heuristics with two recent state of the art learned heuristics. The first approach, STRIPS-HGN (Shen, Trevizan, and Thiébaux, 2020), is a supervised learning method that learns a heuristic function using hypergraph networks (HGN), which generalize Graph Neural Networks (GNNs) (Battaglia et al., 2018; Scarselli et al., 2009). Due to its hypergraph representation, it is able to learn domain-dependent as well as domain-independent heuristics, depending on the dataset. The authors have provided us with pre-trained weights for three domains: gripper, ferry, and blocksworld for the domain-dependent setting.

While STRIPS-HGN was originally used with , for a fairer comparison to our methods we use it with GBFS, since we do not consider plan quality in this work. We denote the resulting method by GBFS-HGN. As with previous methods, we do not limit time or memory, bounding the number of evaluated nodes instead.

The second approach we compare to is GBFS-GNN (Rivlin, Hazan, and Karpas, 2019), an RL-based heuristic learning method that trains a GNN-based value function. The authors use Proximal Policy Optimization (Schulman et al., 2017), a state of the art RL method that stabilizes the training by limiting the amount of policy change in each step (the updated policy stays in the proximity of the previous policy). The value function is a GNN optionally equipped with attentions Veličković et al. (2018); Vaswani et al. (2017). In addition, the authors proposed to adjust by the policy and its entropy . The heuristic value of the successor state is given by . We call it an -adjusted value function.

The authors also proposed a variant of GBFS which launches a greedy lookahead guided by the heuristics after each expansion, similar to PROBE Lipovetzky and Geffner (2011), Jasper Xie, Müller, and Holte (2014), Mercury Katz and Hoffmann (2014) or with lookahead Stern et al. (2010). We distinguish their algorithmic improvement and the heuristics improvement by naming their search algorithm as Greedy Best First Lookahead Search (GBFLS). Our formal rendition of GBFLS can be found in Appendix A.3.

We counted the number of test instances that are solved by these approaches within 100,000 node evaluations. In the case of GBFLS, the evaluations also include the nodes that appear during the lookahead. We evaluated GBFS-HGN on the domains where pretrained weights are available. For GBFS-GNN, we obtained the source code from the authors (private communication) and minimally modified it to train on the same training instances that we used for our approach. We evaluated 4 variants of GBFS-GNN: GBFS-H, GBFS-V, GBFLS-H, and GBFLS-V, where “H” denotes -adjusted value function, and “V” denotes the original value function. Note that the fair evaluation should compare our method with GBFS-H/V, not GBFLS-H/V.

Table 1 shows the results. We first observed that the large part of the success of GBFS-GNN should be attributed to the lookahead extension of GBFS. This is because the score is GBFLS-V GBFS-H GBFS-V, i.e., GBFLS-V performs very well even with a bad heuristics (). While we report the coverage for both GBFLS-H/V and GBFS-H/V, the configurations that are comparable to our setting are GBFS-H/V. First, note that GBFS-HGN is significantly outperformed by all other methods. Comparing to the other two, both and outperform GBFS-H in 7 out of the 8 domains, loosing only on blocks. It is worth noting that outperforms GBFS-H in miconic, satellite, and visitall, loosing only on gripper. Since both and GBFS-H are trained without reward shaping, the difference is due to the network shape (NLM vs GNN) and the training (Modified RTDP vs PPO).

5 Related Work

Early attempts to learn heuristic functions include applying shallow, fully connected neural networks to puzzle domains (Arfaee, Zilles, and Holte, 2010, 2011), its online version (Thayer, Dionne, and Ruml, 2011), combining SVMs (Cortes and Vapnik, 1995) and NNs (Satzger and Kramer, 2013), learning a residual from planning heuristics similar to ours (Yoon, Fern, and Givan, 2006, 2008), or a relative ranking between states instead of absolute values (Garrett, Kaelbling, and Lozano-Pérez, 2016). More recently, Ferber, Helmert, and Hoffmann (2020) tested fully-connected layers in modern frameworks. ASNet (Toyer et al., 2018) learns domain-dependent heuristics using a network that is similar to GNNs. These approaches are based on supervised learning methods that require the high-quality training dataset (accurate goal distance estimates of states) that are prepared separately. Unlike supervised methods that depend on high-quality data, our RL-based approaches must explore the environment by itself to collect useful data, which is automated but could not be sample efficient.

Other RL-based approaches include Policy Gradient with FF to accelerate exploration for probabilistic PDDL (Buffet, Aberdeen et al., 2007), and PPO-based Meta-RL (Duan et al., 2016) for PDDL3.1 discrete-continuous hybrid domains (Gutierrez and Leonetti, 2021). These approaches do not use reward shaping to improve heuristics, thus our contributions are orthogonal.

Grounds and Kudenko (2005) combined RL and STRIPS planning with reward shaping, but in a significantly different setting: They treat a 2D navigation as a two-tier hierarchical problem where unmodified FF (Hoffmann and Nebel, 2001) or Fast Downward (Helmert, 2006) are used as high-level planner, then their plans are used to shape the rewards for the low-level RL agent. They do not train the high-level planner.

6 Conclusion

In this paper, we proposed a domain-independent reinforcement learning framework for learning domain-specific heuristic functions. Unlike existing work on applying policy gradient to planning (Rivlin, Hazan, and Karpas, 2019), we based our algorithm on value iteration. We addressed the difficulty of training an RL agent with sparse rewards using a novel reward-shaping technique which leverages existing heuristics developed in the literature. We showed that our framework not only learns a heuristic function from scratch (), but also learns better if aided by heuristic functions (reward shaping). Furthermore, the learned heuristics keeps outperforming the baseline over a wide range of problem sizes, demonstrating its generalization over the number of objects in the environment.


  • Alkhazraji et al. (2020) Alkhazraji, Y.; Frorath, M.; Grützner, M.; Helmert, M.; Liebetraut, T.; Mattmüller, R.; Ortlieb, M.; Seipp, J.; Springenberg, T.; Stahl, P.; and Wülfing, J. 2020. Pyperplan.
  • Arfaee, Zilles, and Holte (2010) Arfaee, S. J.; Zilles, S.; and Holte, R. C. 2010. Bootstrap Learning of Heuristic Functions. In Felner, A.; and Sturtevant, N. R., eds., Proc. of Annual Symposium on Combinatorial Search. AAAI Press.
  • Arfaee, Zilles, and Holte (2011) Arfaee, S. J.; Zilles, S.; and Holte, R. C. 2011. Learning Heuristic Functions for Large State Spaces. Artificial Intelligence, 175(16-17): 2075–2098.
  • Asai and Fukunaga (2015) Asai, M.; and Fukunaga, A. 2015. Solving Large-Scale Planning Problems by Decomposition and Macro Generation. In Proc. of the International Conference on Automated Planning and Scheduling (ICAPS). Jerusalem, Israel.
  • Asai and Fukunaga (2017) Asai, M.; and Fukunaga, A. 2017. Tie-Breaking Strategies for Cost-Optimal Best First Search. J. Artif. Intell. Res.(JAIR), 58: 67–121.
  • Bäckström and Klein (1991) Bäckström, C.; and Klein, I. 1991. Planning in Polynomial Time: The SAS-PUBS Class. Computational Intelligence, 7(3): 181–197.
  • Badia et al. (2020) Badia, A. P.; Piot, B.; Kapturowski, S.; Sprechmann, P.; Vitvitskyi, A.; Guo, Z. D.; and Blundell, C. 2020. Agent57: Outperforming the Atari Human Benchmark. In

    Proc. of the International Conference on Machine Learning

    , 507–517. PMLR.
  • Battaglia et al. (2018) Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez-Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.; Raposo, D.; Santoro, A.; Faulkner, R.; et al. 2018. Relational Inductive Biases, Deep Learning, and Graph Networks. arXiv preprint arXiv:1806.01261.
  • Bradbury et al. (2018) Bradbury, J.; Frostig, R.; Hawkins, P.; Johnson, M. J.; Leary, C.; Maclaurin, D.; Necula, G.; Paszke, A.; VanderPlas, J.; Wanderman-Milne, S.; and Zhang, Q. 2018. JAX: Composable Transformations of Python+NumPy Programs.
  • Brafman and Domshlak (2003) Brafman, R. I.; and Domshlak, C. 2003. Structure and Complexity in Planning with Unary Operators. J. Artif. Intell. Res.(JAIR), 18: 315–349.
  • Buffet, Aberdeen et al. (2007) Buffet, O.; Aberdeen, D.; et al. 2007. FF+FPG: Guiding a Policy-Gradient Planner. In Proc. of the International Conference on Automated Planning and Scheduling (ICAPS), 42–48.
  • Burns et al. (2012) Burns, E. A.; Hatem, M.; Leighton, M. J.; and Ruml, W. 2012. Implementing Fast Heuristic Search Code. In Proc. of Annual Symposium on Combinatorial Search.
  • Bylander (1994) Bylander, T. 1994. The Computational Complexity of Propositional STRIPS Planning. Artificial Intelligence, 69(1): 165–204.
  • Cortes and Vapnik (1995) Cortes, C.; and Vapnik, V. 1995. Support-Vector Networks. Machine learning, 20(3): 273–297.
  • Domshlak, Hoffmann, and Katz (2015) Domshlak, C.; Hoffmann, J.; and Katz, M. 2015. Red-Black Planning: A New Systematic Approach to Partial Delete Relaxation. Artificial Intelligence, 221: 73–114.
  • Dong et al. (2019) Dong, H.; Mao, J.; Lin, T.; Wang, C.; Li, L.; and Zhou, D. 2019. Neural Logic Machines. In Proc. of the International Conference on Learning Representations.
  • Duan et al. (2016) Duan, Y.; Schulman, J.; Chen, X.; Bartlett, P. L.; Sutskever, I.; and Abbeel, P. 2016. RL: Fast Reinforcement Learning via Slow Reinforcement Learning. CoRR, abs/1611.02779.
  • Erol, Nau, and Subrahmanian (1995) Erol, K.; Nau, D. S.; and Subrahmanian, V. S. 1995. Complexity, Decidability and Undecidability Results for Domain-Independent Planning. Artificial Intelligence, 76(1–2): 75–88.
  • Fawcett et al. (2011) Fawcett, C.; Helmert, M.; Hoos, H.; Karpas, E.; Röger, G.; and Seipp, J. 2011. FD-Autotune: Domain-Specific Configuration using Fast Downward. In ICAPS 2011 Workshop on Planning and Learning, 13–17.
  • Ferber, Helmert, and Hoffmann (2020) Ferber, P.; Helmert, M.; and Hoffmann, J. 2020. Neural Network Heuristics for Classical Planning: A Study of Hyperparameter Space. In Proc. of European Conference on Artificial Intelligence, 2346–2353.
  • Fern, Khardon, and Tadepalli (2011) Fern, A.; Khardon, R.; and Tadepalli, P. 2011. The First Learning Track of the International Planning Competition. Machine Learning, 84(1-2): 81–107.
  • Fikes and Nilsson (1972) Fikes, R. E.; and Nilsson, N. J. 1972. STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving. Artificial Intelligence, 2(3): 189–208.
  • Garrett, Kaelbling, and Lozano-Pérez (2016) Garrett, C. R.; Kaelbling, L. P.; and Lozano-Pérez, T. 2016. Learning to Rank for Synthesizing Planning Heuristics. In Proc. of International Joint Conference on Artificial Intelligence (IJCAI), 3089–3095.
  • Grounds and Kudenko (2005) Grounds, M.; and Kudenko, D. 2005. Combining Reinforcement Learning with Symbolic Planning. In Adaptive Agents and Multi-Agent Systems III. Adaptation and Multi-Agent Learning, 75–86. Springer.
  • Gutierrez and Leonetti (2021) Gutierrez, R. L.; and Leonetti, M. 2021. Meta Reinforcement Learning for Heuristic Planing. In Proc. of the International Conference on Automated Planning and Scheduling (ICAPS), volume 31, 551–559.
  • Hart, Nilsson, and Raphael (1968) Hart, P. E.; Nilsson, N. J.; and Raphael, B. 1968. A Formal Basis for the Heuristic Determination of Minimum Cost Paths. Systems Science and Cybernetics, IEEE Transactions on, 4(2): 100–107.
  • He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep Residual Learning for Image Recognition. In

    Proc. of IEEE Conference on Computer Vision and Pattern Recognition

    , 770–778.
  • Helmert (2006) Helmert, M. 2006. The Fast Downward Planning System. J. Artif. Intell. Res.(JAIR), 26: 191–246.
  • Henderson et al. (2018) Henderson, P.; Islam, R.; Bachman, P.; Pineau, J.; Precup, D.; and Meger, D. 2018. Deep Reinforcement Learning that Matters. In Proc. of AAAI Conference on Artificial Intelligence, volume 32.
  • Hoffmann and Nebel (2001) Hoffmann, J.; and Nebel, B. 2001. The FF Planning System: Fast Plan Generation through Heuristic Search. J. Artif. Intell. Res.(JAIR), 14: 253–302.
  • Jonsson (2007) Jonsson, A. 2007. The Role of Macros in Tractable Planning over Causal Graphs. In Proc. of International Joint Conference on Artificial Intelligence (IJCAI).
  • Jonsson and Bäckström (1998a) Jonsson, P.; and Bäckström, C. 1998a. State-Variable Planning under Structural Restrictions: Algorithms and Complexity. Artificial Intelligence, 100(1–2): 125–176.
  • Jonsson and Bäckström (1998b) Jonsson, P.; and Bäckström, C. 1998b. Tractable Plan Existence Does Not Imply Tractable Plan Generation. Annals of Mathematics and Artificial Intelligence, 22(3,4): 281–296.
  • Junghanns and Schaeffer (2000) Junghanns, A.; and Schaeffer, J. 2000. Sokoban: A Case-Study in the Application of Domain Knowledge in General Search Enhancements to Increase Efficiency in Single-Agent Search. Artificial Intelligence.
  • Katz and Domshlak (2008a) Katz, M.; and Domshlak, C. 2008a. New Islands of Tractability of Cost-Optimal Planning. J. Artif. Intell. Res.(JAIR), 32: 203–288.
  • Katz and Domshlak (2008b) Katz, M.; and Domshlak, C. 2008b. Structural Patterns Heuristics via Fork Decomposition. In Proc. of the International Conference on Automated Planning and Scheduling (ICAPS), 182–189.
  • Katz and Hoffmann (2014) Katz, M.; and Hoffmann, J. 2014. Mercury Planner: Pushing the Limits of Partial Delete Relaxation. In Eighth International Planning Competition (IPC-8): planner abstracts, 43–47.
  • Katz and Keyder (2012) Katz, M.; and Keyder, E. 2012. Structural Patterns Beyond Forks: Extending the Complexity Boundaries of Classical Planning. In Proc. of AAAI Conference on Artificial Intelligence, 1779–1785.
  • Keyder, Hoffmann, and Haslum (2012) Keyder, E.; Hoffmann, J.; and Haslum, P. 2012. Semi-Relaxed Plan Heuristics. In Proc. of the International Conference on Automated Planning and Scheduling (ICAPS), 128–136.
  • Kocsis and Szepesvári (2006) Kocsis, L.; and Szepesvári, C. 2006. Bandit Based Monte-Carlo Planning. In Proc. of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 282–293. Springer.
  • Lillicrap et al. (2016) Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2016. Continuous Control with Deep Reinforcement Learning. In Proc. of the International Conference on Learning Representations.
  • Lin (1993) Lin, L.-J. 1993. Reinforcement Learning for Robots using Neural Networks. Technical report, Carnegie-Mellon Univ Pittsburgh PA School of Computer Science.
  • Lipovetzky and Geffner (2011) Lipovetzky, N.; and Geffner, H. 2011. Searching for Plans with Carefully Designed Probes. In Proc. of the International Conference on Automated Planning and Scheduling (ICAPS).
  • Ma et al. (2020) Ma, T.; Ferber, P.; Huo, S.; Chen, J.; and Katz, M. 2020. Online Planner Selection with Graph Neural Networks and Adaptive Scheduling. In Proc. of AAAI Conference on Artificial Intelligence, volume 34, 5077–5084.
  • Mnih et al. (2016) Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous Methods for Deep Reinforcement Learning. In Proc. of the International Conference on Machine Learning, 1928–1937. PMLR.
  • Mnih et al. (2015) Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-Level Control through Deep Reinforcement Learning. Nature, 518(7540): 529–533.
  • Muggleton (1991) Muggleton, S. 1991.

    Inductive Logic Programming.

    New generation computing, 8(4): 295–318.
  • Ng, Harada, and Russell (1999) Ng, A. Y.; Harada, D.; and Russell, S. 1999. Policy Invariance under Reward Transformations: Theory and Application to Reward Shaping. In Proc. of the International Conference on Machine Learning, volume 99, 278–287.
  • Pohl (1970) Pohl, I. 1970. Heuristic Search Viewed as Path Finding in a Graph. Artificial Intelligence, 1(3-4): 193–204.
  • Ravanbakhsh, Schneider, and Poczos (2016) Ravanbakhsh, S.; Schneider, J.; and Poczos, B. 2016. Deep Learning with Sets and Point Clouds. arXiv preprint arXiv:1611.04500.
  • Reiter (1981) Reiter, R. 1981. On Closed World Data Bases. In Readings in Artificial Intelligence, 119–140. Elsevier.
  • Rivlin, Hazan, and Karpas (2019) Rivlin, O.; Hazan, T.; and Karpas, E. 2019. Generalized Planning With Deep Reinforcement Learning. In Proc. of the ICAPS Workshop on Bridging the Gap Between AI Planning and Reinforcement Learning (PRL).
  • Russell et al. (1995) Russell, S. J.; Norvig, P.; Canny, J. F.; Malik, J. M.; and Edwards, D. D. 1995. Artificial Intelligence: A Modern Approach, volume 2. Prentice hall Englewood Cliffs.
  • Satzger and Kramer (2013) Satzger, B.; and Kramer, O. 2013.

    Goal Distance Estimation for Automated Planning using Neural Networks and Support Vector Machines.

    Natural Computing, 12(1): 87–100.
  • Scarselli et al. (2009) Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and Monfardini, G. 2009. The Graph Neural Network Model. IEEE Transactions on Neural Networks, 20(1): 61–80.
  • Schaul et al. (2015) Schaul, T.; Horgan, D.; Gregor, K.; and Silver, D. 2015. Universal Value Function Approximators. In Proc. of the International Conference on Machine Learning, 1312–1320. PMLR.
  • Schulman et al. (2015) Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz, P. 2015. Trust Region Policy Optimization. In Proc. of the International Conference on Machine Learning, 1889–1897. PMLR.
  • Schulman et al. (2017) Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347.
  • Shen, Trevizan, and Thiébaux (2020) Shen, W.; Trevizan, F.; and Thiébaux, S. 2020. Learning Domain-Independent Planning Heuristics with Hypergraph Networks. In Proc. of the International Conference on Automated Planning and Scheduling (ICAPS), volume 30, 574–584.
  • Silver et al. (2016) Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature, 529(7587): 484–489.
  • Stern et al. (2010) Stern, R.; Kulberis, T.; Felner, A.; and Holte, R. 2010. Using Lookaheads with Optimal Best-First Search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 24.
  • Sutton and Barto (2018) Sutton, R. S.; and Barto, A. G. 2018. Reinforcement Learning: An Introduction. MIT press.
  • Thayer, Dionne, and Ruml (2011) Thayer, J.; Dionne, A.; and Ruml, W. 2011. Learning Inadmissible Heuristics during Search. In Proc. of the International Conference on Automated Planning and Scheduling (ICAPS), volume 21.
  • Toyer et al. (2018) Toyer, S.; Trevizan, F.; Thiébaux, S.; and Xie, L. 2018.

    Action Schema Networks: Generalised Policies with Deep Learning.

    In Proc. of AAAI Conference on Artificial Intelligence, volume 32.
  • Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is All You Need. In Advances in Neural Information Processing Systems, 5998–6008.
  • Veličković et al. (2018) Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; and Bengio, Y. 2018. Graph Attention Networks. In Proc. of the International Conference on Learning Representations.
  • Wiewiora (2003) Wiewiora, E. 2003. Potential-Based Shaping and Q-value Initialization are Equivalent. J. Artif. Intell. Res.(JAIR), 19: 205–208.
  • Wolpert, Macready et al. (1995) Wolpert, D. H.; Macready, W. G.; et al. 1995. No Free Lunch Theorems for Search. Technical report, Technical Report SFI-TR-95-02-010, Santa Fe Institute.
  • Xie, Müller, and Holte (2014) Xie, F.; Müller, M.; and Holte. 2014. Jasper: The Art of Exploration in Greedy Best-First Search. In Proc. of the International Planning Competition, 39–42.
  • Yoon, Fern, and Givan (2008) Yoon, S.; Fern, A.; and Givan, R. 2008. Learning Control Knowledge for Forward Search Planning. Journal of Machine Learning Research, 9(4).
  • Yoon, Fern, and Givan (2006) Yoon, S. W.; Fern, A.; and Givan, R. 2006. Learning Heuristic Functions from Relaxed Plans. In Proc. of the International Conference on Automated Planning and Scheduling (ICAPS), volume 2, 3.
  • Zaheer et al. (2017) Zaheer, M.; Kottur, S.; Ravanbakhsh, S.; Poczos, B.; Salakhutdinov, R. R.; and Smola, A. J. 2017. Deep Sets. In Advances in Neural Information Processing Systems, 3391–3401.

Appendix A Appendix

a.1 Neural Logic Machines

The NLM (Dong et al., 2019) is a neural Inductive Logic Programming (ILP) (Muggleton, 1991) system based on FOL and the Closed-World Assumption (Reiter, 1981). NLM represents a set of continuous relaxations of Horn rules as a set of weights in a neural network and is able to infer the truthfulness of some target formulae as a probability. For example, in Blocksworld, based on an input such as on(a, b) for blocks a, b, NLMs may be trained to predict clear(b) is true by learning a quantified formula .

NLM takes a boolean Multi-Arity Predicate Representation (MAPR) of propositional groundings of FOL statements. Assume that we need to represent FOL statements combining predicates of different arities. We denote a set of predicates of arity as (Prolog notation), its propositions as , and the Boolean tensor representation of as . A MAPR is a tuple of tensors where is the largest arity. For example, when we have objects a, b, c and four binary predicates on, connected, above and larger, we enumerate all combinations on(a,a), on(a,b)larger(c,c), resulting in an array . Similarly, we may have for 2 unary predicates, and for 5 ternary predicates.

NLM is designed to learn a class of FOL rules with the following set of restrictions: Every rule is a Horn rule, no rule contains function terms (such as a function that returns an object), there is no recursion, and all rules are applied between neighboring arities. Due to the lack of recursion, the set of rules can be stratified into layers. Let be a set of intermediate conclusions in the -th stratum. Under these assumptions, the following set of rules are sufficient for representing any rules Dong et al. (2019):


Here, (respectively) are predicates, is a sequence of parameters, and is a formula consisting of logical operations and terms . Intermediate predicates and have one less / one more parameters than , e.g., when , and . extracts the predicates whose arity is the same as that of . is a permutation of , and iterates over to generate propositional groundings with various argument orders. represents a formula that combines a subset of these propositions. By chaining these set of rules from to for a sufficient number of times (e.g., from to ), it is able to represent any FOL Horn rules without recursions Dong et al. (2019).

Figure 2: (Left) expand and reduce operations performed on a boolean MAPR containing nullary, unary, and binary predicates and three objects, a, b, and c. Each white / black square represents a boolean value (true / false). (Right) perm tensor operation performed on binary predicates. They are generated by performing the same operations shown on the left side of the figure on , , and . Each predicate is represented as a matrix. For a matrix, perm is equivalent to concatenating the matrix with its transposition. When perm is applied to ternary predicates, it concatenates tensors. After perm, a single, shared fully-connected layer is applied to each combination of arguments (such an operation is sometimes called a pointwise convolution).

All three operations (expand, reduce, and compose) can be implemented as tensor operations over MAPRs (Figure 2). Given a binary tensor of shape , expand copies the -th axis to -th axis resulting in a shape , and reduce takes the of -th axis resulting in a shape . The reduce operation can also use , in which case becomes .

Finally, the compose operation combines the information between the neighboring tensors . In order to use the information in the neighboring arities (, and ), the input concatenates with and , resulting in a shape where . Next, a Perm function enumerates and concatenates the results of permuting the first axes in the tensor, resulting in a shape . It then applies a -D pointwise convolutional filter with output features, resulting in , i.e., applying a fully connected layer to each vector of length while sharing the weights. It is activated by any nonlinearity to obtain the final result, which we denote as . Formally, ,

An NLM contains (the maximum arity) compose operation for the neighboring arities, with appropriately omitting both ends ( and ) from the concatenation. We denote the result as . These horizontal arity-wise compositions can be layered vertically, allowing the composition of predicates whose arities differ more than 1 (e.g., two layers of NLM can combine unary and quaternary predicates). Since is applied in a convolutional manner over object tuples, the number of weights in an NLM layer does not depend on the number of objects in the input. However, it is still affected by the number of predicates in the input, which alters .

When the predicates in the input PDDL domain have a maximum arity , we specify the maximum intermediate arity and the depth of NLM layers as a hyperparameter. The intermediate NLM layers expand the arity up to using expand operation, and shrink the arity near the output because the value function is a scalar (arity 0). For example, with , , , the arity of each layer follows . Higher arities are not necessary near the output because the information in each layer propagates only to the neighboring arities. Since each expand/reduce operation only increments/decrements the arity by one, must satisfy .

We consider NLMs where intermediate layers have a sigmoid activation function, while the output is linear, since we use its raw value as the predicted correction to the heuristic function. In addition, we implement NLM with a

skip connection that was popularized in ResNet image classification network (He et al., 2016): The input of -th layer is a concatenation of the outputs of all previous layers. Due to the direct connections between the layers in various depths, the layers near the input receive more gradient information from the output, preventing the gradient vanishing problem in deep neural networks.

a.2 Domain-Independent Heuristics for Classical Planning

In this section, we discuss various approximations of delete-relaxed optimal cost . Given a classical planning problem , and a state , each heuristics is typically implicitly conditioned by the goal condition. heuristics is recursively defined as follows:


heuristics can be defined based on as a subprocedure. The action which minimizes the second case () of each of the definition above is conceptually a “cheapest action that achieves a subgoal for the first time”, which is called a cheapest achiever / best supporter of . Using and its best supporter function, is defined as follows:


a.3 Greedy Best First Search and Greedy Best First Lookahead Search Rivlin, Hazan, and Karpas (2019)

Given a classical planning problem , we define its state space as a directed graph where , i.e., a power set of subsets of propositions . Greedy Best First Search is a greedy version of algorithm (Hart, Nilsson, and Raphael, 1968), therefore we define first.

We follow the optimized version of the algorithm discussed in Burns et al. (2012) which does not use CLOSE list and avoids moving elements between CLOSE list and OPEN list by instead managing a flag for each search state. Let be a function that computes the sum of and , where is a value stored for each state which represents the currently known upper bound of the shortest path cost from the initial state . For every state , is initialized to infinity except . Whenever a state is expended, is a lower bound of the path cost that goes through . algorithm is defined as in Algorithm 2. We simplified some aspects such as updating the parent node pointer, the rules for tiebreaking Asai and Fukunaga (2017), or extraction of the plan by backtracking the parent pointers. Notice that the update rule for -values in is a Bellman update specialized for a positive cost function.

1:  Priority queue .
2:   for all except .
3:  .closed for all .
5:  while  do
6:     State Expansion
7:     if  then
8:        return  
9:     if .closed  then
10:        continue
11:     else
12:        .closed
13:     for   do
14:        successor Note: Appropriate caching of is necessary.
15:        if  then
16:            Bellman update
17:           .closed Reopening
Algorithm 2 algorithm for a planning problem with state space .

Three algorithms can be derived from by redefining the sorting key for the priority queue OPEN. First, ignoring the heuristic function by redefining yields Dijkstra’s search algorithm. Another is weighted (Pohl, 1970), where we redefine for some value , which results in trusting the heuristic guidance relatively more greedily.

As the extreme version of WA*, conceptually yields the Greedy Best First Search algorithm Russell et al. (1995) which completely greedily trusts the heuristic guidance. In practice, we implement it by ignoring the value, i.e., . This also simplifies some of the conditionals in Algorithm 2: There is no need for updating the value, or reopening the node. In addition, purely satisficing algorithm like GBFS can enjoy an additional enhancement called early goal detection. In , the goal condition is checked when the node is popped from the OPEN list (line 7) – if we detect the goal early, it leaves the possibility that it returns a suboptimal goal node. In contrast, since we do not have this optimality requirement in GBFS, the goal condition can be checked in line 12 where the successor state is generated. GBFS is thus defined as in Algorithm 3.

1:  Priority queue .
2:  .closed for all .
4:  while  do
5:     State Expansion
6:     if .closed  then
7:        continue
8:     else
9:        .closed
10:     for   do
11:        successor Note: Appropriate caching of is necessary.
12:        if  then
13:           return   Early goal detection.
Algorithm 3 GBFS algorithm for a planning problem with state space .

Finally, Rivlin, Hazan, and Karpas (2019) proposed an unnamed extension of GBFS which performs a depth-first lookahead after a node is expanded. We call the search algorithm Greedy Best First Lookahead Search (GBFLS), defined in Algorithm 4. We perform the same early goal checking during the lookahead steps. Note that the nodes are added only when the current node is expanded; Nodes that appear during the lookahead are not added to the OPEN list. However, these nodes must be counted as evaluated node because it is subject to goal checking and because we evaluate their heuristic values. The lookahead has an artificial depth limit which is defined as , i.e., 5 times the value of the FF heuristics at the initial state. When , the limit is set to 50, according to their code base.

1:  Priority queue .
2:  .closed for all .
4:  while  do
5:     State Expansion
6:     if .closed  then
7:        continue
8:     else
9:        .closed
10:     for depth  do
11:        for   do
12:           successor Note: Appropriate caching of is necessary.
13:           if  then
14:              return   Early goal detection.
15:            if
Algorithm 4 GBFLS algorithm for a planning problem with state space .

a.4 Implementation

Our implementation combines the jax auto-differentiation framework for neural networks (Bradbury et al., 2018), and pyperplan (Alkhazraji et al., 2020) for parsing and to obtain the heuristic value of and .

a.5 Generator Parameters

Table 3 contains a list of parameters used to generate the training and testing instances. Since generators have a tendency to create an identical instance especially in smaller parameters, we removed the duplicates by checking the md5 hash value of each file.

Domain Parameters
blocks/train/ 2-6 blocks x 50 seeds 2-6
blocks/test/ 10,20,..,50 blocks x 50 seeds 10-50
ferry/train/ 2-6 locations x 2-6 cars x 50 seeds 4-7
ferry/test/ 10,15,…30 locations and cars x 50 seeds 20-60
gripper/train/ 2,4…,10 balls x 50 seeds (initial/goal locations are randomized) 6-14
gripper/test/ 20,40,…,60 balls x 50 seeds (initial/goal locations are randomized) 24-64
logistics/train/ 1-3 airplanes x 1-3 cities x 1-3 city size x 1-3 packages x 10 seeds 5-13
logistics/test/ 4-8 airplanes/cities/city size/packages x 50 seeds 32-96
satellite/train/ 1-3 satellites x 1-3 instruments x 1-3 modes x 1-3 targets x 1-3 observations 15-39
satellite/test/ 4-8 satellites/instruments/modes/targets/observations x 50 seeds 69-246
miconic/train/ 2-4 floors x 2-4 passengers x 50 seeds 8-12
miconic/test/ 10,20,30 floors x 10,20,30 passengers x 50 seeds 24-64
parking/train/ 2-6 curbs x 2-6 cars x 50 seeds 8-16
parking/test/ 10,15,..,25 curbs x 10,15,..25 cars x 50 seeds 24-54
visitall/train/ For , x grids, 0.5 or 1.0 goal ratio, blocked locations, 50 seeds 8-22
visitall/test/ For , x grids, 0.5 or 1.0 goal ratio, blocked locations, 50 seeds 32-58
Table 3: List of parameters used for generating the training and testing instances.

a.6 Hyperparameters

We trained our network with a following set of hyperparameters: Maximum episode length , Learning rate 0.001, discount rate , maximum intermediate arity , number of layers in satellite and logistics, while in all other domains, the number of features in each NLM layer , batch size 25, temperature for a policy function (Section 2.2), and the total number of SGD steps to 50000, which determines the length of the training. We used for those two domains to address GPU memory usage: Due to the size of the intermediate layer , NLM sometimes requires a large amount of GPU memory. Each training takes about 4 to 6 hours, depending on the domain.

a.7 Preliminary Results on Compatible Domains

We performed a preliminary test on a variety of IPC classical domains that are supported by our implementation. The following domains worked without errors: barman-opt11-strips, blocks, depot, driverlog, elevators-opt11+sat11-strips, ferry, floortile-opt11-strips, freecell, gripper, hanoi, logistics00, miconic, mystery, nomystery-opt11-strips, parking-opt11+sat11-strips, pegsol-opt11-strips, pipesworld-notankage, pipesworld-tankage, rovers, satellite, scanalyzer-08-strips, sokoban-opt11-strips, tpp, transport-opt11+sat08-strips, visitall-opt11-strips, zenotravel.

a.8 Full Results

Figure 3 contains the full results of Figure 1 (Right).

Figure 3: The rate of successfully finding a solution (-axis) for instances with a certain number of objects (-axis). Learned heuristic functions outperform their original baselines used for reward shaping in most domains. Since the initial maximum node evaluation is too permissive, we manually set a threshold for the number of node evaluations for each domain and filtered the instances when the node evaluation exceeded this threshold. This filtering emphasizes the difference because both the learned and the baseline variants may have solved all instances.
Figure 4: Cumulative number of instances that are solved during the training, where -axis is the training step (part 1). Note that this may include solving the same instance multiple times.
Figure 5: Cumulative number of instances that are solved during the training, where -axis is the training step (part 2). Note that this may include solving the same instance multiple times.