1 Introduction
Deep reinforcement learning (RL) approaches have several strengths over conventional approaches to decision making problems, including compatibility with complex and unstructured observations, little dependency on handcrafted models, and some robustness to stochastic environments. However, they are notorious for their poor sample complexity; e.g., it may require environment interactions to successfully learn a policy for Montezuma’s Revenge (Badia et al., 2020). This sample inefficiency prevents their applications in environments where such an exhaustive set of interactions is physically or financially infeasible. The issue is amplified in domains with sparse rewards and long horizons, where the reward signals for success are difficult to obtain through random interactions with the environment.
In contrast, research in AI Planning and classical planning has been primarily driven by the identification of tractable fragments of originally PSPACEcomplete planning problems (Bäckström and Klein, 1991; Bylander, 1994; Erol, Nau, and Subrahmanian, 1995; Jonsson and Bäckström, 1998a, b; Brafman and Domshlak, 2003; Katz and Domshlak, 2008a, b; Katz and Keyder, 2012), and the use of the cost of the tractable relaxed problem as domainindependent heuristic guidance for searching through the state space of the original problem (Hoffmann and Nebel, 2001; Domshlak, Hoffmann, and Katz, 2015; Keyder, Hoffmann, and Haslum, 2012). Contrary to RL approaches, classical planning has focused on longhorizon problems with solutions well over 1000 steps long (Jonsson, 2007; Asai and Fukunaga, 2015). Moreover, classical planning problems inherently have sparse rewards — the objective of classical planning is to produce a sequence of actions that achieves a goal. However, although domainindependence is a welcome advantage, domainindependent methods can be vastly outperformed by carefully engineered domainspecific methods such as a specialized solver for Sokoban (Junghanns and Schaeffer, 2000) due to the nofreelunch theorem for search problems (Wolpert, Macready et al., 1995). Developing such domainspecific heuristics can require intensive engineering effort, with payoff only in that single domain. We are thus interested in developing domainindependent methods for learning domainspecific heuristics.
In this paper, we draw on the strengths of reinforcement learning and classical planning to propose an RL framework for learning to solve STRIPS planning problems. We propose to leverage classical heuristics, derivable automatically from the STRIPS model, to accelerate RL agents to learn a domainspecific neural network value function. The value function, in turn, improves over existing heuristics and accelerates search algorithms at evaluation time.
To operationalize this idea, we use potentialbased reward shaping (Ng, Harada, and Russell, 1999), a wellknown RL technique with guaranteed theoretical properties. A key insight in our approach is to see classical heuristic functions as providing dense rewards that greatly accelerate the learning process in three ways. First, they allow for efficient, informative exploration by initializing a good baseline reactive agent that quickly reaches a goal in each episode during training. Second, instead of learning the value function directly, we learn a residual
on the heuristic value, making learning easier. Third, the learning agent receives a reward by reducing the estimated costtogo (heuristic value). This effectively mitigates the issue of sparse rewards by allowing the agent to receive positive rewards more frequently.
We implement our neural network value functions as Neural Logic Machines (Dong et al., 2019, NLM), a recently proposed neural network architecture that can directly process firstorder logic (FOL) inputs, as are used in classical planning problems. NLM takes a dataset expressed in grounded FOL representations and learns a set of (continuous relaxations of) lifted Horn rules. The main advantage of NLMs is that they structurally generalize across different numbers of terms, corresponding to objects in a STRIPS encoding. Therefore, we find that our learned value functions are able to generalize effectively to problem instances of arbitrary sizes in the same domain.
We provide experimental results that validate the effectiveness of the proposed approach in 8 domains from past IPC (International Planning Competition) benchmarks, providing detailed considerations on the reproducibility of the experiments. We find that our reward shaping approach achieves good sample efficiency compared to sparsereward RL, and that the use of NLMs allows for generalization to novel problem instances. For example, our system learns from blocksworld instances with 26 objects, and the result enhances the performance of solving instances with up to 50 objects.
2 Background
We denote a multidimensional array in bold.
denotes a concatenation of tensors
and in the last axis where the rest of the dimensions are same between and . Functions (e.g., ) are applied to arrays elementwise. Finally, we let denote .2.1 Classical Planning
We consider planning problems in the STRIPS subset of PDDL Fikes and Nilsson (1972), which for simplicity we refer to as lifted STRIPS. We denote such a planning problem as a 5tuple . is a set of objects, is a set of predicates, and is a set of actions. We denote the arity of predicates and action as and , and their parameters as, e.g., . We denote the set of predicates and actions instantiated on as and , respectively, which is a union of Cartesian products of predicates/actions and their arguments, i.e., they represent the set of all ground propositions and actions. A state is a set of propositions that are true in that state. An action is a 4tuple , where are preconditions, addeffects, and deleteeffects, and is a cost of taking the action . In this paper, we primarily assume a unitcost domain where for all . Given a current state , a ground action is applicable when , and applying an action to yields a successor state . Finally, are the initial state and a goal condition, respectively. The task of classical planning is to find a plan which satisfies and every action satisfies its preconditions at the time of using it. The machine representation of a state and the goal condition is a bitvector of size , i.e., the
th value of the vector is 1 when the corresponding
th proposition is in , or .2.2 Markov Decision Processes
In general, RL methods address domains modeled as a discounted Markov decision processes (MDP),
where is a set of states, is a set of actions,encodes the probability
of transitioning from a state to a successor state by an action , is a reward function,is a probability distribution over initial states, and
is a discount factor. In this paper, we restrict our attention to deterministic models because PDDL domains are deterministic, and we have a deterministic mapping . Given a policy representing a probability of performing an action in a state, we define a sequence of random variables
, and , representing states, actions and rewards over time .Our goal is to find a policy maximizing its long term discounted cumulative rewards, formally defined as a value function We also define an actionvalue function to be the value of executing a given action and subsequently following some policy , i.e., An optimal policy is a policy that achieves the optimal value function that satisfies for all states and policies. satisfies Bellman’s equation:
(1) 
where is referred to as the optimal actionvalue function. We may omit in , for clarity.
Finally, we can define a policy by mapping actionvalues in each state to a probability distribution over actions. For example, given an actionvalue function, , we can define a policy , where is a temperature that controls the greediness of the policy. It returns a greedy policy when ; and approaches a uniform policy when .
2.3 Formulating Classical Planning as an MDP
There are two typical ways to formulate a classical planning problem as an MDP. In one strategy, given a transition , one may assign a reward of 1 when , and 0 otherwise (Rivlin, Hazan, and Karpas, 2019). In another strategy, one may assign a reward of 0 when , and otherwise (or, more generally in a nonunitcost domain). In this paper we use the second, negativereward model because it tends to induce more effective exploration in RL due to optimistic initial values (Sutton and Barto, 2018). Both cases are considered sparse reward problems because there is no information about whether one action sequence is better than another until a goal state is reached.
3 Bridging Deep RL and AI Planning
We consider a multitask learning setting with a training time and a test time (Fern, Khardon, and Tadepalli, 2011). During training, classical planning problems from a single domain are available. At test time, methods are evaluated on heldout problems from the same domain. The transition model (in PDDL form) is known at both training and test time.
Learning to improve planning has been considered in RL. For example, in AlphaGo (Silver et al., 2016), a value function was learned to provide heuristic guidance to Monte Carlo Tree Search (Kocsis and Szepesvári, 2006). Applying RL techniques in our classical planning setting, however, presents unique challenges.
(P1): Preconditions and deadends. In MDPs, a failure to perform an action is typically handled as a selfcycle to the current state in order to guarantee that the state transition probability is welldefined for all states. Another formulation augments the state space with an absorbing state with a highly negative reward. In contrast, classical planning does not handle nondeterministic outcomes (success and failure). Instead, actions are forbidden at a state when its preconditions are not satisfied, and a state is called a deadend when no actions are applicable. In a selfcycle formulation, random interaction with the environment could be inefficient due to repeated attempts to perform inapplicable actions. Also, the second formulation requires assigning an adhoc amount of negative reward to an absorbing state, which is not appealing.
(P2): Objective functions. While the MDP framework itself does not necessarily assume discounting, the majority of RL applications (Schulman et al., 2015; Mnih et al., 2015, 2016; Lillicrap et al., 2016) aim to maximize the expected cumulative discounted rewards of trajectories. In contrast, classical planning tries to minimize the sum of costs (negative rewards) along trajectories, i.e., cumulative undiscounted costs, thus carrying the concepts in classical planning over to RL requires caution.
(P3): Input representations. While much of the deep RL literature assumes an unstructured (e.g., images in Atari) or a factored input representation (e.g., location and velocity in cartpole), classical planning deals with structured inputs based on FOL to perform domain and problemindependent planning. This is problematic for typical neural networks, which assume a fixedsized input. Recently, several network architectures were proposed to achieve invariance to size and ordering, i.e., neural networks for setlike inputs (Ravanbakhsh, Schneider, and Poczos, 2016; Zaheer et al., 2017). Graph Neural Networks (Battaglia et al., 2018) have also been recently used to encode FOL inputs (Rivlin, Hazan, and Karpas, 2019; Shen, Trevizan, and Thiébaux, 2020; Ma et al., 2020). While the choice of the architecture is arbitrary, our network should be able to handle FOL inputs.
3.1 Value Iteration for Classical Planning
Our main approach will be to learn a value function that can be used as a heuristic to guide planning. To learn estimated value functions, we build on the value iteration (VI) algorithm (line 1, Algorithm 1), where a known model of the dynamics is used to incrementally update the estimates of the optimal value function . The current estimates is updated by the r.h.s. of Eq. 1 until a fixpoint is reached.
In classical planning, however, state spaces are too large to enumerate its states (line 3), or to represent the estimates in a tabular form (line 4).
To avoid the exhaustive enumeration of states in VI, Real Time Dynamic Programming (Sutton and Barto, 2018, RTDP, line 5) samples a subset of the state space based on the current policy. In this work, we use onpolicy RTDP, which replaces the second with (line 13) for the current policy defined by the of the current actionvalue estimates. Onpolicy methods are known to be more stable but can sometimes lead to slower convergence.
Next, to avoid representing the value estimates in an exhaustive table, we encode using a neural network parameterized by weights
and applying the Bellman updates approximately with Stochastic Gradient Descent (line
13).We use experience replay (Lin, 1993; Mnih et al., 2015) to smooth out changes in the policy and reduce the correlation between updated states (lines 612). We store the history of states into a FIFO buffer , and update with minibatches sampled from to leverage GPUbased parallelism.
We modify RTDP to address the assumptions (P1) in classical planning, resulting in line 15. First, in our multitask setting, where goals vary between problem instances, we wish to learn a single goalparameterized value function that generalizes across problems (Schaul et al., 2015). We omitted the goal for notational concision, but all of our value functions are implicitly goalparameterized, i.e., .
Next, since larger problem instances typically require more steps to solve, states from these problems are likely to dominate the replay buffer. This can make updates to states from smaller problems rare, which can lead to catastrophic forgetting. To address this, we separate the buffer into buckets (line 22), where states in one bucket are from problem instances with the same number of objects. When we sample a minibatch, we randomly select a bucket and randomly select states from this bucket.
Next, instead of terminating the inner loop and sampling the initial state in the same state space, we redefine to be a distribution of problem instances, and select a new training instance and start from its initial state (line 18).
Finally, since in RTDP is not possible at a state with no applicable actions (a.k.a. deadlock), we reset the environment upon entering such a state (line 19). We also select actions only from applicable actions and do not treat an inapplicable action as a selfcycle (line 20). Indeed, training a value function along a trajectory that includes selfcycles has no benefit because the testtime agents never execute them due to duplicate detection.
3.2 Planning Heuristics as Dense Rewards
The fundamental difficulty of applying RLbased approaches to classical planning is the lack of dense reward to guide exploration. We address this by combining heuristic functions (e.g., ) with a technique called potentialbased reward shaping. To correctly perform this technique, we should take care of the difference between the discounted and nondiscounted objectives (P2).
Potentialbased reward shaping (Ng, Harada, and Russell, 1999) is a technique that helps RL algorithms by modifying the reward function . Formally, with a potential function , a function of states, we define a shaped reward function on transitions, , as follows:
(2) 
Let be a MDP with a shaped reward , and be the original MDP. When the discount factor , or when the MDP is proper, i.e., every policy eventually () reaches a terminal state with probability 1 under , any optimal policy of is an optimal policy of regardless of , thus RL converges to an policy optimal in the original MDP . Also, the optimal value function under satisfies
(3) 
In other words, an agent trained in is learning an offset of the original optimal value function from the potential function. The potential function thus acts as prior knowledge about the environment, which initializes the value function to nonzero values (Wiewiora, 2003).
Building on this theoretical background, we propose to leverage existing domainindependent heuristics to define a potential function that guides the agent while it learns to solve a given domain. A naive approach that implements this idea is to define . The value is negated because the MDP formulation seeks to maximize reward and is an estimate of costtogo, which should be minimized. Note that the agent receives an additional reward when is positive (Eq. 2). When , this means that approaching toward the goal and reducing is treated as a reward signal. Effectively, this allows us to use a domainindependent planning heuristic to generate dense rewards that aid in the RL algorithm’s exploration.
However, this straightforward implementation has two issues: (1) First, when the problem contains a deadend, the function may return , i.e., . This causes a numerical error in gradientbased optimization. (2) Second, the value function still requires a correction even if is the “perfect” oracle heuristic . Recall that is the optimal discounted value function with rewards per step. Given an optimal unitcost costtogo of a state , the discounted value function and the nondiscounted costtogo can be associated as follows:
(4) 
Therefore, the amount of correction needed (i.e., ) is not zero even in the presence of an oracle . This is a direct consequence of discounting difference.
To address these issues, we propose to use the discounted value of the heuristic function as a potential function. Recall that a heuristic function is an estimate of the costtogo from the current state to a goal. Since does not provide a concrete idea of how to reach a goal, we tend to treat it as a black box. An important realization, however, is that it nevertheless represents a sequence of actions; thus its value can be decomposed into a sum of action costs (below, left), and we define a corresponding discounted heuristic function (below, right):
(5) 
Notice that results in . Also, is bounded within , avoiding numerical issues.
3.3 ValueFunction Generalized over Size
To achieve the goal of learning domaindependent, instanceindependent heuristics, the neural value function used in the rewardshaping framework discussed above must be invariant to the number, the order, and the textual representation of propositions and objects in a PDDL definition (P3). We propose the use of Neural Logic Machine (Dong et al., 2019, NLM)
layers, which are originally designed for a supervised learning task over FOL inputs. We describe here how states and goals are encoded but provide a summary of NLM layers in Appendix
A.1).NLMs act on binary arrays representing the presence of each proposition in a state. Propositions are grouped by the arity of the predicates they were grounded from. This forms a set of d arrays denoted as , where the leading dimensions are indexed by objects and the last dimension is indexed by predicates of arity . For example, when we have objects a, b, c and four binary predicates on, connected, above and larger, we enumerate all combinations on(a,a), on(a,b) … larger(c,c), resulting in an array . Similarly, we may have for 2 unary predicates, and for 5 ternary predicates. The total number of elements in all arrays combined matches the number of propositions .
To form the input to the NLMs, we concatenate these binary arrays representing the state and another set of binary arrays encoding the goal conditions, thus doubling the size of the last dimension. Once computed, these arrays can be used by NLMs without any additional processing.
4 Experimental Evaluation
Our objective is to see whether our RL agent can improve the efficiency of a Greedy BestFirst Search (GBFS), a standard algorithm for solving satisficing planning problems, over a standard domainindependent heuristic. The efficiency is measured in terms of the number of nodeevaluations performed during search. In addition, we place an emphasis on generalization: We hope that NLMs are able to generalize from smaller training instances with fewer objects to instances with more objects.
We train our RL agent with rewards shaped by and heuristics obtained by pyperplan (Alkhazraji et al., 2020) library. We denote blind heuristic to denote a baseline (no shaping). While our program is compatible with a wide range of unitcost IPC domains (see the list of 25 domains in Appendix A.7
), we focus on extensively testing a selected subset of domains with a large enough number of independently trained models with different random seeds (20), to produce highconfidence results. This is due to the fact that RL algorithms tend to have a large amount of variance in their outcomes
(Henderson et al., 2018), induced by sensitivity to initialization, randomization in exploration, and randomization in experience replay.We trained our system on five classical planning domains: 4ops blocksworld, ferry, gripper, logistics, satellite, as well as three additional IPC domains: miconic, parking, and visitall. In all domains, we generated problem instances using existing parameterized generators (Fawcett et al., 2011). ^{1}^{1}1Please see Appendix Table 3 for the list of parameters.
For each domain, we provided between 195 and 500 instances for training, and between 250 and 700 instances for testing. Each agent is trained for 50000 steps, which takes about 4 to 6 hours on Xeon E52600 v4 and Tesla K80. All hyperparameters can be found in Appendix
A.6.Once the training was done, we evaluated the learned heuristics within GBFS on the test instances. Instead of setting time or memory limits, we limited the maximum node evaluations in GBFS to 100,000. If a problem was solved within the evaluations bound, the configuration gets the score 1 for that instance, otherwise it gets 0. The sum of the scores for the test instances of each domain is called the coverage in that domain. Table 1 shows the coverage in each of the tested domains, comparing our configurations to the baseline ones, as well as to the prior work ones (referred to in Section 4.1). The baseline configurations are denoted by their heuristic (e.g., is the GBFS with ), while our heuristic functions, obtained by a training with reward shaping , are denoted with the capital (e.g., ). Additionally, Figure 1 goes beyond the pure coverage and compares the node evaluations to the baseline. These results answer the following questions:
(Q1) Do our agents learn heuristic functions at all, i.e., (green dots in Figure 1), where is equivalent to breadthfirst search with duplicate detection, and is baseline RL without reward shaping? With the exception of visitall and miconic, could not solve any instances in the test set, while using the heuristics learned without shaping () significantly improved coverage in 5 of the 6 domains.
(Q2) Do they improve over the baseline heuristics it was initialized with, i.e., ? Table 1 suggests that the rewardshapingbased training has successfully improved the coverage upon the baseline heuristics in some domains (blocks, ferry). Moreover, even if no significant improvements are observed in the coverage, Figure 1 shows that the search effort is significantly reduced (ferry, gripper, miconic, visitall). However, the effect tends to be negative on logistics, or when the baseline is already quite effective (e.g., particularly , but note that improves upon in blocks). In such cases, there is little room to improve upon the baseline, and thus the high randomness of reinforcement learning may harm the performance.
(Q3) Do our agents with reward shaping outperform our agents without shaping? According to Table 1, and outperforms . Notice that and also outperform . This suggest that the informativeness of the base heuristic used for reward shaping affects the quality of the learned heuristic. This matches the theoretical expectation: the potential function plays the role of domain knowledge that initializes the policy.
(Q4) Can the improvement be explained by accelerated exploration during training? Table 2 shows the total number of goals reached by the agent during training, indicating that reward shaping indeed helps the agent reach goals more often (compared to no reward shaping in ). See Appendix Figure 45 for cumulative plots.
(Q5) Does the heuristics obtained by our value function implemented with NLM layers maintain its improvement in larger problem instances, i.e., does it generalize to larger number of objects? Figure 1 (Right) plots the number of objects (axis) and the ratio of success (axis) over blocks instances. The agents are trained on 26 objects while evaluated on 1050 objects. It shows that the heuristic accuracy is improved in instances whose size far exceeds the training instances for . Due to space limitations, plots for the remaining domains are in Appendix, Figure 3.
domain  

blocks  4691 102  4808 93  5089 71 
ferry  4981 183  5598 46  5530 70 
gripper  2456 190  3856 40  3482 125 
logistics  3475 219  5059 155  5046 134 
miconic  3568 26  3794 22  3808 25 
parking  3469 509  4763 80  4716 56 
satellite  3292 200  4388 80  4387 54 
visitall  1512 91  1360 75  2063 53 
The cumulative number of goal states the agent has reached during training. The numbers are average and standard deviation over 20 seeds. Best numbers among heuristics are highlighted in bold, with ties equally highlighted when there are no statistically significant differences between them under Wilcoxon’s ranksum test (
). The results indicate that reward shaping significantly accelerates the exploration compared to no shaping ().4.1 Comparison with Previous Work
Next, we compared our learned heuristics with two recent state of the art learned heuristics. The first approach, STRIPSHGN (Shen, Trevizan, and Thiébaux, 2020), is a supervised learning method that learns a heuristic function using hypergraph networks (HGN), which generalize Graph Neural Networks (GNNs) (Battaglia et al., 2018; Scarselli et al., 2009). Due to its hypergraph representation, it is able to learn domaindependent as well as domainindependent heuristics, depending on the dataset. The authors have provided us with pretrained weights for three domains: gripper, ferry, and blocksworld for the domaindependent setting.
While STRIPSHGN was originally used with , for a fairer comparison to our methods we use it with GBFS, since we do not consider plan quality in this work. We denote the resulting method by GBFSHGN. As with previous methods, we do not limit time or memory, bounding the number of evaluated nodes instead.
The second approach we compare to is GBFSGNN (Rivlin, Hazan, and Karpas, 2019), an RLbased heuristic learning method that trains a GNNbased value function. The authors use Proximal Policy Optimization (Schulman et al., 2017), a state of the art RL method that stabilizes the training by limiting the amount of policy change in each step (the updated policy stays in the proximity of the previous policy). The value function is a GNN optionally equipped with attentions Veličković et al. (2018); Vaswani et al. (2017). In addition, the authors proposed to adjust by the policy and its entropy . The heuristic value of the successor state is given by . We call it an adjusted value function.
The authors also proposed a variant of GBFS which launches a greedy lookahead guided by the heuristics after each expansion, similar to PROBE Lipovetzky and Geffner (2011), Jasper Xie, Müller, and Holte (2014), Mercury Katz and Hoffmann (2014) or with lookahead Stern et al. (2010). We distinguish their algorithmic improvement and the heuristics improvement by naming their search algorithm as Greedy Best First Lookahead Search (GBFLS). Our formal rendition of GBFLS can be found in Appendix A.3.
We counted the number of test instances that are solved by these approaches within 100,000 node evaluations. In the case of GBFLS, the evaluations also include the nodes that appear during the lookahead. We evaluated GBFSHGN on the domains where pretrained weights are available. For GBFSGNN, we obtained the source code from the authors (private communication) and minimally modified it to train on the same training instances that we used for our approach. We evaluated 4 variants of GBFSGNN: GBFSH, GBFSV, GBFLSH, and GBFLSV, where “H” denotes adjusted value function, and “V” denotes the original value function. Note that the fair evaluation should compare our method with GBFSH/V, not GBFLSH/V.
Table 1 shows the results. We first observed that the large part of the success of GBFSGNN should be attributed to the lookahead extension of GBFS. This is because the score is GBFLSV GBFSH GBFSV, i.e., GBFLSV performs very well even with a bad heuristics (). While we report the coverage for both GBFLSH/V and GBFSH/V, the configurations that are comparable to our setting are GBFSH/V. First, note that GBFSHGN is significantly outperformed by all other methods. Comparing to the other two, both and outperform GBFSH in 7 out of the 8 domains, loosing only on blocks. It is worth noting that outperforms GBFSH in miconic, satellite, and visitall, loosing only on gripper. Since both and GBFSH are trained without reward shaping, the difference is due to the network shape (NLM vs GNN) and the training (Modified RTDP vs PPO).
5 Related Work
Early attempts to learn heuristic functions include applying shallow, fully connected neural networks to puzzle domains (Arfaee, Zilles, and Holte, 2010, 2011), its online version (Thayer, Dionne, and Ruml, 2011), combining SVMs (Cortes and Vapnik, 1995) and NNs (Satzger and Kramer, 2013), learning a residual from planning heuristics similar to ours (Yoon, Fern, and Givan, 2006, 2008), or a relative ranking between states instead of absolute values (Garrett, Kaelbling, and LozanoPérez, 2016). More recently, Ferber, Helmert, and Hoffmann (2020) tested fullyconnected layers in modern frameworks. ASNet (Toyer et al., 2018) learns domaindependent heuristics using a network that is similar to GNNs. These approaches are based on supervised learning methods that require the highquality training dataset (accurate goal distance estimates of states) that are prepared separately. Unlike supervised methods that depend on highquality data, our RLbased approaches must explore the environment by itself to collect useful data, which is automated but could not be sample efficient.
Other RLbased approaches include Policy Gradient with FF to accelerate exploration for probabilistic PDDL (Buffet, Aberdeen et al., 2007), and PPObased MetaRL (Duan et al., 2016) for PDDL3.1 discretecontinuous hybrid domains (Gutierrez and Leonetti, 2021). These approaches do not use reward shaping to improve heuristics, thus our contributions are orthogonal.
Grounds and Kudenko (2005) combined RL and STRIPS planning with reward shaping, but in a significantly different setting: They treat a 2D navigation as a twotier hierarchical problem where unmodified FF (Hoffmann and Nebel, 2001) or Fast Downward (Helmert, 2006) are used as highlevel planner, then their plans are used to shape the rewards for the lowlevel RL agent. They do not train the highlevel planner.
6 Conclusion
In this paper, we proposed a domainindependent reinforcement learning framework for learning domainspecific heuristic functions. Unlike existing work on applying policy gradient to planning (Rivlin, Hazan, and Karpas, 2019), we based our algorithm on value iteration. We addressed the difficulty of training an RL agent with sparse rewards using a novel rewardshaping technique which leverages existing heuristics developed in the literature. We showed that our framework not only learns a heuristic function from scratch (), but also learns better if aided by heuristic functions (reward shaping). Furthermore, the learned heuristics keeps outperforming the baseline over a wide range of problem sizes, demonstrating its generalization over the number of objects in the environment.
References
 Alkhazraji et al. (2020) Alkhazraji, Y.; Frorath, M.; Grützner, M.; Helmert, M.; Liebetraut, T.; Mattmüller, R.; Ortlieb, M.; Seipp, J.; Springenberg, T.; Stahl, P.; and Wülfing, J. 2020. Pyperplan.
 Arfaee, Zilles, and Holte (2010) Arfaee, S. J.; Zilles, S.; and Holte, R. C. 2010. Bootstrap Learning of Heuristic Functions. In Felner, A.; and Sturtevant, N. R., eds., Proc. of Annual Symposium on Combinatorial Search. AAAI Press.
 Arfaee, Zilles, and Holte (2011) Arfaee, S. J.; Zilles, S.; and Holte, R. C. 2011. Learning Heuristic Functions for Large State Spaces. Artificial Intelligence, 175(1617): 2075–2098.
 Asai and Fukunaga (2015) Asai, M.; and Fukunaga, A. 2015. Solving LargeScale Planning Problems by Decomposition and Macro Generation. In Proc. of the International Conference on Automated Planning and Scheduling (ICAPS). Jerusalem, Israel.
 Asai and Fukunaga (2017) Asai, M.; and Fukunaga, A. 2017. TieBreaking Strategies for CostOptimal Best First Search. J. Artif. Intell. Res.(JAIR), 58: 67–121.
 Bäckström and Klein (1991) Bäckström, C.; and Klein, I. 1991. Planning in Polynomial Time: The SASPUBS Class. Computational Intelligence, 7(3): 181–197.

Badia et al. (2020)
Badia, A. P.; Piot, B.; Kapturowski, S.; Sprechmann, P.; Vitvitskyi, A.; Guo,
Z. D.; and Blundell, C. 2020.
Agent57: Outperforming the Atari Human Benchmark.
In
Proc. of the International Conference on Machine Learning
, 507–517. PMLR.  Battaglia et al. (2018) Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; SanchezGonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.; Raposo, D.; Santoro, A.; Faulkner, R.; et al. 2018. Relational Inductive Biases, Deep Learning, and Graph Networks. arXiv preprint arXiv:1806.01261.
 Bradbury et al. (2018) Bradbury, J.; Frostig, R.; Hawkins, P.; Johnson, M. J.; Leary, C.; Maclaurin, D.; Necula, G.; Paszke, A.; VanderPlas, J.; WandermanMilne, S.; and Zhang, Q. 2018. JAX: Composable Transformations of Python+NumPy Programs.
 Brafman and Domshlak (2003) Brafman, R. I.; and Domshlak, C. 2003. Structure and Complexity in Planning with Unary Operators. J. Artif. Intell. Res.(JAIR), 18: 315–349.
 Buffet, Aberdeen et al. (2007) Buffet, O.; Aberdeen, D.; et al. 2007. FF+FPG: Guiding a PolicyGradient Planner. In Proc. of the International Conference on Automated Planning and Scheduling (ICAPS), 42–48.
 Burns et al. (2012) Burns, E. A.; Hatem, M.; Leighton, M. J.; and Ruml, W. 2012. Implementing Fast Heuristic Search Code. In Proc. of Annual Symposium on Combinatorial Search.
 Bylander (1994) Bylander, T. 1994. The Computational Complexity of Propositional STRIPS Planning. Artificial Intelligence, 69(1): 165–204.
 Cortes and Vapnik (1995) Cortes, C.; and Vapnik, V. 1995. SupportVector Networks. Machine learning, 20(3): 273–297.
 Domshlak, Hoffmann, and Katz (2015) Domshlak, C.; Hoffmann, J.; and Katz, M. 2015. RedBlack Planning: A New Systematic Approach to Partial Delete Relaxation. Artificial Intelligence, 221: 73–114.
 Dong et al. (2019) Dong, H.; Mao, J.; Lin, T.; Wang, C.; Li, L.; and Zhou, D. 2019. Neural Logic Machines. In Proc. of the International Conference on Learning Representations.
 Duan et al. (2016) Duan, Y.; Schulman, J.; Chen, X.; Bartlett, P. L.; Sutskever, I.; and Abbeel, P. 2016. RL: Fast Reinforcement Learning via Slow Reinforcement Learning. CoRR, abs/1611.02779.
 Erol, Nau, and Subrahmanian (1995) Erol, K.; Nau, D. S.; and Subrahmanian, V. S. 1995. Complexity, Decidability and Undecidability Results for DomainIndependent Planning. Artificial Intelligence, 76(1–2): 75–88.
 Fawcett et al. (2011) Fawcett, C.; Helmert, M.; Hoos, H.; Karpas, E.; Röger, G.; and Seipp, J. 2011. FDAutotune: DomainSpecific Configuration using Fast Downward. In ICAPS 2011 Workshop on Planning and Learning, 13–17.
 Ferber, Helmert, and Hoffmann (2020) Ferber, P.; Helmert, M.; and Hoffmann, J. 2020. Neural Network Heuristics for Classical Planning: A Study of Hyperparameter Space. In Proc. of European Conference on Artificial Intelligence, 2346–2353.
 Fern, Khardon, and Tadepalli (2011) Fern, A.; Khardon, R.; and Tadepalli, P. 2011. The First Learning Track of the International Planning Competition. Machine Learning, 84(12): 81–107.
 Fikes and Nilsson (1972) Fikes, R. E.; and Nilsson, N. J. 1972. STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving. Artificial Intelligence, 2(3): 189–208.
 Garrett, Kaelbling, and LozanoPérez (2016) Garrett, C. R.; Kaelbling, L. P.; and LozanoPérez, T. 2016. Learning to Rank for Synthesizing Planning Heuristics. In Proc. of International Joint Conference on Artificial Intelligence (IJCAI), 3089–3095.
 Grounds and Kudenko (2005) Grounds, M.; and Kudenko, D. 2005. Combining Reinforcement Learning with Symbolic Planning. In Adaptive Agents and MultiAgent Systems III. Adaptation and MultiAgent Learning, 75–86. Springer.
 Gutierrez and Leonetti (2021) Gutierrez, R. L.; and Leonetti, M. 2021. Meta Reinforcement Learning for Heuristic Planing. In Proc. of the International Conference on Automated Planning and Scheduling (ICAPS), volume 31, 551–559.
 Hart, Nilsson, and Raphael (1968) Hart, P. E.; Nilsson, N. J.; and Raphael, B. 1968. A Formal Basis for the Heuristic Determination of Minimum Cost Paths. Systems Science and Cybernetics, IEEE Transactions on, 4(2): 100–107.

He et al. (2016)
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016.
Deep Residual Learning for Image Recognition.
In
Proc. of IEEE Conference on Computer Vision and Pattern Recognition
, 770–778.  Helmert (2006) Helmert, M. 2006. The Fast Downward Planning System. J. Artif. Intell. Res.(JAIR), 26: 191–246.
 Henderson et al. (2018) Henderson, P.; Islam, R.; Bachman, P.; Pineau, J.; Precup, D.; and Meger, D. 2018. Deep Reinforcement Learning that Matters. In Proc. of AAAI Conference on Artificial Intelligence, volume 32.
 Hoffmann and Nebel (2001) Hoffmann, J.; and Nebel, B. 2001. The FF Planning System: Fast Plan Generation through Heuristic Search. J. Artif. Intell. Res.(JAIR), 14: 253–302.
 Jonsson (2007) Jonsson, A. 2007. The Role of Macros in Tractable Planning over Causal Graphs. In Proc. of International Joint Conference on Artificial Intelligence (IJCAI).
 Jonsson and Bäckström (1998a) Jonsson, P.; and Bäckström, C. 1998a. StateVariable Planning under Structural Restrictions: Algorithms and Complexity. Artificial Intelligence, 100(1–2): 125–176.
 Jonsson and Bäckström (1998b) Jonsson, P.; and Bäckström, C. 1998b. Tractable Plan Existence Does Not Imply Tractable Plan Generation. Annals of Mathematics and Artificial Intelligence, 22(3,4): 281–296.
 Junghanns and Schaeffer (2000) Junghanns, A.; and Schaeffer, J. 2000. Sokoban: A CaseStudy in the Application of Domain Knowledge in General Search Enhancements to Increase Efficiency in SingleAgent Search. Artificial Intelligence.
 Katz and Domshlak (2008a) Katz, M.; and Domshlak, C. 2008a. New Islands of Tractability of CostOptimal Planning. J. Artif. Intell. Res.(JAIR), 32: 203–288.
 Katz and Domshlak (2008b) Katz, M.; and Domshlak, C. 2008b. Structural Patterns Heuristics via Fork Decomposition. In Proc. of the International Conference on Automated Planning and Scheduling (ICAPS), 182–189.
 Katz and Hoffmann (2014) Katz, M.; and Hoffmann, J. 2014. Mercury Planner: Pushing the Limits of Partial Delete Relaxation. In Eighth International Planning Competition (IPC8): planner abstracts, 43–47.
 Katz and Keyder (2012) Katz, M.; and Keyder, E. 2012. Structural Patterns Beyond Forks: Extending the Complexity Boundaries of Classical Planning. In Proc. of AAAI Conference on Artificial Intelligence, 1779–1785.
 Keyder, Hoffmann, and Haslum (2012) Keyder, E.; Hoffmann, J.; and Haslum, P. 2012. SemiRelaxed Plan Heuristics. In Proc. of the International Conference on Automated Planning and Scheduling (ICAPS), 128–136.
 Kocsis and Szepesvári (2006) Kocsis, L.; and Szepesvári, C. 2006. Bandit Based MonteCarlo Planning. In Proc. of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 282–293. Springer.
 Lillicrap et al. (2016) Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2016. Continuous Control with Deep Reinforcement Learning. In Proc. of the International Conference on Learning Representations.
 Lin (1993) Lin, L.J. 1993. Reinforcement Learning for Robots using Neural Networks. Technical report, CarnegieMellon Univ Pittsburgh PA School of Computer Science.
 Lipovetzky and Geffner (2011) Lipovetzky, N.; and Geffner, H. 2011. Searching for Plans with Carefully Designed Probes. In Proc. of the International Conference on Automated Planning and Scheduling (ICAPS).
 Ma et al. (2020) Ma, T.; Ferber, P.; Huo, S.; Chen, J.; and Katz, M. 2020. Online Planner Selection with Graph Neural Networks and Adaptive Scheduling. In Proc. of AAAI Conference on Artificial Intelligence, volume 34, 5077–5084.
 Mnih et al. (2016) Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous Methods for Deep Reinforcement Learning. In Proc. of the International Conference on Machine Learning, 1928–1937. PMLR.
 Mnih et al. (2015) Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. HumanLevel Control through Deep Reinforcement Learning. Nature, 518(7540): 529–533.

Muggleton (1991)
Muggleton, S. 1991.
Inductive Logic Programming.
New generation computing, 8(4): 295–318.  Ng, Harada, and Russell (1999) Ng, A. Y.; Harada, D.; and Russell, S. 1999. Policy Invariance under Reward Transformations: Theory and Application to Reward Shaping. In Proc. of the International Conference on Machine Learning, volume 99, 278–287.
 Pohl (1970) Pohl, I. 1970. Heuristic Search Viewed as Path Finding in a Graph. Artificial Intelligence, 1(34): 193–204.
 Ravanbakhsh, Schneider, and Poczos (2016) Ravanbakhsh, S.; Schneider, J.; and Poczos, B. 2016. Deep Learning with Sets and Point Clouds. arXiv preprint arXiv:1611.04500.
 Reiter (1981) Reiter, R. 1981. On Closed World Data Bases. In Readings in Artificial Intelligence, 119–140. Elsevier.
 Rivlin, Hazan, and Karpas (2019) Rivlin, O.; Hazan, T.; and Karpas, E. 2019. Generalized Planning With Deep Reinforcement Learning. In Proc. of the ICAPS Workshop on Bridging the Gap Between AI Planning and Reinforcement Learning (PRL).
 Russell et al. (1995) Russell, S. J.; Norvig, P.; Canny, J. F.; Malik, J. M.; and Edwards, D. D. 1995. Artificial Intelligence: A Modern Approach, volume 2. Prentice hall Englewood Cliffs.

Satzger and Kramer (2013)
Satzger, B.; and Kramer, O. 2013.
Goal Distance Estimation for Automated Planning using Neural Networks and Support Vector Machines.
Natural Computing, 12(1): 87–100.  Scarselli et al. (2009) Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and Monfardini, G. 2009. The Graph Neural Network Model. IEEE Transactions on Neural Networks, 20(1): 61–80.
 Schaul et al. (2015) Schaul, T.; Horgan, D.; Gregor, K.; and Silver, D. 2015. Universal Value Function Approximators. In Proc. of the International Conference on Machine Learning, 1312–1320. PMLR.
 Schulman et al. (2015) Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz, P. 2015. Trust Region Policy Optimization. In Proc. of the International Conference on Machine Learning, 1889–1897. PMLR.
 Schulman et al. (2017) Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347.
 Shen, Trevizan, and Thiébaux (2020) Shen, W.; Trevizan, F.; and Thiébaux, S. 2020. Learning DomainIndependent Planning Heuristics with Hypergraph Networks. In Proc. of the International Conference on Automated Planning and Scheduling (ICAPS), volume 30, 574–584.
 Silver et al. (2016) Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature, 529(7587): 484–489.
 Stern et al. (2010) Stern, R.; Kulberis, T.; Felner, A.; and Holte, R. 2010. Using Lookaheads with Optimal BestFirst Search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 24.
 Sutton and Barto (2018) Sutton, R. S.; and Barto, A. G. 2018. Reinforcement Learning: An Introduction. MIT press.
 Thayer, Dionne, and Ruml (2011) Thayer, J.; Dionne, A.; and Ruml, W. 2011. Learning Inadmissible Heuristics during Search. In Proc. of the International Conference on Automated Planning and Scheduling (ICAPS), volume 21.

Toyer et al. (2018)
Toyer, S.; Trevizan, F.; Thiébaux, S.; and Xie, L. 2018.
Action Schema Networks: Generalised Policies with Deep Learning.
In Proc. of AAAI Conference on Artificial Intelligence, volume 32.  Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is All You Need. In Advances in Neural Information Processing Systems, 5998–6008.
 Veličković et al. (2018) Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; and Bengio, Y. 2018. Graph Attention Networks. In Proc. of the International Conference on Learning Representations.
 Wiewiora (2003) Wiewiora, E. 2003. PotentialBased Shaping and Qvalue Initialization are Equivalent. J. Artif. Intell. Res.(JAIR), 19: 205–208.
 Wolpert, Macready et al. (1995) Wolpert, D. H.; Macready, W. G.; et al. 1995. No Free Lunch Theorems for Search. Technical report, Technical Report SFITR9502010, Santa Fe Institute.
 Xie, Müller, and Holte (2014) Xie, F.; Müller, M.; and Holte. 2014. Jasper: The Art of Exploration in Greedy BestFirst Search. In Proc. of the International Planning Competition, 39–42.
 Yoon, Fern, and Givan (2008) Yoon, S.; Fern, A.; and Givan, R. 2008. Learning Control Knowledge for Forward Search Planning. Journal of Machine Learning Research, 9(4).
 Yoon, Fern, and Givan (2006) Yoon, S. W.; Fern, A.; and Givan, R. 2006. Learning Heuristic Functions from Relaxed Plans. In Proc. of the International Conference on Automated Planning and Scheduling (ICAPS), volume 2, 3.
 Zaheer et al. (2017) Zaheer, M.; Kottur, S.; Ravanbakhsh, S.; Poczos, B.; Salakhutdinov, R. R.; and Smola, A. J. 2017. Deep Sets. In Advances in Neural Information Processing Systems, 3391–3401.
Appendix A Appendix
a.1 Neural Logic Machines
The NLM (Dong et al., 2019) is a neural Inductive Logic Programming (ILP) (Muggleton, 1991) system based on FOL and the ClosedWorld Assumption (Reiter, 1981). NLM represents a set of continuous relaxations of Horn rules as a set of weights in a neural network and is able to infer the truthfulness of some target formulae as a probability. For example, in Blocksworld, based on an input such as on(a, b) for blocks a, b, NLMs may be trained to predict clear(b) is true by learning a quantified formula .
NLM takes a boolean MultiArity Predicate Representation (MAPR) of propositional groundings of FOL statements. Assume that we need to represent FOL statements combining predicates of different arities. We denote a set of predicates of arity as (Prolog notation), its propositions as , and the Boolean tensor representation of as . A MAPR is a tuple of tensors where is the largest arity. For example, when we have objects a, b, c and four binary predicates on, connected, above and larger, we enumerate all combinations on(a,a), on(a,b) … larger(c,c), resulting in an array . Similarly, we may have for 2 unary predicates, and for 5 ternary predicates.
NLM is designed to learn a class of FOL rules with the following set of restrictions: Every rule is a Horn rule, no rule contains function terms (such as a function that returns an object), there is no recursion, and all rules are applied between neighboring arities. Due to the lack of recursion, the set of rules can be stratified into layers. Let be a set of intermediate conclusions in the th stratum. Under these assumptions, the following set of rules are sufficient for representing any rules Dong et al. (2019):
(expand)  
(reduce)  
(compose)  
Here, (respectively) are predicates, is a sequence of parameters, and is a formula consisting of logical operations and terms . Intermediate predicates and have one less / one more parameters than , e.g., when , and . extracts the predicates whose arity is the same as that of . is a permutation of , and iterates over to generate propositional groundings with various argument orders. represents a formula that combines a subset of these propositions. By chaining these set of rules from to for a sufficient number of times (e.g., from to ), it is able to represent any FOL Horn rules without recursions Dong et al. (2019).
All three operations (expand, reduce, and compose) can be implemented as tensor operations over MAPRs (Figure 2). Given a binary tensor of shape , expand copies the th axis to th axis resulting in a shape , and reduce takes the of th axis resulting in a shape . The reduce operation can also use , in which case becomes .
Finally, the compose operation combines the information between the neighboring tensors . In order to use the information in the neighboring arities (, and ), the input concatenates with and , resulting in a shape where . Next, a Perm function enumerates and concatenates the results of permuting the first axes in the tensor, resulting in a shape . It then applies a D pointwise convolutional filter with output features, resulting in , i.e., applying a fully connected layer to each vector of length while sharing the weights. It is activated by any nonlinearity to obtain the final result, which we denote as . Formally, ,
An NLM contains (the maximum arity) compose operation for the neighboring arities, with appropriately omitting both ends ( and ) from the concatenation. We denote the result as . These horizontal aritywise compositions can be layered vertically, allowing the composition of predicates whose arities differ more than 1 (e.g., two layers of NLM can combine unary and quaternary predicates). Since is applied in a convolutional manner over object tuples, the number of weights in an NLM layer does not depend on the number of objects in the input. However, it is still affected by the number of predicates in the input, which alters .
When the predicates in the input PDDL domain have a maximum arity , we specify the maximum intermediate arity and the depth of NLM layers as a hyperparameter. The intermediate NLM layers expand the arity up to using expand operation, and shrink the arity near the output because the value function is a scalar (arity 0). For example, with , , , the arity of each layer follows . Higher arities are not necessary near the output because the information in each layer propagates only to the neighboring arities. Since each expand/reduce operation only increments/decrements the arity by one, must satisfy .
We consider NLMs where intermediate layers have a sigmoid activation function, while the output is linear, since we use its raw value as the predicted correction to the heuristic function. In addition, we implement NLM with a
skip connection that was popularized in ResNet image classification network (He et al., 2016): The input of th layer is a concatenation of the outputs of all previous layers. Due to the direct connections between the layers in various depths, the layers near the input receive more gradient information from the output, preventing the gradient vanishing problem in deep neural networks.a.2 DomainIndependent Heuristics for Classical Planning
In this section, we discuss various approximations of deleterelaxed optimal cost . Given a classical planning problem , and a state , each heuristics is typically implicitly conditioned by the goal condition. heuristics is recursively defined as follows:
(6) 
heuristics can be defined based on as a subprocedure. The action which minimizes the second case () of each of the definition above is conceptually a “cheapest action that achieves a subgoal for the first time”, which is called a cheapest achiever / best supporter of . Using and its best supporter function, is defined as follows:
(7)  
(8)  
(9) 
a.3 Greedy Best First Search and Greedy Best First Lookahead Search Rivlin, Hazan, and Karpas (2019)
Given a classical planning problem , we define its state space as a directed graph where , i.e., a power set of subsets of propositions . Greedy Best First Search is a greedy version of algorithm (Hart, Nilsson, and Raphael, 1968), therefore we define first.
We follow the optimized version of the algorithm discussed in Burns et al. (2012) which does not use CLOSE list and avoids moving elements between CLOSE list and OPEN list by instead managing a flag for each search state. Let be a function that computes the sum of and , where is a value stored for each state which represents the currently known upper bound of the shortest path cost from the initial state . For every state , is initialized to infinity except . Whenever a state is expended, is a lower bound of the path cost that goes through . algorithm is defined as in Algorithm 2. We simplified some aspects such as updating the parent node pointer, the rules for tiebreaking Asai and Fukunaga (2017), or extraction of the plan by backtracking the parent pointers. Notice that the update rule for values in is a Bellman update specialized for a positive cost function.
Three algorithms can be derived from by redefining the sorting key for the priority queue OPEN. First, ignoring the heuristic function by redefining yields Dijkstra’s search algorithm. Another is weighted (Pohl, 1970), where we redefine for some value , which results in trusting the heuristic guidance relatively more greedily.
As the extreme version of WA*, conceptually yields the Greedy Best First Search algorithm Russell et al. (1995) which completely greedily trusts the heuristic guidance. In practice, we implement it by ignoring the value, i.e., . This also simplifies some of the conditionals in Algorithm 2: There is no need for updating the value, or reopening the node. In addition, purely satisficing algorithm like GBFS can enjoy an additional enhancement called early goal detection. In , the goal condition is checked when the node is popped from the OPEN list (line 7) – if we detect the goal early, it leaves the possibility that it returns a suboptimal goal node. In contrast, since we do not have this optimality requirement in GBFS, the goal condition can be checked in line 12 where the successor state is generated. GBFS is thus defined as in Algorithm 3.
Finally, Rivlin, Hazan, and Karpas (2019) proposed an unnamed extension of GBFS which performs a depthfirst lookahead after a node is expanded. We call the search algorithm Greedy Best First Lookahead Search (GBFLS), defined in Algorithm 4. We perform the same early goal checking during the lookahead steps. Note that the nodes are added only when the current node is expanded; Nodes that appear during the lookahead are not added to the OPEN list. However, these nodes must be counted as evaluated node because it is subject to goal checking and because we evaluate their heuristic values. The lookahead has an artificial depth limit which is defined as , i.e., 5 times the value of the FF heuristics at the initial state. When , the limit is set to 50, according to their code base.
a.4 Implementation
a.5 Generator Parameters
Table 3 contains a list of parameters used to generate the training and testing instances. Since generators have a tendency to create an identical instance especially in smaller parameters, we removed the duplicates by checking the md5 hash value of each file.
Domain  Parameters  

blocks/train/  26 blocks x 50 seeds  26 
blocks/test/  10,20,..,50 blocks x 50 seeds  1050 
ferry/train/  26 locations x 26 cars x 50 seeds  47 
ferry/test/  10,15,…30 locations and cars x 50 seeds  2060 
gripper/train/  2,4…,10 balls x 50 seeds (initial/goal locations are randomized)  614 
gripper/test/  20,40,…,60 balls x 50 seeds (initial/goal locations are randomized)  2464 
logistics/train/  13 airplanes x 13 cities x 13 city size x 13 packages x 10 seeds  513 
logistics/test/  48 airplanes/cities/city size/packages x 50 seeds  3296 
satellite/train/  13 satellites x 13 instruments x 13 modes x 13 targets x 13 observations  1539 
satellite/test/  48 satellites/instruments/modes/targets/observations x 50 seeds  69246 
miconic/train/  24 floors x 24 passengers x 50 seeds  812 
miconic/test/  10,20,30 floors x 10,20,30 passengers x 50 seeds  2464 
parking/train/  26 curbs x 26 cars x 50 seeds  816 
parking/test/  10,15,..,25 curbs x 10,15,..25 cars x 50 seeds  2454 
visitall/train/  For , x grids, 0.5 or 1.0 goal ratio, blocked locations, 50 seeds  822 
visitall/test/  For , x grids, 0.5 or 1.0 goal ratio, blocked locations, 50 seeds  3258 
a.6 Hyperparameters
We trained our network with a following set of hyperparameters: Maximum episode length , Learning rate 0.001, discount rate , maximum intermediate arity , number of layers in satellite and logistics, while in all other domains, the number of features in each NLM layer , batch size 25, temperature for a policy function (Section 2.2), and the total number of SGD steps to 50000, which determines the length of the training. We used for those two domains to address GPU memory usage: Due to the size of the intermediate layer , NLM sometimes requires a large amount of GPU memory. Each training takes about 4 to 6 hours, depending on the domain.
a.7 Preliminary Results on Compatible Domains
We performed a preliminary test on a variety of IPC classical domains that are supported by our implementation. The following domains worked without errors: barmanopt11strips, blocks, depot, driverlog, elevatorsopt11+sat11strips, ferry, floortileopt11strips, freecell, gripper, hanoi, logistics00, miconic, mystery, nomysteryopt11strips, parkingopt11+sat11strips, pegsolopt11strips, pipesworldnotankage, pipesworldtankage, rovers, satellite, scanalyzer08strips, sokobanopt11strips, tpp, transportopt11+sat08strips, visitallopt11strips, zenotravel.
Comments
There are no comments yet.