1 Introduction
Despite the prevalence of deep learning for perception tasks in computer vision and natural language processing, its application to problem solving tasks, such as planning, is still in its infancy. The majority of deep learning approaches to planning use conventional architectures designed for perception tasks, rely on handengineering features or encoding planning problems as images, and do not learn knowledge that generalises beyond planning with a different initial state or goal
[7, 2, 14]. One exception is Action Schema Networks (ASNets) [30, 31], a neural network architecture which exploits the relational structure of a given planning domain described in (P)PDDL, to learn generalised policies applicable to problems of any size within the domain.
The motivation of our work is to go even further than architectures such as ASNets, and learn to plan – or at least to guide the search for a plan – independently of the domain considered. In particular, we consider the problem of learning domainindependent heuristics that generalise not only across states, goals, and object sets, but also across domains.
We focus on the wellknown class of deleterelaxation heuristics for propositional STRIPS planning [6, 16], of which , , and LMcut are popular examples. These heuristics can be seen as the leastcost path in the hypergraph representing the deleterelaxed problem for a suitable aggregation function. The vertices of this hypergraph represent the problem’s propositions and the hyperedges represent actions connecting their preconditions to their positive effects. We can therefore frame the problem of learning domainindependent heuristics as that of learning a mapping from the hypergraph representation of the deleterelaxed problem (and optionally other features) to a cost estimate. To develop and evaluate this hypergraph learning framework, we make three contributions:

Hypergraph Networks (HGNs), our novel framework which generalises Graph Networks [4] to hypergraphs. The HGN framework may be used to design new hypergraph deep learning models, and inherently supports combinatorial generalisation to hypergraphs with different numbers of vertices and hyperedges.

STRIPSHGNs, an instance of a HGN which is designed to learn heuristics by approximating shortest paths over the hypergraph induced by the delete relaxation of a STRIPS problem. STRIPSHGNs use a powerful recurrent encodeprocessdecode architecture which allows them to incrementally propagate messages within the hypergraph in latent space.

A detailed empirical evaluation, which rigorously defines the Hypergraph Network configurations and training procedure we use in our experiments. We train and evaluate our STRIPSHGNs on a variety of domains and show that they are able learn domaindependent and domainindependent heuristics which potentially outperform , , and LMcut.
As far as we are aware, this is the first work to learn domainindependent heuristics completely from scratch.
2 Related Work
There is a large body of literature on learning for planning. Jimenez et al. jimenez:etal:12 and Toyer et al. toyeretal2019asnets provide excellent surveys on these existing approaches. Due to space limitations, we focus on deep learning (DL) approaches to planning which differ in what they learn, the features and architectures they use, and the generality they confer.
What is learned? Existing DL approaches may be split into four categories: learning domain descriptions [26, 3], policies [7, 30, 14, 18, 11], heuristics [25, 2, 29, 13], and planner selection [27]. Our work is concerned with learning heuristics. One of the key differences of our approach with the existing stateoftheart for learning heuristics is that we learn heuristics from scratch instead of improving or combining existing heuristics. That being said, STRIPSHGNs are also suitable to learn heuristic improvements or combinations, and with some adaptations, to learn actions rankings; however, we have not experimented with these settings.
Features and Architectures.
Most existing DL approaches to planning use standard architectures, and rely on handengineered features or encodings of planning problems as images. For instance, Sievers et al. sieversetal2019dl train Convolutional Neural Networks (CNNs) over graphical representations of planning problems converted into images, to determine which planner should be invoked for a planning task. For learning generalised policies and heuristics, Groshev et al. groshevetal2018lgrp train CNNs and Graph Convolutional Networks with images obtained via a domainspecific handcoded problem conversion. In contrast, our approach does not require handcoded features and instead learns latent features directly from a rich hypergraph representation of the planning problem.
Another approach is ASNets [30], a neural network architecture dedicated to planning, composed of alternating action and proposition layers which are sparsely connected according to the relational structure of the action schemas in a (P)PDDL domain. A disadvantage of ASNets is its fixed receptive field which limits its capability to support long chains of reasoning. Our STRIPSHGNs architecture does not have such an intrinsic receptive field limitation.
Generalisation. Existing approaches and architectures for learning policies and heuristics have limited generalisation capabilities. Many generalise to problems with different initial states and goals, but not to problems with different sets or numbers of objects. Exceptions include ASNets, whose weight sharing scheme allows the generated policies to generalise to problems of any size from a given (P)PDDL domain, and TraPSNet [11], whose graph attention network can be transferred between different numbers of objects in an RDDL domain. As our experiments show, not only does STRIPSHGNs support generalisation across problem sizes, but it also supports learning domainindependent heuristics that generalise across domains, including to domains that were not seen during training.
3 Planning Heuristics
We are concerned classical planning problems represented in propositional STRIPS [10]. Such a problem is a tuple where is the set of propositions; is the set of actions; represents the initial state; represents the set of goal states; and is the cost of action . Each action is defined as a triple where the precondition is the set of propositions which must be true in order for to be applied, while the add and deleteeffects and are the sets of propositions which the action makes true and false, respectively, when applied.
A solution plan for a STRIPS problem is a sequence of applicable actions leading from the initial state to the goal, i.e., induces a sequence of states such that , , and for all and . The cost of a plan is the sum of the costs of its actions . An optimal plan is a plan which has minimum cost.
Heuristics. Let be the state space. A heuristic function provides an estimate of the cost to reach a goal state from a state , allowing a search algorithm to focus on promising parts of the state space. The optimal heuristic is the heuristic that gives the cost of the optimal plan to reach a goal state from . A heuristic is admissible iff it never overestimates this optimal cost, i.e., , and is inadmissible otherwise. Many heuristics are obtained by approximating the cost of the optimal plan for a relaxation of the original problem . A wellknown relaxation, the deleterelaxation of is obtained by ignoring the deleteeffects of all actions in , i.e., , where . This works considers three baseline domainindependent heuristics which are based on the deleterelaxation: (admissible), (inadmissible) [6], and the LandmarkCut heuristic (admissible) [16].
4 Hypergraph Networks
Hypergraph Networks (HGNs) is our generalisation of the Graph Networks [4] framework to hypergraphs. HGNs may be used to represent and extend existing DL models including CNNs, graph neural networks, and stateoftheart hypergraph neural networks. We will not explore HGNs in great detail, as it is not the focus of this paper.
Hypergraph Definition. A hypergraph is a generalisation of a graph in which a hyperedge may connect any number of vertices together. A directed hypergraph in the HGN framework is defined as a triple where: represents the hypergraphlevel (global) features; is the set of vertices where represents the th vertex’s features; and is the set of hyperedges, where represents the th hyperedge’s features, is the vertex set which contains the indices of the vertices which are in the head of the th hyperedge (i.e., receivers), and is the vertex set which contains the indices of the vertices which are in the tail of the th hyperedge (i.e., senders). This is in contrast to Graph Networks, where and are singletons, i.e., . An example of a hyperedge for a deleterelaxed STRIPS action is depicted in Figure 1.
Hypergraph Network Block. A Hypergraph Network (HGN) block is a hypergraphtohypergraph function which forms the core building block of a HGN. The internal structure of a HGN block is identical to a Graph Network block [4], except now the hyperedge update function supports multiple receivers and senders. A full HGN block is composed of three update functions, , and , and three aggregation functions, , and :
where and are the sets which represent the vertex features of the receivers and senders of the th hyperedge, respectively. Additionally, for the th vertex, we define , , and . Essentially, represents the updated hyperedges where the th vertex is a receiver vertex, represents all the updated hyperedges, and represents all the updated vertices.
Since the input to the aggregation functions are essentially sets, each must be permutation invariant to ensure that all permutations of the input give the same aggregated result. Hence could, for example, be a function that takes an elementwise summation of the input, maximum, minimum, mean, etc [4].
Computation Steps. In a single forward pass of a HGN block, the hyperedge update function is firstly applied to all hyperedges to compute perhyperedge updates. Each updated hyperedge feature is computed using the current hyperedge’s feature , the features of the receiver and sender vertices and , respectively, and the global features . Next, the vertex update function is applied to all vertices to compute pervertex updates. Each updated vertex feature is computed using the aggregated information from all the hyperedges the vertex ‘receives’ a signal from (i.e., it appears in the head of the hyperedge), the current vertex’s feature , and the global features . Finally, the global update function is applied to compute the new global features [4] using the aggregated information from all the hyperedges and vertices in the hypergraph and , respectively, along with the current global features .
Configuring HGN Blocks. Each update function in a HGN block must be implemented by some function , where the signature of determines what input it gets [4]. For example, the function that implements in a full HGN block (Figure 2) is a function which accepts the global, vertex, and hyperedge attributes. Each function may be implemented in any manner, as long as it accepts the input parameters and conforms to the required output. Since HGN blocks are hypergraphtohypergraph functions, we may compose blocks sequentially and repeatedly apply them.
5 STRIPSHGNs
STRIPSHGNs is our instantiation of a HGN which uses a recurrent encodeprocessdecode architecture [4] for learning heuristics. STRIPSHGNs are designed to be highly adaptable to different input features for each proposition and action, as well as being agnostic to the implementation of each update function in each HGN block.
Hypergraph Representation. The input to a STRIPSHGN is a hypergraph which contains the input proposition and action features for the state , along with the hypergraph structure of the relaxed STRIPS problem , where:

, as global features are not required as input to a STRIPSHGN. Nevertheless, it is easy to adapt STRIPSHGNs to support global features, e.g., we could supplement a STRIPSHGN with a heuristic value computed by another heuristic such that the network learns an “improvement” on similar to [13].

contains the input features for the propositions in the problem. Features for a proposition could include whether it is true for the current state or goal state, and whether the proposition is a fact landmark for the state [24].

for the actions in the relaxed problem . For an action represented by the th hyperedge, represents the input features for (e.g., the cost of the action , and whether the action is in the disjunctive action landmarks from state ) and (resp. ) is the vertex set containing the indices of the vertices in the additive effects (resp. preconditions) of .
The output of a STRIPSHGN is a hypergraph where
is a 1dimensional vector representing the heuristic value for
, thus we enforce both and to be the empty set.5.1 Architecture
A STRIPSHGN is composed of three main HGN blocks: the encoding, processing (core), and decoding block. Our architecture follows a recurrent encodeprocessdecode design [15, 4], as depicted in Figure 3. The input hypergraph is firstly encoded to a latent representation by the encoding block at time step . This allows the network to operate on a richer representation of the input features in latent space.
Next, the initial latent representation of the hypergraph is concatenated with the previous output of the processing block . Initially, when has not been called (i.e., at time step just after has been computed), is concatenated with itself. Note that the hypergraph structure for and is identical because the HGN blocks do not update the senders or receivers for a hyperedge. Implementationwise, concatenating a hypergraph with another involves concatenating the features for each corresponding vertex together, and the features for each corresponding hyperedge together (the global features are not concatenated as they are not required as input to a STRIPSHGN). This results in a broadened feature vector for each vertex and hyperedge.
The processing block , which outputs a hypergraph for each time step , is applied times with the initial encoded hypergraph concatenated with the previous output of as the input (see Figure 3). Evidently, this results in intermediate hypergraph outputs, one for each for time step , and one final hypergraph for the time step . The decoding block takes the hypergraph output by the block and decodes it to the hypergraph which contains the heuristic value for state in the global feature . Observe that we can decode each latent hypergraph which is output by to obtain a heuristic value for each time step . We use this fact to train a STRIPSHGN by optimising the loss on the output of each time step.
Core Block Details. We can interpret a STRIPSHGN as a message passing model which performs steps of message passing [12], as the shared processing block is repeated times using a recurrent architecture. A single step of message passing is equivalent to sending a ‘signal’ from a vertex to its immediate neighbouring vertices. Although this means that a vertex only receives a ‘signal’ from other vertices at most hops away, we theorise that this is sufficient to learn a powerful function which aggregates proposition and action features in the latent space.
In contrast to architectures such as ASNets and CNNs, which have a fixed receptive field that is determined by the number of hidden layers, the receptive field of a STRIPSHGN is effectively determined by the number of message passing steps. Evidently, we can increase or decrease the receptive field of a STRIPSHGN by scaling the number of message passing steps, hence providing a significant advantage over networks with fixed receptive fields.
WithinBlock Design. The encoder block (Figure 4) encodes the vertex and hyperedge input features independently of each other using its and , respectively.
The core processing block of a STRIPSHGN (Figure 4) takes the concatenated vertex and hyperedge features from the latent hypergraphs and as input. computes perhyperedge updates based on these hyperedge and vertex features. computes pervertex updates based on the vertex features and the aggregated features of the hyperedges where the vertex is a receiver, which is computed using . Finally, uses the aggregated vertex and aggregated hyperedge features calculated with and , respectively, to compute a latent representation for the heuristic value.
The decoder block (Figure 4) takes the latent representation of the global features of the hypergraph returned by the core HGN block and uses its to decode it into a onedimensional heuristic value. The vertex and hyperedge features are not used as already represents an aggregation of these features as computed by .
The choice of learning model for the update functions , and within each block is not strict, as long as the model conforms to the input and output requirements. The choice of aggregation functions , , and should be permutation invariant to the ordering of the inputs, otherwise different heuristic values could be obtained for different permutations of the same STRIPS problem. We detail our choice of update and aggregation functions in Section 6.1 which describes our experimental setup.
5.2 Training Algorithm
We consider learning a heuristic function as a regression problem, where ideally provides nearoptimal estimates of the cost to go. We train our STRIPSHGNs with the values generated by the optimal heuristic . Given a set of training problems , we run an optimal planner for each to obtain optimal statevalue pairs . We then generate the deleterelaxed hypergraph for and the state to get a training sample . We denote by the set containing all training samples.
Weight Optimisation.
We use supervised learning and assume that each update function in the encoder, core, and decoder blocks of a
STRIPSHGN has some weights that need to be learned. For simplicity, we aggregate these weights into a single variable . Let be the heuristic learned by a STRIPSHGN which is parameterised by the weights .Recall that we can decode the latent hypergraph that is output by the core HGN block at each time step into a heuristic value
. Our loss function averages the losses of these intermediate outputs at each time step to encourage a
STRIPSHGN to find a good heuristic value in the smallest number of message passing steps possible [4]. We use the mean squared error (MSE) loss function:where is a minibatch within the entire training set , is the number of message passing steps, and is the input hypergraph for state in a problem.
We use minibatch gradient descent [20] to update the weights in the direction which minimises by using the gradient
. In a single epoch, we apply this update to every minibatch
. We repeatedly apply more epochs until we reach a maximum number of epochs or exceed a fixed training time. During evaluation time, we use the heuristic value output at the last message step .5.3 Limitations of STRIPSHGNs
Firstly, it is expensive to compute a single heuristic value using a STRIPSHGN, given the computational cost of the matrix operations required for a single step of message passing; these costs scale with the number of vertices and hyperedges in the hypergraph. However, this cost may pay off if the learned heuristic provides very informative estimates near the optimal , as it may reduce the total CPU time required to find a nearoptimal solution.
The number of message passing steps
for the core HGN block is a hyperparameter which, in theory, should be adaptively selected based on how ‘far’ away the current state is from the goal. However, determining a good value for
is not trivial, and should ideally be automatically determined by a STRIPSHGN by using its intermediate outputs. In practice, we found that setting was sufficient to achieve promising results.Finally, we are unable to provide any formal guarantees that the heuristics learned by STRIPSHGNs are admissible. Although we train STRIPSHGNs on the optimal heuristic values, it is unfeasible to analyse a network to understand what it is exactly computing.
6 Empirical Evaluation
Our experiments are aimed at showing the generalisation capability of STRIPSHGNs to problems they were not trained on. For each experiment, we select a small pool of training problems (potentially from several domains) and train a STRIPSHGN. We then evaluate the learned heuristic on a larger pool of testing problems with differing initial/goal states, problem sizes and even domains. We repeat each experiment for STRIPSHGNs 10 times, resulting in 10 different trained networks, to measure the influence of the randomly generated problems and the training procedure.
6.1 Experimental Setup
Hardware. All experiments were conducted on an Amazon Web Services c5.2xlarge server with an Intel Xeon Platinum 8000 series processor running at 3.4Ghz. To ensure fairness between STRIPSHGN and our baselines, each experiment was limited to a single core. We enforced a 16GB memory cutoff; however, only blind search reached this cutoff and the other planners never exceeded 2GB.
Search Configuration. We compare STRIPSHGNs against the following baselines: no heuristic (i.e., blind), , LMcut, and . These baselines all represent heuristics computable using the same input as used by STRIPSHGNs— the deleterelaxation hypergraph — making this a fair comparison as all heuristics have access to the same information. We use A* as the search algorithm to compare the different heuristics, since STRIPSHGNs is trained using optimal heuristic values and we believe that its estimates are sufficiently informative to find the optimal solution.
To generate the training data for each training problem, we used Fast Downward [17] configured with A* search and the LMcut heuristic with a timeout of 2 minutes. To evaluate each testing problem with a heuristic, we used A* search in Pyperplan [1] with a 5 minute timeout. For each problem and heuristic, we run A* once.
We used Pyperplan for evaluation as STRIPSHGNs are implemented in Python. We observed that the implementations of the deleterelaxation heuristics in Pyperplan are much slower than their counterparts in Fast Downward. Hence, our results for CPU times should be considered as preliminary.
STRIPSHGNs Configuration. We generate the hypergraph of each planning problem by using the deleterelaxed problem computed by Pyperplan. For a STRIPS problem and a given state , we encode the input features for each proposition (vertex) as a vector of length 2 where: (resp. ) iff is true in state (resp. in the goal ), and 0 otherwise. The input feature for each action represented by a hyperedge is a vector , where is the cost of , and and are the number of positive effects and preconditions for action , respectively. and are used by a STRIPSHGN to determine how much of a ‘signal’ it should send from a given hyperedge.
We set the number of message passing steps
for the recurrent core HGN block to 10, and implement each update function as a Multilayer Perceptron (MLP) with two sequential fullyconnected (FC) layers, each with an output dimensionality of 32. We apply the LeakyReLU activation function
[23] following each FC layer. We add an extra FC layer with an output dimensionality of 1 in of the decoding block. Since the input to a MLP must be a fixedsize vector, we concatenate each update function’s input features before feeding them into the MLP.However, for the hyperedge update function in the core block, the number of receiver and sender vertices may vary with each hyperedge. For a given set of domains, we can compute the maximum number of preconditions and positive effects of each possible action by analysing their action schemas – this allows us to fix the size of the feature vectors for the receiver and sender vertices. We convert the set of input vertices (resp. ) into a fixedsize vector determined by (resp. ), by stacking each vertex feature (resp.
) in alphabetical order by their proposition names, and padding the vector with zeros if the required length is not reached.
For the aggregation functions , and in the core block of a STRIPSHGN, we use elementwise summation. We denote the heuristic learned by this configuration of STRIPSHGN as .
Training Procedure. We split the training data into
bins using quantile binning of the target heuristic values and use stratified
fold to split the training set into folds with each fold containing approximately the same percentage of samples for each heuristic bin. For each fold , we train a STRIPSHGN using as the training set and as the validation set, and select the network at the epoch which achieved the lowest loss on the validation set . Since we train one STRIPSHGN for each of the folds, we are left with separate networks. We select the network which performed best on its validation set as the single representative STRIPSHGN for an experiment, which we then evaluate on a previously unseen test set.Although fold is more commonly used for cross validation, we use it to reduce potential noise and demonstrate robustness over the training set used. Unless otherwise specified, we set and and use the Adam optimiser with a learning rate of 0.001 and a L2 penalty (weight decay) of 0.00025 [19]. We set the minibatch size to 1 as we found that this resulted in a learned heuristic with the best planning performance and helped the loss function converge much faster despite the ‘noisier’ training procedure. This may be attributed to the small size of our training sets, which is usually limited to 50200 samples.
6.2 Domains and Problems Considered
The actions in the domains we consider have a unit cost. The problems we train and evaluate on are randomly generated and unique. We consider the following domains:

8puzzle [9]. Our training and test set consists of 10 and 50 problems, respectively. Only the initial state varies in these problems.

Sokoban [9]. Our training set consists of 20 problems (10 of grid sizes 5 and 7), and our test set contains 50 problems (20 of grid sizes 5 and 7, 10 of size 8). The number of boxes was set to 2 and the number of walls randomly selected between 3 and 5.

Ferry^{1}^{1}1https://fai.cs.unisaarland.de/hoffmann/ffdomains.html. Our training set consists of 9 problems, one for each of the parametrisations . Our test set contains 36 problems ().

Blocksworld [28]. Our training set for Blocksworld is formed of 10 problems (5 with 4 and 5 blocks, resp.). We have two evaluation sets for separate experiments: consists of 100 problems (20 with 6,…,10 blocks, resp.), while consists of 50 problems (10 with 4,…,8 blocks, resp.).

Gripper [22]. Our training set for Gripper contains 3 Gripper problems (1, 2, 3 balls, resp.). Due to the low number of samples for Gripper (only 20 pairs), we resample the training set to 60 samples using stratified sampling with replacement. The test set consists of 17 problems with balls.

Zenotravel [21]. Our training set consists of 10 problems (5 with 2 and 3 cities, resp., with 14 planes and 25 people), while our testing set contains 60 problems ().
6.3 Experimental Results
Our experiments may be broken down into learning a domaindependent or a domainindependent heuristic. For each of the experiments we describe below, we present the results for the number of nodes expanded, CPU time, and deviation from the optimal plan length when using A* (Figure 5). For
, the results are presented as the average and its 95% confidence interval over the 10 different experiments. Additionally, the coverage ratio on the testing problems for each heuristic is shown in Table
1. For , we calculate the average coverage for the 10 repeated experiments.Blind  LMcut  
8puzzle  1  1  1  1  1 
Sokoban  1  1  1  0.96  0.91 
Ferry  0.42  0.36  1  0.47  0.77 
Seen Blocksworld  0.78  0.68  1  0.97  0.97 
Seen Gripper  0.71  0.59  0.59  0.41  0.69 
Seen Zenotravel  0.62  0.55  1  0.82  0.6 
Unseen Blocksworld  1  1  1  1  0.88 
Can we learn domaindependent heuristics? In order to evaluate this, we train and test STRIPSHGNs separately on 8puzzle, Sokoban and Ferry. For 8puzzle, we limit the training time for each fold to 10 minutes. For Sokoban, we use bins and folds and limit the training time within each fold to 20 minutes. For Ferry, we use folds as the training set is quite small (61 samples) and limit the training time for each fold to 3 minutes.
Figures 5, 5 and 5 depict the results of these experiments. Firstly, for 8puzzle, expands less nodes than all the baselines including , yet deviates significantly less from the optimal plan. For Sokoban, expands marginally more nodes than and LMcut, but finds nearoptimal plans. This is respectable, as Sokoban is known to be difficult for learningbased approaches as it is PSPACEcomplete [8]. Finally, for Ferry, is able to solve problems of much larger size than the admissible heuristics are able to solve. also obtains a smaller deviation from the optimal than . Therefore, STRIPSHGNs are able to learn domaindependent heuristics which potentially outperform our baseline heuristics.
Can we learn domainindependent heuristics? To determine whether this is feasible, we train a STRIPSHGN using data from multiple domains at once: the training set of each domain is binned and stratified into folds then, for , the folds of all considered domains are merged into a single fold and is used as the training set. Using this procedure, we train a STRIPSHGNs on the training problems for Blocksworld, Gripper and Zenotravel; and evaluate the network on the respective test sets for these domains ( for Blocksworld). We limit the training time for each fold to 15 minutes. Notice that each testing domain has been seen by the network during training.
Figures 5, 5 and 5 depict our results. For Blocksworld, requires fewer node expansions on average than all the baselines including , which compared to , deviates significantly more from the optimal plan length. For Gripper, requires remarkably less node expansions than the baselines and is able to find solutions to the larger test problems within the limited search time ( and LMcut are occluded by for the number of nodes expanded). For Zenotravel, requires fewer node expansions than the blind heuristic, and LMcut for the more difficult problems. However, at a certain point we are unable to solve more difficult problems due to the expense of computing a single heuristic value. Additionally, deviates slightly less from the optimal plan length compared to .
Thus, STRIPSHGNs are capable of learning domainindependent heuristics which generalise to problems from the domains a network has seen during training. This is a very powerful result, as current approaches for learning domainindependent heuristics rely on features derived from existing heuristics, while we are able to learn heuristics from scratch.
Is capable of generalising to unseen domains? To determine whether this is the case, we train each STRIPSHGN on the training problems for Zenotravel and Gripper, while we evaluate the network on the test problems for Blocksworld. We use the same training data generation procedure described for learning domainindependent heuristics, and limit the training time for each fold to 10 minutes.
Notice that Blocksworld is not in the training set, thus it is an unseen domain for . Figure 5 depicts the results of this experiment (one problem for which
achieved low coverage is left out as it skews the plots). We can observe that
does better than and blind search in terms of number of node expansions. This is despite the fact that the network did not see any Blocksworld problems during training. We note that we ran the experiments with unseen Gripper and unseen Zenotravel (using the other two domains as the training set). The results for these were not as promising compared to unseen Blocksworld, but the STRIPSHGNs still managed to learn a meaningful heuristic: for Gripper, the STRIPSHGNs perform similarly to the admissible heuristics, including no deviation from the optimal, but do not scale up to large problems; and for Zenotravel, STRIPSHGN performs better than blind search and but is outperformed by LMcut and .This shows that it is possible for to generalise across to problems from domains it has not seen during training. Unsurprisingly, suffers a loss in planning performance in comparison to networks trained directly on the unseen domain.
Why is not competitive in terms of CPU time? This may be attributed to our current suboptimal implementation of STRIPSHGNs, and the cost of evaluating the network (i.e., message passing steps). Consequently, there is significant room for improvement in this regard. Despite this, our results show that STRIPSHGNs is a feasible and effective approach for learning domaindependent and domainindependent heuristics.
7 Conclusion and Future Work
We have introduced STRIPSHGNs, a recurrent encodeprocessdecode architecture which uses the Hypergraph Networks framework to learn heuristics which are able to generalise not only across states, goals, and object sets, but also across unseen domains. In contrast to existing work for learning heuristics, STRIPSHGNs are able to learn powerful heuristics from scratch, using only the hypergraph induced by the deleterelaxation of the STRIPS problem. This is achieved by leveraging Hypergraph Networks, which allow us to approximate optimal heuristic values by performing message passing on features in a rich latent space. Our experimental results show that STRIPSHGNs are able to learn domaindependent and domainindependent heuristics which are competitive with , and LMcut, which are computed over the same hypergraph, in terms of the number of node expansions required by A*. This suggests that learning heuristics over hypergraphs is a promising approach, and hence should be investigated in further detail.
Potential avenues for future work include using a richer set of input features such as the disjunctive action landmarks or the fact landmarks. This may help the network learn a heuristic which is closer to the optimal and reduce the number of message passing steps required to obtain an informative heuristic estimate. Moreover, the time required to compute a single heuristic value (0.01 to 0.02 seconds) could be reduced significantly by optimising our implementation (e.g., using multiple CPU cores or GPUs, optimising matrix operations and broadcasting), adapting the number of message passing steps in real time, or even pruning the vertices and hyperedges in the hypergraph of the relaxed problem.
Finally, we may investigate how to adapt STRIPSHGNs for Stochastic Shortest Path problems (SSPs) [5]
. Existing heuristics for SSPs either rely on linear programming
[32], which can be expensive, or rely on determinisation, which oversimplifies the probabilistic actions. It may be possible to use Hypergraph Networks to learn an informative heuristic that preserves the probabilistic structure of actions by deriving suitable hypergraphs from factored SSPs.References
 [1] (2011) Pyperplan. External Links: Link Cited by: §6.1.
 [2] (2010) Bootstrap Learning of Heuristic Functions. In Symposium on Combinatorial Search, Cited by: §1, §2.
 [3] (2018) Classical Planning in Deep Latent Space: Bridging the SubsymbolicSymbolic Boundary. In AAAI, Cited by: §2.
 [4] (2018) Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: item 1, §4, §4, §4, §4, §4, §5.1, §5.2, §5.
 [5] (1991) An Analysis of Stochastic Shortest Path Problems. Mathematics of Operations Research 16 (3), pp. 580–595. Cited by: §7.
 [6] (2001) Planning as heuristic search. Artificial Intelligence 129 (12), pp. 5–33. Cited by: §1, §3.
 [7] (2009) The factored policygradient planner. Artificial Intelligence 173 (56), pp. 722–747. Cited by: §1, §2.
 [8] (1997) Sokoban is PSPACEcomplete. Technical report University of Alberta. Cited by: §6.3.
 [9] (2011) The first learning track of the international planning competition. Machine Learning 84 (12), pp. 81–107. Cited by: 1st item, 2nd item.
 [10] (1971) STRIPS: A new approach to the application of theorem proving to problem solving. Artificial Intelligence 2 (34), pp. 189–208. Cited by: §3.
 [11] (2019) Size Independent Neural Transfer for RDDL Planning. In ICAPS, Cited by: §2, §2.
 [12] (2017) Neural Message Passing for Quantum Chemistry. In ICML, Cited by: §5.1.
 [13] (2017) Towards learning domainindependent planning heuristics. In IJCAI Workshop on Architectures for Generality and Autonomy, Cited by: §2, 1st item.
 [14] (2018) Learning Generalized Reactive Policies using Deep Neural Networks. In AAAI Spring Symposium, Cited by: §1, §2.
 [15] (2018) Relational inductive bias for physical construction in humans and machines. arXiv preprint arXiv:1806.01203. Cited by: §5.1.
 [16] (2009) Landmarks, Critical Paths and Abstractions: What’s the Difference Anyway?. In ICAPS, Cited by: §1, §3.
 [17] (2006) The Fast Downward Planning System. Journal of Artificial Intelligence Research 26, pp. 191–246. Cited by: §6.1.
 [18] (2018) Training Deep Reactive Policies for Probabilistic Planning Problems. In ICAPS, Cited by: §2.
 [19] (2014) Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.1.
 [20] (2014) Efficient Minibatch Training for Stochastic Optimization. In KDD, Cited by: §5.2.
 [21] (2003) The 3rd international planning competition: results and analysis. Journal of Artificial Intelligence Research 20, pp. 1–59. Cited by: 6th item.
 [22] (2000) The AIPS98 planning competition. AI Magazine 21 (2), pp. 13. Cited by: 5th item.
 [23] (2013) Rectifier Nonlinearities Improve Neural Network Acoustic Models. In ICML, Cited by: §6.1.
 [24] (2010) The LAMA Planner: Guiding CostBased Anytime Planning with Landmarks. Journal of Artificial Intelligence Research 39, pp. 127–177. Cited by: 2nd item.
 [25] (2008) Learning from Multiple Heuristics. In AAAI, Cited by: §2.
 [26] (2017) Nonlinear Hybrid Planning with Deep Net Learned Transition Models and MixedInteger Linear Programming. In IJCAI, Cited by: §2.
 [27] (2019) Deep Learning for CostOptimal Planning: TaskDependent Planner Selection. In AAAI, Cited by: §2.
 [28] (2001) Blocks world revisited. Artificial Intelligence 125 (12), pp. 119–153. Cited by: 4th item.
 [29] (2011) Learning Inadmissible Heuristics During Search. In ICAPS, Cited by: §2.
 [30] (2018) Action Schema Networks: Generalised Policies With Deep Learning. In AAAI, Cited by: §1, §2, §2.
 [31] (2019) ASNets: Deep Learning for Generalised Planning. arXiv preprint arXiv:1908.01362. Cited by: §1.
 [32] (2017) Occupation measure heuristics for probabilistic planning. In ICAPS, Cited by: §7.
Comments
There are no comments yet.