1 Introduction
Classical Planning is concerned with finding plans, or sequences of actions, that when applied to some initial condition specified by a set of logical predicates, will bring the environment to a state that satisfies a set of goal predicates. This is usually performed by some heuristic search procedure, and the resulting plan is applicable only to the specific instance that was solved. However, a possibly stronger outcome would be to find some sort of higher level plan that can solve many instances that belong to the same domain, and thus share an underlying structure. The study of methods that can discover such higher level plans is called Generalized Planning. Generalized plans do not necessarily exist for all classical planning domains, but finding such solutions for domains in which it is possible could obviate the need to perform compute intensive search in cases where we only wish to find a goal satisfying solution. To give an example of such a generalized plan, let us consider a simplified Blocksworld domain. In this domain there are unique blocks that can be either stacked on each other or strewn about the floor, and the goal is to stack and unstack blocks such that we arrive at a goal configuration from an initial configuration. Finding a plan that does so in an optimal number of steps is generally NPhard
[gupta1992complexity], but finding a plan that satisfies the goal regardless of cost can be done in polynomial time in the following manner:
Unstack all the blocks so that they are scattered on the floor

stack the block according to the goal configuration, beginning with the lower blocks
This strategy is not optimal since we might unstack blocks that are already in their proper place according to the goal specification, but it will yield a goal satisfying plan for every instance in this simplified Blocksworld domain. Such a generalized strategy can also be thought of as a policy, which raises the possibility of learning it through reinforcement learning. Machine learning theory often assumes that our training data distribution is representative of the test data distribution, thus justifying our expectation that our models generalize well to the test data. In generalized planning this is not the case, as our test instances could be much larger than the training instances, and thus far out of the training distribution. In this work we show that having the
right inductive bias in the form of a neural network architecture could lead to models that effectively learn policies that are akin to general principles, and can solve problems that are orders of magnitude larger than those encountered during training.2 Background
2.1 Classical Planning
Classical planning uses a formal description language called Planning Domain Definition Language (PDDL) [mcdermott1998pddl], derived from the STRIPS modeling language [fikes1971strips] to define problem domains and their corresponding states and goals. we are concerned with satisficing planning tasks, which can be defined by a set where is a set of propositions (or predicates) that describe the properties of the objects present in task instance and their relations, is a set of operators (or actions types), is the initial state and is a set of goal states. each action type is defined by a triple , where the preconditions is a set of predicates that must have a true value for the action to be applicable, is a set of predicates which the action turns to true upon application and is a set of predicates which the action turns false upon application. We seek to find a plan, or a sequence of actions that when applied will lead to a state for which , within some time limit or a predefined number of steps. Finding plans for planning tasks is often accomplished by heuristic search methods, however in this work we focus on learning reactive planning policies that can train on instances of a specific domain and then generalize to new, unseen instances in that same domain.
2.2 Reinforcement Learning
Reinforcement learning (RL) is a branch of machine learning that deals with learning policies for sequential decision making problems. RL algorithms most often assume the problem can be modelled as a Markov Decision Process (MDP), which in the finite horizon case is defined by a tuple (
, , , , , ), where is the set of states, is the set of actions, is a reward function that maps states or stateactions to some scalar reward,is the transition probability function such that
, is the task horizon and is the distribution over initial states. The value of a policy in the finite horizon RL problem is:(1) 
Where are trajectories sampled by the distribution induced by the policy , initial state distribution and transition function , and is the reward received after taking action at state . The learning problem can thus be formalized as an optimization problem, in which we wish to find the best policy:
(2) 
In the case of large state and action spaces, we cannot hope to represent our policy as a table, and are thus forced to use function approximators to represent the policy, with some parameters . We focus on stochastic policies, which map states and actions to probabilities, such that , and use policy gradient based methods to optimize our policies [williams1992simple]
. Policy gradient methods estimate the gradient of the objective function with respect to the policy parameters using montecarlo sampling. The gradient of the RL objective is:
(3) 
Where is the "return togo". When implementing policy gradient methods, we can estimate the policy gradient by taking the gradient of a "pseudoloss", computed using sampled trajectories:
(4) 
Where is a collection of trajectories sampled at iteration of the algorithm. We can optimize our policy by gradient ascent using the following equation:
(5) 
This kind of algorithm is "onpolicy", which means that data used to update the policy must be generated by the same policy parameters. This requires the algorithm to discard all the data it gathered after each update and collect new data for the next update, which makes onpolicy algorithms data inefficient.
2.3 Proximal Policy Optimization
Proximal Policy Optimization (PPO) [schulman2017proximal] is a policy gradient based algorithm that seeks to better exploit the data gathered during the learning process, by performing several gradient updates on the collected data before discarding it to collect more. In order to avoid stability issues that could arise from large policy updates, PPO uses a special clipped objective to discourage divergence between the current policy and the data collection policy, to define the following optimization problem:
(6) 
Where is a function that clips the values of its input to be between the specified minimum and maximum values, is the policy we are currently optimizing, is the policy used to collect the data (before updating) and is the advantage of the action given the current state and parameters:
(7) 
Where the dependency on the action comes from the empirical return togo , which depends on the specific actions that were taken by the policy. is a state value predicted by some function approximator with parameters , obtained at each iteration by solving:
(8) 
3 Learning Generalized Policies
3.1 State Representation
We chose to represent the states in our framework as graphs, with features encoding the properties and relations between the objects in a given state. Our framework operates on problem domains specified by the PDDL modeling language, in which problem instances are defined by a list of objects and a list of predicates that describe the properties of these objects and the relations between them at the current state. We limit ourselves to domains for which predicates have an arity of no more than two, which is not a significant limitation since higher arity predicates can in many cases be decomposed to several lower arity predicates.
Our graphs are composed of global features, node features and edge features, as in [battaglia2018relational]. We denote our global features , our nodes and our edges . Global features represent properties of the problem instance or entities that are unique for the domain, such as the hand in the Blocksworld domain, and are determined by the 0arity predicates of the domain. Node features represent properties of the objects in the domain, such as their type, and are determined by the 1arity predicates. Lastly, edge features represent relations between the objects and are determined by the 2arity predicates.
When producing a graph representation of a PDDL instance state, a complete graph is produced with a node for each object in the state. For each predicate in the state, the corresponding feature is assigned a binary value of 1, and all other features are assumed to be false with a value of 0. In order to include the goal configuration in the input to the neural network, the goal predicates are treated almost as if they were another stategraph, and the two graphs are concatenated together to form a single representation for the stategoal. The difference between state graphs and goal graphs, is that in the goal graphs a 0 valued feature means that it contributes no goal, and in the state graph a 0 valued feature means that the predicate is assigned a false value.
The classical planning domains used throughout this work are deterministic and Markovian, meaning that the current state holds all the required information to solve the problem optimally. Despite this property, we found that adding past states in addition to the current one helps the learning process and improves the generalization capability to larger instances. While this is not strictly essential, our experiments suggest that this step helps the policy mitigate "backandforth" behavior to some extent, and this is especially helpful on the larger instances where the policy is more prone to make mistakes and then attempt to correct them. Adding this history is straightforward; we simply concatenate the graphs for the K previous states and current state, and then concatenate the goal graph as mentioned previously. We tested several such history horizons, and found that adding only the last state results in overall best performance and generalization. An example of a stategoal graph from the Blocksworld domain can be seen in figures 1 and 2, showing an instance with 3 blocks.
3.2 Graph Embedding
In order to learn good policies using the graph representations of stategoals we first use a Graph Neural Network (GNN) to embed the node, edge and global features of the graph in respective latent spaces. The GNN performs message passing between the different components of the graph, allowing useful information to flow. We use two different types of GNN blocks, each enforces a different style of information flow within the graph and thus more suited to certain problem domains than others. In both of these types the update order is similar and takes the following common form:

Edges are updated using the previous edges and the "origin" nodes of those edges.

Nodes are updated using the previous nodes, the incoming updated edges and the global features.

Globals are updated using the previous globals and the aggregation of the updated nodes.
The first block type we used is similar to the one described in [battaglia2018relational] which we name accordingly Graph Network block (GN block). Mathematically, this block performs the following operations:
(9) 
(10) 
(11) 
(12) 
(13) 
In the above notation,
is a nonlinearity such as Rectified Linear Unit,
is a nodewise maxpooling operation and
, are respective weight matrices and biases. In the GN block, nodes receive messages from their neighbouring nodes indiscriminately, which works well to propagate general information across the graph but makes it harder to transfer specific bits of information when needed.The second type of block was designed to address that shortcoming of the GN block, and for that purpose was endowed with an attention mechanism. We named the second block Graph Network Attention block (GNAT block), and unlike the Graph Attention Network of [velivckovic2017graph], it uses an attention mechanism similar to the Transformer model of [vaswani2017attention]. This block performs the following operations:
(14) 
(15) 
(16) 
(17) 
(18) 
(19) 
(20) 
(21) 
In the above notation, is a nodewise summation operation, is the Hadamard product and , are respective weight matrices and biases. As mentioned above, this type of block allows certain bits of information to travel in the graph in a more deliberate manner, by endowing the nodes with the ability to focus on specific messages. When constructing our GNN model, we can stack several blocks of these types (and combinations of them) to attain a deeper graph embedding capacity. In most of our experiments we used two blocks, either two successive GN blocks, or a GNAT block followed by a GN block. Each configuration excelled at a different group of problems as we will show in the experiments section.
3.3 Policy Representation
Unlike common reinforcement learning benchmarks where the set of actions is fixed and can be conveniently handled by standard neural network architectures, in classical planning problems the set of actions is state dependent and varies in size between states. In PDDL, each domain description defines a set of action types that can be instantiated by grounding those action types to the state. Each action type receives a set of arguments, and in order to be applicable the arguments of the action must conform to a set of preconditions. For example, the Blocksworld domain has an action type called "pickup" which gets a single block object as an argument. This block must be "clear", "ontable" and the "armempty" property must be true for the action to be applicable. All blocks that comply with these preconditions can be picked up, and represent a unique action. In addition to preconditions, each action type also has effects which are caused to the states upon application of the action. Some of these effects could be positive (certain predicates of the state will take a true value) and some negative (predicates will assume a false value).
At each step of planning, the successorstate generator gives the current state and a list of applicable actions. In order to represent the actions in a meaningful way that enables learning a policy over them, we chose to describe the actions in terms of their effects, since these are the essential components needed to make decisions. Since the successorstate generator provides the agent with all the legal actions at each step, we ignored the preconditions (all legal actions satisfy the preconditions). Each action is composed of several effects, each concerning a different aspect of the state, and are either positive or negative. The effects are clustered together based on their type (global effect, node effect or edge effect), and are represented as a concatenation of the embedding of the respective component and a onehot vector describing which predicate is changed and if it is positive or negative. This onehot vector is in the dimension of corresponding input component (
for node effects for example) and contains either 1 for positive effects or 1 for negative effects at the appropriate predicate location. Each effect is transformed by a multi layered perceptron (MLP) according the its type and then the transformed effects are scattered back to their origin actions. The effects of each action are aggregated together to form a single vector representation of that action, which is fed eventually to the policy neural network. Figure
4 illustrates the process of action representation.The final policy is a MLP that outputs a single scalar for each action, and these scalars are then normalized by a softmax operation to get a discrete distribution over the actions. In addition, another MLP takes the final global feature embeddings of the graph and outputs the predicted value of the state, to be used for advantage estimation in the RL algorithm.
3.4 Training Procedure
Since the focus of this work was finding feasible plans, we chose to model our problem as a sparse reward problem with a binary reward. If the agent satisfies all the goals within a predefined horizon length, it gets a reward of 1, and if not it gets no reward. To determine an appropriate time limit we used the commonly used hff heuristic [hoffmann2001ff], which solves a relaxed version of the problem in linear time (the relaxed problem has no negative effects). We take the length of the relaxed plan and multiply it by a constant factor of 5 to get the horizon length.
To train our policy we chose to use Proximal Policy Optimization (PPO) [schulman2017proximal] for its simplicity and good performance. To handle the problem of sparse rewards we initially experimented with using Hindsight Experience Replay DQN [andrychowicz2017hindsight] due to its demonstrated ability to tackle sparse goal reaching problems, but found that it introduced a lot of bias and resulted in unsatisfactory performance. To allow our policy to learn from a sparse binary reward, we resorted to a simpler method; we generated each training episode from a distribution over instance sizes, which includes sizes small enough to be occasionally solved by a randomly initialized policy. Doing this allows the policy to progress to eventually solve all the instance sizes in the distribution, without the need for a manual tuning of a curriculum. Although setting this distribution needs to be done manually, we found it very easy and quick to do by simple trial and error with a random untrained neural network.
We made several small adjustments to the standard PPO algorithm which improved performance in our case. Many RL algorithms implementations roll out the policy for a fixed number of steps before updating the model parameters, often terminating episodes before completion in the process and using methods such as Generalized Advantage Estimation [schulman2015high] and bootstrapping value estimations to estimate the returns as in [mnih2016asynchronous]. We found these elements to add unwanted bias to our learning process, and instead rolled out each episode until termination, using empirical returns instead of bootstrapped value estimates to compute advantages. We also found that using many rollouts and large batch sizes helped stabilize the learning process and resulted in better final performance, and so we performed 100 episode rollouts and used the resulting data to update the model parameters at each iteration of the learning algorithm.
3.5 Planning During Inference
To improve the ability of our generalized policies to use additional time during test, we use them within a search algorithm, as was done in many other works such as [silver2016mastering], [anthony2017thinking]. This type of synthesis gained great success in zero sum games such as Go and Chess [silver2017mastering], where a deep neural network policy was used in conjunction with a Monte Carlo Tree Search algorithm, which prompted other authors to do the same even for nongame problems [abe2019solving]. We take a different approach and design our search algorithm specifically for the case of deterministic planning problems with a strong reactive policy. Our algorithm is based on the classic Greedy Best First Search (GBFS) algorithm, but augments it in several key ways. In standard GBFS, a search tree is constructed from the root node, and at each iteration, the node with the best heuristic estimate is extracted from the open list, expanded and its child nodes added to the open list, and this procedure is repeated until a goal node is found or until time is out. Our algorithm, which we name GBFSGNN, performs a similar procedure, but uses the policy and value functions to compute a heuristic value for each node, and performs a full rollout for each expanded node. The offspring of the expanded node are added to the open list, but the rest of the nodes encountered during the rollout are not, to avoid rapid memory consumption growth in large problems. Each node in our search tree represents a stateaction pair, and we use the following heuristic estimate for each node:
(22) 
In this equation, is the heuristic estimate of the stateaction, is the probability of action under our policy , is the estimated state value according the the critic part of our neural network policy, and is the entropy of the policy’s distribution over actions at state . Figure 5 illustrates our search algorithm.
4 Related Work
Learning to plan has been an active topic of research for many years, with different methods attempting to learn different aspects of a complete solver. Some works attempted to learn heuristic values of states for specific domains using features generated by other domain independent heuristics, such as [yoon2006learning], which learns heuristic values by regression. more recent works such as [garrett2016learning] learn to rank successor states by using RankSVM [joachims2002optimizing]. These types of methods do not explicitly use the state or goal information from the problem description, but rather learn using hand crafted features, and in addition do not learn an explicit planning policy over the available actions. Contrary to this, our methods learns planning policies over explicit states and goals, that directly choose actions to take.
Other works such as [tamar2016value], [groshev2018learning] and [guez2019investigation]
learn an explicit planning policy over actions, using the actual state of the problem as input and a deep convolutional neural network, but rely on having a visual representation of the problem. This limits their usage to domains where a visual representation is available. Another limitation is that
[tamar2016value] and [groshev2018learning]rely in addition on successful plans generated by a planning algorithm and learn policies using imitation learning, while
[guez2019investigation] use reinforcement learning for this purpose. Our work does not rely on visual representations or successful plans generated by planning algorithms, but learns directly from a PDDL representation of the problem by trial and error via deep reinforcement learning.Some works have begun to study the use of graph representations of states and the use of different kinds of graph neural networks for the task of learning policies or heuristics. In [toyer2018action] the authors proposed a unique kind of neural network called Action Schema Network (ASNet) which consists of alternating action layers and proposition layers to learn planning policies. They represent their state as a graph in which objects and actions are connected and propagate information back and forth to finally output a probability over actions. They train their ASNets by imitating plans generated by other planners, and augment the input with domain independent heuristic values to improve performance. In their experiments, they focus mainly on stochastic planning problems and demonstrate that their trained policies can generalize to larger instances than trained on. A limitation of ASNet is their fixed receptive field, that limits their ability to reason over long chains, which our work does not share.
In a recent paper, [shen2019learning] propose an extension of [battaglia2018relational]
to hypergraphs, and use it to learn heuristics over hypergraphs that represent the delete relaxation states of planning problems. They use supervised learning of optimal heuristic values generated by a planning algorithm, and then use the resulting neural network as a heuristic function within a search algorithm. In contrast to this, our method focuses on learning policies, which can be more timeefficient during evaluation since a single forward pass over the neural network is needed to make a decision at each state. Using heuristic estimates requires estimating all the successor states of a state in order to choose the best action, which could potentially increase runtime. Another difference is that our work operates directly on states instead of delete relaxations, which might limit the power of heuristics since some information is omitted. Overviews of older methods of learning to plan can be found in
[minton2014machine], [zimmerman2003learning] and [fern2011first].5 Experiments
5.1 Domains
We evaluate our approach on five common classical planning domains, Chosen from the IPC planning competition collection of domain generators that have predicates of arity no larger than 2:

Blocksworld (4 op): A robotic arm must move blocks from an initial configuration in order to arrange them according to a goal configuration.

Satellite: A fleet of satellites must take images of locations, each with a specified type of sensor.

Logistics: Packages must be delivered to target locations, using airplanes and trucks to move them between cities and locations.

Gripper: A twinarmed robot must deliver balls from room A to room B.

Ferry: A ferry must transport cars from initial locations to designated target locations.
What these five domains have in common is that simple generalized plans can be formulated for them, which are capable of solving arbitrarily large instances. We wish to demonstrate that our method is capable of producing policies that solve much larger instances than those they were trained on, thus automatically discovering such generalized plans. Some domains are easier than others, and in cases where the generalized plan is very easy to describe we often witnessed that the policy generalizes very successfully. For example, the Gripper domain has a very simple strategy (Grab 2 balls with each trip to room B) and indeed our neural network learns the optimal strategy and usually still performs optimally even for instances with hundreds of balls. To demonstrate that our policies indeed generalize well, we trained them on small instances and used both small and large instances for evaluation.

For the Blocksworld domain we trained our policy on instances with 4 blocks, and evaluated on instances with 5100 blocks.

For the Satellite domain we trained our policy on instances with 13 satellites, 13 instruments per satellite, 13 types of instruments, 23 targets, and evaluated on instances with 114 satellites, 211 instruments per satellite, 16 types of instruments and 242 targets.

For the Logistics domain we trained our policy on instances with 23 airplanes, 23 cities, 23 locations per city, 12 packages, and evaluated on instances with 412 airplanes, 415 cities, 16 locations per city and 840 packages.

For the Gripper domain we trained our policy on instances with 3 balls, and evaluated on instances with 5200 balls.

For the Ferry domain we trained our policy on instances with 34 locations, 23 cars and evaluated on instances with 440 locations and 2120 cars.
5.2 Experimental setting
For training our policies, we rely on having instance generators to produce random training instances, since our method requires large amounts of training data. All policies are trained for 1000 iterations, each with 100 training episodes and up to 20 gradient update steps. Experiments are performed on a single machine with a i78700K processor and a single NVIDIA GTX 1070 GPU. We used the same training hyperparameters for all five domains, but slightly varying neural network models. We used a hidden representation size of 256 and ReLU activations, a learning rate of 0.0001, a discount factor of 0.99, an entropy bonus of 0.01, a clipping ratio of 0.2 and a KL divergence cutoff parameter of 0.01. For the Blocksworld and Gripper domains we used a two layer GNN with both layers of the GN block type, and for the Satellite, Ferry and Logistics domains we used a two layer GNN with a GNAT block followed by a GN block. Our code was implemented in Python and our neural networks and learning algorithm were implemented using PyTorch
[paszke2019pytorch].5.3 Baselines
We focus in our evaluation on solving large instances of generalized planning domains and compare our method with a classical planner. Other learning based methods either had no available code by the time this work was written (such as [shen2019learning]) or were inherently limited in scaling to the large problems (for example [toyer2018action]), so we opted for a more general baseline in the form of a classical planner, which can scale to large problems given enough time and memory. We compare against fastdownward [helmert2006fast], which is a state of the art framework. Our approach uses Pyperplan as the model and successor state generator, which is a Python based framework. We use the LAMAfirst configuration as the setup for fastdownward, as it is a top performing competitive satisficing planning algorithm.
5.4 Evaluation Metrics
Since our work is focused on satisficing planning, we use success rate as our main metric. We run both our GBFSGNN and fastdownward on a set of 50 held out evaluation instances per domain, and run each method for a fixed time limit of 600 seconds per instance, we then plot the success rate of each method against the time limit and against the number of expanded states to see how each method scales with given computation. The evaluation instances are generated according to a wide distribution such that both small and large instances are sampled.
5.5 Results
We now present our results. Figure 6 shows a comparison between our method and fastdownward for the five domains we used in our experiments. The plots show success rate as a function of number of expanded states, and demonstrate that our method indeed scales favourably compared to the classical planner on 4 of the 5 domains. In fact, on the 4 domains where our policies generalized well, GBFSGNN required very little to no search. In these domains, a solution can be found by just greedily following the policy in all but the hardest instances. Our search algorithm builds on this generalization capability and uses a small number of full policy rollouts while searching.
In figure 7 we present a comparison between our method and fastdownward, plotting success rate against given runtime. We can see that even though fastdownward has a highly optimized C++ implementation and uses sophisticated modeling tools to efficiently solve planning problems, our method overcomes it in one domain (Blocksworld) and closely matches it on three others. Despite GBFSGNN using a successor state and legal action generator that is orders of magnitude slower than that of fastdownward, our method’s generalization capability makes it competitive with state of the art implementations of classical planners.
An obvious exception concerning the generalization performance of our method is the Logistics domain. Our policy successfully achieved good performance on the training instances but failed to generalize to much larger instance sizes, and consequently was vastly outperformed by fastdownward on that domain. We hypothesize that unlike the other domains, the Logistics domain contains a tighter coupling between the different objects in each instance. In the Satellite domain for example, calibrating an instrument or imaging a target does not interfere with other satellites, in the sense that the policy can have multiple "halfbaked" goals and switch between them without interference. This is not possible in the Logistics domain, as all the packages share the trucks and airplanes, and moving a specific truck to pick up a package might interfere with another package that was meant to be picked up in another location. Different graph neural network architectures could perhaps encourage the policy to remain "fixed" on a single goal until its satisfaction before moving to another, thus possibly overcoming the issue with the Logistics domain and other similar types of problems.
6 Conclusion and Future Work
In this work we studied the ability of graph neural networks and deep reinforcement learning algorithms to learn generalized planning policies that can solve instances much larger than those encountered during training, in effect learning principles that generalize well. Unlike some other approaches, our method does not rely on optimal solutions provided by existing planners, nor on heuristics to boost performance. We further introduce GBFSGNN, a search algorithm that exploits the availability of high performing reactive policies to quickly find solutions to very large instances. Our policies are learned from scratch via reinforcement learning, and combined with GBFSGNN achieve performance that surpasses highly optimized implementations of state of the art planners in terms of expanded states, and is on par in terms of runtime. Directions for future work include studying how specific mechanisms in graph neural networks architectures relate to the emergent generalization behaviour on different domains, studying the effect of different reinforcement learning algorithms on generalization and perhaps exploring regularization schemes on the policy training procedure that might encourage better generalization.
Comments
There are no comments yet.