1 Introduction
Benchmark data sets are indispensable in the evaluation of machine learning models for graphstructured data. With the surging interest in graph representation learning, a rich collection of data sets constructed from reallife applications becomes important for the validation of effectiveness of any existing or newly proposed method, and for demonstrating its widespread applicability. We introduce a new labeled data set, IPC, compiled from AI plannig tasks described in the Planning Domain Definition Language (PDDL) (McDermott, 2000).
In this data set, each planning task is represented as a directed graph, which has target values and whose nodes are equipped with features. Planning tasks described in PDDL admit a concise representation in transition graphs, however too big to fit in any conceivable size memory. Recent advances in planning allow to encode some structural information of a task in graphs of manageable size. Two examples are the problem description graph (Pochter et al., 2011), for a grounded task representation, and the abstract structure graph (Sievers et al., 2019b), for a lifted representation. Hence, our data set consists of two versions of graphs (IPCgrounded and IPClifted) for the same set of tasks; each version may be used independently. There are 2439 planning tasks in total, presplit for training, validation, and testing. Moreover, the lifted version is acyclic.
Accompanied with the tasks are performance results for costoptimal domainindependent planners, each of which attempts to solve a task under a timeout limit seconds. Hence, the target values for each graph are the CPU times of these planners, gathered on the same hardware. For practical reasons, if a planner cannot solve the task before timeout, the target value is artificially set as .
The background on AI Planning and graph construction, including example problem domains, node feature definition, and how the data set can be extended, are presented in Section 2. The characteristics of this data set are different in several aspects from those of the commonly used benchmark data sets for graph kernels and graph neural networks: The sizes of our graphs not only are substantially larger but also vary significantly. The imposed challenges and implications are elaborated in Section 3. We present an example use of the data set in Section 4 and conclude in Section 5.
2 Data Set Construction
PDDL tasks are defined over a firstorder language that consists of predicates, functions, a set of natural numbers, variables, and constants. Given , a normalized (Helmert, 2009) PDDL task is a tuple of schematic operators, schematic axioms, initial state specification, and goal specification, in socalled lifted representation.
Most tools for solving the planning problems (a.k.a. planners) perform grounding as a first step, followed by translation into a language (Bäckström & Nebel, 1995). In , a task consists of finitedomain variables, ground operators, ground axioms, initial state, and the goal.
Our data set consists of two graphical representations per planning task. These representations losslessly encode the information in the planning task and are often used for the computation of structural symmetries (Shleyfman et al., 2015). The graph obtained from the grounded representation is called the problem description graph (PDG) (Pochter et al., 2011). We present here the definition extended to support conditional effects and axioms and refer the reader to Sievers et al. (2019a) for further details.
Definition 1
Let be a task. The problem description graph of is the digraph with nodes
where and , and edges
The graph obtained from the lifted PDDL representations is called the abstract structure graph (ASG) (Sievers et al., 2019b). Planning tasks in PDDL can be naturally modeled as abstract structures, which, in turn, can be represented as graphs. In what follows we present the definitions of abstract structures and abstract structure graphs, referring the reader to Sievers et al. (2019b) for further details.
Definition 2 (Sievers et al., 2019b)
Let be a set of symbols, where each is associated with a type . The set of abstract structures over is inductively defined as follows:

each symbol is an abstract structure, and

for abstract structures , the set and the tuple are abstract structures.
Using the language of a PDDL task , each part of can inductively be defined as an abstract structure, with the symbols of forming the basic abstract structures. Finally, abstract structures can be naturally turned into a graph.
Definition 3 (Sievers et al., 2019b)
Let be an abstract structure over . The abstract structure graph is a digraph , defined as follows.

contains a node for the abstract structure . If contains a node for or , it also contains the nodes for .

For every set (sub)structure there are edges for .

For every tuple (sub)structure , the graph contains auxiliary nodes , an edge , and edges for . For each component , there is an edge .
Note that the edges in ASGs are from the abstract structures to their substructures, which results in acyclic graphs.
In both PDG and ASG, the node features are onehot according to the selfexplanatory node type indicated in the above definitions.
The aim in classical planning is to find a sequence of ground operators that, if applied to the initial state, will necessarily transform it into a goal state. Such a sequence is called a plan
. Assigning a quantitative cost to each ground operator, the cost of a plan is defined as the sum over the costs of its operators. The goal of costoptimal classical planning is to find a provably cheapest plan. There exist dozens if not hundreds of highly parameterized methods for heuristic guidance computation, giving rise to an enormous possible number of planners. As even the classical planning is PSPACEhard, there cannot be one planner that will work well on all possible planning problems. Thus, finding a planner that works well on a given planning problem is a challenging task.
While planners are often domainindependent, in the sense that they depend only on the information encoded in PDDL, the planning tasks encode computational problems from various domains. These domains range from puzzles or oneperson games (e.g., towers of Hanoi, 15puzzle, freecell, and sokoban), to reallife domains (e.g., task planning and automated control of autonomous systems: greenhouse logistics, rovers, elevators, satellites), as well as emerging domains (e.g., genome editing distance computation).
Many of the existing domains were introduced through International Planning Competitions, which were held regularly since 1998. Each such competition, intended for comparing the performance of domainindependent planners, introduced new, previously unseen domains on which submitted planners were tested. In many cases, the authors of the domains supplied not only the planning tasks, but also the generator that allowed for creating additional tasks. Some of these generators can be found at, e.g., https://bitbucket.org/planningresearchers/pddlgenerators. Hence, these generators may be used to extend the current data set with little effort, although for benchmarking purpose we did not include any such task in the data set.
3 Statistics
A number of graph statistics, compared with those of commonly used datasets (Kersting et al., 2016) for benchmarking graph kernels and graph neural networks, are reported in Table 1 and Figures 2 and 3 in the supplementary material. Observations follow.

[leftmargin=*]

The IPC graphs are significantly larger. The graphs in other data sets under comparison generally have tens to hundreds of nodes, but 39% of the graphs in IPCgrounded and 63% in IPClifted have over 1,000 nodes. The largest graph in IPCgrounded has 87,140 nodes, and the number for IPClifted is 238,909.

Note that the size of the largest graph is often the memory bottleneck indicator for graph neural networks, because the batch size is at least this number in stochastic training. Hence, our data set poses substantial challenges for the computation of many neural graph models.

The sizes of the IPC graphs are highly skewed, compared to those of other data sets. For many machine learning tasks, especially in the unsupervised setting, the notion of similarity is key to clustering and categorization. When the sizes of two graphs significantly differ, the intuition of similarity is challenged. After all, what does it mean by saying “a graph with 10 nodes is similar to another graph with 100,000 nodes?”

The lifted graphs are the most sparse, compared to the grounded ones and graphs in other data sets.

Similar to many other data sets, the IPC graphs are not necessarily connected. However, the main connected component generally dominates. Hence, graph neural networks still suffer the memory bottleneck caused by the exceedingly large graphs.

Despite the difference in size and density, the IPC graphs have a moderate diameter, similar to other data sets.
The number of layers in a graph neural network of neighborhoodaggregation style is often questioned beyond hyperparameter tuning; and speculation attributes to the diameter of the graphs. Meanwhile, it has been widely acknowledged that neighborhood aggregation is a type of Laplacian smoothing and too many layers lead to oversmoothing
(Li et al., 2018; Xu et al., 2018; Klicpera et al., 2019). The diameter statistics may be useful for the analysis of the role of smallworld structures handled by graph neural networks.
IPCgrounded  IPClifted  REDDITMULTI12k  REDDITBINARY  

Type  directed  DAG  undirected  undirected 
#Graphs  2,439  2,439  11,929  2,000 
Total #Nodes  6,233,856  9,816,948  4,669,116  859,254 
Max #Nodes  87,140  238,909  3,782  3,782 
Mean (Std) #Nodes  2,555.9 (6,099.0)  4,025.0 (14,507.6)  391.4 (428.7)  429.6 (554.1) 
Mean (Std) Ave Degree^{1}  12.3 (131.0)  2.9 (35.1)  4.7 (27.6)  4.6 (41.3) 
Mean (Std) #CC^{2}  1.09 (0.61)  1.14 (0.49)  2.81 (2.65)  2.48 (2.47) 
Mean (Std) Diam^{3,4}  8.2 (2.3)  17.1 (1.5)  10.9 (3.1)  9.7 (3.1) 

“Ave Degree” is the average node degree (of the undirected version of the graph).

“CC” means connected components (of the undirected version of the graph).

“Diam” means diameter. Because a graph may consist of multiple connected components, we define the diameter as the maximum of the diameters of each connected component.

For large graphs, the diameter is too costly to compute. Hence, for IPC, only the diameters of 94.3% of the graphs are computed. For other data sets, diameters of all graphs are computed.
COLLAB  NCI1  DD  PROTEINS  ENZYMES  MUTAG  
Type  undirected  undirected  undirected  undirected  undirected  undirected 
#Graphs  5,000  4,110  1,178  1,113  600  188 
Total #Nodes  372,474  122,747  334,925  43,471  19,580  3,371 
Max #Nodes  492  111  5,748  620  126  28 
Mean (Std) #Nodes  74.5 (62.3)  29.9 (13.6)  106.5 (284.3)  39.1 (45.8)  32.6 (15.3)  18.0 (4.6) 
Mean (Std) Ave Degree^{1}  132.0 (158.5)  4.3 (1.6)  10.1 (3.4)  7.5 (2.3)  7.6 (2.3)  4.4 (1.5) 
Mean (Std) #CC^{2}  1 (0)  1.19 (0.57)  1.02 (0.18)  1.08 (0.52)  1.24 (3.61)  1 (0) 
Mean (Std) Diam^{3,4}  1.9 (0.3)  13.3 (5.1)  19.9 (7.7)  11.6 (7.9)  10.9 (4.8)  8.2 (1.8) 
4 Example Use
For an illustration of the use of the data set, we focus on the problem of costoptimal planning, whose goal is to solve as many tasks by using costoptimal planners as possible, each given a time limit . Hence, for each of the 17 target values, we convert it to 0 if the value
and 1 otherwise. For each target, the problem becomes a binary classification and thus a probability value between 0 and 1 is output. We select the planner corresponding to the smallest probability and confirm success if its actual planning time is smaller than the timeout limit
. Test accuracy (percentage of successfully solved tasks) is reported.Three methods for comparison are (a) an imagebased CNN whereby the grayscale image is converted from the adjacency matrix of the graph; (b) a graph convolutional network (GCN) (Kipf & Welling, 2017) with attention readout; and (c) a gated graph neural network (GGNN) (Li et al., 2016). For details of the CNN architecture, see Katz et al. (2018).
The data set has been presplit for training, validation, and testing. Table 2 reports the test accuracy. Additionally, we resplit the training/validation combination as a form of cross validation, whereby we fix the test set because it comes from the most recent International Planning Competition. Two forms of random resplits are possible. One is to preserve the domains of the planning tasks (i.e., tasks from the same domain cannot appear in both training and validation), and the other is free from this restriction. We call the former domain split and the latter random split. For each type of resplit, we perform ten randomizations. Table 3
reports the test accuracy together with standard deviation. From both tables, one sees that the lifted graphs yield much higher accuracy and GCN outperforms the other two methods.
Method  Grounded  Lifted 

CNN  73.1%  86.9% 
GCN  80.7%  87.6% 
GGNN  77.9%  81.4% 
Method  Domain Splits  Random Splits 

CNN  82.1% (6.6%)  86.1% (5.5%) 
GCN  85.6% (5.5%)  87.2% (3.5%) 
GGNN  76.6% (5.8%)  74.4% (2.7%) 
5 Conclusions
We have described a new data set, IPC, for benchmarking graphbased learning models (e.g., graph kernels and graph neural networks) in classification, regression, and related uses. The graphs are constructed from AI planning tasks appearing in International Planning Competitions, without requiring human efforts for labeling, and may be extended with random instances of planning problems. The data set has distinctively different statistics from other popularly used benchmarks: the graphs are much larger and their sizes vary substantially. Moreover, the lifted version of the data set is comprised of directed acyclic graphs, enabling the development of specialized graph models. We anticipate that the data set is a valuable inclusion to the current collection of commonly used benchmarks for validating the effectiveness of existing and forthcoming graph methods.
References
 Bäckström & Nebel (1995) Bäckström, C. and Nebel, B. Complexity results for SAS planning. 11(4):625–655, 1995.
 Helmert (2009) Helmert, M. Concise finitedomain representations for PDDL planning tasks. AIJ, 173:503–535, 2009.
 Katz et al. (2018) Katz, M., Sohrabi, S., Samulowitz, H., and Sievers, S. Delfi: Online planner selection for costoptimal planning. In IPC9 planner abstracts, 2018.
 Kersting et al. (2016) Kersting, K., Kriege, N. M., Morris, C., Mutzel, P., and Neumann, M. Benchmark data sets for graph kernels, 2016. URL http://graphkernels.cs.tudortmund.de.
 Kipf & Welling (2017) Kipf, T. N. and Welling, M. Semisupervised classification with graph convolutional networks. In ICLR, 2017.
 Klicpera et al. (2019) Klicpera, J., Bojchevski, A., and Günnemann, S. Predict then propagate: Graph neural networks meet personalized pagerank. In ICLR, 2019.

Li et al. (2018)
Li, Q., Han, Z., and Wu, X.M.
Deeper insights into graph convolutional networks for semisupervised learning.
In AAAI, 2018.  Li et al. (2016) Li, Y., Tarlow, D., Brockschmidt, M., and Zemel, R. Gated graph sequence neural networks. In ICLR, 2016.
 McDermott (2000) McDermott, D. The 1998 AI Planning Systems competition. 21(2):35–55, 2000.
 Pochter et al. (2011) Pochter, N., Zohar, A., and Rosenschein, J. S. Exploiting problem symmetries in statebased planners. In AAAI, 2011.
 Shleyfman et al. (2015) Shleyfman, A., Katz, M., Helmert, M., Sievers, S., and Wehrle, M. Heuristics and symmetries in classical planning. In AAAI, 2015.
 Sievers et al. (2019a) Sievers, S., Katz, M., Sohrabi, S., Samulowitz, H., and Ferber, P. Deep learning for costoptimal planning: Taskdependent planner selection. In Proc. AAAI 2019, 2019a.
 Sievers et al. (2019b) Sievers, S., Röger, G., Wehrle, M., and Katz, M. Theoretical foundations for structural symmetries of lifted pddl tasks. In Proc. ICAPS 2019, 2019b.
 Xu et al. (2018) Xu, K., Li, C., Tian, Y., Sonobe, T., ichi Kawarabayashi, K., and Jegelka, S. Representation learning on graphs with jumping knowledge networks. In ICML, 2018.
Comments
There are no comments yet.