1 Introduction
Recently, deep reactive policies have been shown to be successful for offline probabilistic planning problems represented in RDDL [Issakkimuthu, Fern, and Tadepalli2018] or PPDDL [Toyer et al.2018]. An advantage is that neural policy networks can represent offline policies for very large domains. However, training these networks from scratch can be sampleinefficient and timeconsuming.
Very recently, ToRPIDo, a neural transfer learning approach has been proposed [Bajpai, Garg, and Mausam2018], which trains a generic policy network from some problems of an RDDL domain, and transfers it to a new problem of the same domain. Its architecture runs a graph convolutional network (GCN) [Kipf and Welling2017] on the nonfluent graph structure exposed in an RDDL domain to compute object embeddings in a problem. These are concatenated to produce a latent embedding of the state. A downstream RL module maps the state embedding to an action embedding, which is projected on to the symbolic actions in the domain. While highly effective in reducing learning times, a significant limitation of ToRPIDo is that it only transfers when all training and testing problems are of the same size. Thus, it works only when test problem sizes are known at train time.
In response, we present TraPSNet, the first sizeindependent neural transfer algorithm for RDDL MDPs. As a first step towards this goal, our paper focuses on domains where action templates and (non)fluents are parameterized over a single object only, and there is one binary nonfluent. ToRPIDocan only operate on equisized problems, because its state embeddings have dimensionality proportional to the number of objects in the problem, and its action decoder outputs a distribution over all possible actions, whose number also depends on the problem size. TraPSNet
achieves sizeindependence through the use of two key ideas. First, it uses max pooling of object embeddings to produce a fixeddimensionality state embedding. Second, while it still produces a probability distribution over all actions, it does so by projecting an object embedding onto the probability with which the action applied on that object is taken in the policy. The parameters for this projection function are tied across all objects, making this sizeindependent also.
We perform experiments on two RDDL domains – SysAdmin and Game of Life. These are chosen because while they are highly challenging (because they can have very large state and transition spaces, and complex dynamics), they also satisfy the assumption of unary actions and binary nonfluents. Training on smallsized problems in these domains, and testing on larger instances in IPPC 2014 [Grzes, Hoey, and Sanner2014], we find that TraPSNet achieves excellent zeroshot transfer, i.e., it has a very high reward even before any RL on test problem. Compared to when training from scratch, TraPSNet has vastly superior learning curves. We will release the code of TraPSNet for further research.
2 Related Work
Most current state of the art reinforcement learners are neural models. A popular deep RL agent is Asynchronous Advantage ActorCritic (A3C) [Mnih et al.2016]
, which simultaneously trains a policy and a value network, by running simulated trajectories and backpropagating an advantage function (which is a function of obtained rewards). We refer to the parameters in these two networks as
and , respectively.Any RL agent is naturally applicable to probabilistic planning, since any RDDL [Sanner2010] or PPDDL [Younes et al.2005] problem can always be converted to a simulator for training the agent. Existing literature on neural planning includes: Value Iteration Networks, which operate in flat state spaces [Tamar et al.2017], ActionSchema Networks (ASNets) for solving PPDDL problems [Toyer et al.2018], and deep reactive policies for RDDL problems [Issakkimuthu, Fern, and Tadepalli2018]. Some works have also studied neural transfer, which include Groshev et al. abbeel18, which experiment on only two deterministic domains, and ASNets. While all RDDL problems can, in principle, be converted to PPDDL, the potential exponential blowup of the representation makes ASNets unscalable to our domains. The closest to our work is ToRPIDo, a recent architecture for equisized transfer in RDDL domains [Bajpai, Garg, and Mausam2018].
ToRPIDo is based on the principle that there may exist a latent embedding space for a domain, where similar states in different problems will have similar embeddings. It tries to uncover this latent structure using object connectivities exposed in RDDL via nonfluents. It has a state encoder, which creates object embeddings as projections of this graph adjacency matrix and fluents in a state (using a GCN), and then concatenates them to construct a (latent) state embedding. An RL module maps this state embedding to a (latent) stateaction embedding. An action decoder maps stateaction embedding to a policy (distribution over action symbols). While other modules transfer directly, the decoder needs to be retrained at test time, resulting in a near zeroshot transfer. Because its embeddings and action decoder are sizespecific, ToRPIDo only allows equisized transfers – a limitation we relax in our work. Furthermore, TraPSNet requires no retraining at start, and achieves a full zeroshot transfer.
3 Problem Formulation
An RDDL [Sanner2010] domain enumerates the various fluent predicates (), nonfluent predicates (), a reward function and action templates (
) with their dynamics. Here, fluents refers to predicates that can change value as a consequence of actions, whereas nonfluents stay fixed throughout execution – these often describe the connectivity structure among objects in a problem. An RDDL domain can be likened to a Relational Markov Decision Process
[Boutilier, Reiter, and Price2001]. An RDDL problem within a domain lists the specific objects, values of all nonfluent predicates for those objects and the fluent values for those objects (which define state variables) in the initial state. This completes the description of a factored MDP with a known initial state [Mausam and Kolobov2012].Our goal is to develop a good anytime algorithm for computing an offline policy for an RDDL test problem . An anytime MDP algorithm is one that can be stopped at any time and will return a reasonable policy; it typically produces better policies given more computation time. We use a transfer setting for this, where we are given training problems from the same domain, but of different (typically smaller) size as that of . The transfer objective is to, at training time, learn domainspecific but problemindependent information from training problems, and, at test time, transfer that to .
Post transfer, training further on should achieve a good anytime MDP planner. That has two indicators. First, in zeroshot setting, i.e., when the algorithm is given no access to simulator and cannot retrain, it must return a policy with a high long term reward. Second, it must have superior learning curves compared to a policy learned from scratch on .
As the first step towards the objective of sizeindependent transfer, we focus on domains where all fluents, nonfluents and action templates are unary, except one nonfluent is binary. This is a common setting in many benchmark RDDL domains such as SysAdmin and Game Of Life. Let our factored MDP have a set of objects , parameterized fluents and nonfluents , actions , and a special parameterized binary nonfluent .
4 The TraPSNet Architecture
We name our neural transfer model, TraPSNet – a Network that can Transfer across Problem Sizes. At a high level, it extends A3C and trains problemindependent policy and value networks. The transfer itself is based on two hypotheses: (1) for every domain, neural embeddings can capture similarities of objects and states across problem sizes; (2) value of a specific action (say
) can be effectively estimated via a problemoblivious function that depends on the current state and
’s embeddings.Both the policy and value net of TraPSNet have a state encoder each, whose output feeds into an action decoder (for policy net), and a value decoder (for value net). The parameters of these modules are shared across all training problems in a domain. The state encoders operationalize the first hypothesis by outputting a fixedsize object embedding for each object in the problem based on the nonfluent graph structure, and fluent values related to . For different problems, a variable number of object embeddings are maxpooled to construct a fixedsize state embedding . The action decoder operationalizes the second hypothesis. It projects each object embedding , in conjunction with action id and the overall state embedding , to a realvalued score. This is the unnormalized probability of taking in the state . All actions in a problem are run through a softmax to compute a randomized policy . A similar idea is used in value decoder for estimating . Figure 1 illustrates the policy net of TraPSNet schematically.
State Encoder: We want to construct an embedding for each object based on its individual properties, its neighborhood, and also the global information of the overall state. Similar to Bajpai et al. torpido, this is achieved by casting the state information in a graph. The nodes of the graph are objects . There is an edge between and if . The input features at each node are the concatenated values of and . To compute fixedsize object embeddings , TraPSNet constructs local embeddings for each node in the graph. For this purpose, it uses a Graph Attention Network (GAT)^{1}^{1}1Lack of space precludes a detailed description of a GAT [Velickovic et al.2017]. Briefly, a GAT improves on a GCN by computing, in each node, selfattention coefficients for each neighbor and itself. These coefficients multiplied with node features and added to obtain an intermediate node embedding. This process is repeated K times and the results are max pooled to obtain a final node embedding. followed by a fully connected layer. This takes in the adjacency matrix of the graph and outputs node embeddings.
It then computes an embedding for the whole state , i.e., the entire graph. To achieve a sizeinvariant , TraPSNet pools all s to get . After experimenting with various pooling schemes (max, sum, average), max pooling produced the best results. Similar results have been seen in NLP and vision literature (e.g., [Zhang and Wallace2017]). Maxpooling, intuitively, helps the state embedding retain the “best" value for each feature (dimension), while losing information about the object(s) responsible for that value. For each object , is concatenated with to produce a contextual object embedding, , which is used as input to both the decoders.
Action & Value Decoders: A fullyconnected network maps a contextual embedding into several realvalued scores for each object, one for each action template. Let these networks represent functions and for the action and value decoder, respectively. is interpreted as a score for action – a softmax over for all pairs produces a randomized policy . The value of a state is approximated by value net as .
This architecture enables TraPSNet to apply the same action decoder for problems of different sizes, since the network itself is not sizedependent – it is replicated times (with tied parameters) to compute the values of each action. It also enables estimation of in different ranges for problems of different sizes, akin to the sum of values of objects approximation in Relational MDPs [Guestrin et al.2003].
Learning and Transfer: TraPSNet is trained end to end using a standard RL objective on . For each problem, its RDDL simulator interacts with the agent to generate trajectories. The rewards (advantage) obtained in these trajectories are backpropagated through value and policy nets according to the A3C loss to train and . We make one small modification in which at each step the gradients are accumulated from trajectories of all training problems, so that the learned parameters do not overfit on any one problem.
At transfer time, pretrained TraPSNet can be run directly on using its adjacency matrix, to obtain an initial , without any modification or retraining, since there are no problemspecific parameters. Due to this, we expect the model to have good zeroshot transfer performance. Training using RL on improves the policy further.
Time (hrs)  0  3  6  

Arch.  A3CGCN  A3CGAT  TNet  A3CGCN  A3CGAT  TNet  A3CGCN  A3CGAT  TNet 
Sys 5  0.00  0.07  0.84  0.32  0.53  0.88  0.54  0.79  0.89 
Sys 6  0.06  0.04  0.63  0.50  0.58  0.73  0.65  0.87  0.78 
Sys 7  0.03  0.04  0.92  0.41  0.51  0.89  0.69  0.78  1.00 
Sys 8  0.00  0.06  0.93  0.28  0.58  0.86  0.51  0.53  0.89 
Sys 9  0.00  0.06  0.89  0.41  0.51  0.89  0.57  0.87  0.92 
Sys 10  0.05  0.1  0.88  0.25  0.50  0.93  0.31  0.52  0.92 
GoL 5  0.00  0.07  0.83  0.48  0.68  0.87  0.71  0.79  0.85 
GoL 6  0.00  0.06  0.77  0.62  0.61  0.88  0.56  0.58  0.88 
GoL 7  0.00  0.05  0.88  0.71  0.69  0.92  0.60  0.90  0.88 
GoL 8  0.00  0.10  0.70  0.74  0.75  0.86  0.71  0.96  0.78 
GoL 9  0.00  0.08  0.90  0.56  0.78  0.87  0.75  0.48  0.93 
GoL 10  0.05  0.10  0.32  0.78  0.84  0.28  0.79  1.00  0.35 
5 Experiments
Our experiments evaluate the ability of TraPSNet to perform zeroshot transfer, as well as compare its anytime performance to training from scratch.
Domains: We use two RDDL benchmark domains from International Probabilistic Planning Competition 2014: SysAdmin [Guestrin, Koller, and Parr2001] and Game of Life (GoL) [Sanner2010]. These are chosen, because they are challenging due to their large state spaces and complex dynamics, but also amenable for our algorithm because their nonfluent is binary and actions unary. Briefly, each SysAdmin problem has a network of computers (arranged in different topologies via nonfluent connected), and the goal is to keep as many computers on as possible. The agent can reboot a computer in each step. Each Game of Life problem represents a grid world (of a different size). Each cell is alive or dead, and the agent can make one cell alive in each time step. The goal is to keep as many cells alive as possible.
Experimental Settings: For each domain, we train TraPSNet on randomly generated problem instances of small sizes and then test on benchmark problems of larger sizes. For SysAdmin, we use training problems with 10, 11, 12, 13 and 14 computers, and test on IPPC problems 5 to 10, which have 30 to 50 computers. For Game of Life, we use different problems with 9 cells each, and again test on IPPC problems 5 to 10, which have 16 to 30 cells. The largest test problems are SysAdmin 9 and 10, with state space of , and 50 available actions.
We use the same hyperparameters for all problem instances of all domains keeping in the spirit of domain independent planning. The GAT layer of state encoders uses a neighbourhood of 1. It takes in one feature per node as input, and outputs 3 features per node. A fully connected layer then projects this into a 20dimensional space for
, which is also the dimensionality of. The action and value decoders are 2layer fully connected networks with an intermediate layer of size 20. All layers use a leaky ReLU activation as nonlinearity.
TraPSNetis trained using RMSProp with a learning rate of
. All models are written in TensorFlow and run on an Ubuntu 16.04 machine with Nvidia K40 GPUs.
Baselines & Evaluation Metrics:
To the best of our knowledge, no sizeinvariant transfer algorithm exists for RDDL domains. We compare against our base nontransfer algorithm, A3C. For fairness, we augment A3C with GCN (which is already known to outperform A3C [Bajpai, Garg, and Mausam2018]). We also compare against A3CGAT, to verify if the benefit is due to GAT or transfer.We measure the transfer capability of our model using the evaluation metrics from Bajpai et al. torpido. We measure the performance of our model at intermediate training times by simulating the policy network at upto the specified execution horizon, and averaging the values. This simulation is run 100 times to get a stable result. We call this value . We report , where and represent the highest and lowest values obtained on the current problem by any planning algorithm at any . signifies the fraction of best performance achieved at time . Moreover, acts as a measure of zeroshot transfer.
Results: Figure 2 compares the training curves of TraPSNet and the two baselines for the and problems of both domains (first two graphs are for SysAdmin). The curves plot as a function of training time . Comparing the baselines, we notice that a GAT improves the performance over a GCN for SysAdmin. We also observe that TraPSNet demonstrates excellent zeroshot transfer, obtaining a very high initial reward. It is vastly superior to the baselines, because before training they can only act randomly. As training times increase, TraPSNet’s anytime performance remains better or very close to the baselines for most problems. In many cases, baselines after 6 hours cannot even match up to TraPSNet’s performance at the start. This underscores the importance of our transfer algorithm.
The detailed results for all problems (at three training times) are reported in Table 1, which corroborates these observations. An exception is GoL problem 10, where after excellent initial transfer, TraPSNet’s performance does not match up to that of the other baselines. Further investigation reveals that all training (and other test) problems in the domain are square grids, whereas problem 10 is the only rectangular grid (103). We suspect that the training has overfit somehow on the squareness of the grid.
6 Conclusions
We present TraPSNet, the first neural transfer algorithm for RDDL MDPs that can train on small problems of a domain and transfer to a larger one. This requires TraPSNet to maintain sizeinvariant parameter sets, which is achieved by pooling over object embeddings, and the use of a parametertied action decoder, which projects objects onto corresponding actions. Experiments show vastly superior performance compared to training from scratch.
Our work brings the classical formulation of Relational MDPs back to the fore. We believe neural latent spaces may overcome the limitations of a traditional sum of symbolic basis function representation used previously for this problem. While we demonstrate results for a specific kind of Relational MDPs, in future, we plan to study the robustness and generality of this approach for other types of RDDL domains.
References
 [Bajpai, Garg, and Mausam2018] Bajpai, A.; Garg, S.; and Mausam. 2018. Transfer of deep reactive policies for MDP planning. In NIPS.
 [Boutilier, Reiter, and Price2001] Boutilier, C.; Reiter, R.; and Price, B. 2001. Symbolic dynamic programming for firstorder mdps. In IJCAI, 690–700.

[Groshev et al.2018]
Groshev, E.; Tamar, A.; Goldstein, M.; Srivastava, S.; and Abbeel, P.
2018.
Learning generalized reactive policies using deep neural networks.
In ICAPS.  [Grzes, Hoey, and Sanner2014] Grzes, M.; Hoey, J.; and Sanner, S. 2014. International Probabilistic Planning Competition (IPPC) 2014. In ICAPS.
 [Guestrin et al.2003] Guestrin, C.; Koller, D.; Gearhart, C.; and Kanodia, N. 2003. Generalizing plans to new environments in relational mdps. In IJCAI, 1003–1010.

[Guestrin, Koller, and Parr2001]
Guestrin, C.; Koller, D.; and Parr, R.
2001.
Maxnorm projections for factored mdps.
In
Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, IJCAI 2001, Seattle, Washington, USA, August 410, 2001
, 673–682.  [Issakkimuthu, Fern, and Tadepalli2018] Issakkimuthu, M.; Fern, A.; and Tadepalli, P. 2018. Training deep reactive policies for probabilistic planning problems. In ICAPS.
 [Kipf and Welling2017] Kipf, T. N., and Welling, M. 2017. Semisupervised classification with graph convolutional networks. In ICLR.
 [Mausam and Kolobov2012] Mausam, and Kolobov, A. 2012. Planning with Markov Decision Processes: An AI Perspective. Morgan & Claypool Publishers.

[Mnih et al.2016]
Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T. P.; Harley, T.;
Silver, D.; and Kavukcuoglu, K.
2016.
Asynchronous methods for deep reinforcement learning.
InProceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 1924, 2016
, 1928–1937.  [Sanner2010] Sanner, S. 2010. Relational Dynamic Influence Diagram Language (RDDL): Language Description.
 [Tamar et al.2017] Tamar, A.; Wu, Y.; Thomas, G.; Levine, S.; and Abbeel, P. 2017. Value iteration networks. In Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 1925, 2017, 4949–4953.

[Toyer et al.2018]
Toyer, S.; Trevizan, F. W.; Thiébaux, S.; and Xie, L.
2018.
Action schema networks: Generalised policies with deep learning.
In Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 27, 2018.  [Velickovic et al.2017] Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; and Bengio, Y. 2017. Graph attention networks. CoRR abs/1710.10903.
 [Younes et al.2005] Younes, H. L. S.; Littman, M. L.; Weissman, D.; and Asmuth, J. 2005. The first probabilistic track of the international planning competition. J. Artif. Intell. Res. 24:851–887.

[Zhang and Wallace2017]
Zhang, Y., and Wallace, B. C.
2017.
A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification.
In IJCNLP, 253–263.
Comments
There are no comments yet.