1 Introduction
Deep neural networks (DNNs) have significantly improved the ability of autonomous systems to perform complex tasks, such as image recognition Krizhevsky2012 , speech recognition Deng2013
and natural language processing
Collobert2011 , and can outperform humans and humandesigned superhuman systems in complex planning tasks such as Go Alphago2016 and Chess Alphazero2017 .In the area of learning and planning, recent work on HDMILPPlan Say2017
has explored a twostage framework that (i) learns transitions models from data with ReLUbased DNNs and (ii) plans optimally with respect to the learned transition models using MixedInteger Linear Programming, but did not provide encodings that are able to learn and plan with
discrete state variables. As an alternative to ReLUbased DNNs, Binarized Neural Networks (BNNs) Hubara2016 have been introduced with the specific ability to learn compact models over discrete variables, providing a new formalism for transition learning and planning in factored Boutilier1999 discretized state and action spaces that we explore in this paper. However planning with these BNN transition models poses two nontrivial questions: (i) What is the most efficient compilation of BNNs for planning in domains with factored state and (concurrent) action spaces? (ii) Given that BNNs may learn incorrect domain models, how can a planner repair BNN compilations to improve their planning accuracy (or prove the retraining of BNN is necessary)?To answer question (i), we present two novel compilations of the learned factored planning problem with BNNs based on reductions to Weighted Partial Maximum Boolean Satisfiability (FDSATPlan+) and Binary Linear Programming (FDBLPPlan+). Theoretically, we show that the SATbased BiDirectional Neuron Activation Encoding is asymptotically the most compact encoding in the literature Boudane2018 and has the generalized arcconsistency property through unit propagation. Experimentally, we demonstrate the computational efficiency of our BiDirectional Neuron Activation Encoding compared to the existing neuron activation encoding Say2018 . Then, we test the effectiveness of learning complex state transition models with BNNs, and test the runtime efficiency of both FDSATPlan+ and FDBLPPlan+ on the learned factored planning problems over four factored planning domains with multiple size and horizon settings. While there are methods for learning PDDL models from data Yang2007 ; Amir2008 and excellent PDDL planners Helmert2006 ; Richter2010 , we remark that BNNs are strictly more expressive than PDDLbased learning paradigms for learning concurrent effects in factored action spaces that may depend on the joint execution of one or more actions. Furthermore, while Monte Carlo Tree Search (MCTS) methods Kocsis2006 ; Keller2013 including AlphaGo Alphago2016 and AlphaGoZero Alphago2016 could technically plan with a BNNlearned black box model of transition dynamics, unlike this work, they would not be able to exploit the BNN transition structure and they would not be able to provide optimality guarantees with respect to the learned model.
To answer question (ii), we introduce a finitetime incremental algorithm based on generalized landmark constraints from the decompositionbased costoptimal classical planner Davies2015 , where we detect and constrain invalid sets of action selections from the decision space of the planners and efficiently improve their planning accuracy.
In summary, this work provides the first two planners capable of learning complex transition models in domains with mixed (continuous and discrete) factored state and action spaces as BNNs, and capable of exploiting their structure in Weighted Partial Maximum Boolean Satisfiability and Binary Linear Programming encodings for planning purposes. Theoretically we show the compactness and efficiency of our SATbased encoding, and the finiteness of our incremental algorithm. Empirical results show the computational efficiency of our new BiDirectional Neuron Activation Encoding, demonstrate strong performance for FDSATPlan+ and FDBLPPlan+ in both the learned and original domains, and provide a new transition learning and planning formalism to the datadriven modelbased planning community.
2 Preliminaries
Before we present the Weighted Partial Maximum Boolean Satifiability (WPMaxSAT) and Binary Linear Programming (BLP) compilations of the learned planning problem, we review the preliminaries motivating this work. We begin this section by describing the formal notation and the problem definition that is used in this work.
2.1 Problem Definition
A deterministic factored planning problem is a tuple where is a mixed set of state variables with discrete and continuous domains, is a mixed set of action variables with discrete and continuous domains, is a function that returns true if action and state variables satisfy constraints that represent global constraints, denotes the stationary transition function, and is the reward function. Finally, is the initial state constraints that assign values to all state variables , and is the goal state constraints over the subset of state variables .
Given a planning horizon , a solution (i.e. plan) to is a value assignment to action and state variables such that , over global constraints and time steps and initial and goal state constraints are satisfied such that and , respectively. Similarly, given a planning horizon , an optimal solution to is a plan that maximizes the total reward function .
Next, we introduce an example domain with a complex transition structure.
2.2 Example Domain: Cellda
Influenced by the famous video game The Legend of Zelda Nintendo1986 , Cellda domain models an agent in a two dimensional (4by4) dungeon cell. As visualized by Figure 1, the agent Cellda (C) must escape a dungeon through an initially locked door (D) by obtaining its key (K) without getting hit by her enemy (E). The gridworldlike dungeon is made up of two types of cells: i) regular cells (blank) on which Cellda and her enemy can move from/to deterministically up, down, right or left, and ii) blocks (B) that neither Cellda nor her enemy can walkthrough. The state variables of this domain include two integer variables for describing the location of Cellda, two integer variables for describing the location of the enemy, one boolean variable for describing whether the key is obtained or not, and one boolean variable for describing whether Cellda is alive or not. The action variables of this domain include four mutually exclusive boolean variables for describing the movement of Cellda (i.e., up, down, right or left). The enemy has an adversarial deterministic policy that is unknown to Cellda that will try to minimize the total Manhattan distance between itself and Cellda by breaking the symmetry first in vertical axis. The goal of this domain is to learn the unknown policy of the enemy from previous plays (i.e., data) and escape the dungeon without getting hit. The complete description of this domain can be found in C.
Given that the state transition function that describes the location of the enemy must be learned, a planner that fails to learn the adversarial policy of the enemy E (e.g., as visualized in Figure 1(0(a)0(c))) will get hit by the enemy. In contrast, a planner that learns the adversarial policy of the enemy E (e.g., as visualized in Figure 1(0(d)0(f))) can avoid getting hit by the enemy in this scenario by waiting for two time steps to trap her enemy (who will try to move up for the remaining of time steps and fail to intercept Cellda).
To solve this problem, next we describe a learning and planning framework that i) learns an unknown transition function from data, and ii) plans optimally with respect to the learned deterministic factored planning problem.
2.3 Factored Planning with Deep Neural Network Learned Transition Models
Factored planning with DNN learned transition models is a twostage framework for learning and solving nonlinear factored planning problems as first introduced in HDMILPPlan Say2017 that we briefly review now. Given samples of state transition data, the first stage of the HDMILPPlan framework learns the transition function
using a DNN with Rectified Linear Units (ReLUs)
Nair2010 and linear activation units. In the second stage, the learned transition function is used to construct the learned factored planning problem . That is, the trained DNN with fixed weights is used to predict the state at time step for free state and action variables at time step such that . As visualized in Figure 2, the learned transition function is sequentially chained over horizon , and compiled into a MixedInteger Linear Program yielding the planner HDMILPPlan Say2017 . Since HDMILPPlan utilizes only ReLUs and linear activation units in its learned transition models, the state variables are restricted to have only continuous domains .Next, we describe an efficient DNN structure for learning discrete models, namely Binarized Neural Networks (BNNs) Hubara2016 .
2.4 Binarized Neural Networks
Binarized Neural Networks (BNNs) are neural networks with binary weights and activation functions
Hubara2016 . As a result, BNNs naturally learn discrete models by replacing most arithmetic operations with bitwise operations. Before we describe how BNN learned transitions relate to HDMILPPlan in Figure 2, we first provide a technical description of the BNN architecture, where BNN layers are stacked in the following order:Real or Binary Input Layer
Binary units in all layers, with the exception of the first layer, receive binary input. When the input of the first layer has realvalued domains , bits of precision can be used for a practical representation such that Hubara2016 .
Binarization Layer
Given input of binary unit at layer the deterministic activation function used to compute output is: if , otherwise, where denotes the number of layers and denotes the set of binary units in layer .
Batch Normalization Layer
For all layers
Ioffe2015 is a method for transforming the weighted sum of outputs at layer in to input of binary unit at layer such that: where parameters , , , , anddenote the weight, input mean, input variance, numerical stability constant (i.e., epsilon), input scaling and input bias respectively, where all parameters are computed at training time.
In order to place BNNlearned transition models in the same planning and learning framework of HDMILPPlan Say2017
, we simply note that once the above BNN layers are learned, the Batch Normalization layers reduce to simple linear transforms. This results in a BNN with layers as visualized in Figure
2, where (i) all weights are restricted to either +1 or 1 and (ii) all nonlinear transfer functions at BNN units are restricted to thresholded counts of inputs. The benefit of the BNN encoding over the ReLUbased DNNs of HDMILPPlan Say2017 is that it can directly model discrete variable transitions and BNNs can be translated to both Binary Linear Programming (BLP) and Weighted Partial Maximum Boolean Satisfiability (WPMaxSAT) problems discussed next.2.5 Weighted Partial Maximum Boolean Satisfiability Problem
In this work, one of the planning encodings that we focus on is Weighted Partial Maximum Boolean Satisfiability (WPMaxSAT). WPMaxSAT is the problem of finding a value assignment to the variables of a Boolean formula that consists of hard clauses and weighted soft clauses such that i) all hard clauses evaluate to true (i.e., standard SAT) Davis1960 , and ii) the total weight of the unsatisfied soft clauses is minimized. While WPMaxSAT is known to be NPhard, stateoftheart WPMaxSAT solvers are experimentally shown to scale well for large instances Davies2013 .
2.6 Boolean Cardinality Constraints
When compiling BNNs to satisfiability encodings, it is critical to encode the counting (cardinality) threshold of the binarization layer as compactly as possible. Boolean cardinality constraints describe bounds on the number of Boolean variables that are allowed to be true, and are in the form of . Cardinality Networks provide an efficient encoding in conjunctive normal form (CNF) for counting an upper bound on the number of true assignments to Boolean variables using auxiliary Boolean counting variables such that holds for all where Asin2009 . The detailed CNF encoding of is outlined in A. Given , Boolean cardinality constraint is defined as
(1) 
where is the size of additional input variables. Boolean cardinality constraint is encoded using number of variables and hard clauses Asin2009 .
Similarly, Boolean cardinality constraints of the form are encoded given the CNF encoding of the Cardinality Networks that count a lower bound on the number of true assignments to Boolean variables such that holds for all . The detailed CNF encoding of is also outlined in A. Given , Boolean cardinality constraint is defined as follows.
(2) 
Note that the cardinality constraint is equivalent to . Since Cardinality Networks require the value of to be less than or equal to , Boolean cardinality constraints of the form with must be converted into .
Finally, a Boolean cardinality constraint is generalized arcconsistent if and only if for every value assignment to every Boolean variable in the set , there exists feasible a value assignment to all the remaining Boolean variables . In practice, the ability to maintain generalized arcconsistency (GAC) through efficient algorithms such as unit propagation (as opposed to search) is one of the most important properties for the efficiency of a Boolean cardinality constraint encoded in CNF Sinz2005 ; Bailleux2006 ; Asin2009 ; Jabbour2014 . It has been shown that both and encodings maintain GAC through unit propagation Asin2009 .
2.7 Binary Linear Programming Problem
As an alternative to WPMaxSAT encodings of BNN transition models, we can also leverage Binary Linear Programs (BLPs). The BLP problem requires finding the optimal value assignment to the variables of a mathematical model with linear constraints, linear objective function, and binary decision variables. Similar to WPMaxSAT, BLP is NPhard. The stateoftheart BLP solvers IBM2017 utilize branchandbound algorithms and can handle cardinality constraints efficiently in the size of its encoding.
2.8 Generalized Landmark Constraints
In this section, we review generalized landmark constraints that are necessary for improving the planning accuracy of the learned models when infeasible plans are generated. A generalized landmark constraint is a linear inequality in the form of where denotes the set of action landmarks and denotes counts on actions , that is, the minimum number of times an action must occur in a plan Davies2015 . The decompositionbased planner, OpSeq Davies2015 , incrementally updates generalized landmark constraints to find costoptimal plans to classical planning problems.
3 Weighted Partial Maximum Boolean Satisfiability Compilation of the Learned Factored Planning Problem
In this section, we show how to reduce the learned factored planning problem with BNNs into WPMaxSAT, which we denote as Factored Deep SAT Planner (FDSATPlan+). FDSATPlan+ uses the same learning and planning framework with HDMILPPlan Say2017 as visualized in Figure 2 where the ReLUbased DNN is replaced by a BNN Hubara2016 and the compilation of is a WPMaxSAT instead of a MixedInteger Linear Program (MILP).
3.1 Propositional Variables
First, we describe the set of propositional variables used in FDSATPlan+. We use three sets of propositional variables: action variables, state variables and BNN binary units, where variables use a bitwise encoding.

denotes if th bit of action is executed at time step .

denotes if th bit of state is true at time step .

denotes if BNN binary unit at layer is activated at time step .
3.2 Parameters
Next we define the additional parameters used in FDSATPlan+.

is the initial (i.e., at ) value of the th bit of state variable .

is the function that maps the th bit of a state or an action variable to the corresponding binary unit in the input layer of the BNN such that where .

is the function that maps the th bit of a state variable to the corresponding binary unit in the output layer of the BNN such that where .
The global constraints and goal state constraints are in the form of , and the reward function is in the form of for state and action variables where and .
3.3 The WPMaxSAT Compilation
Below, we define the WPMaxSAT encoding of the learned factored planning problem with BNNs. First, we present the hard clauses (i.e., clauses that must be satisfied) used in FDSATPlan+.
3.3.1 Initial State Clauses
The following conjunction of hard clauses encode the initial state constraints .
(3) 
where hard clause (3) set the initial values of the state variables at time step .
3.3.2 BiDirectional Neuron Activation Encoding
Previous work Say2018 presented a neuron activation encoding for BNNs that is not efficient with respect to its encoding size using number of variables and hard clauses, and the computational effort required to maintain GAC. In this section, we present an efficient CNF encoding to model the activation behaviour of BNN binary unit that requires only variables and hard clauses, and maintains GAC through unit propagation.
Given input , activation threshold and binary activation function if , else , the output of a binary neuron can be efficiently encoded in CNF by defining the Boolean variable to represent the activation of the binary neuron such that if and only if , combining the cardinality networks and to count the number of variables from set that are assigned to true, and adding the unit hard clauses for the auxiliary input variables in conjunction with a bidirectional activation hard clause as follows.
(4) 
In order to efficiently combine the cardinality networks, and , and reduce the number of variables required in half, the conjunction of hard clauses defined in A are taken representing
rather than naively taking the conjunction of the set of hard clauses in and using two separate sets of auxiliary Boolean counting variables Asin2009 . Intuitively, cardinality networks and together count by combining the respective bounds and for all .
Instead of the neuron activation encoding used in prior work Say2018 that utilizes two separate sets of auxiliary Boolean counting variables , where and are encoded with two different sets of auxiliary Boolean counting variables, the BiDirectional encoding we use here shares the same set of decision variables. Further, while the neuron activation encoding of prior work Say2018 uses Sequential Counters Sinz2005 for encoding the cardinality constraints using number of variables and hard clauses, the BiDirectional encoding presented here uses only number of variables and hard clauses as we will show by Lemma 1. Finally, the previous neuron activation encoding Say2018 has been shown not to preserve the GAC property through unit propagation Boudane2018 in contrast to the BiDirectional Neuron Activation Encoding we use here which we show preserves GAC in Theorem 1. From here out for notational clarity, we will refer to the conjunction of hard clauses in BiDirectional Neuron Activation Encoding as . Furthermore, we will refer to the previous neuron activation encoding Say2018 as the UniDirectional Neuron Activation Encoding.
3.3.3 BNN Clauses
Given the efficient CNF encoding , we present the conjunction of hard clauses to model the complete BNN model.
(5)  
(6)  
(7)  
(8)  
(9) 
where activation constant in hard clauses (89) are computed using the batch normalization parameters for binary unit in layer at training time such that:
where denotes the size of set . The computation of the activation constant ensures that is less than or equal to the half size of the previous layer , as BiDirectional Neuron Activation Encoding only counts upto .
Hard clauses (56) map the binary units at the input layer of the BNN (i.e., ) to a unique state or action variable, respectively. Similarly, hard clause (7) maps the binary units at the output layer of the BNN (i.e., ) to a unique state variable. Hard clauses (89) encode the binary activation of every unit in the BNN.
3.3.4 Global Constraint Clauses
The following conjunction of hard clauses encode the global constraints .
(10) 
where hard clause (10) represents domaindependent global constraints on state and action variables. Some common examples of global constraints
such as mutual exclusion on Boolean action variables and onehot encodings for the output of the BNN (i.e., exactly one Boolean state variable must be true) are respectively encoded by hard clauses (
1112) as follows.(11)  
(12) 
In general, linear global constraints in the form of , such as bounds on state and action variables, can be encoded in CNF where are positive integer coefficients and are decision variables with nonnegative integer domains Abio2014 .
3.3.5 Goal State Clauses
The following conjunction of hard clauses encode the goal state constraints .
(13) 
where hard clause (13) set the goal constraints on the state variables at time step .
3.3.6 Reward Clauses
Given the reward function for each time step is in the form of
the following weighted soft clauses (i.e., optional
weighted clauses that may or may not be satisfied where each weight corresponds to
the penalty of not satisfying a clause):
(14) 
can be written to represent where are the weights of the soft clauses for each bit of action and state variables, respectively.
4 Binary Linear Programming Compilation of the Learned Factored Planning Problem
Given FDSATPlan+, we present the Binary Linear Programming (BLP) compilation of the learned factored planning problem with BNNs, which we denote as Factored Deep BLP Planner (FDBLPPlan+).
4.1 Binary Variables and Parameters
FDBLPPlan+ uses the same set of decision variables and parameters as FDSATPlan+.
4.2 The BLP Compilation
FDBLPPlan+ replaces hard clauses (3) and (57) with equivalent linear constraints as follows.
(15)  
(16)  
(17)  
(18) 
Given the activation constant of binary unit in layer , FDBLPPlan+ replaces hard clauses (89) representing the activation of binary unit with the following linear constraints:
(19)  
(20) 
where .
5 Incremental Factored Planning Algorithm for FDSATPlan+ and FDBLPPlan+
Given that the plans found for the learned factored planning problem by FDSATPlan+ and FDBLPPlan+ can be infeasible to the factored planning problem , we introduce an incremental algorithm for finding plans for by iteratively excluding invalid plans from the search space of FDSATPlan+ and FDBLPPlan+. Similar to OpSeq Davies2015 , FDSATPlan+ and FDBLPPlan+ are updated with the following generalized landmark hard clauses or constraints
(22)  
(23) 
respectively, where is the set of bits of actions executed at time steps at the th iteration of the algorithm outlined by Algorithm 1.
For a given horizon , Algorithm 1 iteratively computes a set of actions , or returns infeasibility for the learned factored planning problem . If the set of actions is nonempty, we evaluate whether is a valid plan for the original factored planning problem (i.e., line 3) either in the actual domain or using a high fidelity domain simulator – in our case RDDLsim Sanner2010 . If the set of actions constitutes a plan for , Algorithm 1 returns as a plan. Otherwise, the planner is updated with the new set of generalized landmarks to exclude and the loop repeats. Since the original action space is discretized and represented upto bits of precision, Algorithm 1 can be shown to terminate in no more than iterations by constructing an inductive proof similar to the termination criteria of OpSeq where either a feasible plan for is returned or there does not exist a plan to both and for the given horizon . The outline of the proof can be found in B.
Next, we present a theoretical analysis of BiDirectional Neuron Activation Encoding.
6 Theoretical Results
We now present some theoretical results on BiDirectional Neuron Activation Encoding with respect to its encoding size and the number of variables used.
Lemma 1 (Encoding Size).
BiDirectional Neuron Activation Encoding requires variables and hard clauses.
Proof.
The BiDirectional Neuron Activation Encoding shares the same set of variables as and encodings with the exception of variable , and both and utilize variables Asin2009 . Further, BiDirectional Neuron Activation Encoding takes the conjunction of hard clauses for the base cases of Half Merging Networks and Simplified Merging Networks of and , which can only increase the number of hard clauses required by a multiple of a linear constant (i.e., at most 2 times). Similarly, BiDirectional Neuron Activation Encoding takes the conjunction of hard clauses for the recursive cases of Half Merging Networks and Simplified Merging Networks, which can only increase the number of hard clauses required by a multiple of a linear constant (i.e., at most 2 times). Given BiDirectional Neuron Activation Encoding uses the same recursion structure as and , the number of hard clauses used in BiDirectional Neuron Activation Encoding is asymptotically bounded by the total encoding size of plus , which is still . ∎
Next, we will prove that BiDirectional Neuron Activation Encoding has the GAC property through unit propagation, which is considered to be one of the most important theoretical properties that facilitate the efficiency of a Boolean cardinality constraint encoded in CNF Sinz2005 ; Bailleux2006 ; Asin2009 ; Jabbour2014 .
Definition 1 (Generalized ArcConsistency of Neuron Activation Encoding).
A neuron activation encoding has the generalized arcconsistency property through unit propagation if and only if unit propagation is sufficient to deduce the following:

For any set with size , value assignment to variables , and for all , the remaining variables from the set are assigned to true,

For any set with size , value assignment to variables , and for all , the remaining variables from the set are assigned to false,

Partial value assignment of variables from to true assigns variable , and

Partial value assignment of variables from to false assigns variable
where denotes the size of set .
Theorem 1 (Generalized ArcConsistency of ).
BiDirectional Neuron Activation Encoding has the generalized arcconsistency (GAC) property through unit propagation.
Proof.
To show maintains GAC property through unit propagation, we need to show exhaustively for all four cases of Definition 1 that unit propagation is sufficient to maintain the GAC.
Case 1 ( where , and by unit propagation): When , unit propagation assigns using the hard clause . Given value assignment to variables for any set with size , it has been shown that unit propagation will set the remaining variables from the set to true using the conjunction of hard clauses that encode Asin2009 excluding the unit clause ().
Case 2 ( where , and by unit propagation): When , unit propagation assigns using the hard clause . Given value assignment to variables for any set with size , it has been shown that unit propagation will set the remaining variables from the set to false using the conjunction of hard clauses that encode Asin2009 excluding the unit clause ().
Cases 3 ( where , by unit propagation) When variables from the set are set to true, it has been shown that unit propagation assigns the counting variable using the conjunction of hard clauses that encode Asin2009 excluding the unit clause (). Given the assignment , unit propagation assigns using the hard clause .
Cases 4 ( where , by unit propagation) When variables from the set are set to false, it has been shown that unit propagation assigns the counting variable using the conjunction of hard clauses that encode Asin2009 excluding the unit clause (). Given the assignment , unit propagation assigns using the hard clause . ∎
We now discuss the importance of our theoretical results in the context of both related work and the contributions of our paper. Amongst the stateoftheart CNF encodings Boudane2018 that preserve GAC through unit propagation for constraint , BiDirectional Neuron Activation Encoding uses the smallest number of variables and hard clauses. The previous stateoftheart CNF encoding for constraint is an extension of the Sorting Networks Een06 and uses number of variables and hard clauses Boudane2018 . In contrast, BiDirectional Neuron Activation Encoding is an extension of the Cardinality Networks Asin2009 , and only uses number of variables and hard clauses as per Lemma 1 while maintaining GAC through unit propagation as per Theorem 1.
7 Experimental Results
In this section, we evaluate the effectiveness of factored planning with BNNs. First, we present the benchmark domains used to test the efficiency of our learning and factored planning framework with BNNs. Second, we present the accuracy of BNNs to learn complex state transition models for factored planning problems. Third, we compare the runtime efficiency of BiDirectional Neuron Activation Encoding against the existing UniDirectional Neuron Activation Encoding Say2018 . Fourth, we test the efficiency and scalability of planning with FDSATPlan+ and FDBLPPlan+ on the learned factored planning problems across multiple problem sizes and horizon settings. Finally, we demonstrate the effectiveness of Algorithm 1 to find a plan for the factored planning problem .
7.1 Domain Descriptions
The RDDL Sanner2010 formalism is extended to handle goalspecifications and used to describe the problem . Below, we summarize the extended deterministic RDDL domains used in the experiments, namely Navigation Sanner2011 , Inventory Control (Inventory) Mann2014 , System Administrator (SysAdmin) Guestrin2001 ; Sanner2011 , and Cellda Nintendo1986 . Detailed presentation of the RDDL domains and instances are provided in C.
Navigation
Models an agent in a twodimensional (by) maze with obstacles where the goal of the agent is to move from the initial location to the goal location at the end of horizon . The transition function describes the movement of the agent as a function of the topological relation of its current location to the maze, the moving direction and whether the location the agent tries to move to is an obstacle or not. This domain is a deterministic version of its original from IPPC2011 Sanner2011 . Both the action and the state space are Boolean. We report the results on instances with three maze sizes by and three horizon settings per maze size where , .
Inventory
Describes the inventory management control problem with alternating demands for a product over time where the management can order a fixed amount of units to increase the number of units in stock at any given time. The transition function updates the state based on the change in stock as a function of demand, the time, the current order quantity, and whether an order has been made or not. The action space is Boolean (either order a fixed positive integer amount, or do not order) and the state space is nonnegative integer. We report the results on instances with two demand cycle lengths and three horizon settings per demand cycle length where and .
SysAdmin
Models the behavior of a computer network of size where the administrator can reboot a limited number of computers to keep the number of computers running above a specified safety threshold over time. The transition function describes the status of a computer which depends on its topological relation to other computers, its age and whether it has been rebooted or not, and the age of the computer which depends on its current age and whether it has been rebooted or not. This domain is a deterministic modified version of its original from IPPC2011 Sanner2011 . The action space is Boolean and the state space is a nonnegative integer where concurrency between actions are allowed. We report the results on instances with two network sizes and three horizon settings where and .
Cellda
Models an agent in a two dimensional (4by4) dungeon cell. The agent Cellda must escape her cell through an initially locked door by obtaining the key without getting hit by her enemy. Each grid of the cell is made up of a grid type: i) regular grids which Cellda and her enemy can move from (or to) deterministically up, down, right or left, and ii) blocks that neither Cellda nor her enemies can stand on. The enemy has a deterministic policy that is unknown to Cellda that will try to minimize the total Manhattan distance between itself and Cellda. Given the location of Cellda and the enemy, the adversarial deterministic policy will always try to minimize the distance between the two by trying to move the enemy on axis . The state space is mixed; integer to describe the locations of Cellda and the enemy, and Boolean to describe whether the key is obtained or not and whether Cellda is alive or not. The action space is Boolean for moving up, down, right or left. The transition function updates states as a function of the previous locations of Cellda and the enemy, the moving direction of Cellda, whether the key was obtained or not and whether Cellda was alive or not. We report results on instances with two adversarial deterministic policies and three horizon settings per policy where and .
7.2 Transition Learning Performance
In Table 1, we present test errors for different configurations of the BNNs on each domain instance where the sample data was generated from the RDDLbased domain simulator RDDLsim Sanner2010 using a simple stochastic exploration policy. For each instance of a domain, state transitions were collected and the data was treated as independent and identically distributed. After random permutation, the data was split into training and test sets with 9:1 ratio. The BNNs were trained on MacBookPro with 2.8 GHz Intel Core i7 16 GB memory using the code available Hubara2016 . Overall, Navigation instances required the smallest BNN structures for learning due to their purely Boolean state and action spaces, while both Inventory, SysAdmin and Cellda instances required larger BNN structures for accurate learning, owing to their nonBoolean state and action spaces.
Domain  Network Structure  Test Error (%) 

Navigation(3)  13:36:36:9  0.0 
Navigation(4)  20:96:96:16  0.0 
Navigation(5)  29:128:128:25  0.0 
Inventory(2)  7:96:96:5  0.018 
Inventory(4)  8:128:128:5  0.34 
SysAdmin(4)  16:128:128:12  2.965 
SysAdmin(5)  20:128:128:128:15  0.984 
Cellda(x)  12:128:128:4  0.645 
Cellda(y)  1 
Comments
There are no comments yet.