1 Introduction
In many realworld planning problems with factored [Boutilier1999] and mixed discrete and continuous state and action spaces such as Reservoir Control [Yeh1985], Heating, Ventilation and Air Conditioning (HVAC) [agarwal2010], and Navigation [nonlinear_path_planning], it is difficult to obtain a model of the complex nonlinear dynamics that govern state evolution. For example, in Reservoir Control, evaporation and other sources of water loss are a complex function of volume, bathymetry, and environmental conditions; in HVAC domains, thermal conductance between walls and convection properties of rooms are nearly impossible to derive from architectural layouts; and in Navigation problems, nonlinear interactions between surfaces and traction devices make it hard to accurately predict odometry.
A natural answer to these modeling difficulties is to instead learn the transition model from sampled data; fortunately, the presence of vast sensor networks often make such data inexpensive and abundant. While learning nonlinear models with a priori
unknown model structure can be very difficult in practice, recent progress in Deep Learning and the availability of offtheshelf tools such as Tensorflow
[abadi2016tensorflow]and Pytorch
[paszke2017automatic] make it possible to learn highly accurate nonlinear deep neural networks with little prior knowledge of model structure.However, the modeling of a nonlinear transition model as a deep neural network poses nontrivial difficulties for the optimal control task. Existing nonlinear planners either are not compatible with nonlinear deep network transition models and continuous action input [Penna2009, lohr, coles2013hybrid, ivankovic, piotrowski, scala2016interval], or only optimize goaloriented objectives [Bryce2015, Scala20162, Cashmore2016]. Monte Carlo Tree Search (MCTS) methods [mcts, uct, keller_icaps13] including AlphaGo [silver2016mastering] that could exploit a deep network learned black box model of transition dynamics do not inherently work with continuous action spaces due to the infinite branching factor. While MCTS with continuous action extensions such as HOOT [weinstein2012bandit]
have been proposed, their continuous partitioning methods do not scale to highdimensional concurrent and continuous action spaces. Finally, offline modelfree reinforcement learning with function approximation
[sutton_barto, csaba_rl] and deep extensions [deepqn] do not directly apply to domains with highdimensional continuous action spaces. That is, offline learning methods like Qlearning require action maximization for every update, but in highdimensional continuous action spaces such nonlinear function maximization is nonconvex and computationally intractable at the scale of millions or billions of updates.Despite these limitations of existing methods, all is not lost. First, we remark that our deep network is not a blackbox but rather a graybox; while the learned parameters often lack human interpretability, there is still a uniform layered symbolic structure in the deep neural network models. Second, we make the critical observation that the popular Rectified Linear Unit (ReLU) [relu] transfer function for deep networks enables effective nonlinear deep neural network model learning and permits a direct compilation to a MixedInteger Linear Program (MILP) encoding. Given other components such as a humanspecified objective function and a horizon, this permits direct optimization in a method we call Hybrid Deep MILP Planner (HDMILPPlan).
While arguably an important step forward, we remark that planners with optimality guarantees such as HDMILPPlan can only scale up to moderatesized planning problems. Hence in an effort to scale to substantially larger control problems, we focus on a general subclass of planning problems with purely continuous state and action spaces in order to take advantage of the efficiency of autodifferentiation tools and GPUbased computation. Specifically, we propose to extend work using the Tensorflow tool for planning [Wu2017] in deterministic continuous RDDL [Sanner:RDDL] domains to the case of learned neural network transition models investigated in this article. Specifically, we show that we can embed both a reward function and a deeplearned transition function into a Recurrent Neural Network (RNN) cell, chain multiple cells together for a fixed horizon, and produce a plan in the resulting RNN encoding through endtoend backpropagation in a method we call Tensorflow Planner (TFPlan).
Experimentally, we compare HDMILPPlan and TFPlan versus manually specified domainspecific policies on Reservoir Control, HVAC, and Navigation domains. Our primary objectives are to comparatively evaluate the ability of HDMILPPlan and TFPlan to produce high quality plans with limited computational resources in an online planning setting, and to assess their performance against carefully designed manual policies. For HDMILPPlan, we show that our strengthened MILP encoding improves the quality of plans produced in less computational time over the base encoding. For TFPlan, we show the scalability and the efficiency of our planner on largescale problems and its ability to approximate the optimal plans found by HDMILPPlan on moderatesized problems. Overall, this article contributes and evaluates two novel approaches for planning in domains with learned deep neural net transition models: an optimal and method HDMILPPlan for mixed discrete and continuous state and action spaces with moderate scalability and a fast, scalable GPUbased planner TFPlan for the subset of purely continuous state and action domains.
2 Preliminaries
Before we discuss deep network transition learning, we review the factored planning problem motivating this work.
2.1 Factored Planning Problem
A deterministic factored planning problem is a tuple where is a mixed set of state variables (states) with discrete and continuous domains, is a mixed set of action variables (actions) with discrete and continuous domains, is a function that returns true if action and state pair satisfies global constraints, denotes the transition function between time steps and such that if and is undefined otherwise, is the initial state constraint that assigns values to all state variables S, and is the goal state constraints over subset of state variables . Finally, denotes the stateaction reward function. Given a planning horizon , an optimal solution to is a plan that maximizes the total reward function over horizon such that .
In many realworld problems, it is difficult to model the exact dynamics of the complex nonlinear transition function that governs the evolution of states over the horizon . Therefore in this paper, we do not assume apriori knowledge on , but rather we learn it from data. We limit our model knowledge to the reward function , horizon and global constraint function that specifies whether actions are applicable in state at time , or not, e.g., the outflow from a reservoir must not exceed the present water level, and goal state constraints .
3 Neural Network Transition Learning
A neural network is a layered, acyclic, directed computational network structure inspired by the biological neural networks that constitute our brains [goodfellow2016deep]
. A neural network typical has one or more hidden layers, where more hidden layers indicate deeper information extraction. Each hidden layer typically consists of a linear transformation followed by a nonlinear activation. While most traditional nonlinear activation functions are bounded (e.g., sigmoid or tangent function), simpler piecewise linear activation functions become popular because of their computational efficiency and robustness of preventing saturation.
3.1 Network Structure
We model the transition function as a modified version of denselyconnected network [huang2017densely] ^{1}^{1}1
Densely Connected Network was proposed in Convolutional Neural Networks that concatenate filter maps.
as shown in Figure 1, which, in comparison to a standard fully connected network, allows direct connections of each layer to the output. This can be advantageous when a transition function has differing levels of nonlinearity, allowing linear transitions to pass directly from the input to the output layer, while nonlinear transitions may pass through one or more of the hidden layers.Given a deep neural network configuration with hidden layers, state vectors , and an action vector from data =
and a hyperparameter
, the optimal weights for all layers can be found by solving the following optimization problem:subject to  
(1)  
(2) 
The objective minimizes squared reconstruction error of transitions in the data plus a regularizer. Constraints (1) and (2) define the nonlinear activation of for each layer and outputs , where denotes the concatenation operation and denotes nonlinear activation function. We omitted subscript of all intermediate results as they are temporary values.
We use Rectified Linear Units (ReLUs) [relu] of the form as the activation function in this paper. There is a twofold benefit of deploying ReLU activation in the planning tasks. First, in comparison to the other activation functions, such as sigmoid and hyperbolic tangent, ReLUs can be trained efficiently and permit direct compilation to a set of linear constraints in MixedInteger Linear Programming (MILP) as we will discuss in section 4. Second, ReLU activation is robust to the gradient vanishing in longterm backpropagation which is advantageous in the context of planning through backpropagation which we will discuss in section 5.
3.2 Input and Output Normalization
Input and output of the transition function usually have multiple dimensions, where each input and output dimension may have dramatically different magnitude and variance. For example in Reservoir control problems, the capacity difference among reservoirs can be large. Such significant differences cause unbalanced contributions to the training loss which results in unbalanced prediction quality for the prediction of the states. To address this problem, we deploy loss weighting parameter
on the Mean Squared Error loss. The loss weighting balances the loss contribution throughwhere is a hyperparameter and is the vector of maximum value of each dimension could encounter. In addition to the loss weighting, we normalize input before feeding it into the input of neural network as it significantly improves the training quality [goodfellow2016deep]. To simplify the HDMILPPlan compilation, we push the normalization function into the learned parameters as
where is the weight connecting dimension of input layer to the dimension of the first hidden layer, and is the bias of dimension of the first hidden layer. We show the full derivation in the Appendix for interested readers.
3.3 Model Complexity Minimization
Learning complex function on the finite number of training data would result in unsatisfiable generalization [dietterich1995overfitting], especially for regression problems [harrell2014regression]. Using overfitted transition model in planning can be catastrophic, where planners can take advantage of the generalization limitation to propose a solution that is invalid in the real domain. While deep network structure shows remarkable generalization ability given its over parameterization, maintaining the smallest model that preserves sufficient modeling capacity is necessary [zhang2016understanding]. In the deep learning literature, an effective way to reduce the model complexity is to train model with Dropout [srivastava2014dropout] as a structured regularization that approximate Bayesian Model Averaging [gal2016dropout], where it implicitly votes over a set of candidate models. We deploy Dropout only after each of the hidden layers in the denselyconnected network since the correlation of input dimensions is informative in planning tasks.
Aggressively adding layers into the network does not necessarily improve the prediction accuracy of the learned function [ba2014deep] and can introduce an additional computational cost to the planners that take this function as input. For HDMILPPlan, additional layers result in additional bigM constraints which increase the computational cost of computing a plan as we discuss and experimentally show in sections 4 and 6, respectively.
4 Hybrid Deep MILP Planner
Hybrid ^{2}^{2}2The term hybrid refers to mixed (i.e., discrete and continuous) action and state spaces as used in MDP literature [kveton2006solving]. Deep MILP Planner (HDMILPPlan) is a twostage framework for learning and optimizing nonlinear planning problems. The first stage of HDMILPPlan learns the unknown transition function with denselyconnected network as discussed previously in section 3. The learned transition function is used to construct the learned planning problem . Given a planning horizon , HDMILPPlan compiles the learned planning problem into a MixedInteger Linear Program (MILP) and finds an optimal plan to using an offtheshelf MILP solver. HDMILPPlan operates as an online planner where actions are optimized over the remaining planning horizon in response to sequential state observations from the environment.
We now describe the base MILP encoding of HDMILPPlan. Then, we strengthen the linear relaxation of our base MILP encoding for solver efficiency.
4.1 Base MILP Encoding
We begin with all notation necessary for the HDMILPPlan specification:
4.1.1 Parameters

[noitemsep,nolistsep]

is the value of the initial state .

is the set of ReLUs in the neural network.

is the set of bias units in the neural network.

is the set of output units in the neural network.

is the set of synapsis in the neural network.

is the set of weights in the neural network.

is the set of actions connected to unit .

is the set of states connected to unit .

is the set of units connected to .

specifies the output unit with linear function that predicts state .

specifies the output unit with binary step activation function that predicts state .

is a large constant used in the bigM constraints.
4.1.2 Decision variables

[noitemsep,nolistsep]

denotes the value assignment to action from its domain at time . The domain of can be either discrete or continuous.

denotes the value of state at time . The domain of can be either discrete or continuous.

denotes the output of unit at time .

if rectified linear unit is activated at time , 0 otherwise (i.e., is a boolean variable).
4.1.3 The MILP Compilation
Next we define the MILP formulation of our planning optimization problem that encodes the learned transition model.
(3)  
subject to  
(4)  
(5)  
(6)  
(7)  
(8)  
(9)  
(10)  
(11)  
(12)  
(13)  
(14)  
(15)  
In the above MILP, the objective function (3) maximizes the sum of rewards over a given horizon . Constraint (4) connects input units of the neural network to the initial state of the planning problem at time . Constraint (5) ensures that global constraints are satisfied at every time . Constraint (6) ensures output units of the neural network satisfy goal state constraints of the planning problem at time . Constraint (7
) sets all neurons that represent biases equal to 1. Constraint (
8) ensures that a ReLU is activated if the total weighted input flow into is positive. Constraints (9)(10) together ensure that if a ReLU is active, the outflow from is equal to the total weighted input flow. Constraints (11)(15) predicts the states at time given the values of states, actions, ReLUs at time using different activation functions. In Constraint (11) and Constraints (12)(13), linear activation function is used to predict and , respectively. In Constraints (14)(15), binary step function is used to predict a state variable with Boolean domain. Note that onehot encoding with linear input is also MILP compilable and can be used to predict Boolean state variables when the problem instance permits the use of onehot encoding, that is, when global constraints
include a constraint in the form of for all time steps .Constraints (8)(10) sufficiently encode the piecewise linear activation function of the ReLUs. However, the positive unbounded nature of the ReLUs leads to a poor linear relaxation of the bigM constraints, that is, when all boolean variables (superscripted with ) are relaxed to continuous in Constraints (8)(9); this can significantly hinder the overall performance of standard branch and bound MILP solvers that rely on the linear relaxation of the MILP for guidance. Next, we strengthen our base MILP encoding by preprocessing bounds on state and action variables, and with the addition of auxiliary decision variables and linear constraints, to improve its LP relaxation.
4.2 Strengthened MILP Encoding
In our base MILP encoding, Constraints (4)(6) encode the piecewise linear activation function, , using the bigM constraints for each ReLU . We strengthen the linear relaxation of Constraints (5)(6) by first finding tighter bounds on the input units of the neural network, then separating the input into its positive and negative components. Using these auxiliary variables, we augment our base MILP encoding with an additional linear inequality in the form of . This inequality is valid since the constraints and hold for all and .
4.2.1 Preprocessing Bounds
The optimization problems solved to find the tightest bounds on the input units of the neural network are as follows. The tightest lower bounds on action variables can be obtained by solving the following optimization problem:
(16)  
subject to  
Constraints (1)(14) 
Similarly, the tightest lower bounds on state variables, upper bounds on action and state variables can be obtained by simply replacing the expression in the objective function (16) with , , and , respectively. Given the preprocessing optimization problems have the same theoretical complexity as the original learned planning optimization problem (i.e., NPhard), we limit the computational budget allocated to each preprocessing optimization problem to a fixed amount, and set the lower and upper bounds on the domains of action and state decision variables , to the best dual bounds found in each respective problem.
4.2.2 Additional Decision Variables
The additional decision variables required to implement our strengthened MILP are as follows:

[noitemsep]

and denote the positive and negative value assignments to action at time , respectively.

and denote the positive and negative the values of state at time , respectively.

if is positive at time , 0 otherwise.

if is positive at time , 0 otherwise.
4.2.3 Additional Constraints
The additional constraints in the strengthened MILP are as follows:
(17)  
(18)  
(19)  
(20)  
(21)  
(22)  
(23)  
(24)  
(25)  
(26)  
Here, the pairs and denote the lower and upper bounds on the domains of action and state decision variables , , respectively found by solving the preprocessing optimization problems. Given Constraints (17)(26), Constraint (27) implements our strengthening constraint which provides a valid upper bound on each ReLU .
(27)  
5 Nonlinear Planning via Autodifferentiation
In this section, we present Autodifferentiation based planning which we call Tensorflow Planner (TFPlan). TFPlan plans through the neural networks using Autodifferentiation where variables are tensors that flow over the network operation pipelines. TFPlan represents the planning task as a symbolic recurrent neural network (RNN) architecture with action parameter inputs directly amenable to optimization with GPUbased symbolic toolkits such as Tensorflow and Pytorch.
5.1 Planning through Backpropagation
Backpropagation [rumelhart1988learning]
is a standard method for optimizing parameters of deep neural networks via gradient descent. With the chain rule of derivatives, backpropagation propagates the derivative of the output error of a neural network back to each of its parameters in a single linear time pass in the size of the network using what is known as reversemode automatic differentiation
[Linnainmaa1970]. Despite its theoretical efficiency, backpropagation in largescale deep neural networks in practice is still computationally expensive, and it is only with the advent of recent GPUbased symbolic toolkits have become possible.We reverse the idea of training parameters of the network given fixed inputs to instead optimizing the inputs (i.e., actions) subject to the fixed parameters. As shown in Figure 2, given learned transition and reward function , we want to optimize the input for all to maximize the accumulated reward value where . Specifically, we want to optimize all actions with respect to a planning loss (defined shortly as a function of ) that we minimize via the following gradient update schema
(28) 
where is the optimization rate and the partial derivatives comprising the gradient based optimization in problem instance are computed as
(29) 
where we define the total loss over multiple planning instances . Since both transition and reward functions are not assumed to be convex, optimization on a domain with such dynamics could result in a local minimum or saddle point. To mitigate this problem, we randomly initialize actions for a batch of instances and optimize multiple mutually independent planning instances simultaneously, and eventually return the bestperformed action sequence over all instances.
While there are multiple choices of loss function in Autodifferentiation toolkits, we minimize
since the cumulative rewards we test in this paper are at most piecewise linear ^{3}^{3}3The derivative of a linear function yields a constant value which is not informative in updating actions using the gradient update schema (28).. The optimization of the loss function has dual effects; it optimizes each problem instance independently and provides fast convergence (i.e., faster than optimizing ). We remark that simply defining the objective and the definition of all state variables in terms of predecessor state and action variables via the transition dynamics is enough for autodifferentiation toolkits to build the symbolic directed acyclic graph (DAG) representing the objective and take its gradient with respect to all free action parameters as shown in (29) using reversemode automatic differentiation.5.2 Planning over Long Horizons
The TFPlan compilation of a nonlinear planning problem reflects the same structure as a RNN that is commonly used in deep learning. The connection here is not superficial since a longstanding difficulty with training RNNs lies in the vanishing gradient problem, that is, multiplying long sequences of gradients via the chain rule usually renders them extremely small and make them irrelevant for weight updates, especially when using nonlinear activation functions that can be saturated such as a sigmoid. As described previously, we mitigate this issue by training the transition function with ReLU and linear activation functions, both of which guarantee that the gradient is not reduced by the ratio at each time step. We note that reward function does not trigger the gradient vanishing problem since the output
of each time step is directly connected to the loss function.5.3 Handling Bounds on Actions
Bounds on actions are common in many planning tasks. For example in Navigation domain, the distance that agent can move at each time step is bounded by constant minimum and maximum values. To handle actions with range constraints, we use projected stochastic gradient descent. Projected gradient descent (PGD)
[calamai1987projected]is a method that can handle constrained optimization problems by projecting the parameters (actions) into a feasible range after each gradient update. Precisely, we clip all actions to their feasible range after each epoch of gradient descent:
In an online planning setting, TFPlan ensures the feasibility of actions with bound constraints using PSGD at time step , and relaxes the remaining global constraints and goal constraints during planning.
6 Experimental Results
In this section, we present experimental results that empirically test the performance of both HDMILPPlan and TFPlan on multiple nonlinear planning domains. To accomplish this task we first present three nonlinear benchmark domains, namely: Reservoir Control, Heating, Ventilation and Air Conditioning, and Navigation. Then, we validate the transition learning performance of our proposed ReLUbased denselyconnected neural networks with different network configurations in each domain. Finally we evaluate the efficacy of both proposed planning frameworks based on the learned model by comparing them to strong baseline manually coded policies ^{4}^{4}4As noted in the Introduction, MCTS, and modelfree reinforcement learning are not applicable as baselines given our multidimensional concurrent continuous action spaces. in an online planning setting. For HDMILPPlan, we test the effect of preprocessing to strengthened MILP encoding on run time and solution quality. For TFPlan, we investigate the impact of the number of epochs on planning quality. Finally, we test the scalability of both planners on large scale domains and show that TFPlan can scale much more gracefully compared to HDMILPPlan.
6.1 Illustrative Domains
Reservoir Control has a single state for each reservoir, which denotes the water level of the reservoir and a corresponding action for each to permit a flow from reservoir (with maximum allowable flow ) to the next downstream reservoir. The transition is nonlinear function due to the evaporation from each reservoir , which is defined by the formula
and the water level transition function is
where ranges over all upstream reservoirs of with bounds . The reward function minimizes the total absolute deviation from a desired water level, plus a constant penalty for having water level outside of a safe range (close to empty or overflowing), which is defined for each time step by the formula
where and define the upper and and lower desired ranges
for each reservoir . We report the results on small instances with 3 and 4
reservoirs over planning horizons , and large instances with 10 reservoirs over planning horizons .
Heating, Ventilation and Air Conditioning [agarwal2010] has a state variable denoting the temperature of each room and an action for sending heated air to each room (with maximum allowable volume ) via vent actuation. The bilinear transition function is then
where is the heat capacity of rooms, represents an adjacency predicate with respect to room and represents a thermal conductance between rooms. The reward function minimizes the total absolute deviation from a desired temperature for all rooms plus a linear penalty for having temperatures outside of a range plus a linear penalty for heating air with cost , and is defined for each time step by the formula
We report the results on small instances with 3 and 6
rooms over planning horizons , and a large instance with 60 rooms over planning horizon .
Navigation is designed to test learning of a highly nonlinear transition function and has a single state for the 2D location of an agent and a 2D action intended nominally to move the agent (with minimum and maximum movement boundaries ). The new location is a nonlinear function of the current location (with minimum and maximum maze boundaries ) due to higher slippage in the center of the domain where the transition function is
where is the Euclidean distance from to the center of the domain. The reward function minimizes the total Manhattan distance from the goal location, which is defined for each time step by the formula
where defines the goal location for dimension . We report the results on small instances with maze sizes 8by8 (i.e., ) and 10by10 (i.e., ) over planning horizons , and a large instance with minimum and maximum movement boundaries over planning horizon .
6.2 Transition Learning Performance
In Table 1, we show the mean squared error of training different configurations of neural networks with 200 epochs over different planning domains. We train these neural networks through presampled data that was generated by a simple stochastic exploration policy. We sample data points for all domains and treat them as independent and identically distributed data points. Namely, we random shuffle data points for each epoch of training. The sampled data was split into training and test sets with 4 to 1 ratio, where we report the results from the holdout test sets.
Since denselyconnected networks [huang2017densely] strictly dominated nondenselyconnected networks, we only report the results of the denselyconnected network. Overall, we see that Reservoir and HVAC can be accurately learned with one layer (i.e., an additional layer did not help) while Navigation benefits from having two layers owing to the complexity of its transition. The network with the lowest MSE is used as the deep neural network model for each domain in the subsequent planning experiments.
Figure 3 visualizes the training performance of different neural network configurations over three domain instances. Figures 3 (a)(c) visualize the loss curves over training epochs for three domain instances. In Reservoir, we observe that while both 1 and 2 hidden layer networks have similar MSE values, the former has much smaller variance. In HVAC and Navigation instances, we observe that 1 and 2 hidden layer networks have the smallest MSE values, respectively. Figures 3 (d)(f) visualize the performance of learning transition functions with different number of hidden layers. We observe that Reservoir needs at least one hidden layer to overlap with ground truth whereas HVAC is learned well with all networks (all dashed lines overlap) and Navigation only shows complete overlap (especially near the center nonlinearity) for a two layered neural network. All these results mirror the Mean Squared Error comparisons in Table 1, and provide empirical and intuitive evidence for systematically selecting the minimal neural network structure for each domain.
Domain  Linear  1 Hidden  2 Hidden 

Reservoir (instance with 4 reservoirs)  46500000 487000  343000 7210  653000 85700 
HVAC (instance with 3 rooms)  7102.3  52054  752007100 
Navigation (instance with 10 by 10 maze)  304409.8  942029  194050 
Mean Squared Error Table for all domains and network configurations with 95% Confidence Interval (in
)6.3 Planning Performance
In this section, we investigate the effectiveness of planning with HDMILPPlan and TFPlan to plan for the original planning problem through optimizing the learned planning problem in an online planning setting. We optimized the MILP encodings using IBM ILOG CPLEX 12.7.1 with eight threads and a 1hour total time limit per problem instance on a MacBookPro with 2.8 GHz Intel Core i7 16 GB memory. We optimize TFPlan through Tensorflow 1.9 with an Nvidia GTX 1080 GPU with CUDA 9.0 on a Linux system with 16 GB memory.^{5}^{5}5Due to the fact that CPLEX solver and Tensorflow completely leverage different hardware components, the run times reported only intend to show real time lapse. We connected both planners with RDDLsim [Sanner:RDDL] domain simulator and interactively solved multiple problem instances with different sizes and horizon lengths. In order to approximate the solution quality of TFPlan, we also report HDMILPPlan with 20% duality gap. The results reported for TFPlan, unless otherwise stated, are based on fixed number of epochs for each domain where TFPlan used 1000 epochs for Reservoir and HVAC, and 300 epochs for Navigation.
6.3.1 Comparison of Planning Quality
In Figures 4 (a)(c), we compare the planning qualities of the domainspecific policies (blue), the base MILP model (gray), MILP model with preprocessing and strengthening constraints solved optimally (orange), MILP model with preprocessing and strengthening constraints solved upto 20% duality gap (green) and TFPlan.
In Figure 4 (a), we compare HDMILPPlan and TFPlan to a rulebased local Reservoir planner, which measures the water level in reservoirs, and sets outflows to release water above a prespecified median level of reservoir capacity. In this domain, we observe an average of 15% increase in the total reward obtained by the plans generated by HDMILPPlan in comparison to that of the rulebased local Reservoir planner. Similarly, we find that TFPlan outperforms the rulebased local Reservoir planner with a similar percentage. However, we observe that TFPlan outperforms both HDMILPPlan on Reservoir 4 domain. We investigate this outcome further in Figure 5 (a). We find that the plan returned by HDMILPPlan incurs more penalty due to the noise in the learned transition model where the plan attempts to distribute water to multiple reservoirs and obtain higher reward. As a result, the actions returned by HDMILPPlan break the safety threshold and receive additional penalty. Thus, HDMILPPlan incurs more cost than TFPlan.
In Figure 4 (b), we compare HDMILPPlan and TFPlan to a rulebased local HVAC policy, which turns on the air conditioner anytime the room temperature is below the median value of a given range of comfortable temperatures [20,25] and turns off otherwise. While the reward (i.e., electricity cost) of the proposed models on HVAC 3 rooms are almost identical to that of the locally optimal HVAC policy, we observe significant performance improvement on HVAC 6 rooms settings, which suggests the advantage of the proposed models on complex planning problems, where the manual policy fails to track the temperature interaction among the rooms. Figure 5 (b) further demonstrates the advantage of our planners where the room temperatures controlled by the proposed models are identical to the locally optimal policy with 15% less power usage.
Figure 4 (c) compares HDMILPPlan and TFPlan to a greedy search policy, which uses a Manhattan distancetogoal function to guide the agent towards the direction of the goal (as visualized by Figure 5 (c)). The pairwise comparison of the total rewards obtained for each problem instance per plan shows that the proposed models can outperform the manual policy up to 15%, as observed in the problem instance Navigation,10,8 in Figure 4 (c). The investigation of the actual plans, as visualized by Figure 5 (c), shows that the local policy ignores the nonlinear region in the middle, and tries to reach the goal directly, which cause the plan to be not able to reach the goal position with given step budget. In contrast, both HDMILPPlan and TFPlan can find plans that move around the nonlinearity and successfully reach the goal state, which shows their ability to model the nonlinearity and find plans that are nearoptimal with respect to the learned model over the complete horizon .
Overall we observe that in 10 out of 12 problem instances, the solution quality of the plans generated by HDMILPPlan and TFPlan are significantly better than the total reward obtained by the plans generated by the respective domainspecific humandesigned policies. Further we find that the quality of the plans generated by TFPlan are between the plans generated by i) HDMILPPlan solved to optimality, and ii) HDMILPPlan solved to 20% duality gap.
6.3.2 Comparison of Run Time Performance
In Figures 4 (d)(f), we compare the run time performances of the base MILP model (gray), MILP model with preprocessing and strengthening constraints solved optimally (orange), MILP model with preprocessing and strengthening constraints solved upto 20% duality gap (green) and TFPlan. Figure 4 (f) shows significant run time improvement for the strengthened encoding over the base MILP encoding, while Figures 4 (d)(e) show otherwise. Together with the results presented in Figures 4 (a)(c), we find that domains that utilize neural networks with only 1 hidden layer (e.g., HVAC and Reservoir) do not benefit from the additional fixed computational expense of preprocessing. In contrast, domains that require deeper neural networks (e.g., Navigation) benefit from the additional computational expense of preprocessing and strengthening. Over three domains, we find that TFPlan significantly outperforms HDMILPPlan in all Navigation instances, performs slightly worse in all Reservoir instances and performs comparable in HVAC instances.
6.3.3 Effect of Training Epochs for TFPlan on Planning Quality
To test the effect of the number of optimization epochs on the solution quality, we present results on 10by10 Navigation domain for a horizon of 10 with different epochs. Figure 6 visualizes the increase in solution quality as the number of epochs increase where Figure 6 (a) presents a low quality plan found similar to that of the manualpolicy with 20 epochs, Figure 6 (b) presents a medium quality plan with 80 epochs, and Figure 6 (c) presents a high quality plan similar to that of HDMILPPlan with 320 epochs.
6.3.4 Scalability Analysis on Large Problem Instances
To test the scalability of the proposed planning models, we create three additional domain instances that simulate more realistic planning instances. For Reservoir domain, we create a system with 10 reservoirs with complex reservoir formations where a reservoir may receive water from more than one upstream reservoirs. For HVAC domain, we simulate a building of 6 floors and 60 rooms with complex adjacency setting (including interlevel adjacency modeling). More importantly, in order to capture the complex mutual temperature impact of the rooms, we train the transition function with one hidden layer and 256 neurons. For Navigation domain, we reduce the feasible action range from to and increase the planning horizon to 20 time steps.
In Figures 7 (a)(c), we compare the total rewards obtained by the domainspecific rulebased policy (blue), HDMILPPlan (orange) and TFPlan (red) on larger problem instances. The analysis of Figures 7 (a)(c) shows that TFPlan scales better compared to HDMILPPlan by consistently outperforming the policy, whereas HDMILPPlan outperforms the other two planners in two out of three domains (i.e., Reservoir and HVAC) while suffering from scalability issues in one domain (i.e., Navigation). Particularly, we find that in Navigation domain, HDMILPPlan sometimes does not find feasible plans with respect to the learned model and therefore returns default noop action values.
In Figure 8, we compare the run time performance of all all three planners over all problem instances where we measure problem size as a function of horizon , number of parameters in the learned model, and number of neural network layers. We observe that as the problem size gets larger, HDMILPPlan takes more computational effort to solve due to its additional requirement of proving optimality which can be remedied by allowing a bounded optimality guaranty (e.g., 20% duality gap) on the learned model. We also observe that as the problems sizes get larger, the effect of preprocessing bounds and strengthening constraints payoff. Finally we show that TFPlan scales gracefully as the problem size gets larger. Together with Figures 7 (a)(c), we conclude that TFPlan provides an efficient alternative to HDMILPPlan in largescale planning problems.
7 Conclusion
In this paper, we have tackled the question of how we can plan with expressive and accurate deep network learned transition models that are not amenable to existing solution techniques. We started with improving the accuracy of the learned transition function by using the denselyconnected network with weighted mean square error loss. We leveraged the insight that ReLU based deep networks offer strong learning performance and permit a direct compilation of the neural network transition model to a MixedInteger Linear Program (MILP) encoding in a planner we called Hybrid Deep MILP Planner (HDMILPPlan). To enhance planning efficiency, we have strengthened the linear relaxation of the base MILP encoding. Considering the computational bottleneck of MILPbased optimization, we proposed an alternative Tensorflow Planner (TFPlan) that performs planning using Recurrent Neural Networks, where plans are directly optimized through backpropagation.
We evaluated run time performance and solution quality of the plans generated by both proposed planners over multiple problem instances from three planning domains. We have shown that HDMILPPlan can find optimal plans with respect to the learned models, and TFPlan can approximate the optimal plans with little computational cost. We have shown that the plans generated by both HDMILPPlan and TFPlan yield better solution qualities compared to strong domainspecific humandesigned policies. Also, we have shown that our strengthening constraints improved the solution quality and the run time performance of HDMILPPlan as problem instances got larger. Finally, we have shown that TFPlan can handle largescale planning problems with very little computational cost.
In conclusion, both HDMILPPlan and TFPlan represent a new class of datadriven planning methods that can accurately learn complex state transitions of highdimensional nonlinear planning domains, and provide highquality plans with respect to the learned models.
Appendix A. Derivations
In this section, we extend derivations in the paper for interested readers.
Pushing Normalization into the Network Parameters
Assume we have mean and standard derivation of the data for each input dimension . In general, we normalize data point through the formula
, and compute the first linear transformation after input through the formula
, and transfer the normalization of input to the learned weights and bias:
(30) 
Since the output neuron is invariant, this operation only affects weights and biases of a linear transformation that is directly connected to the input layer.
Appendix B. RDDL Domain Description
In this section, we list the RDDL domain and instance files that we experimented with in this paper.
Reservoir
Domain File
domain Reservoir_Problem{ Ψrequirements = { ΨΨrewarddeterministic Ψ}; Ψtypes { ΨΨid: object; Ψ}; Ψpvariables { ΨΨ// Constant ΨΨMAXCAP(id): { nonfluent, real, default = 100.0 }; ΨΨHIGH_BOUND(id): { nonfluent, real, default = 80.0 }; ΨΨLOW_BOUND(id): { nonfluent, real, default = 20.0 }; ΨΨRAIN(id): { nonfluent ,real, default = 5.0 }; ΨΨDOWNSTREAM(id,id): {nonfluent ,bool, default = false }; ΨΨDOWNTOSEA(id): {nonfluent, bool, default = false }; ΨΨBIGGESTMAXCAP: {nonfluent, real, default = 1000}; ΨΨ//Interm ΨΨvaporated(id): {intermfluent, real}; ΨΨ//State ΨΨrlevel(id): {statefluent, real, default = 50.0 }; ΨΨ//Action ΨΨflow(id): { actionfluent, real, default = 0.0 }; Ψ}; Ψcpfs { ΨΨvaporated(?r) = (1.0/2.0)*sin[rlevel(?r)/BIGGESTMAXCAP]*rlevel(?r); ΨΨrlevel’(?r) = rlevel(?r) + RAIN(?r) vaporated(?r)  flow(?r) ΨΨ + sum_{?r2: id}[DOWNSTREAM(?r2,?r)*flow(?r2)]; Ψ}; Ψreward = sum_{?r: id} [if (rlevel’(?r)>=LOW_BOUND(?r) ^ (rlevel’(?r)<=HIGH_BOUND(?r))) ΨΨΨ then 0 ΨΨΨ else if (rlevel’(?r)<=LOW_BOUND(?r)) ΨΨΨ then (5)*(LOW_BOUND(?r)rlevel’(?r)) ΨΨΨ else (100)*(rlevel’(?r)HIGH_BOUND(?r))] Ψ +sum_{?r2:id}[abs[((HIGH_BOUND(?r2)+LOW_BOUND(?r2))/2.0)rlevel’(?r2)]*(0.1)]; Ψstateactionconstraints { ΨΨforall_{?r:id} flow(?r)<=rlevel(?r); ΨΨforall_{?r:id} rlevel(?r)<=MAXCAP(?r); ΨΨforall_{?r:id} flow(?r)>=0; Ψ}; }
Instance Files
Reservoir 3
nonfluents Reservoir_non { domain = Reservoir_Problem; objects{ id: {t1,t2,t3}; }; nonfluents { RAIN(t1) = 5.0;RAIN(t2) = 10.0;RAIN(t3) = 20.0; MAXCAP(t2) = 200.0;LOW_BOUND(t2) = 30.0;HIGH_BOUND(t2) = 180.0; MAXCAP(t3) = 400.0;LOW_BOUND(t3) = 40.0;HIGH_BOUND(t3) = 380.0; DOWNSTREAM(t1,t2);DOWNSTREAM(t2,t3);DOWNTOSEA(t3); }; } instance is1{ domain = Reservoir_Problem; nonfluents = Reservoir_non; initstate{ rlevel(t1) = 75.0; }; maxnondefactions = 3; horizon = 10; discount = 1.0; }
Reservoir 4
nonfluents Reservoir_non { domain = Reservoir_Problem; objects{ id: {t1,t2,t3,t4}; }; nonfluents { RAIN(t1) = 5.0;RAIN(t2) = 10.0;RAIN(t3) = 20.0;RAIN(t4) = 30.0; MAXCAP(t2) = 200.0;LOW_BOUND(t2) = 30.0;HIGH_BOUND(t2) = 180.0; MAXCAP(t3) = 400.0;LOW_BOUND(t3) = 40.0;HIGH_BOUND(t3) = 380.0; MAXCAP(t4) = 500.0;LOW_BOUND(t4) = 60.0;HIGH_BOUND(t4) = 480.0; DOWNSTREAM(t1,t2);DOWNSTREAM(t2,t3);DOWNSTREAM(t3,t4);DOWNTOSEA(t4); }; } instance is1{ domain = Reservoir_Problem; nonfluents = Reservoir_non; initstate{ rlevel(t1) = 75.0; }; maxnondefactions = 4; horizon = 10; discount = 1.0; }
Reservoir 10
nonfluents Reservoir_non { Ψdomain = Reservoir_Problem; Ψobjects{ ΨΨid: {t1,t2,t3,t4,t5,t6,t7,t8,t9,t10}; Ψ}; Ψnonfluents { ΨΨRAIN(t1) = 15.0;RAIN(t2) = 10.0;RAIN(t3) = 20.0;RAIN(t4) = 30.0;RAIN(t5) = 20.0; ΨΨRAIN(t6) = 10.0;RAIN(t7) = 35.0;RAIN(t8) = 15.0;RAIN(t9) = 25.0;RAIN(t10) = 20.0; ΨΨMAXCAP(t2) = 200.0;LOW_BOUND(t2) = 30.0;HIGH_BOUND(t2) = 180.0; ΨΨMAXCAP(t3) = 400.0;LOW_BOUND(t3) = 40.0;HIGH_BOUND(t3) = 380.0; ΨΨMAXCAP(t4) = 500.0;LOW_BOUND(t4) = 60.0;HIGH_BOUND(t4) = 480.0; ΨΨMAXCAP(t5) = 750.0;LOW_BOUND(t5) = 20.0;HIGH_BOUND(t5) = 630.0; ΨΨMAXCAP(t6) = 300.0;LOW_BOUND(t6) = 30.0;HIGH_BOUND(t6) = 250.0; ΨΨMAXCAP(t7) = 300.0;LOW_BOUND(t7) = 10.0;HIGH_BOUND(t7) = 180.0; ΨΨMAXCAP(t8) = 300.0;LOW_BOUND(t8) = 40.0;HIGH_BOUND(t8) = 240.0; ΨΨMAXCAP(t9) = 400.0;LOW_BOUND(t9) = 40.0;HIGH_BOUND(t9) = 340.0; ΨΨMAXCAP(t10) = 800.0;LOW_BOUND(t10) = 20.0;HIGH_BOUND(t10) = 650.0; ΨΨDOWNSTREAM(t1,t2);DOWNSTREAM(t2,t3);DOWNSTREAM(t3,t4);DOWNSTREAM(t4,t5); ΨΨDOWNSTREAM(t6,t7);DOWNSTREAM(t7,t8);DOWNSTREAM(t8,t5); ΨΨDOWNSTREAM(t5,t6);DOWNSTREAM(t6,t10); ΨΨDOWNSTREAM(t5,t9);DOWNSTREAM(t9,t10); ΨΨDOWNTOSEA(t10); Ψ}; } instance is1{ Ψdomain = Reservoir_Problem; Ψnonfluents = Reservoir_non; Ψinitstate{ ΨΨrlevel(t1) = 175.0; Ψ}; Ψmaxnondefactions = 10; Ψhorizon = 10; Ψdiscount = 1.0; }
Hvac
Domain File
domain hvac_vav_fix{ Ψ types { ΨΨ space : object; Ψ}; Ψpvariables { ΨΨ//Constants ΨΨADJ(space, space) : { nonfluent, bool, default = false }; ΨΨADJ_OUTSIDE(space)ΨΨ: { nonfluent, bool, default = false }; ΨΨADJ_HALL(space)ΨΨΨ: { nonfluent, bool, default = false }; ΨΨR_OUTSIDE(space)ΨΨ: { nonfluent, real, default = 4}; ΨR_HALL(space)ΨΨΨ: { nonfluent, real, default = 2}; ΨΨR_WALL(space, space) : { nonfluent, real, default = 1.5 }; ΨΨIS_ROOM(space)ΨΨ : { nonfluent, bool, default = false }; ΨΨCAP(space) ΨΨΨ : { nonfluent, real, default = 80 }; ΨΨCAP_AIR ΨΨΨ : { nonfluent, real, default = 1.006 }; ΨΨCOST_AIR ΨΨΨ : { nonfluent, real, default = 1 }; ΨΨTIME_DELTA ΨΨΨ : { nonfluent, real, default = 1 }; ΨΨTEMP_AIR ΨΨΨ : { nonfluent, real, default = 40 }; ΨΨTEMP_UP(space)ΨΨ : { nonfluent, real, default = 23.5 }; ΨΨTEMP_LOW(space)ΨΨ : { nonfluent, real, default = 20.0 }; ΨΨTEMP_OUTSIDE(space)ΨΨ: { nonfluent, real, default = 6.0 }; ΨΨTEMP_HALL(space)ΨΨ: { nonfluent, real, default = 10.0 }; ΨΨPENALTY ΨΨΨ : { nonfluent, real, default = 20000 }; ΨΨAIR_MAX(space)ΨΨ : { nonfluent, real, default = 10.0 }; ΨΨTEMP(space) Ψ : { statefluent, real, default = 10.0 }; ΨΨAIR(space)ΨΨ : { actionfluent, real, default = 0.0 }; Ψ}; Ψcpfs { ΨΨ//State ΨΨTEMP’(?s) = TEMP(?s) + TIME_DELTA/CAP(?s) * ΨΨΨ (AIR(?s) * CAP_AIR * (TEMP_AIR  TEMP(?s)) * IS_ROOM(?s) ΨΨΨ+ sum_{?p : space} ((ADJ(?s, ?p)  ADJ(?p, ?s)) * (TEMP(?p)  TEMP(?s)) / R_WALL(?s, ?p)) ΨΨΨ+ ADJ_OUTSIDE(?s)*(TEMP_OUTSIDE(?s)  TEMP(?s))/ R_OUTSIDE(?s) ΨΨΨ+ ADJ_HALL(?s)*(TEMP_HALL(?s)TEMP(?s))/R_HALL(?s)); ΨΨ}; Ψreward =  (sum_{?s : space} IS_ROOM(?s)*(AIR(?s) * COST_AIR + ((TEMP(?s) < TEMP_LOW(?s))  (TEMP(?s) > TEMP_UP(?s))) * PENALTY) + 10.0*abs[((TEMP_UP(?s) + TEMP_LOW(?s))/2.0)  TEMP(?s)]); actionpreconditions{ ΨΨΨforall_{?s : space} [ AIR(?s) >= 0 ]; ΨΨΨforall_{?s : space} [ AIR(?s) <= AIR_MAX(?s)]; ΨΨ}; }
Instance Files
HVAC 3 Rooms
nonfluents nf_hvac_vav_fix{ domain = hvac_vav_fix; objects{ space : { r1, r2, r3}; }; nonfluents { //Define rooms IS_ROOM(r1) = true;IS_ROOM(r2) = true;IS_ROOM(r3) = true; //Define the adjacency ADJ(r1, r2) = true;ADJ(r1, r3) = true;ADJ(r2, r3) = true; ADJ_OUTSIDE(r1) = true;ADJ_OUTSIDE(r2) = true; ADJ_HALL(r1) = true;ADJ_HALL(r3) = true; }; } instance inst_hvac_vav_fix{ domain = hvac_vav_fix; nonfluents = nf_hvac_vav_fix; horizon = 20; discount = 1.0; }
HVAC 6 Rooms
nonfluents nf_hvac_vav_fix{ domain = hvac_vav_fix; objects{ space : { r1, r2, r3, r4, r5, r6 }; }; nonfluents { //Define rooms IS_ROOM(r1) = true;IS_ROOM(r2) = true;IS_ROOM(r3) = true; IS_ROOM(r4) = true;IS_ROOM(r5) = true;IS_ROOM(r6) = true; //Define the adjacency ADJ(r1, r2) = true;ADJ(r1, r4) = true;ADJ(r2, r3) = true; ADJ(r2, r5) = true;ADJ(r3, r6) = true;ADJ(r4, r5) = true; ADJ(r5, r6) = true; ADJ_OUTSIDE(r1) = true;ADJ_OUTSIDE(r3) = true; ADJ_OUTSIDE(r4) = true;ADJ_OUTSIDE(r6) = true; ADJ_HALL(r1) = true;ADJ_HALL(r2) = true;ADJ_HALL(r3) = true; ADJ_HALL(r4) = true;ADJ_HALL(r5) = true;ADJ_HALL(r6) = true; }; } instance inst_hvac_vav_fix{ domain = hvac_vav_fix; nonfluents = nf_hvac_vav_fix; horizon = 20; discount = 1.0; }
HVAC 60 Rooms
nonfluents nf_hvac_vav_fix{ domain = hvac_vav_fix; objects{ space : { r101, r102, r103, r104, r105, r106, r107, r108, r109, r110, r111, r112, r201, r202, r203, r204, r205, r206, r207, r208, r209, r210, r211, r212, r301, r302, r303, r304, r305, r306, r307, r308, r309, r310, r311, r312, r401, r402, r403, r404, r405, r406, r407, r408, r409, r410, r411, r412, r501, r502, r503, r504, r505, r506, r507, r508, r509, r510, r511, r512 }; //Three rooms, one hallway, and the outside world }; nonfluents { //Define rooms //Level1 IS_ROOM(r101) = true;IS_ROOM(r102) = true;IS_ROOM(r103) = true;IS_ROOM(r104) = true; IS_ROOM(r105) = true;IS_ROOM(r106) = true;IS_ROOM(r107) = true;IS_ROOM(r108) = true; IS_ROOM(r109) = true;IS_ROOM(r110) = true;IS_ROOM(r111) = true;IS_ROOM(r112) = true; //Level2 IS_ROOM(r201) = true;IS_ROOM(r202) = true;IS_ROOM(r203) = true;IS_ROOM(r204) = true; IS_ROOM(r205) = true;IS_ROOM(r206) = true;IS_ROOM(r207) = true;IS_ROOM(r208) = true; IS_ROOM(r209) = true;IS_ROOM(r210) = true;IS_ROOM(r211) = true;IS_ROOM(r212) = true; //Level3 IS_ROOM(r301) = true;IS_ROOM(r302) = true;IS_ROOM(r303) = true;IS_ROOM(r304) = true; IS_ROOM(r305) = true;IS_ROOM(r306) = true;IS_ROOM(r307) = true;IS_ROOM(r308) = true; IS_ROOM(r309) = true;IS_ROOM(r310) = true;IS_ROOM(r311) = true;IS_ROOM(r312) = true; //Level4 IS_ROOM(r401) = true;IS_ROOM(r402) = true;IS_ROOM(r403) = true;IS_ROOM(r404) = true; IS_ROOM(r405) = true;IS_ROOM(r406) = true;IS_ROOM(r407) = true;IS_ROOM(r408) = true; IS_ROOM(r409) = true;IS_ROOM(r410) = true;IS_ROOM(r411) = true;IS_ROOM(r412) = true; //Level5 IS_ROOM(r501) = true;IS_ROOM(r502) = true;IS_ROOM(r503) = true;IS_ROOM(r504) = true; IS_ROOM(r505) = true;IS_ROOM(r506) = true;IS_ROOM(r507) = true;IS_ROOM(r508) = true; IS_ROOM(r509) = true;IS_ROOM(r510) = true;IS_ROOM(r511) = true;IS_ROOM(r512) = true; //Define the adjacency //Level1 ADJ(r101, r102) = true;ADJ(r102, r103) = true;ADJ(r103, r104) = true; ADJ(r104, r105) = true;ADJ(r106, r107) = true;ADJ(r107, r108) = true; ADJ(r107, r109) = true;ADJ(r108, r109) = true;ADJ(r110, r111) = true; ADJ(r111, r112) = true; //Level2 ADJ(r201, r202) = true;ADJ(r202, r203) = true;ADJ(r203, r204) = true; ADJ(r204, r205) = true;ADJ(r206, r207) = true;ADJ(r207, r208) = true; ADJ(r207, r209) = true;ADJ(r208, r209) = true;ADJ(r210, r211) = true; ADJ(r211, r212) = true; //Level3 ADJ(r301, r302) = true;ADJ(r302, r303) = true;ADJ(r303, r304) = true; ADJ(r304, r305) = true;ADJ(r306, r307) = true;ADJ(r307, r308) = true; ADJ(r307, r309) = true;ADJ(r308, r309) = true;ADJ(r310, r311) = true; ADJ(r311, r312) = true; //Level4 ADJ(r401, r402) = true;ADJ(r402, r403) = true;ADJ(r403, r404) = true; ADJ(r404, r405) = true;ADJ(r406, r407) = true;ADJ(r407, r408) = true; ADJ(r407, r409) = true;ADJ(r408, r409) = true;ADJ(r410, r411) = true; ADJ(r411, r412) = true; //Level5 ADJ(r501, r502) = true;ADJ(r502, r503) = true;ADJ(r503, r504) = true; ADJ(r504, r505) = true;ADJ(r506, r507) = true;ADJ(r507, r508) = true; ADJ(r507, r509) = true;ADJ(r508, r509) = true;ADJ(r510, r511) = true; ADJ(r511, r512) = true; //InterLevel 12 ADJ(r101, r201) = true;ADJ(r102, r202) = true;ADJ(r103, r203) = true; ADJ(r104, r204) = true;ADJ(r105, r205) = true;ADJ(r106, r206) = true; ADJ(r107, r207) = true;ADJ(r108, r208) = true;ADJ(r109, r209) = true; ADJ(r110, r210) = true;ADJ(r111, r211) = true;ADJ(r112, r212) = true; //InterLevel 23 ADJ(r201, r301) = true;ADJ(r202, r302) = true;ADJ(r203, r303) = true; ADJ(r204, r304) = true;ADJ(r205, r305) = true;ADJ(r206, r306) = true; ADJ(r207, r307) = true;ADJ(r208, r308) = true;ADJ(r209, r309) = true; ADJ(r210, r310) = true;ADJ(r211, r311) = true;ADJ(r212, r312) = true; //InterLevel 34 ADJ(r301, r401) = true;ADJ(r302, r402) = true;ADJ(r303, r403) = true; ADJ(r304, r404) = true;ADJ(r305, r405) = true;ADJ(r306, r406) = true; ADJ(r307, r407) = true;ADJ(r308, r408) = true;ADJ(r309, r409) = true; ADJ(r310, r410) = true;ADJ(r311, r411) = true;ADJ(r312, r412) = true; //InterLevel 45 ADJ(r401, r501) = true;ADJ(r402, r502) = true;ADJ(r403, r503) = true; ADJ(r404, r504) = true;ADJ(r405, r505) = true;ADJ(r406, r506) = true; ADJ(r407, r507) = true;ADJ(r408, r508) = true;ADJ(r409, r509) = true; ADJ(r410, r510) = true;ADJ(r411, r511) = true;ADJ(r412, r512) = true; //Outside //Level1 ADJ_OUTSIDE(r101) = true;ADJ_OUTSIDE(r102) = true;ADJ_OUTSIDE(r103) = true; ADJ_OUTSIDE(r104) = true;ADJ_OUTSIDE(r105) = true;ADJ_OUTSIDE(r106) = true; ADJ_OUTSIDE(r108) = true;ADJ_OUTSIDE(r110) = true;ADJ_OUTSIDE(r111) = true; ADJ_OUTSIDE(r112) = true; //Level2 ADJ_OUTSIDE(r201) = true;ADJ_OUTSIDE(r202) = true;ADJ_OUTSIDE(r203) = true; ADJ_OUTSIDE(r204) = true;ADJ_OUTSIDE(r205) = true;ADJ_OUTSIDE(r206) = true; ADJ_OUTSIDE(r208) = true;ADJ_OUTSIDE(r210) = true;ADJ_OUTSIDE(r211) = true; ADJ_OUTSIDE(r212) = true; //Level3 ADJ_OUTSIDE(r301) = true;ADJ_OUTSIDE(r302) = true;ADJ_OUTSIDE(r303) = true; ADJ_OUTSIDE(r304) = true;ADJ_OUTSIDE(r305) = true;ADJ_OUTSIDE(r306) = true; ADJ_OUTSIDE(r308) = true;ADJ_OUTSIDE(r310) = true;ADJ_OUTSIDE(r311) = true; ADJ_OUTSIDE(r312) = true; //Level4 ADJ_OUTSIDE(r401) = true;ADJ_OUTSIDE(r402) = true;ADJ_OUTSIDE(r403) = true; ADJ_OUTSIDE(r404) = true;ADJ_OUTSIDE(r405) = true;ADJ_OUTSIDE(r406) = true; ADJ_OUTSIDE(r408) = true;ADJ_OUTSIDE(r410) = true;ADJ_OUTSIDE(r411) = true; ADJ_OUTSIDE(r412) = true; //Level5 ADJ_OUTSIDE(r501) = true;ADJ_OUTSIDE(r502) = true;ADJ_OUTSIDE(r503) = true; ADJ_OUTSIDE(r504) = true;ADJ_OUTSIDE(r505) = true;ADJ_OUTSIDE(r506) = true; ADJ_OUTSIDE(r508) = true;ADJ_OUTSIDE(r510) = true;ADJ_OUTSIDE(r511) = true; ADJ_OUTSIDE(r512) = true; //Hallway //Level1 ADJ_HALL(r101) = true;ADJ_HALL(r102) = true;ADJ_HALL(r103) = true; ADJ_HALL(r106) = true;ADJ_HALL(r107) = true;ADJ_HALL(r109) = true; ADJ_HALL(r110) = true; //Level2 ADJ_HALL(r201) = true;ADJ_HALL(r202) = true;ADJ_HALL(r203) = true; ADJ_HALL(r206) = true;ADJ_HALL(r207) = true;ADJ_HALL(r209) = true; ADJ_HALL(r210) = true; //Level3 ADJ_HALL(r301) = true;ADJ_HALL(r302) = true;ADJ_HALL(r303) = true; ADJ_HALL(r306) = true;ADJ_HALL(r307) = true;ADJ_HALL(r309) = true; ADJ_HALL(r310) = true; //Level4 ADJ_HALL(r401) = true;ADJ_HALL(r402) = true;ADJ_HALL(r403) = true; ADJ_HALL(r406) = true;ADJ_HALL(r407) = true;ADJ_HALL(r409) = true; ADJ_HALL(r410) = true; //Level5 ADJ_HALL(r501) = true;ADJ_HALL(r502) = true;ADJ_HALL(r503) = true; ADJ_HALL(r506) = true;ADJ_HALL(r507) = true;ADJ_HALL(r509) = true; ADJ_HALL(r510) = true; }; } instance inst_hvac_vav_fix{ domain = hvac_vav_fix; nonfluents = nf_hvac_vav_fix; //initstate{ //}; maxnondefactions = 60; horizon = 12; discount = 1.0; }
Navigation
Domain File
domain Navigation_Problem{ requirements = { rewarddeterministic }; types { dim: object; }; pvariables { // Constant MINMAZEBOUND(dim): { nonfluent, real, default = 4.0 }; //5.0 for 10x10 instance MAXMAZEBOUND(dim): { nonfluent, real, default = 4.0 }; //5.0 for 10x10 instance MINACTIONBOUND(dim): { nonfluent, real, default = 1.0 }; //0.5 for large scale instance MAXACTIONBOUND(dim): { nonfluent, real, default = 1.0 }; //0.5 for large scale instance GOAL(dim): { nonfluent, real, default = 3.0 }; PENALTY: {nonfluent, real, default = 1000000.0 }; CENTER(dim): {nonfluent, real, default = 0.0}; // Interm distance: {intermfluent,real,level=1 }; scalefactor: {intermfluent,real,level=2 }; proposedLoc(dim):{intermfluent, real, level=3}; //State location(dim): {statefluent, real, default = 4.0 }; //5.0 for 10x10 instance //Action move(dim): { actionfluent, real, default = 0.0 }; }; cpfs { distance = sqrt[sum_{?l:dim}[pow[(location(?l)CENTER(?l)),2]]]; scalefactor = 2.0/(1.0+exp[2*distance])0.99; proposedLoc(?l) = location(?l) + move(?l)*scalefactor; location’(?l)= if(proposedLoc(?l)<=MAXMAZEBOUND(?l) ^ proposedLoc(?l)>=MINMAZEBOUND(?l)) then proposedLoc(?l) else (if(proposedLoc(?l)>MAXMAZEBOUND(?l)) then MAXMAZEBOUND(?l) else MINMAZEBOUND(?l)); }; reward =  sum_{?l: dim}[abs[GOAL(?l)  location(?l)]]; stateactionconstraints { forall_{?l:dim} move(?l)<=MAXACTIONBOUND(?l); forall_{?l:dim} move(?l)>=MINACTIONBOUND(?l); forall_{?l:dim} location(?l)<=MAXMAZEBOUND(?l); forall_{?l:dim} location(?l)>=MINMAZEBOUND(?l); }; }
Instance Files
Navigation 8 by 8 instance
nonfluents Navigation_non { domain = Navigation_Problem; objects{ dim: {x,y}; }; nonfluents { MINMAZEBOUND(x) = 4.0; }; } instance is1{ domain = Navigation_Problem; nonfluents = Navigation_non; initstate{ location(x) = 4.0;location(y) = 4.0; }; maxnondefactions = 2; horizon = 10; discount = 1.0; }
Navigation 10 by 10 instance
nonfluents Navigation_non { domain = Navigation_Problem; objects{ dim: {x,y}; }; nonfluents { MINMAZEBOUND(x) = 5.0; }; } instance is1{ domain = Navigation_Problem; nonfluents = Navigation_non; initstate{ location(x) = 5.0;location(y) = 5.0; }; maxnondefactions = 2; horizon = 10; //20 for large scale instance discount = 1.0; }
Comments
There are no comments yet.