# Scalable Nonlinear Planning with Deep Neural Network Learned Transition Models

In many real-world planning problems with factored, mixed discrete and continuous state and action spaces such as Reservoir Control, Heating Ventilation, and Air Conditioning, and Navigation domains, it is difficult to obtain a model of the complex nonlinear dynamics that govern state evolution. However, the ubiquity of modern sensors allows us to collect large quantities of data from each of these complex systems and build accurate, nonlinear deep neural network models of their state transitions. But there remains one major problem for the task of control -- how can we plan with deep network learned transition models without resorting to Monte Carlo Tree Search and other black-box transition model techniques that ignore model structure and do not easily extend to mixed discrete and continuous domains? In this paper, we introduce two types of nonlinear planning methods that can leverage deep neural network learned transition models: Hybrid Deep MILP Planner (HD-MILP-Plan) and Tensorflow Planner (TF-Plan). In HD-MILP-Plan, we make the critical observation that the Rectified Linear Unit transfer function for deep networks not only allows faster convergence of model learning, but also permits a direct compilation of the deep network transition model to a Mixed-Integer Linear Program encoding. Further, we identify deep network specific optimizations for HD-MILP-Plan that improve performance over a base encoding and show that we can plan optimally with respect to the learned deep networks. In TF-Plan, we take advantage of the efficiency of auto-differentiation tools and GPU-based computation where we encode a subclass of purely continuous planning problems as Recurrent Neural Networks and directly optimize the actions through backpropagation. We compare both planners and show that TF-Plan is able to approximate the optimal plans found by HD-MILP-Plan in less computation time...

There are no comments yet.

## Authors

• 5 publications
• 4 publications
• 17 publications
• ### Planning with Learned Binarized Neural Network Transition Models in Factored State and Action Spaces

In this paper, we leverage the efficiency of Binarized Neural Networks (...
11/26/2018 ∙ by Buser Say, et al. ∙ 0

• ### Compact and Efficient Encodings for Planning in Factored State and Action Spaces with Learned Binarized Neural Network Transition Models

In this paper, we leverage the efficiency of Binarized Neural Networks (...
11/26/2018 ∙ by Buser Say, et al. ∙ 0

• ### Efficient Hierarchical Robot Motion Planning Under Uncertainty and Hybrid Dynamics

Noisy observations coupled with nonlinear dynamics pose one of the bigge...
02/12/2018 ∙ by Ajinkya Jain, et al. ∙ 0

• ### Modelling Mixed Discrete-Continuous Domains for Planning

In this paper we present pddl+, a planning domain description language f...
10/10/2011 ∙ by M. Fox, et al. ∙ 0

• ### Learning Generalized Reactive Policies using Deep Neural Networks

We consider the problem of learning for planning, where knowledge acquir...
08/24/2017 ∙ by Edward Groshev, et al. ∙ 0

• ### Reward Potentials for Planning with Learned Neural Network Transition Models

Optimal planning with respect to learned neural network (NN) models in c...
04/19/2019 ∙ by Buser Say, et al. ∙ 0

• ### Learning to Plan Hierarchically from Curriculum

We present a framework for learning to plan hierarchically in domains wi...
06/18/2019 ∙ by Philippe Morere, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In many real-world planning problems with factored [Boutilier1999] and mixed discrete and continuous state and action spaces such as Reservoir Control [Yeh1985], Heating, Ventilation and Air Conditioning (HVAC) [agarwal2010], and Navigation [nonlinear_path_planning], it is difficult to obtain a model of the complex nonlinear dynamics that govern state evolution. For example, in Reservoir Control, evaporation and other sources of water loss are a complex function of volume, bathymetry, and environmental conditions; in HVAC domains, thermal conductance between walls and convection properties of rooms are nearly impossible to derive from architectural layouts; and in Navigation problems, nonlinear interactions between surfaces and traction devices make it hard to accurately predict odometry.

A natural answer to these modeling difficulties is to instead learn the transition model from sampled data; fortunately, the presence of vast sensor networks often make such data inexpensive and abundant. While learning nonlinear models with a priori

unknown model structure can be very difficult in practice, recent progress in Deep Learning and the availability of off-the-shelf tools such as Tensorflow

and Pytorch

[paszke2017automatic] make it possible to learn highly accurate nonlinear deep neural networks with little prior knowledge of model structure.

However, the modeling of a nonlinear transition model as a deep neural network poses non-trivial difficulties for the optimal control task. Existing nonlinear planners either are not compatible with nonlinear deep network transition models and continuous action input [Penna2009, lohr, coles2013hybrid, ivankovic, piotrowski, scala2016interval], or only optimize goal-oriented objectives [Bryce2015, Scala2016-2, Cashmore2016]. Monte Carlo Tree Search (MCTS) methods [mcts, uct, keller_icaps13] including AlphaGo [silver2016mastering] that could exploit a deep network learned black box model of transition dynamics do not inherently work with continuous action spaces due to the infinite branching factor. While MCTS with continuous action extensions such as HOOT [weinstein2012bandit]

have been proposed, their continuous partitioning methods do not scale to high-dimensional concurrent and continuous action spaces. Finally, offline model-free reinforcement learning with function approximation

[sutton_barto, csaba_rl] and deep extensions [deepqn] do not directly apply to domains with high-dimensional continuous action spaces. That is, offline learning methods like Q-learning require action maximization for every update, but in high-dimensional continuous action spaces such nonlinear function maximization is non-convex and computationally intractable at the scale of millions or billions of updates.

Despite these limitations of existing methods, all is not lost. First, we remark that our deep network is not a black-box but rather a gray-box; while the learned parameters often lack human interpretability, there is still a uniform layered symbolic structure in the deep neural network models. Second, we make the critical observation that the popular Rectified Linear Unit (ReLU) [relu] transfer function for deep networks enables effective nonlinear deep neural network model learning and permits a direct compilation to a Mixed-Integer Linear Program (MILP) encoding. Given other components such as a human-specified objective function and a horizon, this permits direct optimization in a method we call Hybrid Deep MILP Planner (HD-MILP-Plan).

While arguably an important step forward, we remark that planners with optimality guarantees such as HD-MILP-Plan can only scale up to moderate-sized planning problems. Hence in an effort to scale to substantially larger control problems, we focus on a general subclass of planning problems with purely continuous state and action spaces in order to take advantage of the efficiency of auto-differentiation tools and GPU-based computation. Specifically, we propose to extend work using the Tensorflow tool for planning [Wu2017] in deterministic continuous RDDL [Sanner:RDDL] domains to the case of learned neural network transition models investigated in this article. Specifically, we show that we can embed both a reward function and a deep-learned transition function into a Recurrent Neural Network (RNN) cell, chain multiple cells together for a fixed horizon, and produce a plan in the resulting RNN encoding through end-to-end backpropagation in a method we call Tensorflow Planner (TF-Plan).

Experimentally, we compare HD-MILP-Plan and TF-Plan versus manually specified domain-specific policies on Reservoir Control, HVAC, and Navigation domains. Our primary objectives are to comparatively evaluate the ability of HD-MILP-Plan and TF-Plan to produce high quality plans with limited computational resources in an online planning setting, and to assess their performance against carefully designed manual policies. For HD-MILP-Plan, we show that our strengthened MILP encoding improves the quality of plans produced in less computational time over the base encoding. For TF-Plan, we show the scalability and the efficiency of our planner on large-scale problems and its ability to approximate the optimal plans found by HD-MILP-Plan on moderate-sized problems. Overall, this article contributes and evaluates two novel approaches for planning in domains with learned deep neural net transition models: an optimal and method HD-MILP-Plan for mixed discrete and continuous state and action spaces with moderate scalability and a fast, scalable GPU-based planner TF-Plan for the subset of purely continuous state and action domains.

## 2 Preliminaries

Before we discuss deep network transition learning, we review the factored planning problem motivating this work.

### 2.1 Factored Planning Problem

A deterministic factored planning problem is a tuple where is a mixed set of state variables (states) with discrete and continuous domains, is a mixed set of action variables (actions) with discrete and continuous domains, is a function that returns true if action and state pair satisfies global constraints, denotes the transition function between time steps and such that if and is undefined otherwise, is the initial state constraint that assigns values to all state variables S, and is the goal state constraints over subset of state variables . Finally, denotes the state-action reward function. Given a planning horizon , an optimal solution to is a plan that maximizes the total reward function over horizon such that .

In many real-world problems, it is difficult to model the exact dynamics of the complex nonlinear transition function that governs the evolution of states over the horizon . Therefore in this paper, we do not assume a-priori knowledge on , but rather we learn it from data. We limit our model knowledge to the reward function , horizon and global constraint function that specifies whether actions are applicable in state at time , or not, e.g., the outflow from a reservoir must not exceed the present water level, and goal state constraints .

## 3 Neural Network Transition Learning

A neural network is a layered, acyclic, directed computational network structure inspired by the biological neural networks that constitute our brains [goodfellow2016deep]

. A neural network typical has one or more hidden layers, where more hidden layers indicate deeper information extraction. Each hidden layer typically consists of a linear transformation followed by a nonlinear activation. While most traditional nonlinear activation functions are bounded (e.g., sigmoid or tangent function), simpler piecewise linear activation functions become popular because of their computational efficiency and robustness of preventing saturation.

### 3.1 Network Structure

We model the transition function as a modified version of densely-connected network [huang2017densely] 111

Densely Connected Network was proposed in Convolutional Neural Networks that concatenate filter maps.

as shown in Figure 1, which, in comparison to a standard fully connected network, allows direct connections of each layer to the output. This can be advantageous when a transition function has differing levels of nonlinearity, allowing linear transitions to pass directly from the input to the output layer, while nonlinear transitions may pass through one or more of the hidden layers.

Given a deep neural network configuration with hidden layers, state vectors , and an action vector from data =

and a hyperparameter

, the optimal weights for all layers can be found by solving the following optimization problem:

 minimize Wk,bk,k∈{1,…,K}∑n∈{1,…,N}γ∥∥~s′n−s′n∥∥F+λ∑k∈{1,…,K}∥Wk∥2F subject to zl=g((sn||an||z1||…||zk−1)WTk+bk)∀k∈{1,…,K−1},n∈{1,…,N} (1) ~s′n=(sn||an||z1||…||zK−1)WTK+bK∀n∈{1,…,N} (2)

The objective minimizes squared reconstruction error of transitions in the data plus a regularizer. Constraints (1) and (2) define the nonlinear activation of for each layer and outputs , where denotes the concatenation operation and denotes nonlinear activation function. We omitted subscript of all intermediate results as they are temporary values.

We use Rectified Linear Units (ReLUs) [relu] of the form as the activation function in this paper. There is a two-fold benefit of deploying ReLU activation in the planning tasks. First, in comparison to the other activation functions, such as sigmoid and hyperbolic tangent, ReLUs can be trained efficiently and permit direct compilation to a set of linear constraints in Mixed-Integer Linear Programming (MILP) as we will discuss in section 4. Second, ReLU activation is robust to the gradient vanishing in long-term backpropagation which is advantageous in the context of planning through backpropagation which we will discuss in section 5.

### 3.2 Input and Output Normalization

Input and output of the transition function usually have multiple dimensions, where each input and output dimension may have dramatically different magnitude and variance. For example in Reservoir control problems, the capacity difference among reservoirs can be large. Such significant differences cause unbalanced contributions to the training loss which results in unbalanced prediction quality for the prediction of the states. To address this problem, we deploy loss weighting parameter

on the Mean Squared Error loss. The loss weighting balances the loss contribution through

 γ=αs2max,

where is a hyperparameter and is the vector of maximum value of each dimension could encounter. In addition to the loss weighting, we normalize input before feeding it into the input of neural network as it significantly improves the training quality [goodfellow2016deep]. To simplify the HD-MILP-Plan compilation, we push the normalization function into the learned parameters as

 w′ij=wijσiandb′j=−μiwijσi+bj,

where is the weight connecting dimension of input layer to the dimension of the first hidden layer, and is the bias of dimension of the first hidden layer. We show the full derivation in the Appendix for interested readers.

### 3.3 Model Complexity Minimization

Learning complex function on the finite number of training data would result in unsatisfiable generalization [dietterich1995overfitting], especially for regression problems [harrell2014regression]. Using overfitted transition model in planning can be catastrophic, where planners can take advantage of the generalization limitation to propose a solution that is invalid in the real domain. While deep network structure shows remarkable generalization ability given its over parameterization, maintaining the smallest model that preserves sufficient modeling capacity is necessary [zhang2016understanding]. In the deep learning literature, an effective way to reduce the model complexity is to train model with Dropout [srivastava2014dropout] as a structured regularization that approximate Bayesian Model Averaging [gal2016dropout], where it implicitly votes over a set of candidate models. We deploy Dropout only after each of the hidden layers in the densely-connected network since the correlation of input dimensions is informative in planning tasks.

Aggressively adding layers into the network does not necessarily improve the prediction accuracy of the learned function [ba2014deep] and can introduce an additional computational cost to the planners that take this function as input. For HD-MILP-Plan, additional layers result in additional big-M constraints which increase the computational cost of computing a plan as we discuss and experimentally show in sections 4 and 6, respectively.

## 4 Hybrid Deep MILP Planner

Hybrid 222The term hybrid refers to mixed (i.e., discrete and continuous) action and state spaces as used in MDP literature [kveton2006solving]. Deep MILP Planner (HD-MILP-Plan) is a two-stage framework for learning and optimizing nonlinear planning problems. The first stage of HD-MILP-Plan learns the unknown transition function with densely-connected network as discussed previously in section 3. The learned transition function is used to construct the learned planning problem . Given a planning horizon , HD-MILP-Plan compiles the learned planning problem into a Mixed-Integer Linear Program (MILP) and finds an optimal plan to using an off-the-shelf MILP solver. HD-MILP-Plan operates as an online planner where actions are optimized over the remaining planning horizon in response to sequential state observations from the environment.

We now describe the base MILP encoding of HD-MILP-Plan. Then, we strengthen the linear relaxation of our base MILP encoding for solver efficiency.

### 4.1 Base MILP Encoding

We begin with all notation necessary for the HD-MILP-Plan specification:

#### 4.1.1 Parameters

• [noitemsep,nolistsep]

• is the value of the initial state .

• is the set of ReLUs in the neural network.

• is the set of bias units in the neural network.

• is the set of output units in the neural network.

• is the set of synapsis in the neural network.

• is the set of weights in the neural network.

• is the set of actions connected to unit .

• is the set of states connected to unit .

• is the set of units connected to .

• specifies the output unit with linear function that predicts state .

• specifies the output unit with binary step activation function that predicts state .

• is a large constant used in the big-M constraints.

#### 4.1.2 Decision variables

• [noitemsep,nolistsep]

• denotes the value assignment to action from its domain at time . The domain of can be either discrete or continuous.

• denotes the value of state at time . The domain of can be either discrete or continuous.

• denotes the output of unit at time .

• if rectified linear unit is activated at time , 0 otherwise (i.e., is a boolean variable).

#### 4.1.3 The MILP Compilation

Next we define the MILP formulation of our planning optimization problem that encodes the learned transition model.

 maximize H∑t=1Q({Yst+1,Xat|s∈S,a∈A}) (3) subject to Ys1=VI(s)∀s∈S (4) C({Yst,Xat|s∈S,a∈A}) (5) G({YsH+1|s∈SG}) (6) Pft=1∀f∈B (7) Pft≤MPbft∀f∈R (8) Pgt≤M(1−Pbgt)+Pingt∀g∈R (9) Pgt≥Pingt∀g∈R (10) Yst+1=Pingt∀g∈Olin(s),s∈Sc (11) Yst+1+0.5≥Pingt∀g∈Olin(s),s∈Sd (12) Yst+1−0.5≤Pingt∀g∈Olin(s),s∈Sd (13) MYst+1≥Pingt∀g∈Ostep(s),s∈Sd (14) −M(1−Yst+1)≤Pingt∀g∈Ostep(s),s∈Sd (15) where expression Pingt=∑f∈U(g)wfgPft+∑s∈S(g)wsgYst+∑a∈A(g)wagXat∀g∈R for all time steps t=1,…,H except Constraints (% ???)-(???).

In the above MILP, the objective function (3) maximizes the sum of rewards over a given horizon . Constraint (4) connects input units of the neural network to the initial state of the planning problem at time . Constraint (5) ensures that global constraints are satisfied at every time . Constraint (6) ensures output units of the neural network satisfy goal state constraints of the planning problem at time . Constraint (7

) sets all neurons that represent biases equal to 1. Constraint (

8) ensures that a ReLU is activated if the total weighted input flow into is positive. Constraints (9)-(10) together ensure that if a ReLU is active, the outflow from is equal to the total weighted input flow. Constraints (11)-(15) predicts the states at time given the values of states, actions, ReLUs at time using different activation functions. In Constraint (11) and Constraints (12)-(13), linear activation function is used to predict and , respectively. In Constraints (14)-(15

), binary step function is used to predict a state variable with Boolean domain. Note that one-hot encoding with linear input is also MILP compilable and can be used to predict Boolean state variables when the problem instance permits the use of one-hot encoding, that is, when global constraints

include a constraint in the form of for all time steps .

Constraints (8)-(10) sufficiently encode the piecewise linear activation function of the ReLUs. However, the positive unbounded nature of the ReLUs leads to a poor linear relaxation of the big-M constraints, that is, when all boolean variables (superscripted with ) are relaxed to continuous in Constraints (8)-(9); this can significantly hinder the overall performance of standard branch and bound MILP solvers that rely on the linear relaxation of the MILP for guidance. Next, we strengthen our base MILP encoding by preprocessing bounds on state and action variables, and with the addition of auxiliary decision variables and linear constraints, to improve its LP relaxation.

### 4.2 Strengthened MILP Encoding

In our base MILP encoding, Constraints (4)-(6) encode the piecewise linear activation function, , using the big-M constraints for each ReLU . We strengthen the linear relaxation of Constraints (5)-(6) by first finding tighter bounds on the input units of the neural network, then separating the input into its positive and negative components. Using these auxiliary variables, we augment our base MILP encoding with an additional linear inequality in the form of . This inequality is valid since the constraints and hold for all and .

#### 4.2.1 Preprocessing Bounds

The optimization problems solved to find the tightest bounds on the input units of the neural network are as follows. The tightest lower bounds on action variables can be obtained by solving the following optimization problem:

 minimize Xat (16) subject to Constraints (1)-(14)

Similarly, the tightest lower bounds on state variables, upper bounds on action and state variables can be obtained by simply replacing the expression in the objective function (16) with , , and , respectively. Given the preprocessing optimization problems have the same theoretical complexity as the original learned planning optimization problem (i.e., NP-hard), we limit the computational budget allocated to each preprocessing optimization problem to a fixed amount, and set the lower and upper bounds on the domains of action and state decision variables , to the best dual bounds found in each respective problem.

#### 4.2.2 Additional Decision Variables

The additional decision variables required to implement our strengthened MILP are as follows:

• [noitemsep]

• and denote the positive and negative value assignments to action at time , respectively.

• and denote the positive and negative the values of state at time , respectively.

• if is positive at time , 0 otherwise.

• if is positive at time , 0 otherwise.

#### 4.2.3 Additional Constraints

The additional constraints in the strengthened MILP are as follows:

 Xat=X+at+X−at (17) Xat≤UaXbat (18) Xat≥La(1−Xbat) (19) X+at≤UaXbat (20) X−at≥La(1−Xbat) (21) for all actions a∈A where La<0, time % steps t=1,…,H Yst=Y+st+Y−st (22) Yst≤UsYbst (23) Yst≥Ls(1−Ybst) (24) Y+st≤UsYbst (25) Y−st≥Ls(1−Ybst) (26) for all states s∈S where Ls<0, time % steps t=1,…,H+1

Here, the pairs and denote the lower and upper bounds on the domains of action and state decision variables , , respectively found by solving the preprocessing optimization problems. Given Constraints (17)-(26), Constraint (27) implements our strengthening constraint which provides a valid upper bound on each ReLU .

 ∑s∈S(g),wsg>0,Ls≥0wsgYst+∑s∈S(g),wsg>0,Ls<0wsgY+st+∑s∈S(g),wsg<0,Ls<0wsgY−st +∑a∈A(g),wag>0,La≥0wa,gXat+∑a∈A(g),wag>0,La<0wa,gX+at+∑a∈A(g),wag<0,La<0wa,gX−at +∑f∈U(g)∩R,wfg>0wfgPft+∑f∈U(g)∩B,wfg>0wfgPbgt≥Pgt (27) for all ReLU g∈R, time steps t=1,…,H

## 5 Nonlinear Planning via Auto-differentiation

In this section, we present Auto-differentiation based planning which we call Tensorflow Planner (TF-Plan). TF-Plan plans through the neural networks using Auto-differentiation where variables are tensors that flow over the network operation pipelines. TF-Plan represents the planning task as a symbolic recurrent neural network (RNN) architecture with action parameter inputs directly amenable to optimization with GPU-based symbolic toolkits such as Tensorflow and Pytorch.

### 5.1 Planning through Backpropagation

Backpropagation [rumelhart1988learning]

is a standard method for optimizing parameters of deep neural networks via gradient descent. With the chain rule of derivatives, backpropagation propagates the derivative of the output error of a neural network back to each of its parameters in a single linear time pass in the size of the network using what is known as reverse-mode automatic differentiation

[Linnainmaa1970]. Despite its theoretical efficiency, backpropagation in large-scale deep neural networks in practice is still computationally expensive, and it is only with the advent of recent GPU-based symbolic toolkits have become possible.

We reverse the idea of training parameters of the network given fixed inputs to instead optimizing the inputs (i.e., actions) subject to the fixed parameters. As shown in Figure 2, given learned transition and reward function , we want to optimize the input for all to maximize the accumulated reward value where . Specifically, we want to optimize all actions with respect to a planning loss (defined shortly as a function of ) that we minimize via the following gradient update schema

 a←a−η∂L∂a, (28)

where is the optimization rate and the partial derivatives comprising the gradient based optimization in problem instance are computed as

 ∂L∂aitj=∂L∂Li∂Li∂aitj=∂L∂Li[∂Li∂sit+1∂sit+1∂aitj+∂Li∂rit+1∂rit+1∂∂aitj]=∂L∂Li[∂sit+1∂aitjH∑τ=t+1[∂Li∂riτ∂riτ∂siττ−1∏κ=t+1∂siκ∂siκ−1]+∂Li∂rit+1∂rit+1∂∂aitj], (29)

where we define the total loss over multiple planning instances . Since both transition and reward functions are not assumed to be convex, optimization on a domain with such dynamics could result in a local minimum or saddle point. To mitigate this problem, we randomly initialize actions for a batch of instances and optimize multiple mutually independent planning instances simultaneously, and eventually return the best-performed action sequence over all instances.

While there are multiple choices of loss function in Auto-differentiation toolkits, we minimize

since the cumulative rewards we test in this paper are at most piecewise linear 333The derivative of a linear function yields a constant value which is not informative in updating actions using the gradient update schema (28).. The optimization of the loss function has dual effects; it optimizes each problem instance independently and provides fast convergence (i.e., faster than optimizing ). We remark that simply defining the objective and the definition of all state variables in terms of predecessor state and action variables via the transition dynamics is enough for auto-differentiation toolkits to build the symbolic directed acyclic graph (DAG) representing the objective and take its gradient with respect to all free action parameters as shown in (29) using reverse-mode automatic differentiation.

### 5.2 Planning over Long Horizons

The TF-Plan compilation of a nonlinear planning problem reflects the same structure as a RNN that is commonly used in deep learning. The connection here is not superficial since a longstanding difficulty with training RNNs lies in the vanishing gradient problem, that is, multiplying long sequences of gradients via the chain rule usually renders them extremely small and make them irrelevant for weight updates, especially when using nonlinear activation functions that can be saturated such as a sigmoid. As described previously, we mitigate this issue by training the transition function with ReLU and linear activation functions, both of which guarantee that the gradient is not reduced by the ratio at each time step. We note that reward function does not trigger the gradient vanishing problem since the output

of each time step is directly connected to the loss function.

### 5.3 Handling Bounds on Actions

Bounds on actions are common in many planning tasks. For example in Navigation domain, the distance that agent can move at each time step is bounded by constant minimum and maximum values. To handle actions with range constraints, we use projected stochastic gradient descent. Projected gradient descent (PGD)

[calamai1987projected]

is a method that can handle constrained optimization problems by projecting the parameters (actions) into a feasible range after each gradient update. Precisely, we clip all actions to their feasible range after each epoch of gradient descent:

 a←min(max(a,La),Ua)

In an online planning setting, TF-Plan ensures the feasibility of actions with bound constraints using PSGD at time step , and relaxes the remaining global constraints and goal constraints during planning.

## 6 Experimental Results

In this section, we present experimental results that empirically test the performance of both HD-MILP-Plan and TF-Plan on multiple nonlinear planning domains. To accomplish this task we first present three nonlinear benchmark domains, namely: Reservoir Control, Heating, Ventilation and Air Conditioning, and Navigation. Then, we validate the transition learning performance of our proposed ReLU-based densely-connected neural networks with different network configurations in each domain. Finally we evaluate the efficacy of both proposed planning frameworks based on the learned model by comparing them to strong baseline manually coded policies 444As noted in the Introduction, MCTS, and model-free reinforcement learning are not applicable as baselines given our multi-dimensional concurrent continuous action spaces. in an online planning setting. For HD-MILP-Plan, we test the effect of preprocessing to strengthened MILP encoding on run time and solution quality. For TF-Plan, we investigate the impact of the number of epochs on planning quality. Finally, we test the scalability of both planners on large scale domains and show that TF-Plan can scale much more gracefully compared to HD-MILP-Plan.

### 6.1 Illustrative Domains

Reservoir Control has a single state for each reservoir, which denotes the water level of the reservoir and a corresponding action for each to permit a flow from reservoir (with maximum allowable flow ) to the next downstream reservoir. The transition is nonlinear function due to the evaporation from each reservoir , which is defined by the formula

 ert=(1.0/2.0)⋅sin((1.0/2.0)⋅lrt)⋅0.1,

and the water level transition function is

 lrt+1=lrt+∑rupfrup−ftr−ert,

where ranges over all upstream reservoirs of with bounds . The reward function minimizes the total absolute deviation from a desired water level, plus a constant penalty for having water level outside of a safe range (close to empty or overflowing), which is defined for each time step by the formula

 Q(lt+1,ft)= −∑r(0.1⋅∣∣((mr+nr)/2.0)−lrt+1∣∣+100⋅max(mr−lrt+1,0) +5⋅max(lrt+1−nr,0)),

where and define the upper and and lower desired ranges for each reservoir . We report the results on small instances with 3 and 4 reservoirs over planning horizons , and large instances with 10 reservoirs over planning horizons .

Heating, Ventilation and Air Conditioning [agarwal2010] has a state variable denoting the temperature of each room and an action for sending heated air to each room (with maximum allowable volume ) via vent actuation. The bilinear transition function is then

 prt+1=ptr+(Δt/Cr)(br+∑r′(ptr′−prt)/Rrr′),

where is the heat capacity of rooms, represents an adjacency predicate with respect to room and represents a thermal conductance between rooms. The reward function minimizes the total absolute deviation from a desired temperature for all rooms plus a linear penalty for having temperatures outside of a range plus a linear penalty for heating air with cost , and is defined for each time step by the formula

 Q(pt+1,bt)= −∑r(10.0⋅|((mr+nr)/2.0)−prt|+kbr+0.1⋅(max(prt−nr,0) +max(mr−prt,0)).

We report the results on small instances with 3 and 6 rooms over planning horizons , and a large instance with 60 rooms over planning horizon .

Navigation is designed to test learning of a highly nonlinear transition function and has a single state for the 2D location of an agent and a 2D action intended nominally to move the agent (with minimum and maximum movement boundaries ). The new location is a nonlinear function of the current location (with minimum and maximum maze boundaries ) due to higher slippage in the center of the domain where the transition function is

 pt+1=pt+Δp⋅2.0/(1.0+exp(−2⋅Δdp))−0.99.

where is the Euclidean distance from to the center of the domain. The reward function minimizes the total Manhattan distance from the goal location, which is defined for each time step by the formula

 Q(pt+1,Δpt)=−∑d|gd−prt|,

where defines the goal location for dimension . We report the results on small instances with maze sizes 8-by-8 (i.e., ) and 10-by-10 (i.e., ) over planning horizons , and a large instance with minimum and maximum movement boundaries over planning horizon .

### 6.2 Transition Learning Performance

In Table 1, we show the mean squared error of training different configurations of neural networks with 200 epochs over different planning domains. We train these neural networks through pre-sampled data that was generated by a simple stochastic exploration policy. We sample data points for all domains and treat them as independent and identically distributed data points. Namely, we random shuffle data points for each epoch of training. The sampled data was split into training and test sets with 4 to 1 ratio, where we report the results from the hold-out test sets.

Since densely-connected networks [huang2017densely] strictly dominated non-densely-connected networks, we only report the results of the densely-connected network. Overall, we see that Reservoir and HVAC can be accurately learned with one layer (i.e., an additional layer did not help) while Navigation benefits from having two layers owing to the complexity of its transition. The network with the lowest MSE is used as the deep neural network model for each domain in the subsequent planning experiments.

Figure 3 visualizes the training performance of different neural network configurations over three domain instances. Figures 3 (a)-(c) visualize the loss curves over training epochs for three domain instances. In Reservoir, we observe that while both 1 and 2 hidden layer networks have similar MSE values, the former has much smaller variance. In HVAC and Navigation instances, we observe that 1 and 2 hidden layer networks have the smallest MSE values, respectively. Figures 3 (d)-(f) visualize the performance of learning transition functions with different number of hidden layers. We observe that Reservoir needs at least one hidden layer to overlap with ground truth whereas HVAC is learned well with all networks (all dashed lines overlap) and Navigation only shows complete overlap (especially near the center nonlinearity) for a two layered neural network. All these results mirror the Mean Squared Error comparisons in Table 1, and provide empirical and intuitive evidence for systematically selecting the minimal neural network structure for each domain.

### 6.3 Planning Performance

In this section, we investigate the effectiveness of planning with HD-MILP-Plan and TF-Plan to plan for the original planning problem through optimizing the learned planning problem in an online planning setting. We optimized the MILP encodings using IBM ILOG CPLEX 12.7.1 with eight threads and a 1-hour total time limit per problem instance on a MacBookPro with 2.8 GHz Intel Core i7 16 GB memory. We optimize TF-Plan through Tensorflow 1.9 with an Nvidia GTX 1080 GPU with CUDA 9.0 on a Linux system with 16 GB memory.555Due to the fact that CPLEX solver and Tensorflow completely leverage different hardware components, the run times reported only intend to show real time lapse. We connected both planners with RDDLsim [Sanner:RDDL] domain simulator and interactively solved multiple problem instances with different sizes and horizon lengths. In order to approximate the solution quality of TF-Plan, we also report HD-MILP-Plan with 20% duality gap. The results reported for TF-Plan, unless otherwise stated, are based on fixed number of epochs for each domain where TF-Plan used 1000 epochs for Reservoir and HVAC, and 300 epochs for Navigation.

#### 6.3.1 Comparison of Planning Quality

In Figures 4 (a)-(c), we compare the planning qualities of the domain-specific policies (blue), the base MILP model (gray), MILP model with preprocessing and strengthening constraints solved optimally (orange), MILP model with preprocessing and strengthening constraints solved upto 20% duality gap (green) and TF-Plan.

In Figure 4 (a), we compare HD-MILP-Plan and TF-Plan to a rule-based local Reservoir planner, which measures the water level in reservoirs, and sets outflows to release water above a pre-specified median level of reservoir capacity. In this domain, we observe an average of 15% increase in the total reward obtained by the plans generated by HD-MILP-Plan in comparison to that of the rule-based local Reservoir planner. Similarly, we find that TF-Plan outperforms the rule-based local Reservoir planner with a similar percentage. However, we observe that TF-Plan outperforms both HD-MILP-Plan on Reservoir 4 domain. We investigate this outcome further in Figure 5 (a). We find that the plan returned by HD-MILP-Plan incurs more penalty due to the noise in the learned transition model where the plan attempts to distribute water to multiple reservoirs and obtain higher reward. As a result, the actions returned by HD-MILP-Plan break the safety threshold and receive additional penalty. Thus, HD-MILP-Plan incurs more cost than TF-Plan.

In Figure 4 (b), we compare HD-MILP-Plan and TF-Plan to a rule-based local HVAC policy, which turns on the air conditioner anytime the room temperature is below the median value of a given range of comfortable temperatures [20,25] and turns off otherwise. While the reward (i.e., electricity cost) of the proposed models on HVAC 3 rooms are almost identical to that of the locally optimal HVAC policy, we observe significant performance improvement on HVAC 6 rooms settings, which suggests the advantage of the proposed models on complex planning problems, where the manual policy fails to track the temperature interaction among the rooms. Figure 5 (b) further demonstrates the advantage of our planners where the room temperatures controlled by the proposed models are identical to the locally optimal policy with 15% less power usage.

Figure 4 (c) compares HD-MILP-Plan and TF-Plan to a greedy search policy, which uses a Manhattan distance-to-goal function to guide the agent towards the direction of the goal (as visualized by Figure 5 (c)). The pairwise comparison of the total rewards obtained for each problem instance per plan shows that the proposed models can outperform the manual policy up to 15%, as observed in the problem instance Navigation,10,8 in Figure 4 (c). The investigation of the actual plans, as visualized by Figure 5 (c), shows that the local policy ignores the nonlinear region in the middle, and tries to reach the goal directly, which cause the plan to be not able to reach the goal position with given step budget. In contrast, both HD-MILP-Plan and TF-Plan can find plans that move around the nonlinearity and successfully reach the goal state, which shows their ability to model the nonlinearity and find plans that are near-optimal with respect to the learned model over the complete horizon .

Overall we observe that in 10 out of 12 problem instances, the solution quality of the plans generated by HD-MILP-Plan and TF-Plan are significantly better than the total reward obtained by the plans generated by the respective domain-specific human-designed policies. Further we find that the quality of the plans generated by TF-Plan are between the plans generated by i) HD-MILP-Plan solved to optimality, and ii) HD-MILP-Plan solved to 20% duality gap.

#### 6.3.2 Comparison of Run Time Performance

In Figures 4 (d)-(f), we compare the run time performances of the base MILP model (gray), MILP model with preprocessing and strengthening constraints solved optimally (orange), MILP model with preprocessing and strengthening constraints solved upto 20% duality gap (green) and TF-Plan. Figure 4 (f) shows significant run time improvement for the strengthened encoding over the base MILP encoding, while Figures 4 (d)-(e) show otherwise. Together with the results presented in Figures 4 (a)-(c), we find that domains that utilize neural networks with only 1 hidden layer (e.g., HVAC and Reservoir) do not benefit from the additional fixed computational expense of preprocessing. In contrast, domains that require deeper neural networks (e.g., Navigation) benefit from the additional computational expense of preprocessing and strengthening. Over three domains, we find that TF-Plan significantly outperforms HD-MILP-Plan in all Navigation instances, performs slightly worse in all Reservoir instances and performs comparable in HVAC instances.

#### 6.3.3 Effect of Training Epochs for TF-Plan on Planning Quality

To test the effect of the number of optimization epochs on the solution quality, we present results on 10-by-10 Navigation domain for a horizon of 10 with different epochs. Figure 6 visualizes the increase in solution quality as the number of epochs increase where Figure 6 (a) presents a low quality plan found similar to that of the manual-policy with 20 epochs, Figure 6 (b) presents a medium quality plan with 80 epochs, and Figure 6 (c) presents a high quality plan similar to that of HD-MILP-Plan with 320 epochs.

#### 6.3.4 Scalability Analysis on Large Problem Instances

To test the scalability of the proposed planning models, we create three additional domain instances that simulate more realistic planning instances. For Reservoir domain, we create a system with 10 reservoirs with complex reservoir formations where a reservoir may receive water from more than one upstream reservoirs. For HVAC domain, we simulate a building of 6 floors and 60 rooms with complex adjacency setting (including inter-level adjacency modeling). More importantly, in order to capture the complex mutual temperature impact of the rooms, we train the transition function with one hidden layer and 256 neurons. For Navigation domain, we reduce the feasible action range from to and increase the planning horizon to 20 time steps.

In Figures 7 (a)-(c), we compare the total rewards obtained by the domain-specific rule-based policy (blue), HD-MILP-Plan (orange) and TF-Plan (red) on larger problem instances. The analysis of Figures 7 (a)-(c) shows that TF-Plan scales better compared to HD-MILP-Plan by consistently outperforming the policy, whereas HD-MILP-Plan outperforms the other two planners in two out of three domains (i.e., Reservoir and HVAC) while suffering from scalability issues in one domain (i.e., Navigation). Particularly, we find that in Navigation domain, HD-MILP-Plan sometimes does not find feasible plans with respect to the learned model and therefore returns default no-op action values.

In Figure 8, we compare the run time performance of all all three planners over all problem instances where we measure problem size as a function of horizon , number of parameters in the learned model, and number of neural network layers. We observe that as the problem size gets larger, HD-MILP-Plan takes more computational effort to solve due to its additional requirement of proving optimality which can be remedied by allowing a bounded optimality guaranty (e.g., 20% duality gap) on the learned model. We also observe that as the problems sizes get larger, the effect of preprocessing bounds and strengthening constraints pay-off. Finally we show that TF-Plan scales gracefully as the problem size gets larger. Together with Figures 7 (a)-(c), we conclude that TF-Plan provides an efficient alternative to HD-MILP-Plan in large-scale planning problems.

## 7 Conclusion

In this paper, we have tackled the question of how we can plan with expressive and accurate deep network learned transition models that are not amenable to existing solution techniques. We started with improving the accuracy of the learned transition function by using the densely-connected network with weighted mean square error loss. We leveraged the insight that ReLU based deep networks offer strong learning performance and permit a direct compilation of the neural network transition model to a Mixed-Integer Linear Program (MILP) encoding in a planner we called Hybrid Deep MILP Planner (HD-MILP-Plan). To enhance planning efficiency, we have strengthened the linear relaxation of the base MILP encoding. Considering the computational bottleneck of MILP-based optimization, we proposed an alternative Tensorflow Planner (TF-Plan) that performs planning using Recurrent Neural Networks, where plans are directly optimized through backpropagation.

We evaluated run time performance and solution quality of the plans generated by both proposed planners over multiple problem instances from three planning domains. We have shown that HD-MILP-Plan can find optimal plans with respect to the learned models, and TF-Plan can approximate the optimal plans with little computational cost. We have shown that the plans generated by both HD-MILP-Plan and TF-Plan yield better solution qualities compared to strong domain-specific human-designed policies. Also, we have shown that our strengthening constraints improved the solution quality and the run time performance of HD-MILP-Plan as problem instances got larger. Finally, we have shown that TF-Plan can handle large-scale planning problems with very little computational cost.

In conclusion, both HD-MILP-Plan and TF-Plan represent a new class of data-driven planning methods that can accurately learn complex state transitions of high-dimensional nonlinear planning domains, and provide high-quality plans with respect to the learned models.

## Appendix A. Derivations

In this section, we extend derivations in the paper for interested readers.

### Pushing Normalization into the Network Parameters

Assume we have mean and standard derivation of the data for each input dimension . In general, we normalize data point through the formula

 ¯x=(x−μ)⋅σ−1

, and compute the first linear transformation after input through the formula

 z=¯xTw+b.

, and transfer the normalization of input to the learned weights and bias:

 z=¯xTw+b=((x−μ)⋅σ−1)Tw+b=(x−μ)T(w⋅σ−1)+b=xT(w⋅σ−1)w′+b−μT(w⋅σ−1)b′ (30)

Since the output neuron is invariant, this operation only affects weights and biases of a linear transformation that is directly connected to the input layer.

## Appendix B. RDDL Domain Description

In this section, we list the RDDL domain and instance files that we experimented with in this paper.

### Reservoir

#### Domain File

domain Reservoir_Problem{

Ψrequirements = {
ΨΨreward-deterministic
Ψ};

Ψtypes {
ΨΨid: object;
Ψ};

Ψpvariables {

ΨΨ// Constant
ΨΨMAXCAP(id): { non-fluent, real, default = 100.0 };
ΨΨHIGH_BOUND(id): { non-fluent, real, default = 80.0 };
ΨΨLOW_BOUND(id): { non-fluent, real, default = 20.0 };
ΨΨRAIN(id): { non-fluent ,real, default = 5.0 };
ΨΨDOWNSTREAM(id,id): {non-fluent ,bool, default = false };
ΨΨDOWNTOSEA(id): {non-fluent, bool, default = false };
ΨΨBIGGESTMAXCAP: {non-fluent, real, default = 1000};

ΨΨ//Interm
ΨΨvaporated(id): {interm-fluent, real};

ΨΨ//State
ΨΨrlevel(id): {state-fluent, real, default = 50.0 };

ΨΨ//Action
ΨΨflow(id): { action-fluent, real, default = 0.0 };
Ψ};

Ψcpfs {
ΨΨvaporated(?r) = (1.0/2.0)*sin[rlevel(?r)/BIGGESTMAXCAP]*rlevel(?r);
ΨΨrlevel’(?r) = rlevel(?r) + RAIN(?r)- vaporated(?r) - flow(?r)
ΨΨ              + sum_{?r2: id}[DOWNSTREAM(?r2,?r)*flow(?r2)];

Ψ};

Ψreward = sum_{?r: id} [if (rlevel’(?r)>=LOW_BOUND(?r) ^ (rlevel’(?r)<=HIGH_BOUND(?r)))
ΨΨΨ                       then 0
ΨΨΨ                       else if (rlevel’(?r)<=LOW_BOUND(?r))
ΨΨΨ                               then (-5)*(LOW_BOUND(?r)-rlevel’(?r))
ΨΨΨ                               else (-100)*(rlevel’(?r)-HIGH_BOUND(?r))]
Ψ         +sum_{?r2:id}[abs[((HIGH_BOUND(?r2)+LOW_BOUND(?r2))/2.0)-rlevel’(?r2)]*(-0.1)];

Ψstate-action-constraints {

ΨΨforall_{?r:id} flow(?r)<=rlevel(?r);
ΨΨforall_{?r:id} rlevel(?r)<=MAXCAP(?r);
ΨΨforall_{?r:id} flow(?r)>=0;
Ψ};
}


#### Instance Files

Reservoir 3

non-fluents Reservoir_non {
domain = Reservoir_Problem;
objects{
id: {t1,t2,t3};
};
non-fluents {
RAIN(t1) = 5.0;RAIN(t2) = 10.0;RAIN(t3) = 20.0;
MAXCAP(t2) = 200.0;LOW_BOUND(t2) = 30.0;HIGH_BOUND(t2) = 180.0;
MAXCAP(t3) = 400.0;LOW_BOUND(t3) = 40.0;HIGH_BOUND(t3) = 380.0;
DOWNSTREAM(t1,t2);DOWNSTREAM(t2,t3);DOWNTOSEA(t3);
};
}

instance is1{
domain = Reservoir_Problem;
non-fluents = Reservoir_non;
init-state{
rlevel(t1) = 75.0;
};
max-nondef-actions = 3;
horizon = 10;
discount = 1.0;
}


Reservoir 4

non-fluents Reservoir_non {
domain = Reservoir_Problem;
objects{
id: {t1,t2,t3,t4};
};
non-fluents {
RAIN(t1) = 5.0;RAIN(t2) = 10.0;RAIN(t3) = 20.0;RAIN(t4) = 30.0;
MAXCAP(t2) = 200.0;LOW_BOUND(t2) = 30.0;HIGH_BOUND(t2) = 180.0;
MAXCAP(t3) = 400.0;LOW_BOUND(t3) = 40.0;HIGH_BOUND(t3) = 380.0;
MAXCAP(t4) = 500.0;LOW_BOUND(t4) = 60.0;HIGH_BOUND(t4) = 480.0;
DOWNSTREAM(t1,t2);DOWNSTREAM(t2,t3);DOWNSTREAM(t3,t4);DOWNTOSEA(t4);
};
}

instance is1{
domain = Reservoir_Problem;
non-fluents = Reservoir_non;
init-state{
rlevel(t1) = 75.0;
};
max-nondef-actions = 4;
horizon = 10;
discount = 1.0;
}


Reservoir 10

non-fluents Reservoir_non {
Ψdomain = Reservoir_Problem;
Ψobjects{
ΨΨid: {t1,t2,t3,t4,t5,t6,t7,t8,t9,t10};
Ψ};
Ψnon-fluents {
ΨΨRAIN(t1) = 15.0;RAIN(t2) = 10.0;RAIN(t3) = 20.0;RAIN(t4) = 30.0;RAIN(t5) = 20.0;
ΨΨRAIN(t6) = 10.0;RAIN(t7) = 35.0;RAIN(t8) = 15.0;RAIN(t9) = 25.0;RAIN(t10) = 20.0;
ΨΨMAXCAP(t2) = 200.0;LOW_BOUND(t2) = 30.0;HIGH_BOUND(t2) = 180.0;
ΨΨMAXCAP(t3) = 400.0;LOW_BOUND(t3) = 40.0;HIGH_BOUND(t3) = 380.0;
ΨΨMAXCAP(t4) = 500.0;LOW_BOUND(t4) = 60.0;HIGH_BOUND(t4) = 480.0;
ΨΨMAXCAP(t5) = 750.0;LOW_BOUND(t5) = 20.0;HIGH_BOUND(t5) = 630.0;
ΨΨMAXCAP(t6) = 300.0;LOW_BOUND(t6) = 30.0;HIGH_BOUND(t6) = 250.0;
ΨΨMAXCAP(t7) = 300.0;LOW_BOUND(t7) = 10.0;HIGH_BOUND(t7) = 180.0;
ΨΨMAXCAP(t8) = 300.0;LOW_BOUND(t8) = 40.0;HIGH_BOUND(t8) = 240.0;
ΨΨMAXCAP(t9) = 400.0;LOW_BOUND(t9) = 40.0;HIGH_BOUND(t9) = 340.0;
ΨΨMAXCAP(t10) = 800.0;LOW_BOUND(t10) = 20.0;HIGH_BOUND(t10) = 650.0;
ΨΨDOWNSTREAM(t1,t2);DOWNSTREAM(t2,t3);DOWNSTREAM(t3,t4);DOWNSTREAM(t4,t5);
ΨΨDOWNSTREAM(t6,t7);DOWNSTREAM(t7,t8);DOWNSTREAM(t8,t5);
ΨΨDOWNSTREAM(t5,t6);DOWNSTREAM(t6,t10);
ΨΨDOWNSTREAM(t5,t9);DOWNSTREAM(t9,t10);
ΨΨDOWNTOSEA(t10);
Ψ};
}

instance is1{
Ψdomain = Reservoir_Problem;
Ψnon-fluents = Reservoir_non;
Ψinit-state{
ΨΨrlevel(t1) = 175.0;
Ψ};
Ψmax-nondef-actions = 10;
Ψhorizon = 10;
Ψdiscount = 1.0;
}


### Hvac

#### Domain File

domain hvac_vav_fix{
Ψ  types {
ΨΨ  space : object;
Ψ};

Ψpvariables {
ΨΨ//Constants
ΨΨADJ(space, space)    : { non-fluent, bool, default = false };
ΨΨADJ_OUTSIDE(space)ΨΨ: { non-fluent, bool, default = false };
ΨΨADJ_HALL(space)ΨΨΨ: { non-fluent, bool, default = false };
ΨΨR_OUTSIDE(space)ΨΨ: { non-fluent, real, default = 4};
ΨR_HALL(space)ΨΨΨ: { non-fluent, real, default = 2};
ΨΨR_WALL(space, space) : { non-fluent, real, default = 1.5 };
ΨΨIS_ROOM(space)ΨΨ : { non-fluent, bool, default = false };
ΨΨCAP(space) ΨΨΨ : { non-fluent, real, default = 80 };
ΨΨCAP_AIR ΨΨΨ : { non-fluent, real, default = 1.006 };
ΨΨCOST_AIR ΨΨΨ : { non-fluent, real, default = 1 };
ΨΨTIME_DELTA ΨΨΨ : { non-fluent, real, default = 1 };
ΨΨTEMP_AIR ΨΨΨ : { non-fluent, real, default = 40 };
ΨΨTEMP_UP(space)ΨΨ : { non-fluent, real, default = 23.5 };
ΨΨTEMP_LOW(space)ΨΨ : { non-fluent, real, default = 20.0 };
ΨΨTEMP_OUTSIDE(space)ΨΨ: { non-fluent, real, default = 6.0 };
ΨΨTEMP_HALL(space)ΨΨ: { non-fluent, real, default = 10.0 };
ΨΨPENALTY ΨΨΨ : { non-fluent, real, default = 20000 };
ΨΨAIR_MAX(space)ΨΨ : { non-fluent, real, default = 10.0 };
ΨΨTEMP(space) Ψ     : { state-fluent, real, default = 10.0 };
ΨΨAIR(space)ΨΨ     : { action-fluent, real, default = 0.0 };
Ψ};

Ψcpfs {
ΨΨ//State
ΨΨTEMP’(?s) = TEMP(?s) + TIME_DELTA/CAP(?s) *
ΨΨΨ (AIR(?s) * CAP_AIR * (TEMP_AIR - TEMP(?s)) * IS_ROOM(?s)
ΨΨΨ+ sum_{?p : space} ((ADJ(?s, ?p) | ADJ(?p, ?s)) * (TEMP(?p) - TEMP(?s)) / R_WALL(?s, ?p))
ΨΨΨ+ ADJ_OUTSIDE(?s)*(TEMP_OUTSIDE(?s) - TEMP(?s))/ R_OUTSIDE(?s)
ΨΨ};

Ψreward = - (sum_{?s : space} IS_ROOM(?s)*(AIR(?s) * COST_AIR
+ ((TEMP(?s) < TEMP_LOW(?s)) | (TEMP(?s) > TEMP_UP(?s))) * PENALTY)
+ 10.0*abs[((TEMP_UP(?s) + TEMP_LOW(?s))/2.0) - TEMP(?s)]);

action-preconditions{
ΨΨΨforall_{?s : space} [ AIR(?s) >= 0 ];
ΨΨΨforall_{?s : space} [ AIR(?s) <= AIR_MAX(?s)];
ΨΨ};
}


#### Instance Files

HVAC 3 Rooms

non-fluents nf_hvac_vav_fix{
domain = hvac_vav_fix;

objects{
space : { r1, r2, r3};
};

non-fluents {
//Define rooms
IS_ROOM(r1) = true;IS_ROOM(r2) = true;IS_ROOM(r3) = true;

ADJ(r1, r2) = true;ADJ(r1, r3) = true;ADJ(r2, r3) = true;

};
}

instance inst_hvac_vav_fix{
domain = hvac_vav_fix;
non-fluents = nf_hvac_vav_fix;

horizon = 20;
discount = 1.0;
}


HVAC 6 Rooms

non-fluents nf_hvac_vav_fix{
domain = hvac_vav_fix;

objects{
space : { r1, r2, r3, r4, r5, r6 };
};

non-fluents {
//Define rooms
IS_ROOM(r1) = true;IS_ROOM(r2) = true;IS_ROOM(r3) = true;
IS_ROOM(r4) = true;IS_ROOM(r5) = true;IS_ROOM(r6) = true;

ADJ(r1, r2) = true;ADJ(r1, r4) = true;ADJ(r2, r3) = true;
ADJ(r2, r5) = true;ADJ(r3, r6) = true;ADJ(r4, r5) = true;
ADJ(r5, r6) = true;

};
}

instance inst_hvac_vav_fix{
domain = hvac_vav_fix;
non-fluents = nf_hvac_vav_fix;

horizon = 20;
discount = 1.0;
}


HVAC 60 Rooms

non-fluents nf_hvac_vav_fix{
domain = hvac_vav_fix;

objects{
space : { r101, r102, r103, r104, r105, r106, r107, r108, r109, r110, r111, r112,
r201, r202, r203, r204, r205, r206, r207, r208, r209, r210, r211, r212,
r301, r302, r303, r304, r305, r306, r307, r308, r309, r310, r311, r312,
r401, r402, r403, r404, r405, r406, r407, r408, r409, r410, r411, r412,
r501, r502, r503, r504, r505, r506, r507, r508, r509, r510, r511, r512
}; //Three rooms, one hallway, and the outside world
};

non-fluents {
//Define rooms
//Level1
IS_ROOM(r101) = true;IS_ROOM(r102) = true;IS_ROOM(r103) = true;IS_ROOM(r104) = true;
IS_ROOM(r105) = true;IS_ROOM(r106) = true;IS_ROOM(r107) = true;IS_ROOM(r108) = true;
IS_ROOM(r109) = true;IS_ROOM(r110) = true;IS_ROOM(r111) = true;IS_ROOM(r112) = true;
//Level2
IS_ROOM(r201) = true;IS_ROOM(r202) = true;IS_ROOM(r203) = true;IS_ROOM(r204) = true;
IS_ROOM(r205) = true;IS_ROOM(r206) = true;IS_ROOM(r207) = true;IS_ROOM(r208) = true;
IS_ROOM(r209) = true;IS_ROOM(r210) = true;IS_ROOM(r211) = true;IS_ROOM(r212) = true;
//Level3
IS_ROOM(r301) = true;IS_ROOM(r302) = true;IS_ROOM(r303) = true;IS_ROOM(r304) = true;
IS_ROOM(r305) = true;IS_ROOM(r306) = true;IS_ROOM(r307) = true;IS_ROOM(r308) = true;
IS_ROOM(r309) = true;IS_ROOM(r310) = true;IS_ROOM(r311) = true;IS_ROOM(r312) = true;
//Level4
IS_ROOM(r401) = true;IS_ROOM(r402) = true;IS_ROOM(r403) = true;IS_ROOM(r404) = true;
IS_ROOM(r405) = true;IS_ROOM(r406) = true;IS_ROOM(r407) = true;IS_ROOM(r408) = true;
IS_ROOM(r409) = true;IS_ROOM(r410) = true;IS_ROOM(r411) = true;IS_ROOM(r412) = true;
//Level5
IS_ROOM(r501) = true;IS_ROOM(r502) = true;IS_ROOM(r503) = true;IS_ROOM(r504) = true;
IS_ROOM(r505) = true;IS_ROOM(r506) = true;IS_ROOM(r507) = true;IS_ROOM(r508) = true;
IS_ROOM(r509) = true;IS_ROOM(r510) = true;IS_ROOM(r511) = true;IS_ROOM(r512) = true;

//Level1
ADJ(r101, r102) = true;ADJ(r102, r103) = true;ADJ(r103, r104) = true;
ADJ(r104, r105) = true;ADJ(r106, r107) = true;ADJ(r107, r108) = true;
ADJ(r107, r109) = true;ADJ(r108, r109) = true;ADJ(r110, r111) = true;
ADJ(r111, r112) = true;
//Level2
ADJ(r201, r202) = true;ADJ(r202, r203) = true;ADJ(r203, r204) = true;
ADJ(r204, r205) = true;ADJ(r206, r207) = true;ADJ(r207, r208) = true;
ADJ(r207, r209) = true;ADJ(r208, r209) = true;ADJ(r210, r211) = true;
ADJ(r211, r212) = true;
//Level3
ADJ(r301, r302) = true;ADJ(r302, r303) = true;ADJ(r303, r304) = true;
ADJ(r304, r305) = true;ADJ(r306, r307) = true;ADJ(r307, r308) = true;
ADJ(r307, r309) = true;ADJ(r308, r309) = true;ADJ(r310, r311) = true;
ADJ(r311, r312) = true;
//Level4
ADJ(r401, r402) = true;ADJ(r402, r403) = true;ADJ(r403, r404) = true;
ADJ(r404, r405) = true;ADJ(r406, r407) = true;ADJ(r407, r408) = true;
ADJ(r407, r409) = true;ADJ(r408, r409) = true;ADJ(r410, r411) = true;
ADJ(r411, r412) = true;
//Level5
ADJ(r501, r502) = true;ADJ(r502, r503) = true;ADJ(r503, r504) = true;
ADJ(r504, r505) = true;ADJ(r506, r507) = true;ADJ(r507, r508) = true;
ADJ(r507, r509) = true;ADJ(r508, r509) = true;ADJ(r510, r511) = true;
ADJ(r511, r512) = true;
//InterLevel 1-2
ADJ(r101, r201) = true;ADJ(r102, r202) = true;ADJ(r103, r203) = true;
ADJ(r104, r204) = true;ADJ(r105, r205) = true;ADJ(r106, r206) = true;
ADJ(r107, r207) = true;ADJ(r108, r208) = true;ADJ(r109, r209) = true;
ADJ(r110, r210) = true;ADJ(r111, r211) = true;ADJ(r112, r212) = true;
//InterLevel 2-3
ADJ(r201, r301) = true;ADJ(r202, r302) = true;ADJ(r203, r303) = true;
ADJ(r204, r304) = true;ADJ(r205, r305) = true;ADJ(r206, r306) = true;
ADJ(r207, r307) = true;ADJ(r208, r308) = true;ADJ(r209, r309) = true;
ADJ(r210, r310) = true;ADJ(r211, r311) = true;ADJ(r212, r312) = true;
//InterLevel 3-4
ADJ(r301, r401) = true;ADJ(r302, r402) = true;ADJ(r303, r403) = true;
ADJ(r304, r404) = true;ADJ(r305, r405) = true;ADJ(r306, r406) = true;
ADJ(r307, r407) = true;ADJ(r308, r408) = true;ADJ(r309, r409) = true;
ADJ(r310, r410) = true;ADJ(r311, r411) = true;ADJ(r312, r412) = true;
//InterLevel 4-5
ADJ(r401, r501) = true;ADJ(r402, r502) = true;ADJ(r403, r503) = true;
ADJ(r404, r504) = true;ADJ(r405, r505) = true;ADJ(r406, r506) = true;
ADJ(r407, r507) = true;ADJ(r408, r508) = true;ADJ(r409, r509) = true;
ADJ(r410, r510) = true;ADJ(r411, r511) = true;ADJ(r412, r512) = true;

//Outside
//Level1
//Level2
//Level3
//Level4
//Level5

//Hallway
//Level1
//Level2
//Level3
//Level4
//Level5
};
}

instance inst_hvac_vav_fix{
domain = hvac_vav_fix;
non-fluents = nf_hvac_vav_fix;
//init-state{
//};
max-nondef-actions = 60;
horizon = 12;
discount = 1.0;
}


#### Domain File

domain Navigation_Problem{

requirements = {
reward-deterministic
};

types {
dim: object;
};

pvariables {

// Constant
MINMAZEBOUND(dim): { non-fluent, real, default = -4.0 }; //-5.0 for 10x10 instance
MAXMAZEBOUND(dim): { non-fluent, real, default = 4.0 }; //5.0 for 10x10 instance
MINACTIONBOUND(dim): { non-fluent, real, default = -1.0 }; //-0.5 for large scale instance
MAXACTIONBOUND(dim): { non-fluent, real, default = 1.0 }; //0.5 for large scale instance
GOAL(dim): { non-fluent, real, default = 3.0 };
PENALTY: {non-fluent, real, default = 1000000.0 };
CENTER(dim): {non-fluent, real, default = 0.0};

// Interm
distance: {interm-fluent,real,level=1 };
scalefactor: {interm-fluent,real,level=2 };
proposedLoc(dim):{interm-fluent, real, level=3};

//State
location(dim): {state-fluent, real, default = -4.0 }; //-5.0 for 10x10 instance

//Action
move(dim): { action-fluent, real, default = 0.0 };
};

cpfs {

distance = sqrt[sum_{?l:dim}[pow[(location(?l)-CENTER(?l)),2]]];
scalefactor = 2.0/(1.0+exp[-2*distance])-0.99;
proposedLoc(?l) = location(?l) + move(?l)*scalefactor;
location’(?l)= if(proposedLoc(?l)<=MAXMAZEBOUND(?l) ^ proposedLoc(?l)>=MINMAZEBOUND(?l))
then proposedLoc(?l)
else (if(proposedLoc(?l)>MAXMAZEBOUND(?l))
then MAXMAZEBOUND(?l) else MINMAZEBOUND(?l));

};

reward = - sum_{?l: dim}[abs[GOAL(?l) - location(?l)]];

state-action-constraints {
forall_{?l:dim} move(?l)<=MAXACTIONBOUND(?l);
forall_{?l:dim} move(?l)>=MINACTIONBOUND(?l);
forall_{?l:dim} location(?l)<=MAXMAZEBOUND(?l);
forall_{?l:dim} location(?l)>=MINMAZEBOUND(?l);
};
}


#### Instance Files

Navigation 8 by 8 instance

non-fluents Navigation_non {
objects{
dim: {x,y};
};
non-fluents {
MINMAZEBOUND(x) = -4.0;
};
}

instance is1{
init-state{
location(x) = -4.0;location(y) = -4.0;
};
max-nondef-actions = 2;
horizon = 10;
discount = 1.0;
}


Navigation 10 by 10 instance

non-fluents Navigation_non {
objects{
dim: {x,y};
};
non-fluents {
MINMAZEBOUND(x) = -5.0;
};
}

instance is1{