In the area of learning and planning, Hybrid Deep MILP Planning  (HD-MILP-Plan) has introduced a two-stage data-driven framework that i) learns transitions models with continuous action and state spaces using NNs, and ii) plans optimally with respect to the learned NNs using a mixed-integer linear programming (MILP) model. It has been experimentally shown that optimal planning with respect to the learned NNs  presents a challenging task for branch-and-bound (B&B) solvers  due to the poor linear relaxation of the underlying MILP model that has a large number of big-M constraints.
In this paper, we focus on the important problem of improving the efficiency of MILP models for decision making with learned NNs. In order to tackle this challenging problem, we build on potential heuristics [14, 18], which provide an efficient framework for computing a lower bound on the cost of a given state as a function of its features. In this work, we describe the problem of finding an optimal potential bounds for learned NN models with continuous inputs and outputs (i.e., continuous action and state spaces) as a bilevel program, and solve it using a novel finite-time constraint generation algorithm. Features of our linear potential heuristic are defined over the hidden units of the learned NN model, thus providing a rich and expressive candidate feature space. We use our constraint generation algorithm to compute the potential contribution (i.e., reward potential) of each hidden unit to the reward function of the HD-MILP-Plan problem. The precomputed reward potentials are then used to construct linear constraints that bound the reward function of HD-MILP-Plan, and provide a tighter linear relaxation for B&B optimization by exploring smaller number of nodes in the search tree.
Experimentally, we show that our constraint generation algorithm efficiently computes reward potentials for learned NNs, and the overhead computation is justified by the overall strengthening of the underlying MILP model for the task of planning over long horizons. Overall this work bridges the gap between two seemingly distant literatures – planning heuristics for discrete spaces and decision making with learned NN models in continuous action and state spaces. Specifically, we show that data-driven NN models for planning can benefit from advances in heuristics and their impact on the efficiency of search in B&B optimization.
2.1 Deterministic Factored Planning Problem Definition
A deterministic factored planning problem is a tuple where and are sets of state and action variables with continuous domains, is a function that returns true if action and state variables satisfy global constraints, denotes the stationary transition function, and is the reward function. Finally, represents the initial state constraints, and represents the goal constraints. For horizon , a solution to problem (i.e. a plan for ) is a value assignment to the action variables with values for all time steps (and state variables with values for all time steps ) such that and for all time steps , and the initial and goal state constraints are satisfied, i.e. and , where denotes the value of variable at time step . Similarly, an optimal solution to is a plan such that the total reward is maximized. For notational simplicity, we denote as for some function and sets and . Given the notations and the description of the planning problem, we next describe a data-driven planning framework using learned NNs.
2.2 Planning with Neural Network Learned Transition Models
Hybrid Deep MILP Planning  (HD-MILP-Plan) is a two-stage data-driven framework for learning and solving planning problems. Given samples of state transition data, the first stage of the HD-MILP-Plan process learns the transition function
using a NN with Rectified Linear Units (ReLUs) and linear activation units. In the second stage, the learned transition function is used to construct the learned planning problem . As shown in Figure 1, the learned transition function is sequentially chained over the horizon , and compiled into a MILP. Next, we review the MILP compilation of HD-MILP-Plan.
2.3 Mixed-Integer Linear Programming Compilation of HD-MILP-Plan
We begin with all notation necessary for the HD-MILP-Plan.
is the set of ReLUs in the neural network.
is the set of bias units in the neural network.
is the set of output units in the neural network.
denotes the learned weight of the neural network between units and .
is the set of action variables connected as inputs to unit .
is the set of state variables connected as inputs to unit .
is the set of ReLUs connected as inputs to unit .
specifies the output unit that predicts the value of state variable .
is a large constant used in the big-M constraints.
2.3.2 Decision Variables
is a decision variable with continuous domain denoting the value of action variable at time step .
is a decision variable with continuous domain denoting the value of state variable at time step .
is a decision variable with continuous domain denoting the output of ReLUat time step .
if ReLU is activated at time step , 0 otherwise (i.e., is a Boolean decision variable).
2.3.3 MILP Compilation
In the above MILP, the objective function (1) maximizes the sum of rewards over a given horizon . Constraints (2-4) ensure the initial state, global and goal state constraints are satisfied. Constraints (5-9) model the learned transition function . Note that while constraints (5-9) are sufficient to encode the piecewise linear activation behaviour of ReLUs, the use of big-M constraints (6-7) can hinder the overall performance of the underlying B&B solvers that rely on the linear relaxation of the MILP. Therefore next, we turn to potential heuristics that will be used to strengthen the MILP compilation of HD-MILP-Plan.
2.4 Potential Heuristics
Potential heuristics [14, 18] are a family of heuristics that map a set of features to their numerical potentials. In the context of cost-optimal classical planning, the heuristic value of a state is defined as the sum of potentials for all the features that are true in that state. Potential heuristics provide an efficient method for computing a lower bound on the cost of a given state.
In this paper, we introduce an alternative use of potential functions to tighten the linear relaxation of ReLU units in our HD-MILP-Plan compilation and improve the search efficiency of the underlying B&B solver. We define the features of the learned NN over its set of hidden units (i.e., gray circles in Figure 1), and compute the potential contribution (i.e., reward potential) of each hidden unit to the reward function for any time step . These reward potentials are then used to introduce additional constraints on ReLU activations that help guide B&B search in HD-MILP-Plan. Specifically, we are interested in finding a set of reward potentials, denoted as and representing the activation (i.e., ) and the deactivation (i.e., ) of ReLUs , such that the relation holds for all feasible values of , and at any time step . Once and are computed, we will add as a linear constraint to strengthen HD-MILP-Plan. Next we describe bilevel programming that we use to model the problem of finding optimal reward potentials.
2.5 Bilevel Programming
Bilevel programming  is an optimization framework for modeling two-level asymetrical decision making problems with a leader and a follower problem where the leader has the complete knowledge of the follower, and the follower only observes the decisions of the leader to make an optimal decision. Therefore, the leader must incorporate the optimal decision of the follower to optimize its objective.
In this work, we use bilevel programming to model the the problem of finding the optimal reward potentials. In the bilevel programming description of the optimal reward potentials problem, the leader selects the optimal values of reward potentials and , and the follower selects the values of , and such that the expression is maximized. That is, the follower tries to find values of , and that violate the relation as much as possible. Therefore the leader must select the values of reward potentials and by incorporating the optimal decision making model of the follower. Given the preliminaries, we now describe the reward potentials for learned NNs.
3 Reward Potentials for Learned Neural Networks
In this section, we present the optimal reward potentials problem and an efficient constraint generation framework for computing reward potentials for learned NNs.
3.1 Optimal Reward Potentials Problem
The problem of finding the optimal reward potentials over a set of ReLUs for any time step can be defined as the following bilevel optimization problem.
3.1.1 Leader Problem
3.1.2 Follower Problem
In the above bilevel problem, the leader problem selects the values of the reward potentials and such that their total sum is minimized (i.e., objective function (10)), and their total weighted sum for all ReLU activations is an upper bound to all values of the reward function (i.e., constraint (11) and the follower problem). Given the values of the reward potentials and , the follower selects the values of decision variables , , and such that the difference between the value of the reward function and the sum of reward potentials is maximized subject to constraints (3) and (5-9). Next, we show the correctness of the optimal reward potentials problem as the bilevel program described by the leader (i.e., objective function (10) and constraint (11)) and the follower (i.e., objective function (12) and constraints (3) and (5-9)) problems.
Theorem 3.1 (Correctness of The Optimal Reward Potentials Problem)
Proof (by Contradiction)
Let and denote the values of reward potentials selected by the leader problem that violate the relation for some values of , and , implying . However, the feasibility of constraint (11) implies that the value of the objective function (12) must be non-positive (i.e., the follower problem is not solved to optimality), which yields the desired contradiction.
3.2 Constraint Generation for Computing Reward Potentials
The optimal reward potentials problem can be solved efficiently through the following constraint generation framework that decomposes the problem into a master problem and a subproblem. The master problem finds the values of ReLU potentials and . The subproblem finds the values of ReLU variables that violate constraint (11) the most for given and , and also finds the maximum value of reward function for given which is denoted as . Intuitively, the master problem selects the values of ReLU potentials and that are checked by the subproblem for the validity of the relation for all feasible values of , and at any time step . If a violation is found, a linear constraint corresponding to given and is added back to the master problem and the procedure is repeated until no violation is found by the subproblem.
3.2.1 Subproblem :
We denote the optimal values of ReLU variables , found by solving the subproblem as , and the value of the reward function found by solving the subproblem as . Further, we refer to subproblem as .
3.2.2 Master problem :
Given the set of complete value assignments to ReLU variables and optimal objective values for all , the master problem optimizes the regularized 111The squared terms penalize arbitrarily large values of potentials to avoid numerical issues. A similar numerical issue has been found in the computation of potential heuristics for cost-optimal classical planning problems with dead-ends . sum of reward potentials (i.e., regularized objective function (10)) with respect to the modified version of constraint (11) as follows.
We denote the optimal values of ReLU potentials and , found by solving the master problem as and , respectively. Further, we refer to master problem as .
3.2.3 Reward Potentials Algorithm
Given the definitions of the master problem and the subproblem , the constraint generation algorithm for computing an optimal reward potential is outlined as follows.
Given constraints (3) and (5-9) are feasible, Algorithm 1 iteratively computes reward potentials and (i.e., line 3), and first checks if there exists an activation pattern, that is a complete value assignment to ReLU variables , that violates constraint (11) (i.e., lines 4 and 5), and then returns the optimal reward value for the violating activation pattern. Given the optimal reward value for the violating activation pattern, constraint (15) is updated (i.e., lines 6-7). Since there are finite number of activation patterns and solving gives the maximum value of for each pattern , the Reward Potentials Algorithm 1 terminates in at most iterations with an optimal reward potential for the learned NN.
3.2.4 Increasing the Granularity of the Reward Potentials Algorithm
The feature space of Algorithm 1 can be enhanced to include information on each ReLUs input and/or output. Instead of computing reward potentials for only the activation and deactivation of ReLU , we (i) introduce an interval parameter to split the output range of each ReLU into equal size intervals, (ii) introduce auxiliary Boolean decision variables to represent the activation interval of ReLU such that if and only if the output of ReLU is within interval , and otherwise, and (iii) compute reward potentials for each activation interval and deactivation of ReLU .
3.3 Strengthening HD-MILP-Plan
Given optimal reward potentials and , the MILP compilation of HD-MILP-Plan is strengthened through the addition of following constraints:
for all time steps where denotes the upperbound obtained from performing forward reachability on the output of each ReLU in the learned NN. Next, we present our experimental results to demonstrate the efficiency and the utility of computing reward potential and strengthening HD-MILP-Plan.
4 Experimental Results
In this section, we present computational results on (i) the convergence of Algorithm 1, and (ii) the overall strengthening of HD-MILP-Plan with the addition of constraints (16-18) for the task of planning over long horizons. First, we present results on the overall efficiency of Algorithm 1 and the strengthening of HD-MILP-Plan over multiple learned planning instances. Then, we focus on the most computationally expensive domain identified by our experiments to further investigate the convergence behaviour of Algorithm 1 and the overall strengthening of HD-MILP-Plan as a function of time.
4.1 Experimental Setup
The experiments were run on a MacBookPro with 2.8 GHz Intel Core i7 16GB memory. All instances from the HD-MILP-Plan paper , namely Navigation, Reservoir Control and HVAC , were selected. Both domain instance sizes and their respective learned NN sizes are detailed in Table 1. The range bounds on action variables for Navigation domains were constrained to . CPLEX 12.9.0  solver was used to optimize both Algorithm 1, and HD-MILP-PLan, with 6000 seconds of total time limit per domain instance. In our experiments, we show results for the values of interval parameter 222The preliminary experimental results for interval parameter have not shown significant improvements over the base encoding of HD-MILP-Plan.. Finally in the master problem, we have chosen the regularizer constant in the objective function (9) to be where is the large constant used in the big-M constraints of HD-MILP-Plan (i.e., constraints (6-7)).
|Domain Instance||Network Structure||Horizon|
|Navigation (8-by-8 maze)||4:32:32:2||100|
|Navigation (10-by-10 maze)||4:32:32:2||100|
|Reservoir Control (3 reservoirs)||6:32:3||500|
|Reservoir Control (4 reservoirs)||8:32:4||500|
|HVAC (3 rooms)||6:32:3||100|
|HVAC (6 rooms)||12:32:6||100|
4.2 Overall Results
In this section, we present the experimental results on (i) the computation of the optimal reward potentials using Algorithm 1, (ii) and the performance of HD-MILP-Plan with the addition of constraints (16-18) over multiple learned planning instances over long horizons. Table 2 summarizes the computational results and highlights the best performing HD-MILP-Plan settings for each learned planning instance.
|Domain Setting||Alg. 1||Cumul.||Primal||Dual||Open||Closed|
|HVAC,3,100,Base||-||260.21||Opt. found||Opt. proved||0||289529|
|HVAC,3,100,N=2||7||88.21||Opt. found||Opt. proved||0||2501|
|HVAC,3,100,N=3||9||194.44||Opt. found||Opt. proved||0||10891|
|Res,3,500,Base||-||33.01||Opt. found||Opt. proved||0||1|
|Res,3,500,N=2||1||99.81||Opt. found||Opt. proved||0||714|
|Res,3,500,N=3||2||90.27||Opt. found||Opt. proved||0||674|
|Res,4,500,Base||-||300.71||Opt. found||Opt. proved||0||1236|
|Res,4,500,N=2||7||109.66||Opt. found||Opt. proved||0||1924|
|Res,4,500,N=3||6||232.19||Opt. found||Opt. proved||0||1294|
The first column of Table 2 identifies the domain setting of each row. The second column denotes the runtime of Algorithm 1 in seconds. The remaining columns provide information on the performance of HD-MILP-Plan. Specifically, the third column (i.e., Primal) denotes the value of the incumbent plan found by HD-MILP-Plan, the fourth column (i.e., Dual) denotes the value of the duality bound found by HD-MILP-Plan, the fifth and sixth columns (i.e., Open and Closed) denote the number of open and closed nodes in the B&B tree respectively, and the last column (i.e., Cumul.) denotes the cumulative runtime of Algorithm 1 and HD-MILP-Plan in seconds. The bolded values indicate the best performing HD-MILP-Plan settings for each learned planning instance where the performance of each setting is evaluated first based on the runtime performance (i.e., Cumul. column), followed by the quality of incumbent plan (i.e., Primal column) and duality bound (i.e., Dual column) obtained by HD-MILP-Plan.
In total of five out of six instances, we observe that strengthened HD-MILP-Plan with interval parameter performed the best. The pairwise comparison of the base HD-MILP-Plan and strengthened HD-MILP-Plan with interval parameter shows that in almost all instances, the strengthened model performed better in comparison to the base model. The only instance in which the base model significantly outperformed the other two was the Reservoir Control domain with three reservoirs where the B&B solver was able to find an optimal plan in the root node. Overall, we found that especially in the instances where the optimality was hard to prove within the runtime limit of 6000 seconds (i.e., all Navigation instances and HVAC domain with 6 rooms), strengthened HD-MILP-Plan explored signigicantly less number of nodes in general while obtaining either higher quality incumbent plans or lower dual bounds. We observe that Algorithm 1 terminated with optimal reward potentials in less than 10 seconds in both Reservoir Control and HVAC domains, and took as much as 1700 seconds in Navigation domain – highlighting the effect of NN size and complexity (i.e., detailed in Table 1) on the runtime of Algorithm 1. As a result, next we focus on the most computationally expensive domain identified by our experiments, namely Navigation, to get a better understanding on the convergence behaviour of Algorithm 1 and the overall efficiency of HD-MILP-Plan as a function of time.
4.3 Detailed Convergence Results on Navigation Domain
In this section, we inspect the convergence of Algorithm 1 in the Navigation domain for computing an optimal reward potential for the learned NNs.
Figure 2 visualizes the violation of constraint (11) as a function of time over the computation of optimal reward potentials using the Reward Potentials Algorithm 1 for the learned NNs of both Navigation 8-by-8 (i.e., top) and Navigation 10-by-10 (i.e., bottom) planning instances. In both, we observe that the violation of constraint (11) decreases exponentially as a function of time, showcasing a long-tail runtime behaviour and terminates with optimal reward potentials.
4.4 Detailed Strengthening Results on Navigation Domain
Next, we inspect the overall strengthening of HD-MILP-Plan with respect to its underlying linear relaxation and search efficiency as a result of constraints (16-18), for the task of planning over long horizons in the Navigation Domain.
Figures 3 and 4 visualize the overall effect of incorporating constraints (16-18) into HD-MILP-Plan as a function of time for the Navigation domain with (a) 8-by-8 and (b) 10-by-10 maze sizes. In both Figures 3 and 4, linear relaxation (i.e. top), number of closed nodes (i.e., middle), and number open nodes (i.e., bottom), are displayed as a function of time. The inspection of both Figures 3 and 4 show that once the reward potentials are computed, the addition of constraints (16-18) allows HD-MILP-Plan to obtain a tighter bound by exploring signigicantly less number of nodes. In the 8-by-8 maze instance, we observe that HD-MILP-Plan with constraint (16) outperforms the base HD-MILP-Plan by 1700 and 3300 seconds with interval parameter , respectively. In the 10-by-10 maze instance, we observe that HD-MILP-Plan with constraints (16-18) obtains a tighter bound compared to the base HD-MILP-Plan by 3750 seconds and almost reaches the same bound by the time limit (i.e., 6000 seconds) with interval parameter , respectively.
The inspection of the top subfigures in Figures 3 and 4 shows that increasing the value of the interval parameter increases the computation time of Algorithm 1, but can also increase the search efficiency of the underlying B&B solver through increasing its exploration and pruning capabilities, as demonstrated by the middle and bottom subfigures in Figures 3 and 4. Overall from both instances, we conclude that HD-MILP-Plan with constraints (16-18) obtains a linear relaxation that is at least as good as the base HD-MILP-Plan by exploring significantly less number of nodes in the B&B search tree.
5 Related Work
In this paper, we have focused on the important problem of improving the efficiency of B&B solvers for optimal planning with learned NN transition models in continuous action and state spaces. Parallel to this work, planning and decision making in discrete action and state spaces [11, 16, 15], verification of learned NNs [8, 5, 6, 13], robustness evaluation of learned NNs  and defenses to adversarial attacks for learned NNs  have been studied with the focus of solving very similar decision making problems. For example, the verification problem solved by Reluplex  333Reluplex  is a SMT-based learned NN verification software. is very similar to the planning problem solved by HD-MILP-Plan  without the objective function and horizon . Interestingly, the verification problem can also be modeled as an optimization problem  and potentially benefit from the findings presented in this paper. For future work, we plan to explore how our findings in this work translate to solving other important tasks for learned neural networks.
In this paper, we have focused on the problem of improving the linear relaxation and the search efficiency of MILP models for decision making with learned NNs. In order to tacke this problem, we used bilevel programming to correctly model the optimal reward potentials problem. We then introduced a novel finite-time constraint generation algorithm for computing the potential contribution of each hidden unit to the reward function of the planning problem. Given the precomputed values of the reward potentials, we have introduced constraints to tighten the bound on the reward function of the planning problem. Experimentally, we have shown that our constraint generation algorithm efficiently computes reward potentials for learned NNs, and the overhead computation is justified by the overall strengthening of the underlying MILP model as demonstrated on the task of planning over long horizons. With this paper, we have shown the potential of bridging the gap between two seemingly distant literatures; heuristics search and decision making with learned NN models in continuous action and state spaces.
-  Bard, J.: Practical Bilevel Optimization: Algorithms And Applications. Springer US (09 2000). https://doi.org/10.1007/978-1-4757-2836-1
-  Bunel, R., Turkaslan, I., Torr, P.H., Kohli, P., Kumar, M.P.: A unified view of piecewise linear neural network verification (2017)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. Journal of Machine Learning Research12, 2493–2537 (2011)
-  Deng, L., Hinton, G.E., Kingsbury, B.: New types of deep neural network learning for speech recognition and related applications: an overview. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 8599–8603 (2013)
Ehlers, R.: Formal verification of piece-wise linear feed-forward neural networks. In: D’Souza, D., Narayan Kumar, K. (eds.) Automated Technology for Verification and Analysis. pp. 269–286. Springer International Publishing, Cham (2017)
-  Huang, X., Kwiatkowska, M., Wang, S., Wu, M.: Safety verification of deep neural networks. In: Majumdar, R., Kunčak, V. (eds.) Computer Aided Verification. pp. 3–29. Springer International Publishing, Cham (2017)
-  IBM: IBM ILOG CPLEX Optimization Studio CPLEX User’s Manual (2019)
-  Katz, G., Barrett, C., Dill, D., Julian, K., Kochenderfer, M.: Reluplex: An efficient smt solver for verifying deep neural networks. In: Twenty-Ninth International Conference on Computer Aided Verification. CAV (2017)
-  Kolter, Zico, W., Eric: Provable defenses against adversarial examples via the convex outer adversarial polytope. In: Thirty-First Conference on Neural Information Processing Systems (2017)
-  http://dl.acm.org/citation.cfm?id=2999134.2999257
-  Lombardi, M., Gualandi, S.: A lagrangian propagator for artificial neural networks in constraint programming. vol. 21, pp. 435–462 (Oct 2016). https://doi.org/10.1007/s10601-015-9234-6, https://doi.org/10.1007/s10601-015-9234-6
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Twenty-Seventh International Conference on Machine Learning. pp. 807–814 (2010),http://www.icml2010.org/papers/432.pdf
Narodytska, N., Kasiviswanathan, S., Ryzhyk, L., Sagiv, M., Walsh, T.: Verifying properties of binarized deep neural networks. In: Thirty-Second AAAI Conference on Artificial Intelligence. pp. 6615–6624 (2018)
-  Pommerening, F., Helmert, M., R¨oger, G., Seipp, J.: From non-negative to general operator cost partitioning. In: Twenty-Ninth AAAI Conference on Artificial Intelligence. pp. 3335–3341 (2015)
-  Say, B., Sanner, S.: Compact and efficient encodings for planning in factored state and action spaces with learned binarized neural network transition models (2018)
-  Say, B., Sanner, S.: Planning in factored state and action spaces with learned binarized neural network transition models. In: Twenty-Seventh International Joint Conference on Artificial Intelligence. pp. 4815–4821 (2018). https://doi.org/10.24963/ijcai.2018/669, https://doi.org/10.24963/ijcai.2018/669
-  Say, B., Wu, G., Zhou, Y.Q., Sanner, S.: Nonlinear hybrid planning with deep net learned transition models and mixed-integer linear programming. In: Twenty-Sixth International Joint Conference on Artificial Intelligence. pp. 750–756 (2017). https://doi.org/10.24963/ijcai.2017/104, https://doi.org/10.24963/ijcai.2017/104
-  Seipp, J., Pommerening, F., Helmert, M., R¨oger: New optimization functions for potential heuristics. In: Twenty-Fifth International Conference on Automated Planning and Scheduling. pp. 193–201 (2015)
-  Tjeng, V., Xiao, K., Tedrake, R.: Evaluating robustness of neural networks with mixed integer programming. In: Seventh International Conference on Learning Representations (2019)