Combining Reinforcement Learning and Constraint Programming for Combinatorial Optimization

06/02/2020 ∙ by Quentin Cappart, et al. ∙ Element AI Inc Corporation de l'ecole Polytechnique de Montreal 0

Combinatorial optimization has found applications in numerous fields, from aerospace to transportation planning and economics. The goal is to find an optimal solution among a finite set of possibilities. The well-known challenge one faces with combinatorial optimization is the state-space explosion problem: the number of possibilities grows exponentially with the problem size, which makes solving intractable for large problems. In the last years, deep reinforcement learning (DRL) has shown its promise for designing good heuristics dedicated to solve NP-hard combinatorial optimization problems. However, current approaches have two shortcomings: (1) they mainly focus on the standard travelling salesman problem and they cannot be easily extended to other problems, and (2) they only provide an approximate solution with no systematic ways to improve it or to prove optimality. In another context, constraint programming (CP) is a generic tool to solve combinatorial optimization problems. Based on a complete search procedure, it will always find the optimal solution if we allow an execution time large enough. A critical design choice, that makes CP non-trivial to use in practice, is the branching decision, directing how the search space is explored. In this work, we propose a general and hybrid approach, based on DRL and CP, for solving combinatorial optimization problems. The core of our approach is based on a dynamic programming formulation, that acts as a bridge between both techniques. We experimentally show that our solver is efficient to solve two challenging problems: the traveling salesman problem with time windows, and the 4-moments portfolio optimization problem. Results obtained show that the framework introduced outperforms the stand-alone RL and CP solutions, while being competitive with industrial solvers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The design of efficient algorithms for solving NP-hard problems, such as combinatorial optimization problems (COPs), has long been an active field of research wolsey1999integer . Broadly speaking, there exist two main families of approaches for solving COPs, each of them having pros and cons. On the one hand, exact algorithms are based on a complete and clever enumeration of the solutions space lawler1966branch ; rossi2006handbook . Such algorithms will eventually find the optimal solution, but they may be prohibitive for solving large instances because of the exponential increase of the execution time. That being said, well-designed exact algorithms can nevertheless be used to obtain sub-optimal solutions by interrupting the search before its termination. This flexibility makes exact methods appealing and practical, and as such they constitute the core of modern optimization solvers as CPLEX cplex2009v12 , Gurobi optimization2014inc , or Gecode team2008gecode . It is the case of constraint programming (CP) rossi2006handbook , which has the additional asset to be a generic tool that can be used to solve a large variety of COPs, whereas mixed integer programming (MIP) benichou1971experiments solvers only deal with linear problems and limited non-linear cases. A critical design choice in CP is the branching strategy, i.e., directing how the search space must be explored. Naturally, well-designed heuristics are more likely to discover promising solutions, whereas bad heuristics may bring the search into a fruitless subpart of the solution space. In general, the choice of an appropriate branching strategy is non-trivial and their design is a hot topic in the CP community palmieri2016parallel ; fages2017making ; laborie2018objective .

On the other hand, heuristic algorithms aarts2003local ; gendreau2005metaheuristics are incomplete methods that can compute solutions efficiently, but are not able to prove the optimality of a solution. They also often require substantial problem-specific knowledge for building them. In the last years, deep reinforcement learning (DRL) sutton1998introduction ; arulkumaran2017brief has shown its promise to obtain high-quality approximate solutions to some NP-hard COPs bello2016neural ; khalil2017learning ; deudon2018learning ; kool2018attention . Once a model has been trained, the execution time is typically negligible in practice. The good results obtained suggest that DRL is a promising new tool for finding efficiently good approximate solutions to NP-hard problems, provided that (1) we know the distribution of problem instances and (2) that we have enough instances sampled from this distribution for training the model. Nonetheless, current methods have shortcomings. Firstly, they are mainly dedicated to solve a specific problem, as the travelling salesman problem (TSP), with the noteworthy exception of Khalil et al. khalil2017learning that tackle three other graph-based problems, and of Kool et al. kool2018attention that target routing problems. Secondly, they are only designed to act as a constructive heuristic, and come with no systematic ways to improve the solutions obtained, unlike complete methods, such as CP.

As both exact approaches and learning-based heuristics have strengths and weaknesses, a natural question arises: How can we leverage these strengths together in order to build a better tool to solve combinatorial optimization problems ? In this work, we show that it can be successfully done by the combination of reinforcement learning and constraint programming, using dynamic programming as a bridge between both techniques. Dynamic programming (DP) bellman1966dynamic , which has found successful applications in many fields godfrey2002adaptive ; topaloglou2008dynamic ; tang2017near ; ghasempour2019adaptive , is an important technique for modelling COPs. In its simplest form, DP consists in breaking a problem into sub-problems that are linked together through a recursive formulation (i.e., the well-known Bellman equation). The main issue with exact DP is the so-called curse of dimensionality: the number of generated sub-problems grows exponentially, to the point that it becomes infeasible to store all of them in memory.

This paper proposes a generic and complete solver, based on DRL and CP, in order to solve COPs that can be modelled using DP. To the best of our knowledge, it is the first work that proposes to embed a learned heuristic directly inside a CP solver. Our detailed contributions are as follows: (1) A new encoding able to express a DP model of a COP into a RL environment and a CP model; (2) The use of two standard RL training procedures, deep Q-learning and proximal policy optimization, for learning an appropriate CP branching strategy. The training is done using randomly generated instances sampled from a similar distribution to those we want to solve; (3) The integration of the learned branching strategies on three CP search strategies, namely branch-and-bound, iterative limited discrepancy search and restart based search; (4) Promising results on two challenging COPs, namely the travelling salesman problem with time windows, and the 4-moments portfolio optimization

; (5) The open-source release of our code and models, in order to ease the future research in this field

111https://github.com/qcappart/hybrid-cp-rl-solver.

In general, as there are no underlying hypothesis such as linearity or convexity, a DP cannot be trivially encoded and solved by standard integer programming techniques bergman2018discrete . It is one of the reasons that drove us to consider CP for the encoding. The next section presents the hybrid solving process that we designed. Then, experiments on the two case studies are carried out. Finally, a discussion on the current limitations of the approach and the next research opportunities are proposed.

2 A Unifying Representation Combining Learning and Searching

Because of the state-space explosion, solving NP-hard COPs remains a challenge. In this paper, we propose a generic and complete solver, based on DRL and CP, in order to solve COPs that can be modelled using DP. This section describes the complete architecture of the framework we propose. A high-level picture of the architecture is shown in Figure 1. It is divided into three parts: the learning phase, the solving phase and the unifying representation, acting as a bridge between the two phases. Each part contains several components. Green blocks and arrows represent the original contributions of this work and blue blocks corresponds to known algorithms that we adapted for our framework.

Figure 1: Overview of our framework for solving COPs.

2.1 Dynamic Programming Model

Dynamic programming (DP) bellman1966dynamic is a technique combining both mathematical modeling and computer programming for solving complex optimization problems, such as NP-hard problems. In its simplest form, it consists in breaking a problem into sub-problems and to link them through a recursive formulation. The initial problem is then solved recursively, and the optimal values of the decision variables are recovered successively by tracking back the information already computed. Let us consider a general COP , where with are discrete variables that must be assigned in order to maximize a function . In the DP terminology, the decision variables of are referred to as the controls (). They take value from their domain , and enforce a transition () from a state () to another one () where is the set of states. The initial state () is known and a transition is done at each stage () until all the variables have been assigned. Besides, a reward () is induced after each transition. Finally, a DP model can also contain validity conditions () and dominance rules () that restrict the set of feasible actions. The difference between both is that validity conditions are mandatory to ensure the correctness of the DP model () whereas the dominance rules are only used for efficiency purposes (), where , , and represent the equivalence, the implication, and the unfeasible state, respectively. A DP model for a COP can then be modelled as a tuple . The problem can be solved recursively using Bellman Equation, where is a state-value function representing the optimal reward of being at state at stage :

(1)

The reward is equal to zero for the final state () and is backtracked until has been computed. This last value gives the optimal cost of . Then, by tracing the values assigned to the variables , the optimal solution is recovered. Unfortunately, DP suffers from the well-known curse of dimensionality, which prevents its use when dealing with problems involving large state/control spaces. A partial solution to this problem is to prune dominated actions (). An action is dominated if it is valid according to the recursive formulation, but is (1) either strictly worse than another action, or (2) it cannot lead to a feasible solution. In practice, pruning such dominated actions can have a huge impact on the size of the search space, but identifying them is not trivial as assessing those two conditions precisely is problem-dependent. Besides, even after pruning the dominated actions, the size of the state-action space may still be too large to be completely explored in practice.

2.2 RL Encoding

An introduction to reinforcement learning is proposed in Appendix A. Note that all the sets used to define an RL environment are written using a larger size font. Encoding the DP formulation into a RL environment requires to define, adequately, the set of states, the set of actions, the transition function, and the reward function, as the tuple from the DP model and a specific instance of the COP that we are considering. The initial state of the RL environment corresponds to the first stage of the DP model, where no variable has been assigned yet.

State

For each stage of the DP model, we define the RL state as the pair , where is the DP state at the same stage , and is the problem instance we are considering. Note that the second part of the state () is dynamic, as it depends on the current stage in the DP model, or similarly, to the current time-step of the RL episode, whereas the first part () is static

as it remains the same for the whole episode. In practice, each state is embedded into a tensor of features, as it serves as input of a neural network.

Action

Given a state from the DP model at stage and its control , an action at a state has a one-to-one relationship with the control . The action can be done if and only if is valid under the DP model. The idea is to allow only actions that are consistent with regards to the DP model, the validity conditions, and the eventual dominance conditions. Formally, the set of feasible actions at stage are as follows:

Transition

The RL transition gives the state from and in the same way as the transition function of the DP model gives a state from a previous state and a control value . Formally, we have the deterministic transition: .

Reward

An initial idea for designing the RL reward function is to use the reward function of the DP model using the current state and the action that has been selected. However, performing a sequence of actions in a DP subject to validity conditions can lead to a state with no solutions, which must be avoided. Such a situation happens when a state with no action is reached whereas at least one control has not been assigned to a value . Finding first a feasible solution must then be prioritized over maximizing the DP reward and is not considered with this simple form of the RL reward. Based on this, two properties must be satisfied in order to ensure that the reward will drive the RL agent to the optimal solution of the COP: (1) the reward collected through an episode must be lesser than the reward of an episode if the COP solution of is worse than the one obtained with , and (2) the total reward collected through an episode giving an unfeasible solution must be lesser than the reward of any episode giving a feasible solution. By doing so, we ensure that the RL agent has incentive to find, first, feasible solutions, and, then, finding the best ones. The reward we designed is as follows : ; where corresponds to an upper bound of the objective value that can be reached for the COP . The term is a constant factor that gives a strict upper bound on the reward of any solution of the DP and drives the agent to progress into a feasible solution first. The absolute value ensures that the term is positive and is used to negate the effect of negative rewards that may lead the agent to stop the episode as soon as possible. The second term forces then the agent to find the best feasible solution. Finally, a scaling factor can also be added in order to compress the space of rewards into a smaller interval value near zero. Note that for DP models having only feasible solutions, the first term can be omitted.

2.3 Learning Algorithm

We implemented two different agents, one based on a value-based method (DQN) and a second one based on policy gradient (PPO). In both cases, the agent is used to parametrize the weight vector (

w

) of a neural network giving either the Q-values (DQN), or the policy probabilities (PPO). The training is done using randomly generated instances sampled from a similar distribution to those we want to solve. It is important to mention that this learning procedure makes the assumption that we have a generator able to create random instances (

) that follows the same distribution that the ones we want to tackle, or a sufficient number of similar instances from past data. Such an assumption is common in the vast majority of works tackling NP-hard problems using ML khalil2017learning ; kool2018attention ; cappart2019improving .

2.4 Neural Network Architecture

In order to ensure the genericity and the efficiency of the framework, we have two requirements for designing the neural network architecture: (1) be able to handle instances of the same COPs, but that have a different number of variables (i.e., able to operate on non-fixed dimensional feature vectors) and (2) be invariant to input permutations. In other words, encoding variables , , and should produce the same prediction as encoding , , and . A first option is to embed the variables into a set transformer architecture lee2018set , that ensures these two requirements. Besides, many COPs also have a natural graph structure that can be exploited. For such a reason, we also considered another embedding based on graph attention network (GAT) velivckovic2017graph . The embedding, either obtained using GAT or set transformer

, can then be used as an input of a feed-forward network to get a prediction. Case studies will show a practical application of both architectures. For the DQN network, the dimension of the last layer output corresponds to the total number of actions for the COP and output an estimation of the

-values for each of them. The output is then masked in order to remove the unfeasible actions. Concerning PPO, distinct networks for the actor and the critic are built. The last layer on the critic output only a single value. Concerning the actor, it is similar as the DQN case but a softmax selection is used after the last layer in order to obtain the probability to select each action.

2.5 CP Encoding

An introduction to constraint programming is proposed in Appendix B. Note that the teletype font is used to refer to CP notations. This section describes how a DP formulation can be encoded in a CP model. Modeling a problem using CP consists in defining the tuple where X is the set of variables, is the set of domains, C is the set of constraints, and O is the objective function. Let us consider the DP formulation with also the number of stages.

Variables and domains

We make a distinction between the decision variables, on which the search is performed, and the auxiliary variables that are linked to the decision variables, but that are not branched on during the search. The encoding involves two variables per stage: (1) is an auxiliary variable representing the current state at stage whereas (2) is a decision variable representing the action done at this state, similarly to the regular decomposition pesant2004regular . Besides, a last auxiliary variable is considered for the stage , which represents the final state of the system. In the optimal solution, the variables thus indicate the best state that can be reached at each stage, and the best action to select as well.

Constraints

The constraints of our encoding have two purposes. Firstly, they must ensure the consistency of the DP formulation. It is done (1) by setting the initial state to a value (e.g., ), (2) by linking the state of each stage to the previous one through the transition function (), and finally (3) by enforcing each transition to be valid, in the sense that they can only generate a feasible state of the system. Secondly, other constraints are added in order to remove dominated actions and the subsequent states. In the CP terminology, such constraints are called redundant constraint, they do not change the semantic of the model, but speed-up the search. The constraints inferred by our encoding are as follows, where validityCondition and dominanceCondition are both Boolean functions detecting non-valid transitions and dominated actions, respectively.

(Setting initial state)
(Enforcing transitions)
(Keeping valid transitions)
(Pruning dominated states)
Objective function

The goal is to maximize the accumulated sum of rewards generated through the transition () during the stages: . Note that the optimization and branching selection is done only on the decision variables ().

2.6 Search Strategy

From the same DP formulation, we are able to (1) build a RL environment in order to learn the best actions to perform, and (2) state a CP model of the same problem. This consistency is at the heart of the framework. This section shows how the knowledge learned during the training phase can be transferred into the CP search. We considered three standard CP specific search strategy: depth-first branch-and-bound search (BaB), and iterative limited discrepancy search (ILDS), that are able to leverage knowledge learned with a value-based method as DQN, and restart based search (RBS), working together with policy gradient methods. The remaining of this section presents how to plug a model learned with DQN inside the BaB-search of a CP solver. The description of the two other contributions (ILDS with DQN, and RBS with PPO) are available in Appendices C and D.

2.6.1 Depth-First Branch-and-Bound Search with DQN

This search works in a depth-first fashion. When a feasible solution has been found, a new constraint ensuring that the next solution has to be better than the current one is added. In case of an unfeasible solution due to an empty domain reached, the search is backtracked to the previous decision. With this procedure, and provided that the whole search space has been explored, the last solution found is then proven to be optimal. This search requires a good heuristic for the value-selection. This can be achieved by a value-based RL agent, such as DQN. After the training, the agent gives a parametrized state-action value function , and a greedy policy can be used for the value-selection heuristic, which is intended to be of a high, albeit non-optimal, quality. The variable ordering must follow the same order as the DP model in order to keep the consistency with both encoding. As highlighted in other works cappart2019improving , an appropriate variable ordering has an important impact when solving DPs. However, such an analysis goes beyond the scope of this work.

Pre: is a COP having a DP formulation. Pre: w is a trained weight vector. while  is not completed do
       if  then
            
      else
            
       end if
      
end while
return
Algorithm 1 BaB-DQN Search Procedure.

The complete search procedure (BaB-DQN) is presented in Algorithm 1, taking as input a COP , and a pre-trained model with the weight vector w. First, the optimization problem in encoded into a CP model. Then, a new BaB-search is initialized and executed on the generated CP model. Until the search is not completed, a RL state is obtained from the current CP state (encodeStateRL). The first non-assigned variable of the DP is selected and is assigned to the value maximizing the state-action value function . All the search mechanisms inherent of a CP solver but not related to our contribution (propagation, backtracking, etc.), are abstracted in the branchAndUpdate function. Finally, the best solution found during the search is returned. We enriched this procedure with a cache mechanism (). During the search, it happens that similar states are reached more than once chu2010automatically . In order to avoid recomputing the Q-values, one can store the Q-values related to a state already computed and reuse them if the state is reached again. In the worst-case, all the action-value combinations have to be tested. This gives the upper bound , where is the number of actions of the DP model and the maximal domain size. Note that this bound is standard in a CP solver. As the algorithm is based on DFS, the worst-case space complexity is , where is the cache size.

3 Experimental Results

The goal of the experiments is to evaluate the efficiency of the framework for computing solutions of challenging COPs having a DP formulation. To do so, comparisons of our three learning-based search procedures (BaB-DQN, ILDS-DQN, RBS-PPO) with a standard CP formulation (CP-model), stand-alone RL algorithms (DQN, PPO), and industrial solvers are performed. Two NP-hard problems are considered in the main manuscript: the travelling salesman problem with time windows (TSPTW), involving non-linear constraints, and the 4-moments portfolio optimization problem (PORT), which has a non-linear objective. In order to ease the future research in this field and to ensure reproducibility, the implementation, the models, the results, and the hyper-parameters used are released with the permissive MIT open-source license. Algorithms used for training have been implemented in Python and Pytorch paszke2019pytorch is used for designing the neural networks. Library DGL wang2019deep is used for implementing graph embedding, and SetTransformer lee2018set for set embedding. The CP solver used is Gecode team2008gecode , which has the benefit to be open-source and to offer a lot of freedom for designing new search procedures. As Gecode is implemented in C++, an operability interface with Python code is required. It is done using Pybind11 jakob2017pybind11 . Training time is limited to 48 hours, memory consumption to 32 GB and 1 GPU (Tesla V100-SXM2-32GB) is used per model. A new model is recorded after each 100 episodes of the RL algorithm and the model achieving the best average reward on a validation set of 100 instances generated in the same way as for the training is selected. The final evaluation is done on 100 other instances (still randomly generated in the same manner) using Intel Xeon E5-2650 CPU with 32GB of RAM and a time limit of 60 minutes.

3.1 Travelling Salesman Problem with Time Windows (TSPTW)

Detailed information about this case study and the baselines used for comparison is proposed in Appendix E. In short, OR-Tools is an industrial solver developed by Google, PPO uses a beam-search decoding of width 64, and CP-nearest solves the DP formulation with CP, but without the learning part. A nearest insertion heuristic is used for the value-selection instead. Results are summarized in Table 1. First of all, we can observe that OR-Tools, CP-model, and DQN are significantly outperformed by the hybrid approaches. Good results are nevertheless achieved by CP-nearest, and PPO. We observe that the former is better to prove optimality, whereas the latter is better to discover feasible solutions. However, when the size of instances increases, both methods have more difficulties to solve the problem and are also outperformed by the hybrid methods, which are both efficient to find solutions and to prove optimality. Among the hybrid approaches, we observe that DQN-based searches give the best results, both in finding solutions and in proving optimality.

We also note that caching the predictions is useful. Indeed, the learned heuristics are costly to use, as the execution time to finish the search is larger when the cache is disabled. For comparison, the average execution time of a value-selection without caching is 34 milliseconds for BaB-DQN (100 cities), and goes down to 0.16 milliseconds when caching is enabled. For CP-nearest, the average time is 0.004 milliseconds. It is interesting to see that, even being significantly slower than the heuristic, the hybrid approach is able to give the best results.

max width= Approaches 20 cities 50 cities 100 cities Type Name Success Opt. Time Success Opt. Time Success Opt. Time Constraint programming OR-Tools 100 0 < 1 0 0 t.o. 0 0 t.o. CP-model 100 100 < 1 0 0 t.o. 0 0 t.o. CP-nearest 100 100 < 1 99 99 6 0 0 t.o. Reinforcement learning DQN 100 0 < 1 0 0 < 1 0 0 < 1 PPO 100 0 < 1 100 0 5 21 0 46 Hybrid (no cache) BaB-DQN 100 100 < 1 100 99 2 100 52 20 ILDS-DQN 100 100 < 1 100 100 2 100 53 39 RBS-PPO 100 100 < 1 100 80 12 100 0 t.o. Hybrid (with cache) BaB-DQN 100 100 < 1 100 100 < 1 100 91 15 ILDS-DQN 100 100 < 1 100 100 1 100 90 15 RBS-PPO 100 100 < 1 100 99 2 100 11 32

Table 1: Results for TSPTW. Methods with indicate that caching is used, Success reports the number of instances where at least a solution has been found (among 100), Opt. reports the number of instances where the optimality has been proven (among 100), and Time reports the average execution time to complete the search (in minutes, and only including the instances where the search has been completed; when the search has been completed for no instance t.o. (timeout) is indicated.

3.2 4-Moments Portfolio Optimization (PORT)

Detailed information about this case study is proposed in Appendix F. In short, Knitro and APOPT are two general non-linear solvers. Given that the problem is non-convex, these solvers are not able to prove optimality as they may be blocked in local optima. The results are summarized in Table 2. Let us first consider the continuous case. For the smallest instances, we observe that BaB-DQN, ILDS-DQN, and CP-model achieve the best results, although only BaB-DQN has been able to prove optimality for all the instances. For larger instances, the non-linear solvers achieve the best results, but are nevertheless closely followed by RBS-PPO. When the coefficients of variables are floored (discrete case), the objective function is not continuous anymore, making the problem harder for non-linear solvers, which often exploit information from derivatives for the solving process. Such a variant is not supported by APOPT. Interestingly, the hybrid approaches do not suffer from this limitation, as no assumption on the DP formulation is done beforehand. Indeed, ILDS-DQN and BaB-DQN achieve the best results for the smallest instances and RBS-PPO for the larger ones.

max width= Approaches Continuous coefficients Discrete coefficients 20 items 50 items 100 items 20 items 50 items 100 items Type Name Sol. Opt. Sol. Opt. Sol. Opt. Sol. Opt. Sol. Opt. Sol. Opt. Non-linear solver KNITRO 343.79 0 1128.92 0 2683.55 0 211.60 0 1039.25 0 2635.15 0 APOPT 342.62 0 1127.71 0 2678.48 0 - - - - - - Constraint programming CP-model 356.49 98 1028.82 0 2562.59 0 359.81 100 1040.30 0 2575.64 0 Reinforcement learning DQN 306.71 0 879.68 0 2568.31 0 309.17 0 882.17 0 2570.81 0 PPO 344.95 0 1123.18 0 2662.88 0 347.85 0 1126.06 0 2665.68 0 Hybrid (with cache) BaB-DQN 356.49 100 1047.13 0 2634.33 0 359.81 100 1067.37 0 2641.22 0 ILDS-DQN 356.49 1 1067.20 0 2639.18 0 359.81 100 1084.21 0 2652.53 0 RBS-PPO 356.35 0 1126.09 0 2674.96 0 359.69 0 1129.53 0 2679.57 0

Table 2: Results for PORT. Best results are highlighted, Sol. reports the best average objective profit reached, Opt. reports the number of instances where the optimality has been proven (among 100).

4 Discussion and Limitations

First of all, let us highlight that this work is not the first one attempting to use ML for guiding the decision process of combinatorial optimization solvers NIPS2014_5495 . According to the survey and taxonomy of Bengio et al. bengio2018machine , this kind of approach belongs to the third class (Machine learning alongside optimization algorithms) of ML approaches for solving COPs. It is for instance the case of gasse2019exact

, which propose to augment branch-and-bound procedures using imitation learning. However, their approach requires supervised learning and is only limited to (integer) linear problems. The differences we have with this work are that (1) we focus on COPs modelled as a DP, and (2) the training is entirely based on RL. Thanks to CP, the framework can solve a large range of problems, as the TSPTW, involving non-linear combinatorial constraints, or the portfolio optimization problem, involving a non-linear objective function. Note that we nevertheless need a generator of instances, or enough historical data of the same distribution, in order to train the models. Besides its expressiveness, and in contrast to most of the related works solving the problem end-to-end

bello2016neural ; kool2018attention ; deudon2018learning ; joshi2019efficient , our approach is able to deal with problems where finding a feasible solution is difficult and is able to provide optimality proofs. This was considered by Bengio et al. as an important challenge in learning-based methods for combinatorial optimization bengio2018machine .

In most situations, experiments show that our approach can obtain more and better solutions than the other methods with a smaller execution time. However, they also highlighted that resorting to a neural network prediction is an expensive operation to perform inside a solver, as it has to be called numerous times during the solving process. It is currently a bottleneck, especially if we would like to consider larger instances. It is why caching, despite being a simple mechanism, is important. Another possibility is to reduce the complexity of the neural network by compressing its knowledge, which can for instance be done using knowledge-distillation hinton2015distilling or by building a more compact equivalent network serra2020lossless . Note that the Pybind11 binding between the Python and C++ code is also a source of inefficiency. Another solution would be to implement the whole framework into a single, efficient, and expressive enough, programming language.

5 Conclusion

The goal of combinatorial optimization is to find an optimal solution among a finite set of possibilities. There are many practical and industrial applications of COPs, and efficiently solving them directly results in a better utilization of resources and a reduction of costs. However, since the number of possibilities grows exponentially with the problem size, solving is often intractable for large instances. In this paper, we propose a hybrid approach, based on both deep reinforcement learning and constraint programming, for solving COPs that can be formulated as a dynamic program. To do so, we introduced an encoding able to express a DP model into a reinforcement learning environment and a constraint programming model. Then, the learning part can be carried out with reinforcement learning, and the solving part with constraint programming. The experiments carried out on the travelling salesman problem with time windows and the 4-moments portfolio optimization problem show that this framework is competitive with standard approaches and industrial solvers for instances up to 100 variables. These results suggest that the framework may be a promising new avenue for solving challenging combinatorial optimization problems.

References

Appendix A Technical Background on Reinforcement Learning

Let be a tuple representing a deterministic couple agent-environment where S in the set of states in the environment, A is the set of actions that the agent can do, is the transition function leading the agent from one state to another given the action taken, and is the reward function of taking an action from a specific state. The behavior of an agent is defined by a policy , describing the action to be taken given a specific state. The goal of an agent is to learn a policy maximizing the accumulated sum of rewards (eventually discounted) during its lifetime defined by a sequence of states with and the episode length. Such a sequence is called an episode, where is the terminal state. The return after time step is defined as follows: . Note that we omitted the standard discounting factor, because the weight of all decisions have the same importance when solving COPs that do not have a temporal aspect.

In a deterministic environment, the quality of taking an action from a state under a policy is defined by the action-value function . Similarly, the return of a state is defined by the state-value function . The problem is to find a policy that maximizes the final return: and . However, the number of combinations increases exponentially with the number of states and actions, which makes this problem intractable. There exist in the literature a tremendous number of approaches for solving this problem, the many very successful ones are based on deep learning [36] and are referred to as deep reinforcement learning (DRL) [3].

Apart from hybrid approaches such as Monte-Carlo tree search [14] and model-based approaches that are dedicated to learn the environment [40, 24], there exist two families of reinforcement learning algorithms: (1) the value-based methods, where the goal is to learn an action-value function and then to select the actions with high value, and (2) the policy-based methods, where the policy is learned directly.

In this paper, we consider two algorithms, deep Q-learning (DQN) [39], a value-based method, and proximal policy gradient (PPO) [50], a policy-gradient based method. The idea of DQN is to use a deep neural network to approximate the action-value function and to use it through a -learning algorithm [62]. This provides an estimator , where w is a learned weight vector parametrizing . The greedy policy is then used for selecting the actions. In practice, the standard DQN algorithm is improved by several mechanisms in order to speed-up learning and to increase stability [44, 49].

Conversely, the goal of policy gradient methods, as PPO, is to learn an estimation of the policy that maximizes , the expected return of the initial state. In practice, it is often computed using the policy gradient theorem [53]: . The expectation indicates the empirical average over a set of samples that are obtained by following the current policy . This theorem provides a suited form to improve the policy through gradient ascent, by parametrizing the weight vector w. Unlike the DQN algorithm, the policy obtained with policy gradient methods keeps a stochastic aspect. Based on the policy gradient theorem, the main idea of PPO is to avoid parameter updates that change the policy too much during the gradient ascent. It empirically results in improved stability during training.

Appendix B Technical Background on Constraint Programming

Constraint programming (CP) [48] is a general framework proposing simple and efficient algorithmic solutions to COPs. Such as mixed integer programming (MIP), and in contrast to heuristic and local search methods, CP is a complete approach for solving COPs, which can prove the optimality of the solutions found. Considering a minimization problem, a CP model is a tuple , where X is the set of variables, is the set of domains, that contain possible values for the variables, C is the set of constraints, that restrict assignments of values to variables, and O is an objective function. The goal is to choose a value for each variable of X from D which satisfies all the constraints of C and that maximizes the objective O.

Finding feasible solutions essentially consists of an exhaustive enumeration of all the possible assignments of values to variables until the best solution has been found. The search tree grows up exponentially with the number of variables during this process, meaning that the complete enumeration of solutions is intractable for large problems. To cope with this issue, the CP search is first improved by a mechanism called propagation, having the responsibility to reduce the number of possible combinations and then the search tree size. Given the current domains of the variables and a constraint c, the propagation of c removes from the domains the values that cannot be part of the final solution because of the violation of c. This process is repeated at each domain change and for each constraint until no more value can be removed from the domains. This procedure is commonly referred to as the fix-point algorithm. The efficiency and the success of a CP solver is mainly due to the quality of their propagators, i.e., the algorithms performing the propagation for each constraint, the variety of available constraints, and their expressiveness. A tremendous amount of work has been dedicated to the implementation of efficient propagators for constraints [58, 12, 13]. Among them, global constraints [47] are constraints that capture a relation between a non-fixed number of variables. At the time of writing, the global constraints catalog reports more than 400 different existing global constraints [6]. For instance, a well-known global constraint is allDifferent(X), enforcing each variable in X to take a different value [46]. Other examples are the set constraints [2], that enable users to build variables having a set structure and to state constraints on them (union, intersection, difference, etc.), or the circuit constraint [31], enforcing a valid Hamiltonian tour across a set of variables. Then, Unlike MIP solvers that are restricted to linear constraints, one asset of CP is its ability to handle any kind of constraints, even non-linear.

The search is commonly done in a depth-first fashion, together with branch-and-bound [35]. When a feasible solution has been found, a new constraint ensuring that the next solution has to be better than the current one is added. In case of an unfeasible solution due to an empty domain reached, the search is backtracked to the previous decision. With this procedure, and provided that the whole search space has been explored, the last solution found is then proven to be optimal. A drawback of this strategy is that backtracks occurs at the last levels of the tree search. Hence, the early choices are rarely reconsidered. However that CP offers a great flexibility in the search procedure, and allows other kinds of searches such as limited discrepancy search [25] or strategy based on restarts. The last design choices when using CP is the variable-selection and value-selection heuristics. During the search, what variable must be considered first when branching ? Similarly, what value should be tested first. As a guideline, the values that are the most promising should be assigned first. Although crucial for constrained satisfaction problems, the impact of the variable ordering has a smaller impact on the search when dealing with COPs and is not analyzed in this work. The choice of an appropriate value ordering heuristic is non-trivial and their design is a hot topic in the CP community [19, 34].

Appendix C Iterative Limited Discrepancy Search with DQN

Iterative limited discrepancy search (ILDS) [25] is a search strategy commonly used when we have a good prior on the quality of the value selection heuristic used for driving the search. The idea is to restrict the number of decisions deviating from the heuristic choices (i.e., a discrepancy) by a threshold. By doing so, the search will explore a subset of solutions that are likely to be good according to the heuristic while giving a chance to reconsider the heuristic selection which may be sub-optimal. This mechanism is often enriched with a procedure that iteratively increases the number of discrepancies allowed once a level has been fully explored.

As ILDS requires a good heuristic for the value-selection, it is complementary with a value-based RL agent, such as DQN. After the training, the agent gives a parametrized state-action value function , and the greedy policy can be used for the value-selection heuristic, which is intended to be of a high, albeit non-optimal, quality. The variable ordering must follow the same order as the DP model in order to keep the consistency with both encoding.

The complete search procedure we designed (ILDS-DQN) is presented in Algorithm 2, taking as input a COP , a pre-trained model with the weight vector w, and an iteration threshold for the ILDS. First, the optimization problem in encoded into a CP model. Then, for each number of discrepancies allowed, a new search is initialized and executed on . Until the search is not completed, a RL state is obtained from the current CP state (encodeStateRL). The first non-assigned variable of the DP is selected and is assigned to the value maximizing the state-action value function . All the search mechanisms inherent of a CP solver but not related to our contribution (propagation, backtracking, etc.), are abstracted in the branchAndUpdate function. Finally, the best solution found during the search is returned. The cache mechanism () introduced for the BaB search is reused. The worst-case bounds are the same as in Algorithm 1: for the time complexity, and for the space complexity, where is the number of actions of the DP model, is the maximal domain size, and is the cache size.

Pre: is a COP having a DP formulation. Pre: w is a trained weight vector. Pre: is the threshold of the iterative LDS. for  from to  do
       while  is not completed do
             if  then
                  
            else
                  
             end if
            
       end while
      
end for
return
Algorithm 2 ILDS-DQN Search Procedure.

Appendix D Restart-Based Search with PPO

Restart-based search (RBS) is another search strategy, which involves multiple restarts to enforce a suitable level of exploration. The idea is to execute the search, to stop it when a given threshold is reached (i.e., execution time, number of nodes explored, number of failures, etc.), and to restart it. Such a procedure works only if the search has some randomness in it, or if new information is added along the search runs. Otherwise, the exploration will only consider similar sub-trees. A popular design choice is to schedule the restart on the Luby sequence [38], using the number of failures for the threshold, and branch-and-bound for creating the search tree.

The sequence starts with a threshold of 1. Each next parts of the sequence is the entire previous sequence with the last value of the previous sequence doubled. With a size of 15, the luby sequence is the following: . The sequence can also be scaled with a factor , multiplying each element. As a controlled randomness is a key component of this search, it can naturally be used with a policy

parametrized with a policy gradient algorithm. By doing so, the heuristic randomly selects a value among the feasible ones, and according to the probability distribution of the policy through a softmax function. It is also possible to control the exploration level by tuning the softmax function with a standard Boltzmann temperature

. The complete search process is depicted in Algorithm 3. Note that the cache mechanism is reused in order to store the vector of action probabilities for a given state. The worst-case bounds are the same as in Algorithm 1: for the time complexity, and for the space complexity, where is the number of actions of the DP model, is the maximal domain size and, is the cache size.

Pre: is a COP having a DP formulation. Pre: w is a trained weight vector. Pre: is the number of restarts to do. Pre: is the Luby scale factor. Pre: is the softmax temperature. for  from to  do
       while  is not completed do
             if  then
                  
            else
                  
             end if
            
       end while
      
end for
return
Algorithm 3 RBS-PPO Search Procedure.

Appendix E Travelling Salesman Problem with Time Windows (TSPTW)

The travelling salesman problem with time windows (TSPTW) is an extension of the standard travelling salesman problem (TSP). Given an instance of customers, it consists in finding a minimum cost Hamiltonian tour starting and ending at a given depot and visiting all customers (or similarly, cities). Each customer is defined by a position ( and for the 2D case), and a time window (), defining the time slot where he can be visited. Each customer can and must be visited once and the travel time between two customers and is defined by . Note that a customer can be visited before the beginning of its time windows but, in this case, the service has to wait. No customers can be serviced after its time windows, and solutions that fail to serve all the clients are considered infeasible. Finally, there is no time windows associated to the depot. The goal is to minimize the sum of the travel distances. Although the search space of the TSPTW is smaller than for the TSP, the time constraints make the problem more difficult to solve in practice [57].

e.1 Dynamic Programming Model

Given an instance of customers, the DP model has stages where the last state corresponds to the solution obtained. Without loss of generality, we assume that the depot is associated to the first customer (). A state at stage is composed of three entities: (1) the set of remaining customers that still have to be visited (, with the powerset of all the customers without including the depot), (2) the last customer that has been serviced (), and (3) the current time (). An action performed at stage corresponds to the service of customer . The reward is, in fact, a penalty and corresponds to the travel time between two customers (. Note that an additional penalty for coming back to the depot () must also be considered. The DP model we built is as follows.

(Initial state definition)
(Transition function for )
(Transition function for )
(Transition function for )
(First validity condition)
(Second validity condition)
(Dominance pruning)

The initial state enforces to start at the depot at time and that all the customers (excepting the depot) must be visited. When an action is done (i.e., servicing a customer), the state is updated as follows: (1) the client does not have to be visited anymore, (2) he becomes the last visited client, and (3) the current time is updated according to the problem definition. A action is valid if it satisfies two validity conditions: (1) he must be in the set of the non-visited clients, and (2) given the current time, it is possible to visit the client without exceeding the time windows. The non-mandatory dominance rule () removes from all the clients having the time windows exceeded. By doing so, the search space is reduced. Finally, the objective function is to minimize the sum of the travel time.

e.2 Instance Generation

For an instance of customers, the coordinates and are sampled uniformly at random in a grid of size . The rounded 2D Euclidean distances is used for the travel time between two locations. The largest distance separating two customers is then . Time windows are also randomly generated but we ensure that the values selected will allow at least one feasible solution. To do so, we generate the time windows as follows. Let be the maximal time window length allowed and the maximal gap between two consecutive time windows. We first generate a random permutation of all the customers. It constitutes the feasible solution we want to preserve. Then, the time windows are computed as follows: , and , with (i.e., the depot) and the distance between two consecutive customers in the tour. It is important to note that this feasible solution is not known by the solvers, which only receive the customers coordinates and the time windows bounds. Without loss of generality, the values 100 and 10 are used for and .

e.3 Neural Architecture

A TSPTW instance can naturally be represented by a fully connected graph, where the vertices are the customers, and the (weighted) edges express the distances between them. Therefore, the features representing an instance should reflect the combinatorial structure of the graph. To do so, a graph attention network (GAT) [59]

is used in order to produce an embedding for each node of the graph. Few fully connected layers are then added on top of the embedding. For the DQN case, the dimension of the last layer output corresponds to the number of possible actions (i.e., the number of customers) and output an estimation of the Q-values for each of them. Concerning PPO, two distinct networks are built, one for the actor, and the second one for the critic. A GAT embedding is also used and is similar for both. The critic uses a max pooling between the final node embedding and the fully connected layers. The last layer on the critic outputs only a single value. Concerning the actor, it is similar as the DQN case but a softmax selection is used after the last layer in order to obtain a probability to select each action. Note that the output of the softmax and of the DQN network are masked in order to allow only the valid actions to be selected. Implementation is done using

Pytorch [43] and DGL [61]. Each customer is represented by 4 static and 2 dynamic features. Static features are the coordinates of the nodes (, ) and the lower and upper bounds value of the time windows (, ). The dynamic features are related to the DP model. Both are binary values and indicate if the customer still has to be serviced, and, if the customer if the last one that has been serviced. Edges have only a single feature, namely the distances between two customers (). All the features are then divided by their highest possible value. The hyper-parameters used are summarized in Table 3. The ranges have been determined using this reference (https://github.com/llSourcell/Unity_ML_Agents/blob/master/docs) and the values have been fixed by grid searches on subsets of parameters in order to keep the number of combinations tractable. The selection has been made when training on the 20-cities case, and the hyper-parameters were reused for the larger instances, except for the softmax temperature that has been tuned for each size.

max width= Parameter Range tested DQN-based methods PPO-based method 20 cities 50 cities 100 cities 20 cities 50 cities 100 cities Batch size 32 32 32 64 64 64 Learning rate 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 # GAT layers 4 4 4 4 4 4 Embedding dimension 32 32 32 128 128 128 # Hidden layers 2 2 2 4 4 4 Hidden layers dimension 32 32 32 128 128 128 Reward scaling 0.001 0.001 0.001 0.001 0.001 0.001 Softmax temperature 10 10 10 - - - n-step Episode length 20 50 100 - - - Entropy value - - - 0.001 0.001 0.001 Cliping value - - - 0.1 0.1 0.1 Udpate timestep - - - 2048 2048 2048

# Epochs per update

- - - 3 3 3 RBS temperature - - - 20 20 20 Luby scaling factor - - - 128 128 128

Table 3: Hyper-parameter values for TSPTW models.

e.4 Baselines used for Comparison

OR-Tools

A hybrid model based on constraint programming and local search using OR-Tools solver. We reuse the VRPTW example of the documentation222https://developers.google.com/optimization/routing/vrptw, with only one vehicle.

DQN

Solution obtained using the standard DQN algorithm with the greedy selection policy.

PPO

Best solution obtained using PPO with a beam-search decoding of size 64.

CP-Model

A standard constraint programming formulation333http://www.hakank.org/minizinc/tsptw.mzn, that leverages global constraints (allDifferent, circuit, and increasing). It is solved with Gecode [55].

Nearest

A model based on the above DP formulation and uses the ILDS search strategy with a selection based on the closest next customer instead of the DQN prediction. It is also modelled and solved with Gecode [55],

Appendix F 4-Moments Portfolio Optimization Problem (PORT)

In the 4-moments portfolio optimization problem (PORT) [4, 11], an investor has to select a combination of investments that provides the best trade-off between the expected return and different measures of financial risk. Given a set of investments, each with a specific cost (), an expected return (), a standard deviation (), a skewness (), and a kurtosis (

), the goal of the portfolio optimization problem is to find a portfolio with a large positive expected return and skewness, but with a large negative variance and kurtosis, provided that the total investment cost is below its budget

. Besides, the importance of each financial characteristic is weighted (, , , and ) according to the preference of the investor. Let

be a binary variable associated to each investment, indicating whether or not the investment is included in the portfolio. The standard problem is expressed as follows:

maximize (2)
subject to

Note that it is a discrete non-linear programming problem, and that the objective function, taken as-is, is non-convex. Solving this problem using a integer programming solver is non-trivial and requires advanced decomposition methods [11]. Another option is to use a general non-linear solver as Knitro [60], or APOPT [27]. However, as this formulation is non-convex, such solvers will not be able to prove optimality. To do so, a convex reformulation of the problem is required. In this work, we also consider a discrete variant of this problem, where the floor function is applied on all the roots of Equation (2). By doing so, all the coefficients are integers. This variant is especially hard for general non-linear solvers, as we break the linearity of the objective function and increase the risk of getting a poor local optimum.

f.1 Dynamic Programming Model

Given an instance of items, the DP model has stages where the last state corresponds to the final solution obtained. The idea of the DP model is to consider at each investment successively (one per stage) and to decide if it must be inserted into the portfolio. A state at stage only consists of the current cost of the investments in the portfolio. An action performed at stage corresponds to the selection, or not, of the investment . The reward corresponds to the final objective function value (Equation (2)) provided that we are at the last stage (i.e., all the variables have been assigned). Otherwise, it is equal to zero. Then, only the final reward is considered. The DP model, with a validity condition ensuring that the budget is never exceeded, is as follows:

(Initial state definition)
(Transition function for )
(Validity condition)

f.2 Instance Generation

Instances are generated in a similar fashion as [4, 11]. For an instance of investments, the costs and the expected return are sampled uniformly from 0 to 100. The maximum budget is set at . The other financial terms (, , and ) are sampled uniformly from 0 to . Finally, , and .

f.3 Neural Architecture

Unlike the TSPTW, the portfolio optimization problem has no graph structure. For this reason, we have considered for this problem an architecture based on sets (SetTransformer) [37] which has the benefit to be permutation invariant. Then, processing an instance with the items produce the same output as , , or . SetTransformer first produces an embedding for each item. Then, few fully connected layers are added on the top of the embedding to produce the output. Like the TSPTW, the architecture differs slightly according to the RL algorithm used (DQN or PPO). Implementation is done using Pytorch and the SetTransformer code proposed in [37] is reused. Each item is represented by 5 static and 4 dynamic features. Static features are the costs (), and the coefficient of each item according to the objective function (Equation (2)): , , , and . The dynamic features are related to the current state inside the DP model. First, we have three binary values and indicates (1) if the investment has already been considered by the DP model, (2) if the investment is the one that is considered at the current stage, and (3) if the insertion of investment will exceed the budget. Finally, the last feature indicates the remaining budget allowed, minus all the costs. All non-binary features are finally divided by their highest possible value. The hyper-parameters used are summarized in Table 4 and the selection follows the same procedure as for the TSPTW case study. Note that the embedding dimension of 40 has been reused from the initial SetTransformer implementation.

max width= Parameter Range tested DQN-based methods PPO-based method 20 items 50 items 100 items 20 items 50 items 100 items Batch size 64 32 32 128 32 64 Learning rate 0.00001 0.00001 0.00001 0.0001 0.0001 0.0001 Embedding dimension 40 40 40 40 40 40 # Hidden layers 2 2 2 2 2 2 Hidden layers dimension 128 128 128 128 128 128 Reward scaling 0.001 0.001 0.001 0.001 0.001 0.001 Softmax temperature 10 2 10 - - - n-step 20 50 100 - - - Entropy value - - - 0.001 0.001 0.001 Cliping value - - - 0.1 0.1 0.1 Udpate timestep - - - 2048 2048 2048 # Epochs per udpate - - - 4 4 4 RBS temperature - - - 1 1 1 Luby scaling factor - - - 1 1 1

Table 4: Hyper-parameter values for PORT models.

f.4 Baselines used for Comparison

Knitro

Knitro is a commercial software package for solving large scale nonlinear mathematical optimization problems

[60]. The parameters used are the default ones of the solver.

APOPT

APOPT is another solver able to solve large scale optimization problems, such as mixed integer non-linear programs. We used this solver with the GEKKO python interface [5]. The parameters used are the default ones.

DQN

Solution obtained using the standard DQN algorithm with the greedy selection policy.

PPO

Best solution obtained using PPO with a beam-search decoding of size 64.

CP-Model

A CP formulation of the model presented in Equation (2). It is solved with Gecode [55].