1 Introduction
The design of efficient algorithms for solving NPhard problems, such as combinatorial optimization problems (COPs), has long been an active field of research wolsey1999integer . Broadly speaking, there exist two main families of approaches for solving COPs, each of them having pros and cons. On the one hand, exact algorithms are based on a complete and clever enumeration of the solutions space lawler1966branch ; rossi2006handbook . Such algorithms will eventually find the optimal solution, but they may be prohibitive for solving large instances because of the exponential increase of the execution time. That being said, welldesigned exact algorithms can nevertheless be used to obtain suboptimal solutions by interrupting the search before its termination. This flexibility makes exact methods appealing and practical, and as such they constitute the core of modern optimization solvers as CPLEX cplex2009v12 , Gurobi optimization2014inc , or Gecode team2008gecode . It is the case of constraint programming (CP) rossi2006handbook , which has the additional asset to be a generic tool that can be used to solve a large variety of COPs, whereas mixed integer programming (MIP) benichou1971experiments solvers only deal with linear problems and limited nonlinear cases. A critical design choice in CP is the branching strategy, i.e., directing how the search space must be explored. Naturally, welldesigned heuristics are more likely to discover promising solutions, whereas bad heuristics may bring the search into a fruitless subpart of the solution space. In general, the choice of an appropriate branching strategy is nontrivial and their design is a hot topic in the CP community palmieri2016parallel ; fages2017making ; laborie2018objective .
On the other hand, heuristic algorithms aarts2003local ; gendreau2005metaheuristics are incomplete methods that can compute solutions efficiently, but are not able to prove the optimality of a solution. They also often require substantial problemspecific knowledge for building them. In the last years, deep reinforcement learning (DRL) sutton1998introduction ; arulkumaran2017brief has shown its promise to obtain highquality approximate solutions to some NPhard COPs bello2016neural ; khalil2017learning ; deudon2018learning ; kool2018attention . Once a model has been trained, the execution time is typically negligible in practice. The good results obtained suggest that DRL is a promising new tool for finding efficiently good approximate solutions to NPhard problems, provided that (1) we know the distribution of problem instances and (2) that we have enough instances sampled from this distribution for training the model. Nonetheless, current methods have shortcomings. Firstly, they are mainly dedicated to solve a specific problem, as the travelling salesman problem (TSP), with the noteworthy exception of Khalil et al. khalil2017learning that tackle three other graphbased problems, and of Kool et al. kool2018attention that target routing problems. Secondly, they are only designed to act as a constructive heuristic, and come with no systematic ways to improve the solutions obtained, unlike complete methods, such as CP.
As both exact approaches and learningbased heuristics have strengths and weaknesses, a natural question arises: How can we leverage these strengths together in order to build a better tool to solve combinatorial optimization problems ? In this work, we show that it can be successfully done by the combination of reinforcement learning and constraint programming, using dynamic programming as a bridge between both techniques. Dynamic programming (DP) bellman1966dynamic , which has found successful applications in many fields godfrey2002adaptive ; topaloglou2008dynamic ; tang2017near ; ghasempour2019adaptive , is an important technique for modelling COPs. In its simplest form, DP consists in breaking a problem into subproblems that are linked together through a recursive formulation (i.e., the wellknown Bellman equation). The main issue with exact DP is the socalled curse of dimensionality: the number of generated subproblems grows exponentially, to the point that it becomes infeasible to store all of them in memory.
This paper proposes a generic and complete solver, based on DRL and CP, in order to solve COPs that can be modelled using DP. To the best of our knowledge, it is the first work that proposes to embed a learned heuristic directly inside a CP solver. Our detailed contributions are as follows: (1) A new encoding able to express a DP model of a COP into a RL environment and a CP model; (2) The use of two standard RL training procedures, deep Qlearning and proximal policy optimization, for learning an appropriate CP branching strategy. The training is done using randomly generated instances sampled from a similar distribution to those we want to solve; (3) The integration of the learned branching strategies on three CP search strategies, namely branchandbound, iterative limited discrepancy search and restart based search; (4) Promising results on two challenging COPs, namely the travelling salesman problem with time windows, and the 4moments portfolio optimization
; (5) The opensource release of our code and models, in order to ease the future research in this field
^{1}^{1}1https://github.com/qcappart/hybridcprlsolver.In general, as there are no underlying hypothesis such as linearity or convexity, a DP cannot be trivially encoded and solved by standard integer programming techniques bergman2018discrete . It is one of the reasons that drove us to consider CP for the encoding. The next section presents the hybrid solving process that we designed. Then, experiments on the two case studies are carried out. Finally, a discussion on the current limitations of the approach and the next research opportunities are proposed.
2 A Unifying Representation Combining Learning and Searching
Because of the statespace explosion, solving NPhard COPs remains a challenge. In this paper, we propose a generic and complete solver, based on DRL and CP, in order to solve COPs that can be modelled using DP. This section describes the complete architecture of the framework we propose. A highlevel picture of the architecture is shown in Figure 1. It is divided into three parts: the learning phase, the solving phase and the unifying representation, acting as a bridge between the two phases. Each part contains several components. Green blocks and arrows represent the original contributions of this work and blue blocks corresponds to known algorithms that we adapted for our framework.
2.1 Dynamic Programming Model
Dynamic programming (DP) bellman1966dynamic is a technique combining both mathematical modeling and computer programming for solving complex optimization problems, such as NPhard problems. In its simplest form, it consists in breaking a problem into subproblems and to link them through a recursive formulation. The initial problem is then solved recursively, and the optimal values of the decision variables are recovered successively by tracking back the information already computed. Let us consider a general COP , where with are discrete variables that must be assigned in order to maximize a function . In the DP terminology, the decision variables of are referred to as the controls (). They take value from their domain , and enforce a transition () from a state () to another one () where is the set of states. The initial state () is known and a transition is done at each stage () until all the variables have been assigned. Besides, a reward () is induced after each transition. Finally, a DP model can also contain validity conditions () and dominance rules () that restrict the set of feasible actions. The difference between both is that validity conditions are mandatory to ensure the correctness of the DP model () whereas the dominance rules are only used for efficiency purposes (), where , , and represent the equivalence, the implication, and the unfeasible state, respectively. A DP model for a COP can then be modelled as a tuple . The problem can be solved recursively using Bellman Equation, where is a statevalue function representing the optimal reward of being at state at stage :
(1) 
The reward is equal to zero for the final state () and is backtracked until has been computed. This last value gives the optimal cost of . Then, by tracing the values assigned to the variables , the optimal solution is recovered. Unfortunately, DP suffers from the wellknown curse of dimensionality, which prevents its use when dealing with problems involving large state/control spaces. A partial solution to this problem is to prune dominated actions (). An action is dominated if it is valid according to the recursive formulation, but is (1) either strictly worse than another action, or (2) it cannot lead to a feasible solution. In practice, pruning such dominated actions can have a huge impact on the size of the search space, but identifying them is not trivial as assessing those two conditions precisely is problemdependent. Besides, even after pruning the dominated actions, the size of the stateaction space may still be too large to be completely explored in practice.
2.2 RL Encoding
An introduction to reinforcement learning is proposed in Appendix A. Note that all the sets used to define an RL environment are written using a larger size font. Encoding the DP formulation into a RL environment requires to define, adequately, the set of states, the set of actions, the transition function, and the reward function, as the tuple from the DP model and a specific instance of the COP that we are considering. The initial state of the RL environment corresponds to the first stage of the DP model, where no variable has been assigned yet.
State
For each stage of the DP model, we define the RL state as the pair , where is the DP state at the same stage , and is the problem instance we are considering. Note that the second part of the state () is dynamic, as it depends on the current stage in the DP model, or similarly, to the current timestep of the RL episode, whereas the first part () is static
as it remains the same for the whole episode. In practice, each state is embedded into a tensor of features, as it serves as input of a neural network.
Action
Given a state from the DP model at stage and its control , an action at a state has a onetoone relationship with the control . The action can be done if and only if is valid under the DP model. The idea is to allow only actions that are consistent with regards to the DP model, the validity conditions, and the eventual dominance conditions. Formally, the set of feasible actions at stage are as follows:
Transition
The RL transition gives the state from and in the same way as the transition function of the DP model gives a state from a previous state and a control value . Formally, we have the deterministic transition: .
Reward
An initial idea for designing the RL reward function is to use the reward function of the DP model using the current state and the action that has been selected. However, performing a sequence of actions in a DP subject to validity conditions can lead to a state with no solutions, which must be avoided. Such a situation happens when a state with no action is reached whereas at least one control has not been assigned to a value . Finding first a feasible solution must then be prioritized over maximizing the DP reward and is not considered with this simple form of the RL reward. Based on this, two properties must be satisfied in order to ensure that the reward will drive the RL agent to the optimal solution of the COP: (1) the reward collected through an episode must be lesser than the reward of an episode if the COP solution of is worse than the one obtained with , and (2) the total reward collected through an episode giving an unfeasible solution must be lesser than the reward of any episode giving a feasible solution. By doing so, we ensure that the RL agent has incentive to find, first, feasible solutions, and, then, finding the best ones. The reward we designed is as follows : ; where corresponds to an upper bound of the objective value that can be reached for the COP . The term is a constant factor that gives a strict upper bound on the reward of any solution of the DP and drives the agent to progress into a feasible solution first. The absolute value ensures that the term is positive and is used to negate the effect of negative rewards that may lead the agent to stop the episode as soon as possible. The second term forces then the agent to find the best feasible solution. Finally, a scaling factor can also be added in order to compress the space of rewards into a smaller interval value near zero. Note that for DP models having only feasible solutions, the first term can be omitted.
2.3 Learning Algorithm
We implemented two different agents, one based on a valuebased method (DQN) and a second one based on policy gradient (PPO). In both cases, the agent is used to parametrize the weight vector (
w) of a neural network giving either the Qvalues (DQN), or the policy probabilities (PPO). The training is done using randomly generated instances sampled from a similar distribution to those we want to solve. It is important to mention that this learning procedure makes the assumption that we have a generator able to create random instances (
) that follows the same distribution that the ones we want to tackle, or a sufficient number of similar instances from past data. Such an assumption is common in the vast majority of works tackling NPhard problems using ML khalil2017learning ; kool2018attention ; cappart2019improving .2.4 Neural Network Architecture
In order to ensure the genericity and the efficiency of the framework, we have two requirements for designing the neural network architecture: (1) be able to handle instances of the same COPs, but that have a different number of variables (i.e., able to operate on nonfixed dimensional feature vectors) and (2) be invariant to input permutations. In other words, encoding variables , , and should produce the same prediction as encoding , , and . A first option is to embed the variables into a set transformer architecture lee2018set , that ensures these two requirements. Besides, many COPs also have a natural graph structure that can be exploited. For such a reason, we also considered another embedding based on graph attention network (GAT) velivckovic2017graph . The embedding, either obtained using GAT or set transformer
, can then be used as an input of a feedforward network to get a prediction. Case studies will show a practical application of both architectures. For the DQN network, the dimension of the last layer output corresponds to the total number of actions for the COP and output an estimation of the
values for each of them. The output is then masked in order to remove the unfeasible actions. Concerning PPO, distinct networks for the actor and the critic are built. The last layer on the critic output only a single value. Concerning the actor, it is similar as the DQN case but a softmax selection is used after the last layer in order to obtain the probability to select each action.2.5 CP Encoding
An introduction to constraint programming is proposed in Appendix B. Note that the teletype font is used to refer to CP notations. This section describes how a DP formulation can be encoded in a CP model. Modeling a problem using CP consists in defining the tuple where X is the set of variables, is the set of domains, C is the set of constraints, and O is the objective function. Let us consider the DP formulation with also the number of stages.
Variables and domains
We make a distinction between the decision variables, on which the search is performed, and the auxiliary variables that are linked to the decision variables, but that are not branched on during the search. The encoding involves two variables per stage: (1) is an auxiliary variable representing the current state at stage whereas (2) is a decision variable representing the action done at this state, similarly to the regular decomposition pesant2004regular . Besides, a last auxiliary variable is considered for the stage , which represents the final state of the system. In the optimal solution, the variables thus indicate the best state that can be reached at each stage, and the best action to select as well.
Constraints
The constraints of our encoding have two purposes. Firstly, they must ensure the consistency of the DP formulation. It is done (1) by setting the initial state to a value (e.g., ), (2) by linking the state of each stage to the previous one through the transition function (), and finally (3) by enforcing each transition to be valid, in the sense that they can only generate a feasible state of the system. Secondly, other constraints are added in order to remove dominated actions and the subsequent states. In the CP terminology, such constraints are called redundant constraint, they do not change the semantic of the model, but speedup the search. The constraints inferred by our encoding are as follows, where validityCondition and dominanceCondition are both Boolean functions detecting nonvalid transitions and dominated actions, respectively.
(Setting initial state)  
(Enforcing transitions)  
(Keeping valid transitions)  
(Pruning dominated states) 
Objective function
The goal is to maximize the accumulated sum of rewards generated through the transition () during the stages: . Note that the optimization and branching selection is done only on the decision variables ().
2.6 Search Strategy
From the same DP formulation, we are able to (1) build a RL environment in order to learn the best actions to perform, and (2) state a CP model of the same problem. This consistency is at the heart of the framework. This section shows how the knowledge learned during the training phase can be transferred into the CP search. We considered three standard CP specific search strategy: depthfirst branchandbound search (BaB), and iterative limited discrepancy search (ILDS), that are able to leverage knowledge learned with a valuebased method as DQN, and restart based search (RBS), working together with policy gradient methods. The remaining of this section presents how to plug a model learned with DQN inside the BaBsearch of a CP solver. The description of the two other contributions (ILDS with DQN, and RBS with PPO) are available in Appendices C and D.
2.6.1 DepthFirst BranchandBound Search with DQN
This search works in a depthfirst fashion. When a feasible solution has been found, a new constraint ensuring that the next solution has to be better than the current one is added. In case of an unfeasible solution due to an empty domain reached, the search is backtracked to the previous decision. With this procedure, and provided that the whole search space has been explored, the last solution found is then proven to be optimal. This search requires a good heuristic for the valueselection. This can be achieved by a valuebased RL agent, such as DQN. After the training, the agent gives a parametrized stateaction value function , and a greedy policy can be used for the valueselection heuristic, which is intended to be of a high, albeit nonoptimal, quality. The variable ordering must follow the same order as the DP model in order to keep the consistency with both encoding. As highlighted in other works cappart2019improving , an appropriate variable ordering has an important impact when solving DPs. However, such an analysis goes beyond the scope of this work.
The complete search procedure (BaBDQN) is presented in Algorithm 1, taking as input a COP , and a pretrained model with the weight vector w. First, the optimization problem in encoded into a CP model. Then, a new BaBsearch is initialized and executed on the generated CP model. Until the search is not completed, a RL state is obtained from the current CP state (encodeStateRL). The first nonassigned variable of the DP is selected and is assigned to the value maximizing the stateaction value function . All the search mechanisms inherent of a CP solver but not related to our contribution (propagation, backtracking, etc.), are abstracted in the branchAndUpdate function. Finally, the best solution found during the search is returned. We enriched this procedure with a cache mechanism (). During the search, it happens that similar states are reached more than once chu2010automatically . In order to avoid recomputing the Qvalues, one can store the Qvalues related to a state already computed and reuse them if the state is reached again. In the worstcase, all the actionvalue combinations have to be tested. This gives the upper bound , where is the number of actions of the DP model and the maximal domain size. Note that this bound is standard in a CP solver. As the algorithm is based on DFS, the worstcase space complexity is , where is the cache size.
3 Experimental Results
The goal of the experiments is to evaluate the efficiency of the framework for computing solutions of challenging COPs having a DP formulation. To do so, comparisons of our three learningbased search procedures (BaBDQN, ILDSDQN, RBSPPO) with a standard CP formulation (CPmodel), standalone RL algorithms (DQN, PPO), and industrial solvers are performed. Two NPhard problems are considered in the main manuscript: the travelling salesman problem with time windows (TSPTW), involving nonlinear constraints, and the 4moments portfolio optimization problem (PORT), which has a nonlinear objective. In order to ease the future research in this field and to ensure reproducibility, the implementation, the models, the results, and the hyperparameters used are released with the permissive MIT opensource license. Algorithms used for training have been implemented in Python and Pytorch paszke2019pytorch is used for designing the neural networks. Library DGL wang2019deep is used for implementing graph embedding, and SetTransformer lee2018set for set embedding. The CP solver used is Gecode team2008gecode , which has the benefit to be opensource and to offer a lot of freedom for designing new search procedures. As Gecode is implemented in C++, an operability interface with Python code is required. It is done using Pybind11 jakob2017pybind11 . Training time is limited to 48 hours, memory consumption to 32 GB and 1 GPU (Tesla V100SXM232GB) is used per model. A new model is recorded after each 100 episodes of the RL algorithm and the model achieving the best average reward on a validation set of 100 instances generated in the same way as for the training is selected. The final evaluation is done on 100 other instances (still randomly generated in the same manner) using Intel Xeon E52650 CPU with 32GB of RAM and a time limit of 60 minutes.
3.1 Travelling Salesman Problem with Time Windows (TSPTW)
Detailed information about this case study and the baselines used for comparison is proposed in Appendix E. In short, ORTools is an industrial solver developed by Google, PPO uses a beamsearch decoding of width 64, and CPnearest solves the DP formulation with CP, but without the learning part. A nearest insertion heuristic is used for the valueselection instead. Results are summarized in Table 1. First of all, we can observe that ORTools, CPmodel, and DQN are significantly outperformed by the hybrid approaches. Good results are nevertheless achieved by CPnearest, and PPO. We observe that the former is better to prove optimality, whereas the latter is better to discover feasible solutions. However, when the size of instances increases, both methods have more difficulties to solve the problem and are also outperformed by the hybrid methods, which are both efficient to find solutions and to prove optimality. Among the hybrid approaches, we observe that DQNbased searches give the best results, both in finding solutions and in proving optimality.
We also note that caching the predictions is useful. Indeed, the learned heuristics are costly to use, as the execution time to finish the search is larger when the cache is disabled. For comparison, the average execution time of a valueselection without caching is 34 milliseconds for BaBDQN (100 cities), and goes down to 0.16 milliseconds when caching is enabled. For CPnearest, the average time is 0.004 milliseconds. It is interesting to see that, even being significantly slower than the heuristic, the hybrid approach is able to give the best results.
max width= Approaches 20 cities 50 cities 100 cities Type Name Success Opt. Time Success Opt. Time Success Opt. Time Constraint programming ORTools 100 0 < 1 0 0 t.o. 0 0 t.o. CPmodel 100 100 < 1 0 0 t.o. 0 0 t.o. CPnearest 100 100 < 1 99 99 6 0 0 t.o. Reinforcement learning DQN 100 0 < 1 0 0 < 1 0 0 < 1 PPO 100 0 < 1 100 0 5 21 0 46 Hybrid (no cache) BaBDQN 100 100 < 1 100 99 2 100 52 20 ILDSDQN 100 100 < 1 100 100 2 100 53 39 RBSPPO 100 100 < 1 100 80 12 100 0 t.o. Hybrid (with cache) BaBDQN 100 100 < 1 100 100 < 1 100 91 15 ILDSDQN 100 100 < 1 100 100 1 100 90 15 RBSPPO 100 100 < 1 100 99 2 100 11 32
3.2 4Moments Portfolio Optimization (PORT)
Detailed information about this case study is proposed in Appendix F. In short, Knitro and APOPT are two general nonlinear solvers. Given that the problem is nonconvex, these solvers are not able to prove optimality as they may be blocked in local optima. The results are summarized in Table 2. Let us first consider the continuous case. For the smallest instances, we observe that BaBDQN, ILDSDQN, and CPmodel achieve the best results, although only BaBDQN has been able to prove optimality for all the instances. For larger instances, the nonlinear solvers achieve the best results, but are nevertheless closely followed by RBSPPO. When the coefficients of variables are floored (discrete case), the objective function is not continuous anymore, making the problem harder for nonlinear solvers, which often exploit information from derivatives for the solving process. Such a variant is not supported by APOPT. Interestingly, the hybrid approaches do not suffer from this limitation, as no assumption on the DP formulation is done beforehand. Indeed, ILDSDQN and BaBDQN achieve the best results for the smallest instances and RBSPPO for the larger ones.
max width= Approaches Continuous coefficients Discrete coefficients 20 items 50 items 100 items 20 items 50 items 100 items Type Name Sol. Opt. Sol. Opt. Sol. Opt. Sol. Opt. Sol. Opt. Sol. Opt. Nonlinear solver KNITRO 343.79 0 1128.92 0 2683.55 0 211.60 0 1039.25 0 2635.15 0 APOPT 342.62 0 1127.71 0 2678.48 0       Constraint programming CPmodel 356.49 98 1028.82 0 2562.59 0 359.81 100 1040.30 0 2575.64 0 Reinforcement learning DQN 306.71 0 879.68 0 2568.31 0 309.17 0 882.17 0 2570.81 0 PPO 344.95 0 1123.18 0 2662.88 0 347.85 0 1126.06 0 2665.68 0 Hybrid (with cache) BaBDQN 356.49 100 1047.13 0 2634.33 0 359.81 100 1067.37 0 2641.22 0 ILDSDQN 356.49 1 1067.20 0 2639.18 0 359.81 100 1084.21 0 2652.53 0 RBSPPO 356.35 0 1126.09 0 2674.96 0 359.69 0 1129.53 0 2679.57 0
4 Discussion and Limitations
First of all, let us highlight that this work is not the first one attempting to use ML for guiding the decision process of combinatorial optimization solvers NIPS2014_5495 . According to the survey and taxonomy of Bengio et al. bengio2018machine , this kind of approach belongs to the third class (Machine learning alongside optimization algorithms) of ML approaches for solving COPs. It is for instance the case of gasse2019exact
, which propose to augment branchandbound procedures using imitation learning. However, their approach requires supervised learning and is only limited to (integer) linear problems. The differences we have with this work are that (1) we focus on COPs modelled as a DP, and (2) the training is entirely based on RL. Thanks to CP, the framework can solve a large range of problems, as the TSPTW, involving nonlinear combinatorial constraints, or the portfolio optimization problem, involving a nonlinear objective function. Note that we nevertheless need a generator of instances, or enough historical data of the same distribution, in order to train the models. Besides its expressiveness, and in contrast to most of the related works solving the problem endtoend
bello2016neural ; kool2018attention ; deudon2018learning ; joshi2019efficient , our approach is able to deal with problems where finding a feasible solution is difficult and is able to provide optimality proofs. This was considered by Bengio et al. as an important challenge in learningbased methods for combinatorial optimization bengio2018machine .In most situations, experiments show that our approach can obtain more and better solutions than the other methods with a smaller execution time. However, they also highlighted that resorting to a neural network prediction is an expensive operation to perform inside a solver, as it has to be called numerous times during the solving process. It is currently a bottleneck, especially if we would like to consider larger instances. It is why caching, despite being a simple mechanism, is important. Another possibility is to reduce the complexity of the neural network by compressing its knowledge, which can for instance be done using knowledgedistillation hinton2015distilling or by building a more compact equivalent network serra2020lossless . Note that the Pybind11 binding between the Python and C++ code is also a source of inefficiency. Another solution would be to implement the whole framework into a single, efficient, and expressive enough, programming language.
5 Conclusion
The goal of combinatorial optimization is to find an optimal solution among a finite set of possibilities. There are many practical and industrial applications of COPs, and efficiently solving them directly results in a better utilization of resources and a reduction of costs. However, since the number of possibilities grows exponentially with the problem size, solving is often intractable for large instances. In this paper, we propose a hybrid approach, based on both deep reinforcement learning and constraint programming, for solving COPs that can be formulated as a dynamic program. To do so, we introduced an encoding able to express a DP model into a reinforcement learning environment and a constraint programming model. Then, the learning part can be carried out with reinforcement learning, and the solving part with constraint programming. The experiments carried out on the travelling salesman problem with time windows and the 4moments portfolio optimization problem show that this framework is competitive with standard approaches and industrial solvers for instances up to 100 variables. These results suggest that the framework may be a promising new avenue for solving challenging combinatorial optimization problems.
References
 [1] Emile Aarts and Jan Karel Lenstra. Local search in combinatorial optimization. Princeton University Press, 2003.
 [2] Alexander Aiken. Set constraints: Results, applications and future directions. In International Workshop on Principles and Practice of Constraint Programming, pages 326–335. Springer, 1994.
 [3] Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. A brief survey of deep reinforcement learning. CoRR, abs/1708.05866, 2017.
 [4] Alper Atamtürk and Vishnu Narayanan. Polymatroids and meanrisk minimization in discrete optimization. Operations Research Letters, 36(5):618–622, 2008.
 [5] Logan DR Beal, Daniel C Hill, R Abraham Martin, and John D Hedengren. Gekko optimization suite. Processes, 6(8):106, 2018.
 [6] Nicolas Beldiceanu, Mats Carlsson, and JeanXavier Rampon. Global constraint catalog, 2010.
 [7] Richard Bellman. Dynamic programming. Science, 153(3731):34–37, 1966.
 [8] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940, 2016.
 [9] Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learning for combinatorial optimization: a methodological tour d’horizon. arXiv preprint arXiv:1811.06128, 2018.

[10]
Michel Bénichou, JeanMichel Gauthier, Paul Girodet, Gerard Hentges, Gerard
Ribière, and O Vincent.
Experiments in mixedinteger linear programming.
Mathematical Programming, 1(1):76–94, 1971.  [11] David Bergman and Andre A Cire. Discrete nonlinear optimization by statespace decompositions. Management Science, 64(10):4700–4720, 2018.
 [12] Christian Bessiere. Arcconsistency and arcconsistency again. Artificial intelligence, 65(1):179–190, 1994.
 [13] Christian Bessiere. Constraint propagation. In Foundations of Artificial Intelligence, volume 2, pages 29–83. Elsevier, 2006.
 [14] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.
 [15] Quentin Cappart, Emmanuel Goutierre, David Bergman, and LouisMartin Rousseau. Improving optimization bounds using machine learning: Decision diagrams meet deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1443–1451, 2019.
 [16] Geoffrey Chu, Maria Garcia de La Banda, and Peter J Stuckey. Automatically exploiting subproblem equivalence in constraint programming. In International Conference on Integration of Artificial Intelligence (AI) and Operations Research (OR) Techniques in Constraint Programming, pages 71–86. Springer, 2010.
 [17] IBM ILOG Cplex. V12. 1: User’s manual for cplex. International Business Machines Corporation, 46(53):157, 2009.
 [18] Michel Deudon, Pierre Cournut, Alexandre Lacoste, Yossiri Adulyasak, and LouisMartin Rousseau. Learning heuristics for the tsp by policy gradient. In International conference on the integration of constraint programming, artificial intelligence, and operations research, pages 170–181. Springer, 2018.
 [19] JeanGuillaume Fages and Charles Prud’Homme. Making the first solution good! In 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI), pages 1073–1077. IEEE, 2017.

[20]
Maxime Gasse, Didier Chételat, Nicola Ferroni, Laurent Charlin, and Andrea
Lodi.
Exact combinatorial optimization with graph convolutional neural networks.
In Advances in Neural Information Processing Systems, pages 15554–15566, 2019.  [21] Michel Gendreau and JeanYves Potvin. Metaheuristics in combinatorial optimization. Annals of Operations Research, 140(1):189–213, 2005.
 [22] Taha Ghasempour and Benjamin Heydecker. Adaptive railway traffic control using approximate dynamic programming. Transportation Research Part C: Emerging Technologies, 2019.
 [23] Gregory A Godfrey and Warren B Powell. An adaptive dynamic programming algorithm for dynamic fleet management, i: Single period travel times. Transportation Science, 36(1):21–39, 2002.
 [24] David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.
 [25] William D Harvey and Matthew L Ginsberg. Limited discrepancy search. In IJCAI (1), pages 607–615, 1995.
 [26] He He, Hal Daume III, and Jason M Eisner. Learning to search in branch and bound algorithms. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3293–3301. Curran Associates, Inc., 2014.
 [27] J Hedengren, J Mojica, W Cole, and T Edgar. Apopt: Minlp solver for differential and algebraic systems with benchmark testing. In Proceedings of the INFORMS National Meeting, Phoenix, AZ, USA, volume 1417, page 47, 2012.
 [28] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 [29] Wenzel Jakob, Jason Rhinelander, and Dean Moldovan. pybind11–seamless operability between c++ 11 and python, 2017.
 [30] Chaitanya K Joshi, Thomas Laurent, and Xavier Bresson. An efficient graph convolutional network technique for the travelling salesman problem. arXiv preprint arXiv:1906.01227, 2019.
 [31] Latife Genç Kaya and John N Hooker. A filter for the circuit constraint. In International Conference on Principles and Practice of Constraint Programming, pages 706–710. Springer, 2006.
 [32] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems, pages 6348–6358, 2017.
 [33] Wouter Kool, Herke Van Hoof, and Max Welling. Attention, learn to solve routing problems! arXiv preprint arXiv:1803.08475, 2018.
 [34] Philippe Laborie. Objective landscapes for constraint programming. In International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pages 387–402. Springer, 2018.
 [35] Eugene L Lawler and David E Wood. Branchandbound methods: A survey. Operations research, 14(4):699–719, 1966.
 [36] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
 [37] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam R Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attentionbased permutationinvariant neural networks. arXiv preprint arXiv:1810.00825, 2018.
 [38] Michael Luby, Alistair Sinclair, and David Zuckerman. Optimal speedup of las vegas algorithms. Information Processing Letters, 47(4):173–180, 1993.
 [39] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 [40] Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559–7566. IEEE, 2018.
 [41] Gurobi Optimization. Inc.,“gurobi optimizer reference manual,” 2015, 2014.
 [42] Anthony Palmieri, JeanCharles Régin, and Pierre Schaus. Parallel strategies selection. In International Conference on Principles and Practice of Constraint Programming, pages 388–404. Springer, 2016.
 [43] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems, pages 8024–8035, 2019.
 [44] Jing Peng and Ronald J Williams. Incremental multistep qlearning. In Machine Learning Proceedings 1994, pages 226–232. Elsevier, 1994.
 [45] Gilles Pesant. A regular language membership constraint for finite sequences of variables. In International conference on principles and practice of constraint programming, pages 482–495. Springer, 2004.
 [46] JeanCharles Régin. A filtering algorithm for constraints of difference in csps. In AAAI, volume 94, pages 362–367, 1994.
 [47] JeanCharles Régin. Global constraints: A survey. In Hybrid optimization, pages 63–134. Springer, 2011.
 [48] Francesca Rossi, Peter Van Beek, and Toby Walsh. Handbook of constraint programming. Elsevier, 2006.
 [49] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. CoRR, abs/1511.05952, 2015.
 [50] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 [51] Thiago Serra, Abhinav Kumar, and Srikumar Ramalingam. Lossless compression of deep neural networks. arXiv preprint arXiv:2001.00218, 2020.
 [52] Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998.
 [53] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 [54] Yufei Tang, Chaoxu Mu, and Haibo He. Nearspace aerospace vehicles attitude control based on adaptive dynamic programming and sliding mode control. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 1347–1353. IEEE, 2017.
 [55] Gecode Team. Gecode: Generic constraint development environment, 2006, 2008.
 [56] Nikolas Topaloglou, Hercules Vladimirou, and Stavros A Zenios. A dynamic stochastic programming model for international portfolio management. European Journal of Operational Research, 185(3):1501–1524, 2008.
 [57] John N Tsitsiklis. Special cases of traveling salesman and repairman problems with time windows. Networks, 22(3):263–282, 1992.
 [58] Pascal Van Hentenryck, Yves Deville, and ChohMan Teng. A generic arcconsistency algorithm and its specializations. Artificial intelligence, 57(23):291–321, 1992.
 [59] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
 [60] Richard A Waltz and Jorge Nocedal. Knitro 2.0 user’s manual. Ziena Optimization, Inc.[en ligne] disponible sur http://www. ziena. com (September, 2010), 7:33–34, 2004.
 [61] Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, et al. Deep graph library: Towards efficient and scalable deep learning on graphs. arXiv preprint arXiv:1909.01315, 2019.
 [62] Christopher JCH Watkins and Peter Dayan. Qlearning. Machine learning, 8(34):279–292, 1992.
 [63] Laurence A Wolsey and George L Nemhauser. Integer and combinatorial optimization, volume 55. John Wiley & Sons, 1999.
Appendix A Technical Background on Reinforcement Learning
Let be a tuple representing a deterministic couple agentenvironment where S in the set of states in the environment, A is the set of actions that the agent can do, is the transition function leading the agent from one state to another given the action taken, and is the reward function of taking an action from a specific state. The behavior of an agent is defined by a policy , describing the action to be taken given a specific state. The goal of an agent is to learn a policy maximizing the accumulated sum of rewards (eventually discounted) during its lifetime defined by a sequence of states with and the episode length. Such a sequence is called an episode, where is the terminal state. The return after time step is defined as follows: . Note that we omitted the standard discounting factor, because the weight of all decisions have the same importance when solving COPs that do not have a temporal aspect.
In a deterministic environment, the quality of taking an action from a state under a policy is defined by the actionvalue function . Similarly, the return of a state is defined by the statevalue function . The problem is to find a policy that maximizes the final return: and . However, the number of combinations increases exponentially with the number of states and actions, which makes this problem intractable. There exist in the literature a tremendous number of approaches for solving this problem, the many very successful ones are based on deep learning [36] and are referred to as deep reinforcement learning (DRL) [3].
Apart from hybrid approaches such as MonteCarlo tree search [14] and modelbased approaches that are dedicated to learn the environment [40, 24], there exist two families of reinforcement learning algorithms: (1) the valuebased methods, where the goal is to learn an actionvalue function and then to select the actions with high value, and (2) the policybased methods, where the policy is learned directly.
In this paper, we consider two algorithms, deep Qlearning (DQN) [39], a valuebased method, and proximal policy gradient (PPO) [50], a policygradient based method. The idea of DQN is to use a deep neural network to approximate the actionvalue function and to use it through a learning algorithm [62]. This provides an estimator , where w is a learned weight vector parametrizing . The greedy policy is then used for selecting the actions. In practice, the standard DQN algorithm is improved by several mechanisms in order to speedup learning and to increase stability [44, 49].
Conversely, the goal of policy gradient methods, as PPO, is to learn an estimation of the policy that maximizes , the expected return of the initial state. In practice, it is often computed using the policy gradient theorem [53]: . The expectation indicates the empirical average over a set of samples that are obtained by following the current policy . This theorem provides a suited form to improve the policy through gradient ascent, by parametrizing the weight vector w. Unlike the DQN algorithm, the policy obtained with policy gradient methods keeps a stochastic aspect. Based on the policy gradient theorem, the main idea of PPO is to avoid parameter updates that change the policy too much during the gradient ascent. It empirically results in improved stability during training.
Appendix B Technical Background on Constraint Programming
Constraint programming (CP) [48] is a general framework proposing simple and efficient algorithmic solutions to COPs. Such as mixed integer programming (MIP), and in contrast to heuristic and local search methods, CP is a complete approach for solving COPs, which can prove the optimality of the solutions found. Considering a minimization problem, a CP model is a tuple , where X is the set of variables, is the set of domains, that contain possible values for the variables, C is the set of constraints, that restrict assignments of values to variables, and O is an objective function. The goal is to choose a value for each variable of X from D which satisfies all the constraints of C and that maximizes the objective O.
Finding feasible solutions essentially consists of an exhaustive enumeration of all the possible assignments of values to variables until the best solution has been found. The search tree grows up exponentially with the number of variables during this process, meaning that the complete enumeration of solutions is intractable for large problems. To cope with this issue, the CP search is first improved by a mechanism called propagation, having the responsibility to reduce the number of possible combinations and then the search tree size. Given the current domains of the variables and a constraint c, the propagation of c removes from the domains the values that cannot be part of the final solution because of the violation of c. This process is repeated at each domain change and for each constraint until no more value can be removed from the domains. This procedure is commonly referred to as the fixpoint algorithm. The efficiency and the success of a CP solver is mainly due to the quality of their propagators, i.e., the algorithms performing the propagation for each constraint, the variety of available constraints, and their expressiveness. A tremendous amount of work has been dedicated to the implementation of efficient propagators for constraints [58, 12, 13]. Among them, global constraints [47] are constraints that capture a relation between a nonfixed number of variables. At the time of writing, the global constraints catalog reports more than 400 different existing global constraints [6]. For instance, a wellknown global constraint is allDifferent(X), enforcing each variable in X to take a different value [46]. Other examples are the set constraints [2], that enable users to build variables having a set structure and to state constraints on them (union, intersection, difference, etc.), or the circuit constraint [31], enforcing a valid Hamiltonian tour across a set of variables. Then, Unlike MIP solvers that are restricted to linear constraints, one asset of CP is its ability to handle any kind of constraints, even nonlinear.
The search is commonly done in a depthfirst fashion, together with branchandbound [35]. When a feasible solution has been found, a new constraint ensuring that the next solution has to be better than the current one is added. In case of an unfeasible solution due to an empty domain reached, the search is backtracked to the previous decision. With this procedure, and provided that the whole search space has been explored, the last solution found is then proven to be optimal. A drawback of this strategy is that backtracks occurs at the last levels of the tree search. Hence, the early choices are rarely reconsidered. However that CP offers a great flexibility in the search procedure, and allows other kinds of searches such as limited discrepancy search [25] or strategy based on restarts. The last design choices when using CP is the variableselection and valueselection heuristics. During the search, what variable must be considered first when branching ? Similarly, what value should be tested first. As a guideline, the values that are the most promising should be assigned first. Although crucial for constrained satisfaction problems, the impact of the variable ordering has a smaller impact on the search when dealing with COPs and is not analyzed in this work. The choice of an appropriate value ordering heuristic is nontrivial and their design is a hot topic in the CP community [19, 34].
Appendix C Iterative Limited Discrepancy Search with DQN
Iterative limited discrepancy search (ILDS) [25] is a search strategy commonly used when we have a good prior on the quality of the value selection heuristic used for driving the search. The idea is to restrict the number of decisions deviating from the heuristic choices (i.e., a discrepancy) by a threshold. By doing so, the search will explore a subset of solutions that are likely to be good according to the heuristic while giving a chance to reconsider the heuristic selection which may be suboptimal. This mechanism is often enriched with a procedure that iteratively increases the number of discrepancies allowed once a level has been fully explored.
As ILDS requires a good heuristic for the valueselection, it is complementary with a valuebased RL agent, such as DQN. After the training, the agent gives a parametrized stateaction value function , and the greedy policy can be used for the valueselection heuristic, which is intended to be of a high, albeit nonoptimal, quality. The variable ordering must follow the same order as the DP model in order to keep the consistency with both encoding.
The complete search procedure we designed (ILDSDQN) is presented in Algorithm 2, taking as input a COP , a pretrained model with the weight vector w, and an iteration threshold for the ILDS. First, the optimization problem in encoded into a CP model. Then, for each number of discrepancies allowed, a new search is initialized and executed on . Until the search is not completed, a RL state is obtained from the current CP state (encodeStateRL). The first nonassigned variable of the DP is selected and is assigned to the value maximizing the stateaction value function . All the search mechanisms inherent of a CP solver but not related to our contribution (propagation, backtracking, etc.), are abstracted in the branchAndUpdate function. Finally, the best solution found during the search is returned. The cache mechanism () introduced for the BaB search is reused. The worstcase bounds are the same as in Algorithm 1: for the time complexity, and for the space complexity, where is the number of actions of the DP model, is the maximal domain size, and is the cache size.
Appendix D RestartBased Search with PPO
Restartbased search (RBS) is another search strategy, which involves multiple restarts to enforce a suitable level of exploration. The idea is to execute the search, to stop it when a given threshold is reached (i.e., execution time, number of nodes explored, number of failures, etc.), and to restart it. Such a procedure works only if the search has some randomness in it, or if new information is added along the search runs. Otherwise, the exploration will only consider similar subtrees. A popular design choice is to schedule the restart on the Luby sequence [38], using the number of failures for the threshold, and branchandbound for creating the search tree.
The sequence starts with a threshold of 1. Each next parts of the sequence is the entire previous sequence with the last value of the previous sequence doubled. With a size of 15, the luby sequence is the following: . The sequence can also be scaled with a factor , multiplying each element. As a controlled randomness is a key component of this search, it can naturally be used with a policy
parametrized with a policy gradient algorithm. By doing so, the heuristic randomly selects a value among the feasible ones, and according to the probability distribution of the policy through a softmax function. It is also possible to control the exploration level by tuning the softmax function with a standard Boltzmann temperature
. The complete search process is depicted in Algorithm 3. Note that the cache mechanism is reused in order to store the vector of action probabilities for a given state. The worstcase bounds are the same as in Algorithm 1: for the time complexity, and for the space complexity, where is the number of actions of the DP model, is the maximal domain size and, is the cache size.Appendix E Travelling Salesman Problem with Time Windows (TSPTW)
The travelling salesman problem with time windows (TSPTW) is an extension of the standard travelling salesman problem (TSP). Given an instance of customers, it consists in finding a minimum cost Hamiltonian tour starting and ending at a given depot and visiting all customers (or similarly, cities). Each customer is defined by a position ( and for the 2D case), and a time window (), defining the time slot where he can be visited. Each customer can and must be visited once and the travel time between two customers and is defined by . Note that a customer can be visited before the beginning of its time windows but, in this case, the service has to wait. No customers can be serviced after its time windows, and solutions that fail to serve all the clients are considered infeasible. Finally, there is no time windows associated to the depot. The goal is to minimize the sum of the travel distances. Although the search space of the TSPTW is smaller than for the TSP, the time constraints make the problem more difficult to solve in practice [57].
e.1 Dynamic Programming Model
Given an instance of customers, the DP model has stages where the last state corresponds to the solution obtained. Without loss of generality, we assume that the depot is associated to the first customer (). A state at stage is composed of three entities: (1) the set of remaining customers that still have to be visited (, with the powerset of all the customers without including the depot), (2) the last customer that has been serviced (), and (3) the current time (). An action performed at stage corresponds to the service of customer . The reward is, in fact, a penalty and corresponds to the travel time between two customers (. Note that an additional penalty for coming back to the depot () must also be considered. The DP model we built is as follows.
(Initial state definition)  
(Transition function for )  
(Transition function for )  
(Transition function for )  
(First validity condition)  
(Second validity condition)  
(Dominance pruning) 
The initial state enforces to start at the depot at time and that all the customers (excepting the depot) must be visited. When an action is done (i.e., servicing a customer), the state is updated as follows: (1) the client does not have to be visited anymore, (2) he becomes the last visited client, and (3) the current time is updated according to the problem definition. A action is valid if it satisfies two validity conditions: (1) he must be in the set of the nonvisited clients, and (2) given the current time, it is possible to visit the client without exceeding the time windows. The nonmandatory dominance rule () removes from all the clients having the time windows exceeded. By doing so, the search space is reduced. Finally, the objective function is to minimize the sum of the travel time.
e.2 Instance Generation
For an instance of customers, the coordinates and are sampled uniformly at random in a grid of size . The rounded 2D Euclidean distances is used for the travel time between two locations. The largest distance separating two customers is then . Time windows are also randomly generated but we ensure that the values selected will allow at least one feasible solution. To do so, we generate the time windows as follows. Let be the maximal time window length allowed and the maximal gap between two consecutive time windows. We first generate a random permutation of all the customers. It constitutes the feasible solution we want to preserve. Then, the time windows are computed as follows: , and , with (i.e., the depot) and the distance between two consecutive customers in the tour. It is important to note that this feasible solution is not known by the solvers, which only receive the customers coordinates and the time windows bounds. Without loss of generality, the values 100 and 10 are used for and .
e.3 Neural Architecture
A TSPTW instance can naturally be represented by a fully connected graph, where the vertices are the customers, and the (weighted) edges express the distances between them. Therefore, the features representing an instance should reflect the combinatorial structure of the graph. To do so, a graph attention network (GAT) [59]
is used in order to produce an embedding for each node of the graph. Few fully connected layers are then added on top of the embedding. For the DQN case, the dimension of the last layer output corresponds to the number of possible actions (i.e., the number of customers) and output an estimation of the Qvalues for each of them. Concerning PPO, two distinct networks are built, one for the actor, and the second one for the critic. A GAT embedding is also used and is similar for both. The critic uses a max pooling between the final node embedding and the fully connected layers. The last layer on the critic outputs only a single value. Concerning the actor, it is similar as the DQN case but a softmax selection is used after the last layer in order to obtain a probability to select each action. Note that the output of the softmax and of the DQN network are masked in order to allow only the valid actions to be selected. Implementation is done using
Pytorch [43] and DGL [61]. Each customer is represented by 4 static and 2 dynamic features. Static features are the coordinates of the nodes (, ) and the lower and upper bounds value of the time windows (, ). The dynamic features are related to the DP model. Both are binary values and indicate if the customer still has to be serviced, and, if the customer if the last one that has been serviced. Edges have only a single feature, namely the distances between two customers (). All the features are then divided by their highest possible value. The hyperparameters used are summarized in Table 3. The ranges have been determined using this reference (https://github.com/llSourcell/Unity_ML_Agents/blob/master/docs) and the values have been fixed by grid searches on subsets of parameters in order to keep the number of combinations tractable. The selection has been made when training on the 20cities case, and the hyperparameters were reused for the larger instances, except for the softmax temperature that has been tuned for each size.max width= Parameter Range tested DQNbased methods PPObased method 20 cities 50 cities 100 cities 20 cities 50 cities 100 cities Batch size 32 32 32 64 64 64 Learning rate 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 # GAT layers 4 4 4 4 4 4 Embedding dimension 32 32 32 128 128 128 # Hidden layers 2 2 2 4 4 4 Hidden layers dimension 32 32 32 128 128 128 Reward scaling 0.001 0.001 0.001 0.001 0.001 0.001 Softmax temperature 10 10 10    nstep Episode length 20 50 100    Entropy value    0.001 0.001 0.001 Cliping value    0.1 0.1 0.1 Udpate timestep    2048 2048 2048
# Epochs per update
   3 3 3 RBS temperature    20 20 20 Luby scaling factor    128 128 128e.4 Baselines used for Comparison
 ORTools

A hybrid model based on constraint programming and local search using ORTools solver. We reuse the VRPTW example of the documentation^{2}^{2}2https://developers.google.com/optimization/routing/vrptw, with only one vehicle.
 DQN

Solution obtained using the standard DQN algorithm with the greedy selection policy.
 PPO

Best solution obtained using PPO with a beamsearch decoding of size 64.
 CPModel

A standard constraint programming formulation^{3}^{3}3http://www.hakank.org/minizinc/tsptw.mzn, that leverages global constraints (allDifferent, circuit, and increasing). It is solved with Gecode [55].
 Nearest

A model based on the above DP formulation and uses the ILDS search strategy with a selection based on the closest next customer instead of the DQN prediction. It is also modelled and solved with Gecode [55],
Appendix F 4Moments Portfolio Optimization Problem (PORT)
In the 4moments portfolio optimization problem (PORT) [4, 11], an investor has to select a combination of investments that provides the best tradeoff between the expected return and different measures of financial risk. Given a set of investments, each with a specific cost (), an expected return (), a standard deviation (), a skewness (), and a kurtosis (
), the goal of the portfolio optimization problem is to find a portfolio with a large positive expected return and skewness, but with a large negative variance and kurtosis, provided that the total investment cost is below its budget
. Besides, the importance of each financial characteristic is weighted (, , , and ) according to the preference of the investor. Letbe a binary variable associated to each investment, indicating whether or not the investment is included in the portfolio. The standard problem is expressed as follows:
maximize  (2)  
subject to  
Note that it is a discrete nonlinear programming problem, and that the objective function, taken asis, is nonconvex. Solving this problem using a integer programming solver is nontrivial and requires advanced decomposition methods [11]. Another option is to use a general nonlinear solver as Knitro [60], or APOPT [27]. However, as this formulation is nonconvex, such solvers will not be able to prove optimality. To do so, a convex reformulation of the problem is required. In this work, we also consider a discrete variant of this problem, where the floor function is applied on all the roots of Equation (2). By doing so, all the coefficients are integers. This variant is especially hard for general nonlinear solvers, as we break the linearity of the objective function and increase the risk of getting a poor local optimum.
f.1 Dynamic Programming Model
Given an instance of items, the DP model has stages where the last state corresponds to the final solution obtained. The idea of the DP model is to consider at each investment successively (one per stage) and to decide if it must be inserted into the portfolio. A state at stage only consists of the current cost of the investments in the portfolio. An action performed at stage corresponds to the selection, or not, of the investment . The reward corresponds to the final objective function value (Equation (2)) provided that we are at the last stage (i.e., all the variables have been assigned). Otherwise, it is equal to zero. Then, only the final reward is considered. The DP model, with a validity condition ensuring that the budget is never exceeded, is as follows:
(Initial state definition)  
(Transition function for )  
(Validity condition) 
f.2 Instance Generation
f.3 Neural Architecture
Unlike the TSPTW, the portfolio optimization problem has no graph structure. For this reason, we have considered for this problem an architecture based on sets (SetTransformer) [37] which has the benefit to be permutation invariant. Then, processing an instance with the items produce the same output as , , or . SetTransformer first produces an embedding for each item. Then, few fully connected layers are added on the top of the embedding to produce the output. Like the TSPTW, the architecture differs slightly according to the RL algorithm used (DQN or PPO). Implementation is done using Pytorch and the SetTransformer code proposed in [37] is reused. Each item is represented by 5 static and 4 dynamic features. Static features are the costs (), and the coefficient of each item according to the objective function (Equation (2)): , , , and . The dynamic features are related to the current state inside the DP model. First, we have three binary values and indicates (1) if the investment has already been considered by the DP model, (2) if the investment is the one that is considered at the current stage, and (3) if the insertion of investment will exceed the budget. Finally, the last feature indicates the remaining budget allowed, minus all the costs. All nonbinary features are finally divided by their highest possible value. The hyperparameters used are summarized in Table 4 and the selection follows the same procedure as for the TSPTW case study. Note that the embedding dimension of 40 has been reused from the initial SetTransformer implementation.
max width= Parameter Range tested DQNbased methods PPObased method 20 items 50 items 100 items 20 items 50 items 100 items Batch size 64 32 32 128 32 64 Learning rate 0.00001 0.00001 0.00001 0.0001 0.0001 0.0001 Embedding dimension 40 40 40 40 40 40 # Hidden layers 2 2 2 2 2 2 Hidden layers dimension 128 128 128 128 128 128 Reward scaling 0.001 0.001 0.001 0.001 0.001 0.001 Softmax temperature 10 2 10    nstep 20 50 100    Entropy value    0.001 0.001 0.001 Cliping value    0.1 0.1 0.1 Udpate timestep    2048 2048 2048 # Epochs per udpate    4 4 4 RBS temperature    1 1 1 Luby scaling factor    1 1 1
f.4 Baselines used for Comparison
 Knitro

Knitro is a commercial software package for solving large scale nonlinear mathematical optimization problems
[60]. The parameters used are the default ones of the solver.  APOPT

APOPT is another solver able to solve large scale optimization problems, such as mixed integer nonlinear programs. We used this solver with the GEKKO python interface [5]. The parameters used are the default ones.
 DQN

Solution obtained using the standard DQN algorithm with the greedy selection policy.
 PPO

Best solution obtained using PPO with a beamsearch decoding of size 64.
 CPModel
Comments
There are no comments yet.