Implementation of the paper "Improving Optimization Bounds using Machine Learning: Decision Diagrams meet Deep Reinforcement Learning".
Finding tight bounds on the optimal solution is a critical element of practical solution methods for discrete optimization problems. In the last decade, decision diagrams (DDs) have brought a new perspective on obtaining upper and lower bounds that can be significantly better than classical bounding mechanisms, such as linear relaxations. It is well known that the quality of the bound achieved through this flexible bounding method is highly reliant on the ordering of variables chosen for building the diagram, and finding an ordering that optimizes standard metrics, or even improving one, is an NP-hard problem. In this paper, we propose an innovative and generic approach based on deep reinforcement learning for obtaining an ordering for tightening the bounds obtained with relaxed and restricted DDs. We apply the approach to both the Maximum Independent Set Problem and the Maximum Cut Problem. Experimental results on synthetic instances show that the deep reinforcement learning approach, by achieving tighter objective function bounds, generally outperforms ordering methods commonly used in the literature when the distribution of instances is known. To the best knowledge of the authors, this is the first paper to apply machine learning to directly improve relaxation bounds obtained by general-purpose bounding mechanisms for combinatorial optimization problems.READ FULL TEXT VIEW PDF
We introduce a general method for relaxing decision diagrams that allows...
Many real-world problems can be reduced to combinatorial optimization on...
An ordered binary decision diagram (OBDD) is a directed acyclic graph th...
In practice, it is quite common to face combinatorial optimization probl...
The objective function of a quadratic combinatorial optimization problem...
Graph matching under node and pairwise constraints has been a building b...
Constraint Optimization Problems (COP) are often considered without
Implementation of the paper "Improving Optimization Bounds using Machine Learning: Decision Diagrams meet Deep Reinforcement Learning".
Relaxation bounds, and mechanisms by which those bounds can be improved, are perhaps the most critical element of scalable generic algorithms for discrete optimization problems. As machine learning popularizes, a natural question arises: how can machine learning be used for improving optimization bounds? Finding a way to utilize the power of machine learning to prove tighter relaxation bounds may be a key for unlocking significant performance improvements in optimization solvers. This paper provides, to the best knowledge of the authors, a first effective approach in the literature towards achieving this goal.
The challenge one faces in using machine learning to tighten relaxation bounds is that the bound provided by classical methods (e.g., LP or SDP relaxations) are inflexible; the algorithm used to solve the relaxation has no effect on the quality of the bound. For example, given an IP model, the LP relaxation will report the same bound independent of what method is used to solve the relaxation and any other decision employed during the solution algorithm.
Contrastingly, approximate decision diagrams (DDs) [Bergman, van Hoeve, and Hooker2011], a recently introduced optimization technology, provide a flexible bounding method, in that decisions employed in the execution of the algorithms used to build the DDs directly affect the quality of the bound. This is true for both relaxed DDs, that prove relaxation bounds, and restricted DDs, that identify primal solutions. This opens the door for potential integration with machine learning.
Initially introduced for representing switching circuits [Lee1959] and for formal verification [Bryant1986], DDs in discrete optimization are used to encode the feasible solutions of a problem while preserving its combinatorial structure. A common application is to provide bounds, both upper and lower, for discrete optimization problems [Bergman, van Hoeve, and Hooker2011, Bergman et al.2013]. However, the quality of the bounds is known to be tightly related to the variable ordering considered during the construction of the DD [Bergman et al.2012]. It has been shown that finding an optimal ordering for general DDs, or even improving a given variable ordering, is NP-hard [Bollig and Wegener1996], and is often challenging to even model. Thus, designing methods for finding a good ordering is a hot topic in the community and continues as a challenge. The idea suggested in this paper is to use machine learning to identify good variable orderings that therefore result in tighter objective function bounds.
In another field of research, reinforcement learning (RL) [Sutton and Barto1998] is an area of machine learning focusing on how an agent can learn from its interactions with an environment. The agent moves from state to state by performing a sequence of actions, each of them giving a specific reward. The behavior of an agent is characterized by a policy, determining which action should be taken from each state. Given this context, the goal is to learn a policy maximizing the sum of rewards of each action done by the agent.
However, traditional methods for RL suffer from a lack of scalability and are limited to low-dimensional problems. The main issue is that some states are never considered during the learning process when large state spaces are considered. Recently, deep learning[LeCun, Bengio, and Hinton2015] provided new tools to overcome this problem. The idea is to use a deep architecture as a function approximation for generalizing knowledge from visited to unknown states. Such an improvement enabled RL to scale to problems that were previously intractable. Notorious examples are the superhuman performances obtained for the game of Go [Silver et al.2016] and Atari 2600 [Mnih et al.2013, Mnih et al.2015]. The combination of RL with a deep network is commonly referred as deep reinforcement learning (DRL) [Arulkumaran et al.2017].
Even more recently, DRL has also been applied to identify high-quality primal bounds to some NP-hard combinatorial problems. Most work focuses on the classical Traveling Salesman Problem [Bello et al.2016, Deudon et al.2018], with the exception of the approach of Khalil et al. [Khalil et al.2017] that tackles four NP-hard problems having a graph structure. They use a deep learning architecture in order to embed the graph structure into features [Dai, Dai, and Song2016]. The competitive results obtained suggest that this approach is a promising new tool for finding solutions to NP-hard problems. In this paper, we further push these efforts to be able to generate dual bounds.
Given this related work, our contribution is positioned as follows. We propose a generic approach based on DRL in order to identify variable orderings for approximate DDs. The goal is to find orderings providing tight bounds. The focus is on relaxed DDs, as this provides a mechanism for utilizing machine learning to improve relaxation bounds, but we also show the effectiveness for restricted DDs, adding to the recent literature on using machine learning for finding high-quality heuristic solutions. The approach has been validated on the maximum independent set problem, for which the variable ordering has been intensively studied[Bergman et al.2012]. Its application to the maximum cut problem is also considered. We note that there has been limited work on applying machine learning to identify variable orderings for DDs in unrelated fields [Carbin2006, Grumberg, Livne, and Markovitch2011]. To the best of our knowledge, this work has not been extended to optimization.
This paper is structured as follows. The next section introduces the technical background related to DDs and RL. The process that we designed for learning an appropriate variable ordering is then presented. The RL model and the learning algorithms are detailed and the construction of the DD using RL is described. Finally, experiments on synthetic instances are carried out in the last section.
In the optimization field, a decision diagram (DD) is a structure that encodes a set of solutions to a constrained optimization problem (COP) where is the set of variables, the set of discrete domains restricting the values that variables can take, the sets of constraints and the objective function. Formally, a DD associated to a combinatorial problem is a directed-layered-acyclic graph where is the set of nodes, the set of arcs and is a function associating a label at each arc. The set of nodes is partitioned into layers , i.e., . Layers and are both composed of a single node: the root and the terminal node, respectively. The width of layer in a DD is defined as the number of nodes in that layer: . The width of the DD is the maximum-width layer: . Each arc is directed from a node in a layer to a node in layer where . The function associates to each arc a label . The arcs directed out of each node have distinct labels, i.e., at most one arc with tail having any domain value . We assume that for each , there must exist a directed path from the root node to and from to the terminal node. A cost is also associated to every arc in , which is used to encode objective functions of solutions.
In this paper, a DD for a COP has layers where is the number of decision variables in . Each layer (except the last one) is linked to one variable of and an arc from to with label represents the assignment to . A direct path from the root to the terminal node of corresponds then to a solution of . The assignment of variables in to layers during the construction of the DD is referred as the variable ordering.
A DD is exact when the solutions encoded align exactly with the feasible solutions of the initial problem and for any arc-directed root-to-terminal node path , the sum of the costs of the arcs equates to the evaluation of the objective function for the solution it corresponds, i.e., . In this case, the longest path from the root to the terminal node corresponds to the optimal solution of . However, the width of DDs tend to grow exponentially with the number of variables in the problem, which reduces its usability for large instances. A DD is relaxed when its encodes a superset of the feasible solutions of and for any arc-directed root-to-terminal node path , the sum of the costs of the arcs is an upper bound (in the case of maximization) on the evaluation of the objective function for the solution it corresponds, i.e., . A relaxed DD can be constructed incrementally, merging nodes on each layer until the width is below a threshold [Bergman, van Hoeve, and Hooker2011] in such a way that no solution is lost during the merging process. Hence, for a relaxed DD, the longest path gives an upper bound of the optimal solution for . Finally, a DD is restricted when it under-approximates the feasible solutions of and for any arc-directed root-to-terminal node path , the sum of the costs of the arcs is a lower bound on the evaluation of the objective function for the solution it corresponds, i.e., . There are several ways to construct such a DD, perhaps the simplest one is removing nodes from each layer once the width threshold is reached. Unlike relaxed DDs, solutions are lost during the reduction. For maximization problems, the longest path provides a lower bound of the optimal solution for . Optimization bounds can thereby be directly obtained through relaxed and restricted DDs. Both take as input a specified maximum width, and it has been empirically shown that larger DDs generally provide tighter bounds but are in return more expensive to compute. An exhaustive description of DDs and their construction are provided in [Bergman et al.2016].
Let be a tuple representing a deterministic couple agent-environment where is the set of states in the environment, is the set of actions that the agent can do, is the transition function leading the agent from a state to another one given the action taken, and is the reward function of taking an action from a specific state. The behavior of an agent is defined by a policy , describing the action to be done given a specific state. The goal of an agent is to learn a policy maximizing the accumulated sum of rewards (eventually discounted) during its lifetime defined by a sequence of states with . Such a sequence is called an episode where is the terminal state. The expected return after time step is denoted by where is a discounting factor used for parametrizing the weight of future rewards. For a deterministic environment, The quality of taking an action from a state under a policy is defined by the action-value function . The problem is to find a policy maximizing the expected return: . In practice, is computed from an initial policy and by two nested operations: (1) the policy evaluation, making the action-value function consistent with the current policy, and (2) the policy iteration, improving greedily the current policy.
However, in practice, the optimal policy, or even an optimal action-value function, cannot be computed in a reasonable amount of time. A method based on approximation, such as Q-learning [Watkins and Dayan1992]
, is then required. Instead of computing the optimal action-value function, Q-learning approximates the function by iteratively updating a current estimate after each action. The update function is defined as follows:, where denotes the update and the learning rate.
Another issue arising for large problems is that almost every state encountered may never have been seen during previous updates, thus necessitating a method capable of utilizing prior knowledge to generalize for different states that share similarities. Among them, neural fitted Q-learning [Riedmiller2005]
uses a neural network for approximating the action-value function. This provides an estimatorwhere wBottou2010] or another optimizer coupled with back-propagation [Rumelhart, Hinton, and Williams1986] is then used for updating w and aim to minimize the squared loss between the current Q-value and the new value that should be assigned using Q-learning: where the square loss is . Updates are done using experience replay. Let be a sample representing an action done at a specific state with its reward and a sample store. Each time an action is performed, is added in . Then, the optimizer updates w using a random sample taken from .
Designing a RL model for determining the variable ordering of a DD associated to the COP requires defining, adequately, the tuple to represent the system. Our model is defined as follows.
A state is a pair containing an ordered sequence of variables and a partially constructed DD associated with variables in . A state is terminal if includes all the variables of .
An action is defined as the selection of a variable from . An action can be performed at state if and only if it is not yet inserted in ().
A transition is a function updating a state according to the action performed. Let be an operator adding the variable into a decision diagram and another operator appending the variable to the sequence , we have .
The reward function is designed to tighten the bounds obtained with the DD. When maximizing, upper bounds are provided by relaxed DDs and lower bounds are provided by restricted DDs. Both cases are associated with a common reward. Let and indicate the current upper/lower bound obtained with the DD . Such bound corresponds to the current longest path of the relaxed/restricted DD from the root node to the last constructed layer. At each variable insertion in , the difference in the longest path when adding the new layer is computed. When computing the upper bound, this difference is penalized because we want the bound to be as small as possible: . It is rewarded for the lower bound where we want to increase it instead: . For minimization problems, upper bounds are provided by restricted DDs and lower bounds by relaxed DDs.
Note that this formalization is generic and can be applied to any problem that can be represented by a DD constructed layer-by-layer. Indeed, all the problem-dependent characteristics are embedded into the DD construction and the insertion of variables (operator in the transition function).
The basis of the learning algorithm relies on neural fitted Q-learning as described in the previous section and is presented in Algorithm 1. At each iteration, a COP () is randomly taken from the training set and the learning is conducted on it. Effective learning for any particular class of COPs should consider instances for that class of COP. For example, if the goal is to find objective function bounds for an instance of the maximum independent set problem (formally defined later), other instances from that class of problem should be used during the training. The algorithm returns a vector of weights (w) which is used for parametrizing the approximate action-value function . The basic algorithmic framework can be improved through the following:
Instead of updating the -function using a single sample as previously explained, it is also possible to update it by considering a mini-batch of samples from the store memory . As stressed by [Masters and Luschi2018], the choice of the mini-batch size can have a huge impact on the learning process. Let be the squared loss related to a sample with as the batch size; the gradient update, where the square loss of each sample is summed, is a follows: .
Always following a greedy policy results in a lack of exploration during learning. One solution is to introduce limited randomness in choosing an action.
-greedy refers to taking a random action with probabilitywhere . Otherwise, the current policy is followed. In our case, is adaptive and decreases linearly during the learning process, resulting in focused exploration at first followed by increasingly favoring exploitation.
Gradient-based methods have difficulties to learn when rewards are large or sparse. Reward scaling compresses the space of rewards into a smaller interval value near zero, while still remaining sufficiently large, since, as stressed by [Henderson et al.2017], tiny rewards can also lead to failed learning. We let be the scaling factor, generally defined as a power of 10, and rescale the rewards as .
Once the model has been trained, the next step is to use it in order to build the DD for a new instance. Let us illustrate the construction on the maximum independent set problem.
Let be a simple undirected graph. An independent set of is a subset of vertices such that there is no two vertices in that are connected by an edge of . The maximum independent set problem (MISP) consists in finding the independent set with the largest cardinality.
Note that we use the term vertices for elements of the graph and nodes for DDs. The problem is fully represented by a graph
and a classical formulation assigns a binary variablefor each vertex indicating if the variable is selected in the set or not. More details about the internal operations of the construction are provided in [Bergman et al.2012]. Specific to the learning, the environment tuple is generated for , and the learned model is then applied on it. The environment is directly infered from the previous formulation: the current state is the list of variables already considered with the DD currently built, an action consists in choosing a new variable and the transition function with the reward is associated to the DD construction. At each state, the model is called in order to compute the estimated -value for each action that can be performed in the current state. The network structureToVec [Dai, Dai, and Song2016] can be used for parametrizing [Khalil et al.2017]. The construction is driven by the policy . The best vertex according to the approximated action-value function is inserted in the DD at each step.
This process is illustrated in Figure 7 for a MISP instance. The partially constructed DD and the inserted/remaining vertices are depicted for each state. The value in each vertex indicates the -value computed by the model for each state-action pair. Gray vertices are the ones that are greedily selected by the policy. No vertices can be inserted twice. The construction is terminated when all vertices are inserted.
Our first set of experiments are carried out on the MISP, for which the impact of variable ordering has been deeply studied [Bergman et al.2013]. The last experiments analyze the generalization of the approach on the Maximum Cut Problem. For the MISP, the approach is compared with the linear relaxation bound, random orderings, and three ordering heuristics commonly used in the literature:
Linear Relaxation (LP): The LP value of the linear relaxation obtained using a standard clique formulation for the MISP as described in [Bergman et al.2013].
Random Selection (RAND): An ordering of the vertices is drawn uniformly at random from all permutations. For each test, 100 random trials are performed and the average, best and worst results are reported.
Maximal Path Decomposition (MDP): A maximal path decomposition is precomputed and used as the ordering of the vertices [Bergman et al.2013]. This ordering bounds the width of the exact DD by the Fibonacci numbers.
Minimum Number of States (MIN): Having constructed up to layer and hence chosen the first vertices, the next vertex is selected as the one appearing in the fewest number of states in the DD nodes in layer . This heuristic aims to minimize greedily the size of the subsequent layer.
Minimum Vertex Degree (DEG): The vertices are ordered in decreasing order of vertex degree.
MISP instances were generated using the Barabasi-Albert (BA) model [Albert and Barabási2002]. Such a model is commonly used for generating real-world and scale-free graphs. They are defined by the number of nodes () and an attachment parameter (). The greater is , the denser is the graph. Edges are added preferentially to nodes having a higher degree. Training and testing have been carried out on XX (name hidden for the review). Training time is limited to 20 hours, memory consumption to 64 GB and one GPU is used. For each configuration, the training is done using 1000 generated random BA graphs (between 90 and 100 nodes) that are refreshed every 5000 iterations. Different models with a specific value for the attachment parameter () are trained. At the first time, testing is carried out on 100 other random graphs of the same size and having the same attachment parameter as for the training. Other configurations are then considered. Performance profiles [Dolan and Moré2002] are used for comparing the approaches. This tool provides a synthetic view on how an approach performs compared to the others tested. The metric considered is the optimality gap (i.e. the relative distance between the bound and the optimal solution).
Our model is implemented upon the code of Dai et al. for the learning part and upon the code of Bergman et al. for building the DD of the MISP instances. Evaluation of the different orderings is also done using this software. The learning is done using Adam optimizer [Kingma and Ba2014]. Library networkX [Hagberg, Swart, and S Chult2008] is used for generating the random graphs. For the reproducibility of results, the implementation of our approach and the models are available online (hidden for the review). Optimal solutions of the MISP instances and the linear relaxations have been obtained using CPLEX 12.6.3.
The goal of the experiments is to show the adequacy of our approach for computing both upper and lower bounds in different scenarios commonly considered in practice.
The first set of experiments aim to determine the best DD maximal width () for training the model. Let us first consider for the attachment parameter as in [Khalil et al.2017]. We trained four models () for relaxed DDs (RL-UB-4), and tested the models using the same values of . Figure 17 shows the performance profiles of the models when evaluated on relaxed DDs of a various width. Random ordering (RAND) is also reported and is outperformed by the four models. The shaded area represent the range of the RAND performance when considering the best and the worst solution obtained among the 100 trials. Interestingly, these results suggest that the width chosen for the training has a negligible impact on the quality of the model, even when the width considered during the testing is different than that for the training. As computing small-width DDs is less computationally expensive than those with larger widths, we select the model trained with a width of 2 for the remainder of the experiments on MISP. Concerning restricted DDs, as shown in the next set of experiments (Figures (e)e-(h)h), lower bounds close to the optimal solutions are already obtained with small-width DDs ().
Our approach is now compared to the other variable ordering heuristics using BA graphs having a varied density (). A specific model is trained for each distribution for both relaxed (RL-UB-) and restricted DDs (RL-LB-). Evaluation is done on graphs following the same distribution as those used in training. Results are presented on Figures (a)a-(d)d for relaxed DDs having a width of 100. In all the configurations tested, our approach provides a better upper bound than the RAND, MIN, MDP and DEG heuristics. For sparsest graphs (Figure (a)a), the optimal solution is reached for almost all the instances. When the graphs are relatively sparse (Figure (b)b), the linear relaxation provides the best bound. However, this trend decreases as the density of the graphs grow (Figures (c)c and (d)d). For these graphs, our model gives the best performance for all the instances. Results for restricted DDs with a width of 2 are depicted on Figures (e)e-(h)h. Again, our model has the best results over those tested and provides stronger lower bound, close to the optimal solution. Optimality is reached for of the easiest instances () and for of the hardest ones ().
Let us now consider the situation depicted in Figure (b)b where RL-UB-4 provides a worse bound than the linear relaxation of the problem. Figure (a)a depicts the evolution of the optimality gap when the model is tested on relaxed DDs of an increasingly larger width. As RAND provided results far outside the range of the other methods for relaxed DDs, we do not include it in the subsequent plots. The plot depicts that RL-UB-4 remains better than the other ordering heuristics tested, and when the DD width is sufficiently large () the LP relaxation bound is beaten and the optimal solution is almost reached (). Figure (b)b reports the execution time of the different methods. The linear relaxation is the fastest method and is almost instantaneous. Concerning the orderings, RL-UB-4, MDP and DEG are static ordering, and execution time for each generally increases similarly with the width, while MIN requires dynamically processing the nodes in a layer for determining the next vertex to insert.
In a similar way, this set of experiments aim to analyze how the learned models perform when larger graphs are considered. Results in Figure (a)a depict the optimality gap of the different approaches for relaxed DDs (). We can observe that the learned model remains robust against increases of the graph size although the gap between the other orderings progressively decreases. When the graph size is far beyond the size used for the training, the model strives to generalize which indicates that training on larger graphs should be required. The LP bounds for large graphs are out of range of DDs of this limited width. The same experiment is carried out for restricted DDs and reported on in Figure (b)b. Given that the optimality is reached even with small-width DDs, only a width of 2 is considered. Here, RL-LB-4 provides the best lower bound even for the largest graphs tested. This is consistent with other heuristics implemented through RL.
This set of experiments aims to analyze the performance of the learned models when they are tested on a different distribution than that used for training. Figure 17 presents the relative gap with the model specifically trained on the distribution tested. For instance, when , the gap is computed using RL-UB-8 as reference (or using RL-LB-8 for restricted DDs). We use this measure instead of the optimality gap in order to nullify the impact of the instance difficulty. The gap is then null for the distribution used as reference (RL-UB-4 and RL-LB-4). Results show that the more the distribution is distant from the reference, the greater is the gap, which indicates that the learned model strives to generalize. For small perturbations ( and ), good performances are still achieved. These results suggest that it is important to have clues on the distribution of the graphs that we want to access in order to feed appropriately the model during training.
Let be a simple undirected graph. A maximum cut of is a subset of nodes such that is maximized, where is the set of edges having a node in and the other one in . The maximum cut problem (MCP) is that of finding a maximum cut.
As an example of its generalizability, our approach is also applied to the MCP. The DD is built according to formulation of [Bergman et al.2016]. The learning process and the model is the same as for the MISP. Generation of graphs is still done using a Barabasi-Albert distribution () with edge weights uniformly and independently generated from . For training, weights are scaled with a factor of 0.01. The ordering obtained is compared with RAND and with the MAX-WEIGHT heuristic which selects the vertex having the highest sum of incoming weights [Bergman et al.2016]. Results reporting the optimality gap of the three methods are presented in Figure 17 for relaxed ( for training and testing) and restricted DDs (). As previously, the RL approach is the best for restricted DDs. Concerning relaxed DDs, performances of MAX-WEIGHT and better than RAND are reached, indicating that the learning is effective. However, further analysis should be required in order to see if better bounds can be obtained for DDs of such a width.
Objective function bounds are paramount to general and scalable algorithms for combinatorial optimization. Decision diagrams provide a novel and flexible mechanism for obtaining high-quality bounds whose output is amenable to improvement through machine learning, since the objective function bound obtained is directly linked to the heuristic choices taken. This paper provides a generic approach based on deep reinforcement learning for finding high-quality heuristics for variable orderings that are shown experimentally to tighten the bounds proven by approximate DDs. Experimental results indicated the promise of the approach when applied to the maximum independent set and maximum cut problems. Insights from a thorough experimental evaluation indicate: (1) the approach generally outperforms variable ordering heuristics appearing in the literature; (2) the width chosen during training has negligible impact when applied to unseen problems; (3) the model generalizes well when the width is increased and, in most cases, is applied to larger graphs; and, (4) it is important to have a measure of the distribution on the evaluated graphs in order to be able to feed the model during training. This last point remains a challenge when extending the approach to real-world problems. As a future work, we plan to tackle it by generating new instances for training using generative models from the initial graphs.
To the best of our knowledge, this is the first paper to propose the use of machine learning in discrete optimization algorithms for the purpose of learning better bounds. It opens new insights of research and multiple possibilities of future work, such as the application to different domains that utilize DDs as constraint programming or verification of systems.
International Conference on Integration of Artificial Intelligence (AI) and Operations Research (OR) Techniques in Constraint Programming, 34–49. Springer.