1 Introduction
Mixed Integer Programming (MIP) is widely used to solve vast combinatorial optimizations (CO) in the field of Operations Research (OR). The existence of integer variables endows MIP formulations with the ability to capture the discrete nature of many real world decisions. Applications of MIP includes production scheduling, vehicle routing, facility location, airline crew scheduling, to mention only a few. In many realworld settings, homogeneous MIP instances with similar scales and combinatorial structures are optimized repeatedly, but being treated as completely new tasks. These MIP models across days share remarkable similarities in model structures and solution appearances, which motivates us to integrate Machine Learning (ML) method to explore correlations between an MIP model’s structure and its solution values to improve the solver’s performance.
Identifying correlations between problem structures and solution values is not new, and is widely used as guidelines for heuristics design for CO problems. These heuristic methods are usually humandesigned priority rules to guide the search directions to more promising regions in solution space. For example, the nearest neighbor algorithm for traveling salesman problem (TSP) constructs a heuristic solution by choosing the nearest unvisited node as the salesman’s next move, based on the observation that two distantly distributed nodes are unlikely to appear consecutively in the optimal route. Similar examples include the shortest processing time first heuristic for flow shop scheduling, the saving heuristic for vehicle routing, the first fit algorithm for bin packing, among many others. A major drawback of heuristics design using problemspecific knowledge is the lack of generalities to other problems, where new domain knowledge has to be reidentified.
MIP models can describe CO problems of various types using a standard formulation strategy , differing only in model coefficients and . This makes it possible to explore connections between problem structures and the resulting solution values without prior domain knowledge. Predicting solution values of general integer variables in MIP is a difficult task. Notice that most MIP models are binary variables intensive^{1}^{1}1Take the benchmark set of MIPLIB 2017 miplib2017
as an example, among all 240 MIP benchmark instances, 164 of them are Binary Integer Linear Programming (BILP) problems, and 44 out of the 76 remaining are imbalanced in the sense that binary variables account for more than 90% of all integer variables.
, a natural way to explore hidden information in the model is to treat solution value prediction of binary variables as binary classification tasks. Major challenges in solution prediction lies in the implicit correlations among decision variables, since a feasible solution is restricted by constraints in MIP, i.e., . Rather than predicting each decision variable value isolatedly, we propose a tripartite graph representation of MIP and use graph embedding to capture connections among the variables. Note that none of the two variable nodes are directly linked in the trigraph, but can be neighbors of distance if they appear in the same constraint in the MIP formulation. Correlations among variables are reflected in embeddings of the trigraph where each vertex maintains aggregate feature information from its neighbors.Incorporating solution prediction results in MIP solving process is not trivial. In fact, false prediction of a single decision variable can sometimes lead to infeasibility of the entire problem. Instead of utilizing the predicted solutions directly, we identify decision variables that are predictable and stable and use this information to guide the Branch and Bound (B&B) tree search to emphasis on unpredictable variables to accelerate solving convergence. This is achieved by a novel labelling mechanism on the training instances, where a sequence of feasible solutions are generated by an iterated proximity search method. Stable decision variables, of which the value remain unchanged across these solutions, are recorded. It is noticeable that although obtaining optimal solutions can be a difficult task, the stable variables can be viewed as an easytopredict part that reflects the MIP’s local optimality structure. This labelling mechanism is inspiring especially for difficult MIP instances when solving them to optimality is almost impossible.
The overall framework of solution prediction based MIP solving can be summarized as follows:
Training data generation: For a certain type of CO problem, generate a set of MIP instances of similar scale from the same distribution . For each , collect the corresponding variable features, constraint features and edge features, and use the iterated proximity search method to generate solution labels for each binary variable in .
GCN model training: For each , generate a trigraph from its MIP formulation and train a Graph Convolutional Network (GCN) for solution value prediction of each binary variable.
Application of solution prediction: For a new MIP instance from , collect features, build the trigraph and use the GCN model to make solution value predictions, based on which an initial local branching cut heuristic is applied to solve .
2 Related work
With a similar motivation, there are some recent attempts that consider integration of ML and OR for solving CO problems. Dai et al. dai2017learning
combined reinforcement learning and graph embedding to learn greedy construction heuristics for several optimization problems over graphs. Li et al.
li2018combinatorialtrained a graph convolutional network to estimate the likelihood that a vertex in a graph appears in the optimal solution. Selsam et al.
selsam2018learningproposed a message passing neural network named NeuroSAT to solve SAT problems via a supervised learning framework. Vinyals et al.
vinyals2015pointer , Kool et al. kool2018attention2 ; kool2018attention and Nazari et al. nazari2018reinforcementtrained Pointer Networks (or its variants) with recurrent neural network (RNN) decoder to solve permutation related optimization problems over graphs, such as Traveling Salesman Problem (TSP) and Vehicle Routing Problem (VRP). Quite different from their settings, our solution prediction framework does not restrict to certain graph based problems, but can adapt to a variety of CO problems using a standard MIP formulation.
Quite related to our work, there is an increasing concern on using ML techniques to enhance MIP solving performance. Alvarez et al. alvarez2017machine , Marcos et al. marcos2016online , and Khalil et al. khalil2016learning tried to use learningbased approach to imitate the behavior of the socalled strong branching method, a nodeefficient but timeconsuming branching variable selection method in the B&B search tree. He et al. he2014learning
used imitation learning to train a node selection and a node pruning policy to speed up tree search in the B&B process. Khalil et al.
khalil2017learning used binary classification to predict whether a primal heuristic will succeed at a given node and then decide whether to run a heuristic at that node. Kruber et al. kruber2017learning proposed a supervised learning method to decide whether a DanzigWolfe reformulation should be applied and which decomposition to choose among all possibles. Interested readers can refer to Bengio et al. bengio2018machine for a comprehensive survey on the use of machine learning methods in CO.The proposed MIP solving framework is different from previous work in two aspects:
Generalization: Previous solution generation method for CO usually focus on problems with certain solution structures. For example, applications of Pointer Networks vinyals2015pointer ; kool2018attention are only suited for sequencebased solution encoding, and reinforcement learning dai2017learning ; li2018combinatorial type decision making is based on the assumption that a feasible solution can be obtained by sequential decisions. In contrast, the proposed framework do not limit to problems of certain types, but is applicable to all CO problems that can be modeled as MIPs. This greatly enlarges the applicable area of the proposed framework.
Representation: Previous applications of ML techniques to enhance MIP solving performance mainly use handcrafted features and make predictions on each variable independently. Notice that the solution value of an variable is strongly correlated to the objective function and the constraints it participates in, we build a novel tripartite graph representation for MIP, based on which graph embedding technique is applied to extract correlations among variables, constraints and the objective function without human intervene.
3 The Solution Framework
Consider an MIP problem instance of the following general form:
(1)  
s.t.  (2)  
(3)  
(4) 
where the index set of decision variables is partitioned into , and
are the index set of binary variables, general integer variables and continuous variables, respectively. The main task here is to predict the probability that a binary variable
to take value 1 (or zero) in the optimal solution. Next we describe in detail the tripartite graph representation of MIP, the GCN model structure, and how solution prediction results are incorporated to accelerate MIP solving.3.1 Graph Representation for MIP
Our main idea is to use a tripartite graph to represent an input MIP instance . In particular, objective function coefficients , constraints righthandside (RHS) coefficients and coefficients matrix information is extracted from to build the graph. Vertices and edges in the graph are detailed as follows and graphically illustrated in Fig 1.
Vertices:
1) the set of decision variable vertices , each of which corresponds to a binary variable in .
2) the set of constraint vertices , each of which corresponds to a constraint in .
3) an objective function vertex .
Edges:
1)  edge: there exists an edge between and if the corresponding variable of has a nonzero coefficient in the corresponding constraint of in the MIP formulation.
2)  edge: for each , there exists an edge between and .
3)  edge: there exists an edge between and if is active in the MIP’s root LP relaxation.
Remark.
The presented trigraph representation not only captures connections among the variables, constraints and objective functions, but maintains the detailed coefficients numerics in its structure as well. In particular, nonzero entries in coefficient matrix are included as features of  edges, entries in objective coefficients as features of  edges, and entries in RHS coefficients as features of  edges. Note that the constraint RHS coefficients are correlated to the objective function by viewing LP relaxation of from a dual perspective.
3.2 Solution Prediction for MIP
We describe in Algorithm 1 the overall forward propagation prediction procedure based on the trigraph. The model has three stages: 1) a projection layer of same embedding size for each node so that node feature has the same dimension (line 13 in the algorithm). 2) a graph attention network to transform node information among connected nodes (line 410). 3) two fullyconnected layers between variable nodes and the output layer after several information transitions (line 11).
Nodes’ representations in the tripartite graph are updated via a 4step procedure. In the first step (line 5 in Algorithm 1), the objective node aggregates the representations of all variable nodes to update its representation . In the second step (line 7), and are used to update representation of their neighboring constraint node . In the third step (line 8), representations of constraints are aggregated to update , while in the fourth step (line 10) and are combined to update . See Fig. 2 for an illustration of information transition flow in the trigraph.
The intuition behind Algorithm 1 is that at each iteration, a variable node can incrementally gather more aggregation information from its neighboring nodes, which correspond to the related constraints and variables in the MIP formulation. It is worth mentioning that the above transitions only extract connection relations among the nodes, ignoring the detailed coefficients numerics and in MIP. To enhance the representation ability of our model, we include an attention mechanism to import information from the coefficients values.
Attention Mechanism:
A distinct feature in the tripartite graph structure is the heterogeneities in nodes and arcs. Rather than using a shared linear transformation (i.e., weight matrix) for all nodes
velivckovic2017graph , we consider different transformations in each step of graph embedding updates, reflecting the importance of feature of one type of node on the other. In particular, given node of type and node of type , the attention coefficient which indicates the importance of node from its neighbor is computed as:(5) 
where are embeddings of node and edge respectively, and is the attention weight matrix between type and nodes. For each , the attention coefficient is normalized cross over all neighbor nodes using a softmax function. With this attention mechanism, MIP coefficients information in and
(all of which contained in the feature vector of the edges) are incorporated to reflect edge connection importance in the graph.
3.3 PredictionBased MIP Solving
Next, we introduce how the solution value prediction results are utilized to improve MIP solving performance. One approach is to add a local branching type initial cut to the MIP model to reduce the search space of feasible solutions This method aim to identify decision variables that are predictable and stable, and guide the B&B tree search to emphasize on unpredictable variables which accelerate solving convergence while maintaining near optimality.
Let denotes the predicted solution value of binary variable , and let denotes an subset of indices of binary variables. A local branching initial cut to the model is defined as:
(6) 
where is a problem parameter that controls the maximum distance from a new solution to the predicted solution . Adding cuts with respect to subset rather than is due to the unpredictable nature of unstable variables in MIP solutions. Therefore, only those variables with high probability to take value 0 or 1 are included in . It is worth mentioning that for the extreme case of equals 0, the initial cut is equivalent to fixing variables with indices in at their predicted values.
4 Data Collection
Features: An ideal feature collection procedure should capture sufficient information to describe the solution process, and being of low computational complexity as well. A good trade off between these two concerns is to collect features at the root node of the B&B tree, where the problem has been presolved to eliminate redundant variables and constraints and the LP relaxation is solved. In particular, we collect for each instance 3 types of problem features, i.e., variable features, constraint features and edge features. Features descriptions are summarized in Table 1 in Appendix A.
As presented in the feature table, features of variables and constraints can be divided into three categories: basic features, LP features and structure features. The structure features (most of which can be found in alvarez2017machine ; khalil2016learning ) are usually handcrafted statistics to reflect correlations between variables and constraints. It is noticeable that our tripartiate graph neural network model can naturally capture these correlations without human expertise and can generate more advanced structure information to improve prediction accuracy. This will be verified in the computational evaluations section.
Labels: To make predictions on solution values of binary variables, an intuitive labeling scheme for the variables is to label them with the optimal solution values. Note, however, obtaining optimal solutions for medium or large scale MIP instances can be very timeconsuming or even an impossible task. This implies that labeling with optimal solutions can only be applied to solvable MIP instances, which limit the applications of the proposed framework.
Instead of collecting optimal solutions, we propose to identify stable and unstable variables in solutions. This is motivated by the observation that solution values of the majority of binary decision variables remain unchanged across a series of different feasible solutions. To be specific, given a set of solutions to an MIP instance , a binary variable is defined as unstable if there exists some such that , and as stable otherwise. Although obtaining optimal solutions might be a difficult task, the stable variables can be viewed as an easytopredict part in the (near) optimal solutions. To generate a sequence of solutions to an instance, we use the socalled proximity search method fischetti2014proximity . Starting from some initial solution with objective value , a neighborhood solution with the objective value improvement being at least can be generated by solving the following optimization:
(7)  
s.t.  (8)  
(9)  
(10) 
where the objective function represents the distance between and a new solution . Note that the above optimization is computationally tractable since solving process can terminate as soon as a feasible solution is found. By iteratively applying this method, we obtain a series of improving feasible solutions to the original problem. Stable binary variables are labeled with their solution values while the unstable variables are marked as unstable and discarded from the training set.
Remark.
The logic behind the stable variable labelling scheme is to explore local optimality patterns rather than global optimality. In each iteration of proximity search, a neighboring better solution is found, with a few flips on solution values of the binary variables. Performing such a local search step for many rounds can identify local minimum patterns which reflects domain knowledge of the CO problem. Take the Traveling Salesman Problem (TSP) as an example. Let a binary variable defines whether node is visited immediately after node . If node and are geometrically faraway from each other, is likely to be zero in all the solutions generated by proximity search, and being recorded by our labelling scheme, which reflects the underlying local optimality knowledge for TSP.
5 Experimental Evaluations
Setup. To evaluate the proposed framework, we modify the stateoftheart open source MIP solver SCIP 6.0.0 achterberg2009scip
for data collection and solution quality comparison. The GCN model is built using the Tensorflow API. All experiments were conducted on a cluster of three 4core machines with Intel 2.2 GHz processors and 16 GB RAM.
Instances. To test the effectiveness and generality of our solution prediction method in this scenario, we generate problem instances of 8 distinct types: Fixed Charge Network Flow (FCNF), Capacitated Facility Location (CFL), Generalized Assignment (GA), Maximal Independent Set (MIS), Multidimensional Knapsack (MK), Set Covering (SC), Traveling Salesman Problem (TSP) and Vehicle Routing Problem (VRP). These problems are the most commonly visited NPhard combinatorial optimizations in OR and are quite general because they differ significantly in MIP structures and optimal solution structures. For each problem type, 200 MIP instances of similar scales are generated. Detailed MIP formulation and instance parameters of each type are included in Appendix B.
Data collection.
In terms of feature collection, we implemented a feature extraction plugin embedded in the branching procedure of SCIP. In particular, variable features, constraint features and edge features are collected right before the first branching decision is made at the root node, where presolving process and the root LP relaxation has completed. No further exploration of the B&B search tree is needed for data collection and thus the feature collection process terminates at the root node. Construction of the tripartite graph is also completed at the root node where SCIP is working with a transformed MIP model such that redundant variables and constraints have been removed. In terms of label collection, we applied the proximity search method to label stable binary variables. Each proximity search iteration terminates as soon as a feasible solution is found.
5.1 Results of Solution Prediction
We demonstrate the effectiveness of the proposed GCN model on solution prediction accuracy against two classical classifiers: Logistics Regression (LR)
scikitlearnand XGBoost (XGB)
^{2}^{2}2For LR and XGB, only variable features are used for prediction since they do not maintain MIP formulation structure in their model. chen2016xgboost. Noting that solution values of binary variables are usually highly imbalanced (due to the onehot encoding convention in MIP formulation for CO problems), we use the Average Precision (AP) metric
zhu2004recall to justify the results of different classifiers.Basic features  Basic&structure features  All features  

Instances  LR  XGB  GCN  LR  XGB  GCN  LR  XGB  GCN 
FCNF  0.0888  0.0993  0.2614  0.2291  0.2749  0.3169  0.7852  0.7867  0.7880 
CFL  0.2939  0.4485  0.5899  0.2884  0.5669  0.6286  0.8458  0.8460  0.8500 
GA  0.4046  0.4991  0.7440  0.6325  0.7495  0.7969  0.9188  0.9362  0.9373 
MIS  0.2967  0.2816  0.3545  0.2988  0.2885  0.3366  0.3220  0.2966  0.3254 
MK  0.5614  0.5242  0.8402  0.8034  0.8077  0.8427  0.9317  0.9240  0.9270 
SC  0.7441  0.7470  0.7482  0.7326  0.7480  0.7532  0.9357  0.9588  0.9590 
TSP  0.3295  0.3265  0.3584  0.3464  0.3488  0.3526  0.4139  0.4009  0.4132 
VRP  0.3825  0.3907  0.4027  0.4003  0.4203  0.4241  0.4495  0.4373  0.4587 
Average  0.3877  0.4146  0.5374  0.4664  0.5256  0.5564  0.7003  0.6983  0.7073 
Table 1 describes the AP value comparison results for the three classifiers under three settings: using only basic features, using basic&structure features, and using all features respectively. It is observed that the proposed GCN model outperforms two baseline classifiers in all settings. The performance advantage is particularly significant in the basic feature columns (0.5374 by GCN against 0.3877 by LR and 0.4146 by XGB), where only raw coefficient numerics in MIP are used for prediction. Another notable statistics in the table are that the GCN model with only basic features is in average superior to LR and XGB models with basic&structure features, indicating that the proposed embedding framework can extract more information compared to handcrafted structure features used in the literature alvarez2017machine ; khalil2016learning . For comparisons in the all features column, the advantage of GCN is less significant due to the reason that high level MIP structure information is also captured in its LP features.
To help illustrate the predictability in solution values for problems of different types, we present in Fig.3 the detailed accuracy curve for each of the 8 problem types using GCN model with all features. Fig.3(a) is the standard precisionrecall curve from which the statistics in Table 1 are calculated. Fig.3(b) depicts the prediction accuracy if we only predict a certain percentage of most predictable binary variables, of which the output probability by GCN is most extreme. It can be observed from the figure that solution values of most considered MIP problems are fairly predictable, with a 100% accuracy if we make solution value prediction on the top 5080% most predictable variables. Among the tested problem types, solution values to MIS problem instances are most unpredictable, which is consistent with intuition that the set of nodes with maximal cardinality in a graph may be multiple and can be hardly obtained with local optimality information in our labelling scheme.
5.2 Comparisons of solution quality
To evaluate the value of incorporating solution predictions in MIP solving, we compare the performance of predictionbased MIP solving against that of the solver’s default setting. For each problem type, 10 testing MIP instances (of similar scales and with the same coefficients distribution) are generated. Since optimal solutions or tight lower bounds are difficult to obtain within a reasonable time limit, we use the primal gap metric and the primal integral metric khalil2017learning to capture solver’s performance. The primal gap metric reports the relative gap in objective values of the current solution to the best known solution , and primal integral records the average primal gap in the solution process.
Primal Gap (%)  Primal Integral  

DEF1^{*}  DEF2  DEF5  DEF10  GCN  DEF1  DEF2  DEF5  DEF10  GCN  
FCNF  4.585  3.991  0.782  0.327  0.000  444.6  360.6  203.0  186.9  9.8 
CFL  1.856  1.271  0.280  0.050  0.233  163.7  123.0  101.0  84.9  37.1 
GA  5.207  4.972  4.755  3.901  0.000  689.3  548.8  426.6  357.3  135.1 
MIS  6.571  6.522  2.174  1.087  2.174  509.5  366.5  256.8  224.1  116.9 
MK  0.035  0.035  0.035  0.035  0.000  2.9  2.7  2.7  2.7  1.5 
SC  0.595  0.314  0.129  0.129  0.611  61.7  55.0  38.8  38.8  45.7 
TSP  7.344  3.804  1.894  0.592  0.099  843.3  668.0  488.7  346.3  190.6 
VRP  2.777  1.710  1.710  1.710  0.694  265.1  154.6  154.4  154.4  32.9 
3.484  2.661  1.568  1.072  0.544  362.2  274.1  209.9  172.6  80.0 

DEF1 corresponds to the solver’s default setting where the execution time limit is seconds. Similarly, DEF2, DEF5 and DEF10 correspond to the solver’s default setting with , and seconds execution time respectively.
Table 2 is a collection of solution quality results by the proposed method with a 1000 seconds execution time limit. To demonstrate the significance of performance improvement, we allow the solver to run for 210 times the execution time (i.e., 200010000 seconds) in its default setting. It is revealed from the table that the proposed method (the GCN column) gains remarkable advantages to the solver’s default setting of the same time limit (the DEF1 column) on all testing problems. Compared to the default setting with 10000 seconds (the DEF10 column), the proposed framework still maintains an average better performance, indicating a 10 times acceleration in solution finding.
5.3 Generalization to larger instances
The graph embedding framework endows the model to train and test on MIP instances of different scales. This is important for MIP solving since there is hardly any good strategy to handle largescale NPhard MIP problems. To investigate this, we train our GCN model on smallscale MIP instances, and test its applicability in largescale ones. Detailed statistics of small and large instances are reported in Appendix A. It is revealed from table 3 that the GCN model maintains an acceptable prediction accuracy degradation when problem scale differs in the training and testing phase. In addition, the prediction result is still useful to improve solver’s performance.
FCNF  CFL  GA  MIS  MK  SC  TSP  VRP  Average  
Average Precision  GCN  0.653  0.837  0.963  0.091  0.789  0.878  0.396  0.358  0.621 
GCNG^{*}  0.675  0.801  0.873  0.104  0.786  0.843  0.343  0.321  0.593  
Primal Gap (%)  DEF  4.585  1.942  4.633  3.409  0.035  0.498  6.895  2.626  3.078 
GCN  0.000  0.913  0.480  0.000  0.000  0.513  1.726  0.539  0.521  
GCNG  2.119  0.000  0.000  0.000  0.000  0.390  4.300  0.910  0.965 

GCNG is the GCN model trained on smallscale MIP instances.
6 Conclusions
We presented a supervised solution prediction framework to explore the correlations between MIP formulation structure and its local optimality patterns. The key feature of the model is a tripartite graph representation for MIP, based on which graph embedding is used to extract connection information among variables, constraints and the objective function. Through extensive experimental evaluations on 8 types of distinct MIP problems, we demonstrate the effectiveness and generality of the GCN model in prediction accuracy. Incorporated in an initial cut to the MIP model, the prediction results help to accelerate SCIP’s solution finding process by 10 times on similar problems. This result is inspiring to practitioners who are facing routinely largescale MIPs on which the solver’s execution time is unacceptably long and tedious.
References
 [1] Tobias Achterberg. SCIP: solving constraint integer programs. Mathematical Programming Computation, 1(1):1–41, 2009.
 [2] Alejandro Marcos Alvarez, Quentin Louveaux, and Louis Wehenkel. A machine learningbased approximation of strong branching. INFORMS Journal on Computing, 29(1):185–195, 2017.
 [3] Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learning for combinatorial optimization: a methodological tour d’horizon. arXiv preprint arXiv:1811.06128, 2018.
 [4] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016.
 [5] Hanjun Dai, Elias Khalil, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems, pages 6348–6358, 2017.
 [6] Matteo Fischetti and Michele Monaci. Proximity search for 01 mixedinteger convex programming. Journal of Heuristics, 20(6):709–731, 2014.
 [7] He He, Hal Daume III, and Jason M Eisner. Learning to search in branch and bound algorithms. In Advances in neural information processing systems, pages 3293–3301, 2014.
 [8] Elias B Khalil, Bistra Dilkina, George L Nemhauser, Shabbir Ahmed, and Yufen Shao. Learning to run heuristics in tree search. In 26th International Joint Conference on Artificial Intelligence (IJCAI), 2017.
 [9] Elias Boutros Khalil, Pierre Le Bodic, Le Song, George L Nemhauser, and Bistra N Dilkina. Learning to branch in mixed integer programming. In AAAI, pages 724–731, 2016.
 [10] Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems! arXiv preprint arXiv:1803.08475, 2018.
 [11] WWM Kool and M Welling. Attention solves your tsp. arXiv preprint arXiv:1803.08475, 2018.
 [12] Markus Kruber, Marco E Lübbecke, and Axel Parmentier. Learning when to use a decomposition. In International Conference on AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems, pages 202–210. Springer, 2017.
 [13] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Combinatorial optimization with graph convolutional networks and guided tree search. In Advances in Neural Information Processing Systems, pages 537–546, 2018.
 [14] Alejandro Marcos Alvarez, Louis Wehenkel, and Quentin Louveaux. Online learning for strong branching approximation in branchandbound. 2016.
 [15] MIPLIB 2017, 2018. http://miplib.zib.de.
 [16] Mohammadreza Nazari, Afshin Oroojlooy, Lawrence Snyder, and Martin Takác. Reinforcement learning for solving the vehicle routing problem. In Advances in Neural Information Processing Systems, pages 9839–9849, 2018.
 [17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 [18] Daniel Selsam, Matthew Lamm, Benedikt Bünz, Percy Liang, Leonardo de Moura, and David L Dill. Learning a sat solver from singlebit supervision. arXiv preprint arXiv:1802.03685, 2018.
 [19] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
 [20] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Advances in Neural Information Processing Systems, pages 2692–2700, 2015.
 [21] Mu Zhu. Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, 2:30, 2004.
Appendix A
In this appendix, we describe the variable node features, constraint node features, and edge features in detail. All the features are collected at the root node of the B&B search tree where presolving and root LP relaxation has completed.
Variable Features  Feature Description  count  

Basic  variable type (is binary, general integer)  2  
objective function coefficents (original, positive, negative)  3  
number of nonzero coefficient in the constraint.  1  
number of up (down) locks.  2  
LP  LP value ()  3  
LP value is fractional  1  
pseudocosts (upwards, downwards, ratio, sum, product)  5  
global lower (upper) bound  2  
reduced cost  1  
Structure 

4  
maximum (minimum) ratio of positive (negative) LHS (RHS) value.  8  

10  

15  
Constraint Features  
Basic 

12  
Lefthandside (LHS) and righthandside (RHS)  2  
number of nonzero (positive, negative) entries in the constraint  3  
LP  dual solution of the constraint  1  
basis status of the constraint in the LP solution  1  
Structure  sum norm of absolute (positive, negative) values of coefficients  3  
variable coefficient statistics in the constraint (mean, stdev, min, max)  4  
Edge Features  
Basic  original edge coefficient  1  
normalized edge coefficient  1 
Appendix B
In this appendix, we describe the MIP formulation for each of the 8 types of CO problems in the paper. Table 5 and 6 below summarize the number of variables, constraints, percentage of nonzeros in the coefficient matrix , and the percentage of binary variables that takes value 1 in the optimal solution.






FCNF  [2398, 3568]  [1402, 2078]  0.00118  0.02913  
CFL  [28956, 28956]  [29336, 29336]  0.00014  0.01358  
GA  [22400, 22400]  [600, 600]  0.00333  0.02500  
MIS  [400, 400]  [19153, 19713]  0.00502  0.05815  
MK  [765, 842]  [46, 51]  1.00000  0.02684  
SC  [4500, 4500]  [3500, 3500]  0.03000  0.02602  
TSP  [17689, 19600]  [17954, 19879]  0.00031  0.00737  
VRP  [1764, 1764]  [1802, 1802]  0.00314  0.02963 






FCNF  [1702, 2640]  [996, 1533]  0.00163  0.03392  
CFL  [1212, 1212]  [1312, 1312]  0.00303  0.08624  
GA  [1152, 1152]  [108, 108]  0.01852  0.08333  
MIS  [125, 125]  [1734, 1929]  0.01620  0.14572  
MK  [315, 350]  [19, 21]  1.00000  0.07113  
SC  [750, 750]  [550, 550]  0.05000  0.06533  
TSP  [1296, 1600]  [1367, 1679]  0.00387  0.02697  
VRP  [196, 196]  [206, 206]  0.02422  0.08069 
6.1 Fixed Charge Network Flow:
Consider a directed graph , where each node has demand and the demand is balanced in the graph: . The capacity of an arc is and the cost of an quantity flow on this arc has a cost .
Decision variables:

: binary variable. , if arc is used and otherwise.

: continuous variable, flow quantity on arc .
Formulation:
(11)  
s.t.  (12)  
(13)  
(14) 
6.2 Capacitated Facility Location:
Suppose there are facilities and customers and we wish to satisfy the customers demand at minimum cost. Let denote the fixed cost of building facility and the shipping cost of products from facility to customer . The demand of customer is assumed to be and the capacity of facility is assumed to be .
Decision variables:

: binary variable. if facility is built, and otherwise.

: continuous variable, the fraction of the demand fulfilled by facility .
Formulation:
(15)  
s.t.  (16)  
(17)  
(18)  
(19) 
6.3 Generalized Assignment:
Suppose there are tasks and agents and we wish to assign the tasks to agents to maximize total revenue. Let denote the revenue of assigning task to agent and denote the resource consumed, . The total resource of agent is assumed to be .
Decision variables:

: binary variable. if task is assigned to agent , and otherwise.
Formulation:
(20)  
s.t.  (21)  
(22)  
(23) 
6.4 Maximal Independent Set:
Consider an undirected graph , a subset of nodes is called an independent set iff there is no edge between any pair of nodes in . The maximal independent set problem is to find an independent set in of maximum cardinality.
Decision variables:

: binary variable. if node is is chosen in the independent set, and 0 otherwise.
Formulation:
(24)  
s.t.  (25)  
(26) 
6.5 Multidimensional Knapsack:
Consider a knapsack problem of items with dimensional capacity constraints. Let denotes the capacity of the th dimension in the knapsack, the profit of packing item , and the size of item in the th dimension, .
Decision variables:

: binary variable. if item is chosen in the knapsack, and 0 otherwise.
Formulation:
(27)  
s.t.  (28)  
(29) 
6.6 Set Covering:
Given a finite set and a collection of subsets of , the set covering problem is to identify the fewest sets of which the union is .
Decision variables:

: binary variable. if set is chosen, and 0 otherwise.
Formulation:
(30)  
s.t.  (31)  
(32) 
6.7 Traveling Salesman Problem
Given a list of cities, the traveling salesman problem is to find a shortest route to visit each city and returns to the origin city. Let denotes the distance from city to city (). We use the wellknown MillerTuckerZemlin (MTZ) formulation to model the TSP.
Decision variables:

: binary variable. city is visited immediately after city , and 0 otherwise.

: continuous variable, indicating the that city is the th visited city.
Formulation:
(33)  
s.t.  (34)  
(35)  
(36)  
(37)  
(38) 
6.8 Vehicle Routing Problem
Given a set of customers, the vehicle routing problem is to find the optimal set of routes to traverse in order to fulfill the demand of customers. To serve the customers, a fleet of vehicles, each of which has a maximum capacity are provided. Let denotes the distance from customer to customer ().
Decision variables:

: binary variable. city is visited immediately after city by some vehicle, and 0 otherwise.

: continuous variable, the cumulated demand on the route that visits node up to this visit.
Formulation:
(39)  
s.t.  (40)  
(41)  
(42)  
(43)  
(44)  
(45) 