## 1 Introduction

Mixed-Integer Linear Programs (MILPs) arise naturally in many decision-making problems such as auction design [abrache2007combinatorial], warehouse planning [elson1972site], capital budgeting [capbudgeting] or scheduling [floudas2005mixed]. Apart from a linear objective function and linear constraints, some decision variables of a MILP are required to take integral values, which makes the problem NP-hard [book/PapadimitriouS82].

Modern mathematical solvers typically employ the BB algorithm [journal/econometrica/LandD60] to solve general MILPs to global optimality. While the worst-case time complexity of BB is exponential in the size of the problem [books/Wolsey98], it has proven efficient in practice, leading to wide adoption in various industries. At a high level, BB adopts a divide-and-conquer approach that consists in recursively partitioning the original problem into a tree of smaller sub-problems, and solving linear relaxations of the sub-problems until an integral solution is found and proven optimal.

Despite its apparent simplicity, there are many practical aspects that must be considered for BB to perform well [achterberg_thesis]; such decisions will affect the search tree, and ultimately the overall running time. These include several decision problems [journals/top/LodiZ2017] that arise during the execution of the algorithm, such as node selection: which sub-problem do we analyze next?; and variable selection

(a.k.a. branching): which decision variable must be used (branched on) to partition the current sub-problem? While such decisions are typically made using hard-coded expert heuristics which are implemented in modern solvers, more and more attention is given to statistical learning approaches for replacing and improving upon those heuristics

[conf/nips/HeDE14, Alvarez2014ASM, confs/aaai/KhalilBND16, gasse2019exact, zarpellon2020parameterizing]. An extensive review of different approaches at the intersection of statistical learning and combinatorial optimization is given in

arxiv/BengioLP18.Recently, gasse2019exact proposed to tackle the variable selection problem in B

B using a Graph Neural Network (GNN) model. The GNN exploits the bipartite graph formulation of MILPs together with a shared parametric representation, thus allowing it to model problems of arbitrary size. Using imitation learning, the model is trained to approximate a very good but computationally expensive “expert" heuristic named

strong branching [techreport/ApplegateBCC95]. The resulting branching strategy is shown to improve upon previously proposed approaches for branching on several MILP problem benchmarks, and is competitive with state-of-the-art BB solvers.While the GNN model seems particularly suited for learning branching strategies, one drawback is a high computational cost for inference, i.e., choosing the branching variable at each node of the BB tree. In gasse2019exact

, the authors use a high-end GPU card to speed-up the GNN inference time, which is a common practice in deep learning but is somewhat unrealistic for MILP practitioners. Indeed, commercial MILP solvers rely solely on CPUs for computation, and the GNN model from

gasse2019exact is not competitive on CPU-only machines, as illustrated in Figure 1. There is indeed a trade-off between the quality of the branching decisions made and the time spent obtaining those decisions. This trade-off is well-known in MILP solvers, and has given rise to carefully balanced strategies designed by MILP experts, such as hybrid branching [confs/cpaior/AchterbergB09].In this paper, we study the time-accuracy trade-off in learning to branch with the aim of devising a model that is both computationally inexpensive and accurate for branching. To this end, we propose a hybrid architecture that uses a GNN model only at the root node of the BB tree and a weak but fast predictor, such as a simple Multi-Layer Perceptron (MLP), at the remaining nodes. In doing so, the weak model is enhanced by high-level structural information extracted at the root node by the GNN model. In addition to this new hybrid architecture, we experiment and evaluate the impact of several variants to our training protocol for learning to branch, including: (i) end-to-end training [glasmachers2017limits, bojarski2016end], (ii) knowledge distillation [hinton2015distilling], (iii) auxiliary tasks [liebel2018auxiliary], and (iv) depth-dependent weighting of the training loss for learning to branch, an idea originally proposed by conf/nips/HeDE14 in the context of node selection.

We evaluate our approach on large-scale MILP instances from four problem families: Capacitated Facility Location, Combinatorial Auctions, Set Covering, and Independent Set. We demonstrate empirically that our combination of hybrid architecture and training protocol results in state-of-the-art performance in the realistic setting of a CPU-restricted machine. While we observe a slight decrease in the predictive performance of our model with respect to the original GNN from [gasse2019exact]

, its reduced computational cost still allows for a reduction of up to 26% in overall solving time on all the evaluated benchmarks compared to the default branching strategy of the modern open-source solver SCIP

[GleixnerEtal2018OO]. Also, our hybrid model preserves the ability to extrapolate to harder problems than trained on, as the original GNN model.## 2 Related Work

Finding a branching strategy that results in the smallest BB tree for a MILP is at least as hard–and possibly much harder–than solving the MILP with any strategy. Still, small trees can be obtained by using a computationally expensive heuristic named strong branching (SB) [techreport/ApplegateBCC95, lindSB]. The majority of the research efforts in variable selection are thus aimed at matching the performance of SB through faster approximations, via cleverly handcrafted heuristics such as reliability pseudocost branching [ACHTERBERG200542]

, and recently via machine learning

^{1}

^{1}1Interestingly, early works on ML methods for branching can be traced back to 2000 [achterberg_thesis, Acknowledgements]. [Alvarez2014ASM, confs/aaai/KhalilBND16, gasse2019exact]. We refer to journals/top/LodiZ2017 for an extensive survey of the topic.

Alvarez2014ASM and confs/aaai/KhalilBND16

showed that a fast discriminative classifier such as extremely randomized trees

[geurts2006extremely][svmHearst] on hand-designed features can be used to mimic SB decisions. Subsequently, gasse2019exact and zarpellon2020parameterizing showed the importance of representation learning for branching. Our approach, in some sense, combines the superior representation framework of gasse2019exact with the computationally cheaper framework of confs/aaai/KhalilBND16. Such hybrid architectures have been successfully used in ML problems such as visual reasoning [perez2018film], style-transfer [dumoulin2016learned][srivastava2015highway, dhingra2016gated], and speech recognition [kim2017dynamic].## 3 Preliminaries

Throughout this paper, we use boldface for vectors and matrices. A MILP is a mathematical optimization problem that combines a linear objective function, a set of linear constraints, and a mix of continuous and integral decision variables. It can be written as:

where denotes the cost vector, the matrix of constraint coefficients, is the vector of constant terms of the constraints, and there are integer variables, .

The BB algorithm can be described as follows.
One first solves the linear program (LP) relaxation of the MILP, obtained by disregarding the integrality constraints on the decision variables.
If the LP solution satisfies the MILP integrality constraints, or is worse than a known integral solution, then there is no need to proceed further.
If not, then one divides the MILP into two sub-MILPs. This is typically done by picking an integral decision variable that has a fractional value, , and create two sub-MILPs with additional constraints and , respectively.
The decision variable that is used to partition the feasible region is called the branching variable, while denotes the branching candidates.
The second step is to select one of the leaves of the tree, and repeat the above steps until all leaves have been processed^{2}^{2}2For a more involved description of BB, the reader is referred to achterberg_thesis..

In this work, we refer to the first node processed by BB as the root node, which contains the original MILP, and all subsequent nodes containing a local MILP as tree nodes, whenever the distinction is required. Otherwise we refer to them simply as nodes.

## 4 Methodology

As mentioned earlier, computationally heavy GNNs can be prohibitively slow when used for branching on CPU-only machines. In this section we describe our hybrid alternative, which combines the superior inductive bias of a GNN at the root node with a computationally inexpensive model at the tree nodes. We also discuss various enhancements to the training protocol, in order to enhance the performance of the learned models.

### 4.1 Hybrid architecture

A variable selection strategy in BB can be seen as a scoring function that outputs a score for every branching candidate. As such, can be modeled as a parametric function, learned by ML. Branching then simply involves selecting the highest-scoring candidate according to :

We consider two forms of node representations for machine learning models: (i) a graph representation , such as the variable-constraint bipartite graph of gasse2019exact, where , with variable features, edge features, and constraint features; and (ii) branching candidate features , such as those from confs/aaai/KhalilBND16, which is cheaper to extract than the first representation. For convenience, we denote by the generic space of the branching candidate features, and by the graph representation of the root node. The various features used in this work are detailed in the supplementary materials.

In BB, structural information in the tree nodes, , shares a lot of similarity with that of the root node,

. Extracting, but also processing that information at every node is an expensive task, which we will try to circumvent. The main idea of our hybrid approach is then to succinctly extract the relevant structural information only once, at the root node, with a parametric model

. We then combine in the tree nodes this preprocessed structural information with the cheap candidate features , using a hybrid model . By doing so, we hope that the resulting model will approach the performance of an expensive but powerful , at almost the same cost as an inexpensive but less powerful . Figure 2 illustrates the differences between those approaches, in terms of data extraction.For an exhaustive coverage of the computational spectrum of hybrid models, we consider four ways to enrich the feature space of an MLP via a GNN’s output, sumarrized in Table 1. In CONCAT, we concatenate the candidate’s root representations with the features at a node. In FiLM [perez2018film], we generate film parameters , from the GNN, for each candidate, which are further used to modulate the hidden layers of the MLP. In details, if is the intermediate representation of the MLP, it gets linearly modulated as . While both the above architectures have similar computational complexity, it has been shown that FiLM subsumes the CONCAT architecture dumoulin2018feature. On the other the end of the spectrum lie the most inexpensive hybrid architectures, HyperSVM and HyperSVM-FiLM. HyperSVM is inspired by HyperNetworks [ha2016hypernetworks], and simply consists in a multi-class Support Vector Machine (SVM), whose parameters are predicted by the root GNN. We chose a simple linear disciminator, for a minimal computational cost. Finally, in HyperSVM-FiLM we increase the expressivity of HyperSVM with the help of modulations, similar to that of FiLM.

Data extraction | Computational Cost | Decision Function | |
---|---|---|---|

GNN gasse2019exact | expensive | Expensive | |

MLP | cheap | Moderate | |

CONCAT | hybrid | Moderate | |

FiLM [perez2018film] | hybrid | Moderate | |

HyperSVM | hybrid | Cheapest | |

HyperSVM-FiLM | hybrid | Cheapest |

### 4.2 Training Protocol

We use strong branching decisions as ground-truth labels for imitation learning, and collect observations of the form . Thus, the data used for training the model is . We treat the problem of identifying as a classification problem, such that is the target outcome. Considering as our ground-truth, our objective (1) is to minimize the cross-entropy loss between and a one-hot vector with one at the target:

(1) |

Performance on unseen instances, or generalization, is of utmost importance when the trained models are used on bigger instances. The choices in training the aforementioned architectures influence this ability. In this section, we discuss four such important choices that lead to better generalization.

#### 4.2.1 End-to-end Training (e2e)

A , with pre-trained parameters , is obtained using the procedure described in gasse2019exact

. We use this pre-trained GNN to extract variable representations at the root node and use it as an input to the MLPs at a tree node (pre). This results in a considerable performance boost over plain "salt-of-the-earth" MLP used at all tree nodes. However, going a step further, an end-to-end (e2e) approach involves training the GNN and the MLP together by backpropagating the gradients from a tree node to the root node. In doing so, e2e training aligns the variable representations at the root node with the prediction task at a tree node. At the same time, it is not obvious that it should result in a stable learning behavior because the parameters for GNN need to adapt to various tree nodes. Our experiments explore both pre-training (pre) and end-to-end training (e2e), namely:

(2) |

#### 4.2.2 Knowledge Distillation (KD)

Using the outputs of a pre-trained expert model as a soft-target for training a smaller model has been successfully used in model compression [hinton2015distilling]. In this way, one aims to learn an inexpensive model that has the same generalization power as an expert. Thus, for better generalization, instead of training our hybrid architectures with cross-entropy [Goodfellow-et-al-2016] on ground-truth hard-labels, we study the effect of training with KL Divergence [kullback1951information] between the outputs of a pre-trained GNN and a hybrid model, namely:

(3) |

#### 4.2.3 Auxiliary Tasks (AT)

An inductive bias, such as GNN, encodes a prior on the way to process raw input. Auxiliary tasks, on the other hand, inject priors in the model through additional learning objectives, which are not directly linked to the main task. These tasks are neither related to the final output nor do they require additional training data. One such auxiliary task is to maximize the diversity in variable representations. The intuition is that very similar representations lead to very close MLP score predictions, which is not useful for branching.

We minimize a pairwise loss function that ensures maximum separation between the variable representations projected on a unit hypersphere. We consider two types of objectives for this: (i) Euclidean Distance (ED), and (ii) Minimum Hyperspherical Energy (MHE)

[liu2018learning], inspired from the well-known Thomson problem thomsonin Physics. While ED separates the representations in the Euclidean space on the hypersphere, MHE ensures uniform distribution over the hypersphere. Denoting

as the variable representation for the variable projected on a unit hypersphere and as the Euclidean distance between the representations for the variables and , our new objective function is given as , where(4) |

#### 4.2.4 Loss Weighting Scheme

The problem of distribution shift is unavoidable in a sequential process like B&B. A suboptimal branching decision at a node closer to the root node can have worse impact on the size of the BB tree as compared to when such a decision is made farther from it. In such situations, one can use depth (possibly normalized) as a feature, but the generalization on bigger instances is a bit unpredictable as the distribution of this feature might be very different from that observed in the training set. Thus, we experimented with different depth-dependent formulations for weighting the loss at any node. Denoting as the depth of a tree node relative to the depth of the tree, we weight the loss at different tree nodes by , making our objective function as

(5) |

Specifically, we considered 5 different weighting functions such that all of them have the same end points, i.e., at the root node and at the deepest node. Different functions were chosen depending on their intermediate behaviour in between these two points. We experimented with exponential, linear, quadratic and sigmodal decay behavior of these functions. Table 3 lists various functions and their mathematical forms considered in our experiments.

## 5 Experiments

We follow the experimental setup of gasse2019exact, and evaluate each branching strategy across four different problem classes, namely Capacitated Facility Location, Minimum Set Covering, Combinatorial Auctions, and Maximum Independent Set. Randomly generated instances are solved offline using SCIP [GleixnerEtal2018OO] to collect training samples of the form . We leave the description of the data collection and training details of each model to the supplementary materials.

##### Evaluation.

As in gasse2019exact, our evaluation instances are labeled as small, medium, and big based on the size of underlying MILP. Small instances have the same size as those used to generate the training datasets, and thus match the training distribution, while instances of increasing size allows us to measure the generalization ability of the trained models. Each scenario uses 20 instances, solved using 3 different seeds to account for solver variability. We report standard metrics used in the MILP community for benchmarking B

B solvers: (i) Time: 1-shifted geometric mean

^{3}

^{3}3for complete definition refer to Appendix A.3 in achterberg_thesis of running times in seconds, including the running times for unsolved instances, (ii) Nodes: hardware-independent 1-shifted geometric mean of BB node count of the instances solved by all branching strategies , and (iii) Wins: number of times each branching strategy resulted in the fastest solving time, over total number of solved instances. All branching strategies are evaluated using the open-source solver SCIP [GleixnerEtal2018OO] with a time limit of 45 minutes, and cutting planes are allowed only at the root node.

##### Baselines.

We compare our hybrid model to several non-ML baselines, including SCIP’s default branching heuristic Reliability Pseudocost Branching (RPB), the “gold standard" heuristic Full Strong Branching (FSB), and a very fast but ineffective heuristic Pseudocost Branching (PB). We also include the GNN model from gasse2019exact run on CPU (GNN), and also several fast but less expressive models such as SVMRank from confs/aaai/KhalilBND16, LambdaMART from burges2010ranknet, and ExtraTree Classifier from geurts2006extremely as benchmarks. For conciseness, we chose to report those last three competitor models using an optimistic aggregation scheme, by systematically choosing only the best performing method among the three (COMP). For completeness, we also report the performance of a GNN model run on a high-end GPU, although we do not consider that method as a baseline and therefore do not include it in the Wins indicator.

##### Model selection.

To investigate the effectiveness of different architectures, we empirically compare the performance of their end-to-end variants. Figure 3 compares the Top-1 test accuracy of models across the four problem sets. The performance of GNNs (blue), being the most expressive model, serves as an upper bound to the performance of hybrid models. All of the considered hybrid models outperform MLPs (red) across all problem sets. Additionally, we observe that FiLM (green) and CONCAT (purple) perform significantly better than other architectures. However, it is not clear if there is a clear winner among them. We also note that the cheapest hybrid models, HyperSVM and HyperSVM-FiLM, though better than MLPs, still do not perform as well as FiLM or CONCAT models.

##### Training protocols.

In Table 3, we show the effect of different protocols discussed in section 4 on Top-1 accuracy of the FiLM models. We observe that the presence of these protocols improves the performance by 0.5-0.9%, which translates to a substantial reduction in the number of B

B nodes of the solver. Except for Combinatorial Auctions, FiLM’s performance is improved by knowledge distillation, which suggests that the soft targets of pre-trained GNN yield a better generalization performance. Lastly, we launch a hyperparameter search for auxiliary objective: ED and MHE on top of the e2e & KD models. Auxiliary tasks further help the accuracy of the hybrid models, but for some problem classes it is ED that works well while for others it is MHE. The detailed results of these experiments are in the supplement for further reference.

cauctions | facilities | indset | setcover | |
---|---|---|---|---|

Pretrained GNN | 44.12 0.09 | 65.78 0.06 | 53.16 0.51 | 50.00 0.09 |

e2e | 44.31 0.08 | 66.33 0.33 | 53.23 0.58 | 50.16 0.05 |

e2e & KD | 44.10 0.09 | 66.60 0.21 | 53.08 0.3 | 50.31 0.19 |

e2e & KD & AT | 44.56 0.13 | 66.85 0.28 | 53.68 0.23 | 50.37 0.03 |

Type | Weighting scheme | Nodes | Wins |
---|---|---|---|

Constant | 9678 | 10/60 | |

Exponential decay | 9793 | 10/60 | |

Linear | 9789 | 12/60 | |

Quadratic decay | 9561 | 14/60 | |

Sigmoidal | 9534 | 14/60 |

##### Effect of loss weighting.

We empirically investigate the effect of the different loss weighting schemes discussed in Section 4.2.4. We train a simple MLP model on our small Combinatorial Auctions instances, and measure the resulting BB tree size on big instances. We report aggregated results in Table 3, and provide instance-level results in the supplement. We observe that the most commonly used exponential and linear schemes actually seem to degrade the performance of the learned strategy, as we believe those may be too aggressive at disregarding nodes early on in the tree. On the other hand, both the quadratic and sigmoidal schemes result in an improvement, thus validating the idea that depth-dependent weighting can be beneficial for learning to branch. We therefore opt for a sigmoidal loss weighting scheme in our training protocol.

##### Complete benchmark.

Finally, to evaluate the runtime performance of our hybrid approach, we replace SCIP’s default branching strategy with our best performing model, FiLM. We observe in Table 4 that FiLM performs substantially better than all other CPU-based branching strategies. The computationally expensive FSB, our “gold standard”, becomes impractical as the size of instances grows, whereas RPB remains competitive. While GNN retains its small number of nodes, it loses in running time performance on CPU. Note that we found that FiLM models for Maximum Independent Set did initially overfit on small instances, such that the performance on larger instances degraded substantially. To overcome this issue we used weight decay ng2004feature regularization, with a validation set of 2000 observations generated using random medium instances (not used for evaluation). We report the performance of the regularized models in the supplement, and use the best performing model to report evaluation performance on medium and big instances. We also show in the supplement that the cheapest computation model of HyperSVM/HyperSVM-FiLM do not provide any runtime advantages over FiLM. We achieve up to 26% reduction on medium instances and up to 8% reduction on big instances in overall solver running time compared to the next-best branching strategy, including both learned and classical strategies.

##### Limitations.

We would like to point out some limitations of our work. First, given the NP-Hard nature of MILP solving, it is fairly time consuming to evaluate performance of the trained models on the instances bigger than considered for this work. One can consider the primal-dual bound gap after a time limit as an evaluation metric for the bigger instances, but this is misaligned with the solving time objective. Second, we have used Top-1 accuracy on the test set as a proxy for the number of nodes, but there is an inherent distribution shift because of the sequential nature of B

B that leads to out-of-distribution observations. Third, generalization to larger instances is a central problem in the design of branching strategies. Several techniques discussed in this work form only a part of the solution to this problem. On the Maximum Independent Set problem, we originally noticed a poor generalization capability, which we addressed by cross-validation using a small validation set. In future work, we plan to perform an extensive study of the effect of architecture and training protocols on generalization performance. Another experiment worth conducting would be to train on larger instances than (small), in order to get a better view of how and to which point the different models are able to generalize. However, not only is this a time-consuming process, but there is also an upper limit on the size of the problems on which it is reasonable to conduct experiments, simply due to hardware constraints. Finally, although we showed the efficacy of our models on a broad class of MILPs, there may be other problem classes for which our models might not result in a substantial runtime improvements.## 6 Conclusion

As more operations research and integer programming tools start to include ML and deep learning modules, it is necessary to be mindful of the practical bottlenecks faced in that domain. To this end, we combine the expressive power of GNNs with the computational advantages of MLPs to yield novel hybrid models for learning to branch, a central component in MILP solving. We integrate various training protocols that augment the basic MLPs and help bridge the accuracy gap with more expensive models. This competitive accuracy translates into savings in time and nodes when used in MILP solving, as compared to both default, expert-designed branching strategies and expensive GNN models, thus obtaining the “best of both worlds" in terms of the time-accuracy trade-off. More broadly, our philosophy revolves around understanding the intricacies and practical constraints of MILP solving and carefully adapting deep learning techniques therein. We believe that this integrative approach is crucial to the adoption of statistical learning in exact optimization solvers. We will release the code and pre-trained models following the review process.

## 7 Acknowledgments

We would like to acknowledge the important role played by our colleagues at Mila and CERC through building a fun learning environment. We would also like to thank Felipe Serano and Benjamin Müller for their technical help with SCIP and insightful discussions on branching in MILPs. PG wants to thank Giulia Zarpellon, Didier Chételat, Antoine Prouvost, Karsten Roth, David Yu-Tung Hui, Tristan Deleu, Maksym Korablyov, and Alex Lamb for enlightening discussions on deep learning and integer programming. PG would also like to thank Compute Canada for their resources.

Comments

There are no comments yet.