Curriculum learning for multilevel budgeted combinatorial problems

07/07/2020 ∙ by Adel Nabli, et al. ∙ Université de Montréal 0

Learning heuristics for combinatorial optimization problems through graph neural networks have recently shown promising results on some classic NP-hard problems. These are single-level optimization problems with only one player. Multilevel combinatorial optimization problems are their generalization, encompassing situations with multiple players taking decisions sequentially. By framing them in a multi-agent reinforcement learning setting, we devise a value-based method to learn to solve multilevel budgeted combinatorial problems involving two players in a zero-sum game over a graph. Our framework is based on a simple curriculum: if an agent knows how to estimate the value of instances with budgets up to B, then solving instances with budget B+1 can be done in polynomial time regardless of the direction of the optimization by checking the value of every possible afterstate. Thus, in a bottom-up approach, we generate datasets of heuristically solved instances with increasingly larger budgets to train our agent. We report results close to optimality on graphs up to 100 nodes and a 185 × speedup on average compared to the quickest exact solver known for the Multilevel Critical Node problem, a max-min-max trilevel problem that has been shown to be at least Σ_2^p-hard.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Learning to solve the Multilevel Critical Node Problem. Code for the paper

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The design of heuristics to tackle real-world instances of NP-hard combinatorial optimization problems over graphs has attracted the attention of many Computer Scientists over the years Gonzalez (2007)

. With advances in Deep Learning

Goodfellow et al. (2016) and Graph Neural Networks Wu et al. (2020), the idea of leveraging the recurrent structures appearing in the combinatorial objects belonging to a distribution of instances of a given problem to learn efficient heuristics with a Reinforcement Learning (RL) framework has received an increased interest Bengio et al. (2018); Mazyavkina et al. (2020). Although these approaches show promising results on many fundamental NP-hard problems over graphs, such as Maximum Cut Barrett et al. (2020) or the Traveling Salesman Problem Kool et al. (2019), the range of combinatorial challenges on which they are directly applicable is still limited.

Indeed, most of the combinatorial problems over graphs solved heuristically with Deep Learning Bai et al. (2020); Barrett et al. (2020); Bello et al. (2016); Cappart et al. (2019); Khalil et al. (2017); Kool et al. (2019); Li et al. (2018); Ma et al. (2019)

are classic NP-hard problems for which the canonical optimization formulation is a single-level Mixed Integer Linear Program: there is one decision-maker


We will use interchangeably the words decision-maker, agent and player. Note that decision-maker, player and agent are usually used in Operations Research, Game Theory and Reinforcement Learning, respectively. Similarly for the words decision, strategy and policy.

seeking to minimize a linear cost subject to linear constraints and integer requirements. However, in many real-world situations, decision-makers interact with each other. A particular case of such setting are sequential games with a hierarchy between players: an upper level authority (a leader) optimizes its goal subject to the response of a sequence of followers seeking to optimize their own objectives given the actions previously made by others higher in the hierarchy. These problems are naturally modeled as Multilevel Programming problems (MPs) and can be seen as a succession of nested optimization tasks, i.e. mathematical programs with optimization problems in the constraints Bracken and McGill (1973); Zhang et al. (2015).

Thus, finding an optimal strategy for the leader in the multilevel setting may be harder than for single-level problems as evaluating the cost of a given strategy might not be possible in polynomial time: it requires solving the followers optimization problems. In fact, even Multilevel Linear Programming with a sequence of players (levels) is -hard Blair (1992); Dudás et al. (1998); Jeroslow (1985). In practice, exact methods capable to tackle medium-sized instances in reasonable time have been developed for max-min-max Trilevels, min-max Bilevels and more general Bilevel Programs (e.g., Carvalho et al. (2020); Fischetti et al. (2017, 2019); Lozano and Smith (2017); Tahernejad et al. (2020)).

Despite the computational challenges intrinsic to MPs, these formulations are of practical interest as they properly model hierarchical decision problems. Originally appearing in economics in the bilevel form, designated as Stackelberg competitions von Stackelberg (1934), they have since been extended to more than two agents and seen their use explode in Operations Research Lachhwani and Dwivedi (2017); Sinha et al. (2018). Thus, research efforts have been directed at finding good quality heuristics to solve those problems, e.g. DeNegre (2011); Fischetti et al. (2018); Forghani et al. (2020); Talbi (2013). Hence, one can ask whether we can make an agent learn how to solve a wide range of instances of a given multilevel problem, extending the success of recent Deep Learning approaches on solving single-level combinatorial problems to higher levels.

In this paper, we propose a simple curriculum to learn to solve a common type of multilevel combinatorial optimization problem: budgeted ones that are zero-sum games played in a graph. Although the framework we devise is set to be general, we center our attention on the Multilevel Critical Node problem (MCN) Baggio et al. (2020) and its variants. The reasons for such a choice are manifold. First, the MCN is an example of a Defender-Attacker-Defender game Brown et al. (2006) which received much attention lately as it aims to find the best preventive strategies to defend critical network infrastructures against malicious attacks. As it falls under the global framework of network interdiction games, it is also related to many other interdiction problems with applications ranging from floods control Ratliff et al. (1975) to the decomposition of matrices into blocks Furini et al. (2019). Moreover, an exact method to solve the problem has been presented in Baggio et al. (2020) along with a publicly available dataset of solved instances222, which we can use to assess the quality of our heuristic. Lastly, complexity results are available for several variants and sub-problems of MCN, indicating its challenging nature Nabli et al. (2020).

Contributions. Our contribution rests on several steps. First, we frame generic Multilevel Budgeted Combinatorial problems (MBC) as Alternating Markov Games Littman (1996); Littman and Szepesvari (1996). This allows us to devise a first algorithm, MultiL-DQN, to learn -values. By leveraging both the combinatorial setting (the environment is deterministic) and the budgeted case (the length of an episode is known in advance), we motivate a curriculum, MultiL-Cur. Introducing a Graph Neural Networks based agent, we empirically demonstrate the efficiency of our curriculum on versions of the MCN, reporting results close to optimality on graphs of size up to .

Paper structure. Section 2 formalizes the MBC problem. In Section 3, we provide an overview of the relevant literature. The MBC is put in a Multi-Agent RL setting in Section 4 along with the presentation of our algorithmic approaches: MultiL-DQN and MultiL-Cur. Section 5 states the particular game MCN in which our methodology is validated in Section 6.

2 Problem statement

The general setting for the MPs we are considering is the following: given a graph , two concurrent players, the leader and the follower, compete over the same combinatorial quantity , with the leader wanting to maximize it and the follower to minimize it. They are given a total number of moves and a sequence of budgets . Although our study and algorithms also apply to general integer cost functions , for the sake of clarity, we will only consider situations where the cost of a move is its cardinality. We focus on perfect-information games, i.e. both players have full knowledge of the budgets allocated and previous moves. The leader always begins and the last move is attributed by the parity of . At each turn , the player concerned makes a set of decisions about the graph. This set is denoted by and constrained by the previous moves . We consider games where players can only improve their objective by taking a decision: there is no incentive to pass. Without loss of generality, we can assume that

is odd. Then, the Multilevel Budgeted Combinatorial problem (MBC) can be formalized as:


MBC is a zero-sum game as both leader and follower have the same objective function but their direction of optimization is opposite. A particular combinatorial optimization problem is defined by specifying the quantity , fixing , and by characterizing the nature of both the graph (e.g directed, weighted) and of the actions allowed at each turn (e.g labeling edges, removing nodes). The problem being fixed, a distribution of instances is determined by setting a sampling law for random graphs and for the other parameters, having specified bounds beforehand: , , . Our ultimate goal is thus to learn good quality heuristics that manage to solve each .

In order to achieve that, we aim to leverage the recurrent structures appearing in the combinatorial objects in the distribution by learning graph embeddings that could guide the decision process. As data is usually very scarce (datasets of exactly solved instances being hard to produce), the go-to framework to learn useful representations in these situations is Reinforcement Learning Sutton and Barto (1998).

3 Related Work

The combination of graph embedding with reinforcement learning to learn to solve distributions of instances of combinatorial problems was introduced by Dai et al. Khalil et al. (2017). Thanks to their S2V-DQN meta-algorithm, they managed to show promising results on three classic budget-free NP-hard problems. Thenceforth, there is a growing number of methods proposed to either improve upon S2V-DQN results Barrett et al. (2020); Cappart et al. (2019); Kool et al. (2019); Li et al. (2018); Ma et al. (2019) or tackle other types of NP-hard problems on graphs Bai et al. (2020). As all these approaches focus on single player games, they are not directly applicable to MBC.

To tackle the multiplayer case, Multi-Agent Reinforcement Learning (MARL) Littman (1994); Shoham et al. (2007) appears as the natural toolbox. The combination of Deep Learning with RL recently led to one of the most significant breakthrough in perfect-information, sequential two-player games: AlphaGo Silver et al. (2017). Although neural network based agents managed to exceed human abilities on other combinatorial games (e.g. backgammon Tesauro (2002)), these approaches focus on one fixed board game. Thus, they effectively learn to solve only one (particularly hard) instance of a combinatorial problem, whereas we aim to solve a whole distribution of them. Hence, the MBC problem we propose to study is at a crossroads between previous works on MARL and deep learling for combinatorial optimization.

Finally, taking an other direction, some shifted their attention from specific problems to rather focus on general purpose solvers. For example, methodologies have been proposed to speed up the branch-and-bound implementations for (single-level) linear combinatorial problems by learning to branch Balcan et al. (2018)

using Graph Convolutional Neural Networks

Gasse et al. (2019), Deep Neural Networks Zarpellon et al. (2020) or RL Etheve et al. (2020) to name some recent works; see the surveys Bengio et al. (2018); Lodi and Zarpellon (2017)

. To the best of our knowledge, the literature on machine learning approaches for general multilevel optimization is restricted to the linear non-combinatorial case. For instance, in

He et al. (2014); Shih et al. (2004)

the linear multilevel problems are converted into a system of differential equations and solved using recurrent neural networks.

4 Multi-Agent Reinforcement Learning framework

Whereas single agent RL is usually described with Markov Decision Processes, the framework needs to be extended to account for multiple agents. This has been done in the seminal work of Shapley

Shapley (1953) by introducing Markov Games. In our case, we want to model two-player games in which moves are not played simultaneously but alternately. The natural setting for such situation was introduced by Littman in Littman (1994, 1996); Littman and Szepesvari (1996) under the name of Alternating Markov Games.

4.1 Alternating Markov Games

An Alternating Markov Game involves two players: a maximizer and a minimizer. It is defined by the tuple with and the set of states and actions, respectively, for player ,

the transition function mapping state-actions couples to probabilities of next states and

a reward function. For , we define as the expected reward of the concerned agent for following the optimal minimax policy against an optimal opponent starting from state . In a similar fashion, is the expected reward for the player taking action in state and both agents behaving optimally thereafter. Finally, with the introduction of the discount factor , we can write the generalized Bellman equations for Alternating Markov Games Littman (1996):


4.2 MARL formulation of the Multilevel Budgeted Combinatorial problem

We now have all the elements to frame the MBC in the Alternating Markov Game framework. The leader is the maximizer and the follower the minimizer. The states consist of a graph and a tuple of budgets , beginning with . Thus, the value function is defined with:


The game is naturally sequential with an episode length of : each time step corresponds to a level . The challenge of such formulation is the size of the action space that can become large quickly. Indeed, in a graph with nodes, and ’s budget , if the action that he/she can perform is "removing a set of nodes from the graph" (a common move in network interdiction games), then the size of the action space for the first move of the game is . To remedy this, we define the sets of individual decisions available at each level . Then, we make the simplifying observation that a player making a set of decisions in one move is the same as him/her making a sequence of decisions in one strike. More formally, we have the simple lemma (proof in Appendix A.1):

Lemma 4.1.

The Multilevel Budgeted Combinatorial optimization problem (1) is equivalent to:

In this setting, the length of an episode is no longer but : the leader makes sequential actions, then the follower the following ones, etc. To simplify the notations, we re-define the as the sets of actions available for the agent playing at time . As each action takes place on the graph, is readable from . Moreover, we now have .

The environments considered in MBC are deterministic and their dynamics are completely known. Indeed, given a graph , a tuple of budgets and a chosen action (e.g removing the node ), the subsequent graph , tuple of budgets and next player are completely and uniquely determined. Thus, we can introduce the next state function that maps state-action couples to the resulting afterstate , and as the function that maps the current state to the player whose turn it is to play. As early rewards weight the same as late ones, we set . Finally, we can re-write equations (2) and (3) as:


The definition of depends on the combinatorial quantity and the nature of the actions allowed.

4.3 Q-learning for the greedy policy

Having framed the MBC in the Markov Game framework, the next step is to look at established algorithms to learn in this setup. Littman originaly presented minimax Q-learning Littman (1994, 1996) to do so, but in matrix games. An extension using a neural network to estimate has been discussed in Fan et al. (2019). However, their algorithm, Minimax-DQN, is suited for the simultaneous setting and not the alternating one. The main difference being that the former requires the extra work of solving a Nash game between the two players at each step, which is unnecessary in the later as a greedy policy exists Littman (1994). To bring Minimax-DQN to the alternating case, we present MultiL-DQN, an algorithm inspired by S2V-DQN Khalil et al. (2017) but extended to the multilevel setting. See Appendix B.1 for the pseudo-code.

4.4 A curriculum taking advantage of the budgeted combinatorial setting

With MultiL-DQN, the learning agent directly tries to solve instances drawn from , which can be very hard theoretically speaking. However, Lemma 4.1 shows that, at the finest level of granularity, MBC is actually made of nested sub-problems. As we know , the maximum number of levels considered in , instead of directly trying to learn the values of the instances from this distribution, we can ask whether beginning by focusing on the simpler sub-problems and gradually build our way up to the hard ones would result in better final results.

This reasoning is motivated by the work done by Bengio et al. Bengio et al. (2009) on Curriculum Learning. Indeed, it has been shown empirically that breaking down the target training distribution into a sequence of increasingly harder ones actually results in better generalization abilities for the learning agent. But, contrary to the continuous setting devised in their work, where the parameter governing the hardness (entropy) of the distributions considered is continuously varying between and , here we have a natural discrete sequence of increasingly harder distributions to sample instances from.

Indeed, our ultimate goal is to learn an approximate function to (or equivalently to (6)) so that we can apply the subsequent greedy policy to take a decision. Thus, has to estimate the state-action values of every instance appearing in a sequence of decisions. Although the leader makes the first move on instances from , as our game is played on the graph itself, the distribution of instances on which the second decision is made is no longer but instances from on which a first optimal action for the leader has been made. If we introduce the function


then, from top to bottom, we want to estimate the values of taking an action starting from the states where the first action is made, then where the second action is made on instances with original budget at level , and all the way down to

1 Initialize the value-network with weights ;
2 Initialize the list of experts to be empty ;
3 for  do
4        Create , by sampling (, GreedyRollout 333The pseudo-code for the GreedyRollout function is available in Appendix B.2());
5        Initialize , the expert of level with ;
6        Initialize the loss on the validation set to ;
7        for epoch  do
8               for batches  do
9                      Update over with Adam Kingma and Ba (2015) ;
10                      if number of new updates =  then
11                             if  then
12                                    ; ;
14       Add to
return the trained list of experts
Algorithm 1 MultiL-Cur

As the maximum total budget is , has to effectively learn to estimate values from different distributions of instances, one for each possible budget in for each . But the instances in these distributions are not all equally hard to solve. Actually, the tendency is that the deeper in the sequence of decision a distribution is, the easier to solve are the instances sampled from it. For example, the last distribution contains all the instances with a total remaining budget of at most that it is possible to obtain for the last move of the game when every previous action was optimal. The values of these instances can be computed exactly in polynomial time444Assuming the quantity is computable in polynomial time. by checking the reward obtained with every possible action. Thus, if we had access to the , then a natural curriculum would be to begin by sampling a dataset of instances , find their exact value in polynomial time, and train a neural network on the couples . Once this is done, we could pass to the . As these instances have a total budget of at most , we can heuristically solve them by generating every possible afterstate and, by using the freshly trained , take a greedy decision to obtain their approximate targets. In a bottom up approach, we could continue until is trained on the different distributions. The challenge of this setting being that we do not know and hence, are not available. To remedy this, we use a proxy, , obtained by following the random policy for the sequence of previous moves, i.e., we use instead of . Doing so is provably interesting (proof in Appendix A.2):

Lemma 4.2.

, .

Thus, by learning the value of instances sampled from , we also learn values of instances from . To avoid the pitfall of catastrophic forgetting Lange et al. (2019) happening when a neural network switches of training distribution, each time it finishes to learn from a and before the transition to , we freeze a copy of and save it in memory as an “expert of level . All this leads to the Algorithm 1.

5 The Multilevel Critical Node Problem

Figure 1: Example of an MCN game on a graph with unit weights. Here, , are saved.

The MCN Baggio et al. (2020) is a trilevel budgeted combinatorial problem on a weighted graph . The leader is called the defender and the follower is the attacker. The defender begins by removing (vaccinate) a set of nodes, then the attacker labels a set of nodes as attacked (infected), and finally the defender removes (protects) a set of new nodes that were not attacked. Once all the moves are done, attacked nodes are the source of a cascade of infections that propagate through arcs from node to node. All nodes that are not infected are saved. As and where removed from the graph, they are automatically saved and the leader receives and , the weights of the nodes in and , as reward when performing those actions. The quantity that the defender seeks to maximize is thus the sum of the weights of the saved nodes in the end of the game, while the attacker aims to minimize it. An example of a game is presented in Figure 1. The problem can be written as:


With unit weights, the MCN has been shown to be at least NP-hard on undirected graphs and at least -hard on directed ones. In the more general version, with positive weights and costs associated to each node, the problem is -complete  Nabli et al. (2020).

6 Computational Results


We studied versions of the MCN: undirected with unit weights (MCN), undirected with positive weights (MCN), and directed with unit weights (MCN). The first distribution of instances considered is , constituted of Erdos-Renyi graphs Erdos and Renyi (1960) with size and arc density . For the weighted case, we considered integer weights . The second distribution of instances focused on larger graphs with , . To compare our results with exact ones, we used the budgets reported in the experiments of the original MCN paper Baggio et al. (2020): , and .

Figure 2: Architecture of the two neural networks used: and . computes a score for each node, which can be interpreted as its probability to be saved given the context (graph embedding and budgets).

Graph embedding.

The architectures presented in Figure 2

was implemented with Pytorch Geometric

Fey and Lenssen (2019) and Pytorch 1.4 Paszke et al. (2019). At the beginning, a node has only features : its weight and an indicator of whether it is attacked or not. The first step of our node embedding method is to concatenate the node’s two features with its Local Degree Profile Cai and Wang (2018) consisting in simple statistics on its degree. Following the success of Attention on routing problems reported in Kool et al. (2019), we then apply their Attention Layer. As infected nodes are the ones in the same connected component as attacked ones in the graph, we sought to propagate the information of each node to all the others it is connected to. That way, the attacker could know which nodes are already infected before spending the rest of his/her budget, and the defender could realize which nodes are to protect in his/her last move. So, after the Attention Layers, we used an APPNP layer Klicpera et al. (2019) that, given the matrix of nodes embedding , the adjacency matrix with inserted self-loops , its corresponding diagonal degree matrix, and a coefficient , recursively applies times:


To achieve our goal, the value of must be at least equal to the size of the largest connected component possible to have in the distribution of instances . We thus used . Finally, the graph embedding method we used is the one presented in Li et al. (2016). Given two neural networks and which compute, respectively, a score and a projection to

, the graph level representation vector it outputs is:


To train our agent and at inference, we used one gpu of a cluster of NVidia V100SXM2 with 16G of memory555We make our code publicly available: Further details of the implementation are discussed in Appendix D.


We compared our algorithms on and trained our best performing one on . But as it is, comparing MultiL-DQN with MultiL-Cur may be unfair. Indeed, MultiL-DQN uses a -network whereas MultiL-Cur uses a value network. The reason why we used instead of in our second algorithm are twofold. First, as our curriculum leans on the abilities of experts trained on smaller budgets to create the next training dataset, computing values of afterstates is necessary to heuristicaly solve instances with larger budgets. Second, as MCN is a game with one player removing nodes from the graph, symmetries can be leveraged in the afterstates. Indeed, given the graph resulting of a node deletion, many couples of graph and node to delete could have resulted in . Thus, has to learn that all these possibilities are similar, while only needs to learn the value of the shared afterstate, which is more efficient Sutton and Barto (1998). To fairly compare the algorithms, we thus introduce MultiL-MC, a version of MultiL-DQN based on a value network and using Monte-Carlo samples as in MultiL-Cur. Its pseudo-code is available in Appendix B.3.

Table 1: Evolution during training of the loss on test sets of exactly solved instances . Averaged on runs. We measured the loss on distributions arriving at different stages of the curriculum. The approximation ratio and optimality gap were measured after training and averaged over all the tests sets.


Results from Table 1 indicate that MultiL-Cur is the best performing algorithm on . Thus, we trained our learning agent with it on and tested its performance on the datasets generated in Baggio et al. (2020). We compare the results with other heuristics: the random policy (for each instance, we average the value given by random episodes), and the DA-AD heuristic Baggio et al. (2020). The latter consists in separately solving the two bilevel problems inside MCN: is chosen by setting to and exactly solving the Defender-Attacker problem, while and are determined by solving the subsequent Attacker-Defender problem. The metrics we use are the optimality gap and the approximation ratio . Given , the number of instances of a said type for which the optimal value is available, and . In Table 2, we report the inference times in seconds for our trained agents. The ones for the exact method and DA-AD are from Baggio et al. (2020).

exact random DA-AD cur cur cur
20 29 68 3.32 6 0.3 1.00 0.4 0.5 1.00 5.7 1.07 6.9 1.07
40 241 52 2.64 13 7.6 1.09 0.9 5.0 1.06 11.9 1.13 6.5 1.07
60 405 68 3.24 38 7.3 1.09 1.5 4.4 1.05 4.4 1.05 3.7 1.04
80 636 55 2.28 60 3.8 1.04 2.8 2.7 1.03 1.6 1.02 2.8 1.03
100 848 45 1.86 207 2.7 1.03 8.7 49.6 1.50 1.8 1.02 4.1 1.05
Table 2: Comparison between several heuristics and exact methods. Results on MCN are computed on the dataset of the original paper Baggio et al. (2020). For MCN and MCN, we generated our own datasets by making small adaptations to the exact solver of Baggio et al. (2020) originally suited for MCN, see Appendix C for details.


Although the results in Table 1 are the outcome of a total of episodes and optimization steps for all algorithms, our experiments show that we can divide by the data and the number of steps without sacrificing much the results on for the curriculum, which cannot be said of MultiL-DQN that is data hungry, see Appendix E for details. The major drawback of MultiL-Cur is that it needs to compute all possible afterstates at each step of the rollout. This does not scale well with the graph’s size: having nodes for the first step means that there are graphs of size to check. Thus, the curriculum we present is a promising step towards automatic design of heuristis, while opening new research directions on restricting the exploration of rollouts.

Table 2 reveals that the results given by the MultiL-Cur algorithm are close to optimal for a fraction of the time necessary to both DA-AD and the quickest exact solver known (MCN, presented in Baggio et al. (2020)). For the MCN instances, the jump in the metrics for graphs of size

is due to one outlier among the

exactly solved instances of this size. When removed, the values of and drop to and . The performances measured are consistent accross different problems as we also report low values of and for MCN and MCN. The curriculum we devised is thus a robust and efficient way to train agents in a Multilevel Budgeted setting.

Broader Impact

The methodology presented in this work bridges different research fields: Combinatorial Optimization, Game Theory and Reinforcement Learning. Indeed, it contributes for tackling Combinatorial Optimization problems where players rational behavior is considered, while acknowledging the intractability of exact solutions and, consequently, taking advantage of the recent advances in Reinforcement Learning to approximate them. In particular, advances in those areas will lead to improvements of our methodology. For instance, an interesting major benefit of our method is its ability to incorporate and thus take advantage of any exact methods developed for -level problems with . Hence, advancements on optimization directly translate on improvements of our approach.

Multilevel Budgeted Combinatorial problems are of very practical interest. Namely, they can be interpreted as models of robust solutions given that the follower always selects the worst-case scenario for the leader. Therefore, the approaches presented in this paper represent a novel direction on building robust solutions for combinatorial problems.

Finally, we highlight the need and difficulty of scaling up methods for multilevel optimization due to their theoretical and empirical complexity. Note that scalability is of utmost importance as many graph problems of practical interest are associated with social networks which are extremely large. This motivates the development of heuristics, like the one presented here, that can both serve as stand-alone methods or warm-start/guide exact solvers. The main drawback of heuristics for multilevel optimization is on their evaluation: given a leader’s strategy, its associated reward (value) can only be evaluated if the remaining levels are solved to optimality. This means that in opposition to single-level optimization, one must be very careful on the interpretation of the estimated reward associated with an heuristic method: we can be overestimating or underestimating it. In other words, it means that in the remaining levels, players might be failing to behave in an optimal way. Consequently, further research is necessary to provide guarantees on the quality of the obtained solution, namely, on dual bounds.

The authors wish to thank the support of the Institut de valorisation des données and Fonds de Recherche du Québec through the FRQ–IVADO Research Chair in Data Science for Combinatorial Game Theory, and the Natural Sciences and Engineering Research Council of Canada through the discovery grant 2019-04557. This research was enabled in part by support provided by Calcul Québec ( and Compute Canada (


  • [1] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama (2019)

    Optuna: a next-generation hyperparameter optimization framework

    arXiv preprint arXiv:1907.10902. Cited by: §D.1.
  • [2] A. Baggio, M. Carvalho, A. Lodi, and A. Tramontani (2020) Multilevel approaches for the critical node problem. Operations Research To appear. Cited by: §C.1, §C.2, Appendix C, §D.3, §1, §5, §6, §6, §6, Table 2.
  • [3] Y. Bai, D. Xu, A. Wang, K. Gu, X. Wu, A. Marinovic, C. Ro, Y. Sun, and W. Wang (2020) Fast detection of maximum common subgraph via deep q-learning. External Links: 2002.03129 Cited by: §1, §3.
  • [4] M. Balcan, T. Dick, T. Sandholm, and E. Vitercik (2018-10–15 Jul) Learning to branch. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 344–353. External Links: Link Cited by: §3.
  • [5] T. D. Barrett, W. R. Clements, J. N. Foerster, and A. I. Lvovsky (2020) Exploratory combinatorial optimization with reinforcement learning. In

    Proceedings of the 34th National Conference on Artificial Intelligence, AAAI

    Cited by: §1, §1, §3.
  • [6] I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio (2016) Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940. Cited by: §1.
  • [7] Y. Bengio, A. Lodi, and A. Prouvost (2018) Machine learning for combinatorial optimization: a methodological tour d’horizon. External Links: 1811.06128 Cited by: §1, §3.
  • [8] Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, New York, NY, USA, pp. 41–48. External Links: ISBN 9781605585161, Document Cited by: §4.4.
  • [9] C. Blair (1992-12-01) The computational complexity of multi-level linear programs. Annals of Operations Research 34 (1), pp. 13–19. External Links: ISSN 1572-9338, Document Cited by: §1.
  • [10] J. Bracken and J. T. McGill (1973) Mathematical programs with optimization problems in the constraints. Operations Research 21 (1), pp. 37–44. External Links: ISSN 0030364X, 15265463 Cited by: §1.
  • [11] G. Brown, M. Carlyle, J. Salmerón, and R. Wood (2006-12) Defending critical infrastructure. Interfaces 36, pp. 530–544. External Links: Document Cited by: §1.
  • [12] C. Cai and Y. Wang (2018) A simple yet effective baseline for non-attributed graph classification. arXiv preprint arXiv:1811.03508. Cited by: §D.1, §6.
  • [13] Q. Cappart, E. Goutierre, D. Bergman, and L. Rousseau (2019) Improving optimization bounds using machine learning: decision diagrams meet deep reinforcement learning. In AAAI, Cited by: §1, §3.
  • [14] M. Carvalho, K. Glorie, X. Klimentova, M. Constantino, and A. Viana (2020) Robust models for the kidney exchange problem. INFORMS Journal on Computing (to appear). Cited by: §1.
  • [15] P. V. cković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. In International Conference on Learning Representations, Cited by: §D.1.
  • [16] S. DeNegre (2011) Interdiction and discrete bilevel linear programming. Ph.D. Thesis, Lehigh University. External Links: Link Cited by: §1.
  • [17] T. Dudás, B. Klinz, and G. J. Woeginger (1998) The computational complexity of multi-level bottleneck programming problems. In Multilevel Optimization: Algorithms and Application, A. Migdalas, P. M. Pardalos, and P. Värbrand (Eds.), Boston, MA, pp. 165–179. External Links: ISBN 978-1-4613-0307-7, Document Cited by: §1.
  • [18] P. Erdos and A. Renyi (1960) On the evolution of random graphs. Publ. Math. Inst. Hungary. Acad. Sci. 5, pp. 17–61. Cited by: §6.
  • [19] M. Etheve, Z. Alès, C. Bissuel, O. Juan, and S. Kedad-Sidhoum (2020) Reinforcement learning for variable selection in a branch and bound algorithm. arXiv preprint arXiv:2005.10026. Cited by: §3.
  • [20] J. Fan, Z. Wang, Y. Xie, and Z. Yang (2019) A theoretical analysis of deep q-learning. arXiv preprint arXiv:1901.00137. Cited by: §4.3.
  • [21] M. Fey and J. E. Lenssen (2019) Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, Cited by: §6.
  • [22] M. Fischetti, I. Ljubić, M. Monaci, and M. Sinnl (2017) A new general-purpose algorithm for mixed-integer bilevel linear programs. Operations Research 65 (6), pp. 1615–1637. External Links: Document, Link, Cited by: §1.
  • [23] M. Fischetti, I. Ljubić, M. Monaci, and M. Sinnl (2019) Interdiction games and monotonicity, with application to knapsack problems. INFORMS Journal on Computing 31 (2), pp. 390–410. External Links: Document, Link, Cited by: §1.
  • [24] M. Fischetti, M. Monaci, and M. Sinnl (2018) A dynamic reformulation heuristic for generalized interdiction problems. European Journal of Operational Research 267 (1), pp. 40 – 51. External Links: ISSN 0377-2217, Document, Link Cited by: §1.
  • [25] A. Forghani, F. Dehghanian, M. Salari, and Y. Ghiami (2020) A bi-level model and solution methods for partial interdiction problem on capacitated hierarchical facilities. Computers & Operations Research 114, pp. 104831. External Links: ISSN 0305-0548, Document, Link Cited by: §1.
  • [26] F. Furini, I. Ljubic, E. Malaguti, and P. Paronuzzi (2019) Casting light on the hidden bilevel combinatorial structure of the k-vertex separator problem. In OR-19-6, DEI, University of Bologna, Cited by: §1.
  • [27] M. Gasse, D. Chételat, N. Ferroni, L. Charlin, and A. Lodi (2019) Exact combinatorial optimization with graph convolutional neural networks. In NeurIPS, pp. 15554–15566. Cited by: §3.
  • [28] T. F. Gonzalez (2007) Handbook of approximation algorithms and metaheuristics (chapman & hall/crc computer & information science series). Chapman & Hall/CRC. External Links: ISBN 1584885505 Cited by: §1.
  • [29] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. Note: Cited by: §1.
  • [30] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Vol. , pp. 770–778. Cited by: §D.1.
  • [31] X. He, C. Li, T. Huang, C. Li, and J. Huang (2014) A recurrent neural network for solving bilevel linear programming problem. IEEE Transactions on Neural Networks and Learning Systems 25 (4), pp. 824–830. Cited by: §3.
  • [32] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 448–456. Cited by: §D.1.
  • [33] R. G. Jeroslow (1985-06-01) The polynomial hierarchy and a simple model for competitive analysis. Mathematical Programming 32 (2), pp. 146–164. External Links: ISSN 1436-4646, Document Cited by: §1.
  • [34] E. Khalil, H. Dai, Y. Zhang, B. Dilkina, and L. Song (2017) Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 6348–6358. Cited by: §E.4, §1, §3, §4.3.
  • [35] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: 1, 2, 4.
  • [36] J. Klicpera, A. Bojchevski, and S. Günnemann (2019) Combining neural networks with personalized pagerank for classification on graphs. In International Conference on Learning Representations, Cited by: §D.1, §6.
  • [37] W. Kool, H. van Hoof, and M. Welling (2019) Attention, learn to solve routing problems!. In International Conference on Learning Representations, Cited by: §D.1, §E.4, §1, §1, §3, §6.
  • [38] K. Lachhwani and A. Dwivedi (2017-04) Bi-level and multi-level programming problems: taxonomy of literature review and research issues. Archives of Computational Methods in Engineering 25, pp. . External Links: Document Cited by: §1.
  • [39] M. D. Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars (2019) A continual learning survey: defying forgetting in classification tasks. arXiv preprint arXiv:1909.08383. Cited by: §4.4.
  • [40] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel (2016) Gated graph sequence neural networks. In International Conference on Learning Representations, Cited by: §6.
  • [41] Z. Li, Q. Chen, and V. Koltun (2018) Combinatorial optimization with graph convolutional networks and guided tree search. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 539–548. External Links: Link Cited by: §E.5, §1, §3.
  • [42] M. L. Littman and C. Szepesvari (1996) A generalized reinforcement-learning model: convergence and applications. Technical report Brown University, Brown University, USA. Cited by: §1, §4.
  • [43] M. L. Littman (1994) Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the Eleventh International Conference on International Conference on Machine Learning, ICML’94, San Francisco, CA, USA, pp. 157–163. External Links: ISBN 1558603352 Cited by: §3, §4.3, §4.
  • [44] M. L. Littman (1996) Algorithms for sequential decision-making. Ph.D. Thesis, Brown University, Brown University, USA. External Links: ISBN 0591163500 Cited by: §1, §4.1, §4.3, §4.
  • [45] A. Lodi and G. Zarpellon (2017) On learning and branching: a survey. TOP 25, pp. 207–236. Cited by: §3.
  • [46] L. Lozano and J. C. Smith (2017) A backward sampling framework for interdiction problems with fortification. INFORMS J. Comput. 29, pp. 123–139. Cited by: §1.
  • [47] Q. Ma, S. Ge, D. He, D. Thaker, and I. Drori (2019) Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning. arXiv preprint arXiv:1911.04936. Cited by: §1, §3.
  • [48] N. Mazyavkina, S. Sviridov, S. Ivanov, and E. Burnaev (2020) Reinforcement learning for combinatorial optimization: a survey. External Links: 2003.03600 Cited by: §1.
  • [49] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015-02-26) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. External Links: Link Cited by: §B.3.
  • [50] A. Nabli, M. Carvalho, and P. Hosteins (2020) Complexity of the multilevel critical node problem. arXiv preprint arXiv:2007.02370. Cited by: §1, §5.
  • [51] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §6.
  • [52] H. D. Ratliff, G. T. Sicilia, and S. H. Lubore (1975) Finding the n most vital links in flow networks. Management Science 21 (5), pp. 531–539. External Links: ISSN 00251909, 15265501 Cited by: §1.
  • [53] L. S. Shapley (1953) Stochastic games. Proceedings of the National Academy of Sciences 39 (10), pp. 1095–1100. External Links: Document, ISSN 0027-8424 Cited by: §4.
  • [54] H. Shih, U. Wen, S. Lee, K. Lan, and H. Hsiao (2004-07) A neural network approach to multiobjective and multilevel programming problems. Comput. Math. Appl. 48 (1–2), pp. 95–108. External Links: ISSN 0898-1221, Document Cited by: §3.
  • [55] Y. Shoham, R. Powers, and T. Grenager (2007) If multi-agent learning is the answer, what is the question?. Artificial Intelligence 171 (7), pp. 365 – 377. Note: Foundations of Multi-Agent Learning External Links: ISSN 0004-3702, Document Cited by: §3.
  • [56] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis (2017-10-01) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354–359. External Links: ISSN 1476-4687, Document Cited by: §3.
  • [57] A. Sinha, P. Malo, and K. Deb (2018) A review on bilevel optimization: from classical to evolutionary approaches and applications.

    IEEE Transactions on Evolutionary Computation

    22 (2), pp. 276–295.
    Cited by: §1.
  • [58] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014-01) Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15 (1), pp. 1929–1958. External Links: ISSN 1532-4435 Cited by: §D.1.
  • [59] R. S. Sutton and A. G. Barto (1998) Introduction to reinforcement learning. 1st edition, MIT Press, Cambridge, MA, USA. External Links: ISBN 0262193981 Cited by: §2, §6.
  • [60] S. Tahernejad, T. K. Ralphs, and S. T. DeNegre (2020) A branch-and-cut algorithm for mixed integer bilevel linear optimization problems and its implementation. Cited by: §1.
  • [61] E. Talbi (2013) Metaheuristics for bi-level optimization. Springer-Verlag Berlin Heidelberg. External Links: Document Cited by: §1.
  • [62] G. Tesauro (2002) Programming backgammon using self-teaching neural nets. Artificial Intelligence 134 (1), pp. 181 – 199. External Links: ISSN 0004-3702, Document Cited by: §3.
  • [63] H. von Stackelberg (1934) Marktform und gleichgewicht. Springer-Verlag, Berlin. Cited by: §1.
  • [64] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2020) A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, pp. 1–21. External Links: ISSN 2162-2388, Link, Document Cited by: §1.
  • [65] G. Zarpellon, J. Jo, A. Lodi, and Y. Bengio (2020) Parameterizing branch-and-bound search trees to learn branching policies. arXiv preprint arXiv:2002.05120. Cited by: §3.
  • [66] G. Zhang, J. Lu, and Y. Gao (2015) Multi-level decision making: models, methods and applications. Springer-Verlag Berlin Heidelberg. External Links: Document Cited by: §1.

Appendix A Proofs

Lemma A.1.

The Multilevel Budgeted Combinatorial optimization problem (1) is equivalent to:


We immediately have the following relation:

As the same reasoning holds with , we can apply it recursively, which closes the proof. ∎

Lemma A.2.



For all , for all , we define as the set of optimal actions at time in state for the player , where we made evident the dependence of on previous actions. As by assumption we consider games where players can only improve their objective by taking a decision, we have that . For a given and subsequent , recall that

is defined as a random variable with values in

and following the uniform law. Given , we take

, one of the possible sequence of optimal decisions. Then, using the chain rule, it is easy to show by recurrence that

. In words, every optimal sequence of decisions is generated with a strictly positive probability. ∎

Appendix B Algorithms

b.1 MultiL-DQN

As the player currently playing is completely determined from , we can use the same neural network to estimate all the state-action values, regardless of the player. We call the sum of all the budgets in such that an episode stops when .

1 Initialize the replay memory to capacity ;
2 Initialize the -network with weights ;
3 Initialize the target-network with weights ;
4 for episode  do
5        Sample ;
6        ;
7        while  do
8               ;
9               ;
10               ;
11               if  then
12                      Add to ;
13                      Sample a random batch ;
14                      for  do
16                     Update over with Adam Kingma and Ba (2015) ;
17                      Update every steps
return the trained -network
Algorithm 2 MultiL-DQN

b.2 Greedy Rollout

Input : A state with total budget and a list of experts value networks
1 Initialize the value ;
2 while  do
3        Retrieve the expert of the next level from the list ;
4        Generate every possible afterstate ;
5        ;
6        ;
7        ;
return the value
Algorithm 3 Greedy Rollout

b.3 MultiL-MC

As we use Monte-Carlo samples as targets, the values of the targets sampled from the replay memory is not dependent on the current expert as in DQN Mnih et al. (2015) but on a previous version of , which can become outdated quickly. Thus, to easily control the number of times an old estimate is used, we decided to perform an epoch on the memory every time new samples were pushed, and used a capacity so that the total number of times a Monte-Carlo sample is seen is directly .

1 Initialize the replay memory to capacity ;
2 Initialize the value-network with weights ;
3 for episode  do
4        Sample ;
5        Initialize the memory of the episode to be empty;
6        Initialize the length of the episode ;
7        while  do // perform a Monte Carlo sample
8               ;
9               ;
10               Add to ;
12       Initialize the target ;
13        for  do // associate each state to its value
14               Recover from ;
15               ;
16               Add to
17       if there are more than new couples in  then
18               Create a random permutation ;
19               for batches  do // perform an epoch on the memory
20                      Update over the loss with Adam Kingma and Ba (2015)
return the trained value-network
Algorithm 4 MultiL-MC

Appendix C Broadening the scope of the exact algorithm

In order to constitute a test set to compare the results given by our heuristics to exact ones, we used the exact method described in Baggio et al. (2020) to solve a small amount of instances. The algorithm they described was thought for the MCN problem, but is directly applicable without change on MCN. However, in order to monitor the learning at each stage of the curriculum for MCN as in Table 1, there is a need to solve instances where node infections were already performed in the sequence of previous moves but there is still some budget left to spend for the attacker, which is not possible as it is in Baggio et al. (2020). Moreover, small changes need to be made in order to solve instances of MCN.

c.1 Adding nodes that are already infected

We denote by the set of nodes that are already infected at the attack stage and the indicator of whether node is in or not. Then, the total set of infected nodes after the attacker spend his/her remaining budget and infect new nodes is . In order to find , we use the AP algorithm of Baggio et al. (2020), with the following modification to the rlxAP optimization problem: