## 1 Introduction

Combinatorial optimization is an important mathematical field addressing fundamental questions of computation, where its popular examples include the maximum independent set (miller1960problem), satisfiability (schaefer1978complexity) and traveling salesman problem (TSP, voigt1831handlungsreisende). Such problems also arise in various applied fields, e.g., sociology (harary1957procedure), operations research (feo1994greedy) and bioinformatics (gardiner2000graph)

. However, most combinatorial optimization problems are NP-hard to solve, i.e., exact solutions are typically intractable to find in practical situations. Over the past decades, existing works have made significant efforts for resolving this issue by designing fast heuristic solvers

(knuth1997art; biere2009handbook; mezard2009information) that generate approximate solutions for such scenarios.Recently, the remarkable progress in deep learning has stimulated increased interest in learning such heuristics based on deep neural networks (DNNs). Such learning-based approaches are attractive since one could train the solver to specialize in a particular domain of application with less reliance on expert knowledge. As the most straight-forward way, supervised learning schemes can be used for training DNNs to imitate the solutions obtained from existing solvers

(vinyals2015pointer; li2018combinatorial; selsam2018learning). However, the resulting quality and applicability are constrained by those of existing solvers. An ideal direction is to discover new solutions in a fully unsupervised manner, potentially outperforming those based on domain-specific knowledge.To this end, several recent works (bello2016neural; khalil2017learning; deudon2018learning; kool2018attention) consider using deep reinforcement learning (DRL) based on the Markov decision process (MDP) naturally designed with rewards derived from the optimization objective of the target problem. Then, the corresponding agent can be trained based on existing training schemes of DRL, e.g., bello2016neural trained the so-called pointer network for the TSP based on actor-critic training. Such DRL-based methods are especially attractive since they can even solve unexplored problems where domain knowledge is scarce, and no efficient heuristic is known.

Unfortunately, the existing DRL-based methods struggle to compete with the existing, highly optimized solvers. In particular, the gap becomes more significant when the problem requires solutions with higher dimensions or more complex structures. The reasoning is that they mostly emulate greedy algorithms (bello2016neural; khalil2017learning), i.e., choosing one element (or vertex) at each stage of MDP, which is too slow for obtaining a solution under large-scale inputs. This motivates us to seek for an alternative DRL scheme.

Contribution. In this paper, we propose a new scalable DRL framework, coined Learning what to Defer (LwD), designed towards solving combinatorial problems on large graphs. We particularly focus on applying LwD to the popular maximum independent set (MIS) problem (miller1960problem) for finding a maximum set of non-adjacent vertices in the graph, where it is also applicable to a broader range of problems (see Section 4.3). The MIS problem has been used in various applications including classification theory (feo1994greedy)

(sander2008efficient) and communication (jiang2010distributed). In theory, MIS is impossible to approximate in polynomial time by a constant factor (unless P=NP) (hastad1996clique), in contrast to (Euclidean or metric) TSP which can be approximated by a factor of (christofides1976worst).The main novelty of LwD is automatically stretching the determination of the solution throughout multiple steps. In particular, the agent iteratively acts on every undetermined vertex for either (a) determining the membership of the vertex in the solution or (b) deferring the determination to be made in later steps (see Figure 1 for illustration). Inspired by the celebrated survey propagation (braunstein2005survey) for solving the satisfiability (SAT) problem (schaefer1978complexity), LwD could be interpreted as prioritizing the “easier” decisions to be made first, which in turn simplifies the harder ones by eliminating the source of uncertainties. Compared to the greedy strategy (khalil2017learning) which determines the membership of a single vertex at each step, our framework brings significant speedup by learning to make many decisions at once (and deferring the rest).

Based on such a speedup, LwD can solve the optimization problem by generating a large number of candidate solutions in a limited time budget, then reporting the best solution among them. For this scenario, it is beneficial for the algorithm to generate diverse candidates. To this end, we additionally give a novel diversification bonus to our agent during training, which explicitly encourages the agent to generate a large variety of solutions. Specifically, we create a “coupling” of MDPs to generate two solutions for the given MIS problem and reward the agents for a large deviation between the solutions. The resulting reward efficiently improves the performance at the evaluation.

We empirically validate the LwD method on various types of graphs including the Erdös-Rényi (erdHos1960evolution) model, the Barabási-Albert (albert2002statistical) model, the SATLIB (hoos2000satlib) benchmark and real-world graphs. Our algorithm shows consistent superiority over the existing state-of-the-art DRL method (khalil2017learning). Remarkably, it often outperforms the state-of-the-art MIS solver (KaMIS, hespe2019wegotyoucovered), particularly on large-scale graphs, e.g., in our machine, LwD running in 276 seconds achieves better objectives compared to KaMIS running in 3637 seconds on the Barabási-Albert graph with two million vertices. Furthermore, we also show that our fully learning-based scheme generalizes well even to graph types unseen during training and works well even for other similar combinatorial problems: the maximum weighted independent set problem, the prize collecting maximum independent set problem (hassin2006minimum) and the maximum-a-posteriori inference problem for the Ising model (onsager1944crystal).

## 2 Related Works

The maximum independent set (MIS) problem is a prototypical NP-hard task where its optimal solution cannot be approximated by a constant factor in polynomial time unless P = NP (hastad1996clique).^{1}^{1}1It is also known to be a -hard problem in terms of fixed-parameter tractability (downey2012parameterized). Since the problem is NP-hard even to approximate, existing methods (tomita2010simple; san2011exact) for exactly solving the MIS problem suffer from a prohibitive amount of computation in large graphs. To resolve this, researchers have developed a wide range of approximate solvers for the MIS problem (andrade2012fast, lamm2017finding, chang2017computing, hespe2019scalable). Notably, lamm2017finding

developed a combination of an evolutionary algorithm with graph kernelization techniques for the MIS problem. Later,

chang2017computing and hespe2019scalable further improved the graph kernelization technique by introducing new reduction rules and parallelization based on graph partitioning, respectively.In the context of solving combinatorial optimization using neural networks, hopfield1985neural first applied the Hopfield-network for solving the traveling salesman problem (TSP). Since then, several works also tried to utilize neural networks in different forms, e.g., see smith1999neural for a review of such papers. Such works mostly solve combinatorial optimization through online learning, i.e., training was performed for each problem instance separately. More recently, vinyals2015pointer and bello2016neural proposed to solve TSP using an attention-based neural network trained in an offline way. They showed promising results that stimulated many other works to use neural networks for solving combinatorial problems (khalil2017learning, li2018combinatorial, selsam2018learning, deudon2018learning, amizadeh2018learning, kool2018attention). Importantly, khalil2017learning proposed a reinforcement learning framework for solving the minimum vertex cover problem, which is equivalent to solving the MIS problem. They query the agent for each vertex to add as a new member of the vertex cover at each step of the Markov decision process. However, such a greedy procedure hurts the scalability to large-scale graphs, as we mentioned in Section 1. Next, li2018combinatorial aim developing a supervised learning framework for solving the MIS problem. At an angle, their framework is similar to ours; they allow stretching the determination of the solution over multiple steps. However, their scheme for stretching the determination is hand-designed and not trainable from data. Furthermore, their scheme requires supervisions, which are (a) highly sensitive to the quality of solvers used for extracting them and (b) often too expensive or almost impossible to obtain.

## 3 Learning What to Defer

In this paper, we focus on solving the maximum independent set (MIS) problem. Given a graph with vertices and edges , an independent set is a subset of vertices

where no two vertices in the subset are adjacent to each other. A solution to the MIS problem can be represented as a binary vector

with maximum possible cardinality , where each element indicates the membership of vertex in the independent set , i.e., if and only if . Initially, the algorithm has no assumption about its output, i.e., both and are possible for all . At each iteration, the agent acts on each undetermined vertex by either (a) determining its membership to be a certain value, i.e., set or , or (b) deferring the determination to be made later iterations. The agent repeats the action until all the membership of vertices in the independent set is determined.One can interpret such a strategy as progressively narrowing down the set of candidate solutions at each iteration (see Figure 1 for illustration). Intuitively, the act of deferring prioritizes to choose the values of the “easier” vertices first.After each decision, “hard” vertices become easier since decisions on its surrounding easy vertices are better known. In a sense, this idea resembles how humans guide the behavior of animals, known as shaping (peterson2004day), where the easier concepts are fixed first for the animals to affect their decisions on harder concept. We additionally illustrate the whole algorithm in Appendix A.

### 3.1 Deferred Markov Decision Process

We formulate the proposed algorithm as a pair of a Markov decision process (MDP) and an agent, i.e., a policy. At a high level, the MDP initializes its states on the given graph and generates a solution at termination. We train the agent to maximize the MIS objective, formulated as a cumulative sum of rewards over the MDP.

State. Each state of the MDP is represented as a vertex-state vector , where the vertex is determined to be excluded or included in the independent set whenever or , respectively. Otherwise, indicates the determination has been deferred and expected to be made in later iterations. The MDP is initialized with the deferred vertex-states, i.e., for all , and terminated when (a) there is no deferred vertex-state left or (b) time limit is reached.

Action. Actions correspond to new assignments for the next state of vertices. Since vertex-states of included and excluded vertices are immutable, the assignments are defined only on the deferred vertices. It is represented as a vector where denotes a set of current deferred vertices, i.e., .

Transition.
Given two consecutive states and the corresponding assignment , the transition consists of two deterministic phases: the update phase and the clean-up phase. The update phase takes account of the assignment generated by the policy for the corresponding vertices to result in an intermediate vertex-state , i.e., if and otherwise. The clean-up phase modifies the intermediate vertex-state vector to yield a valid vertex-state vector , where the included vertices are only adjacent to the excluded vertices. To this end, whenever there exists a pair of included vertices adjacent to each other, they are both mapped back to the deferred vertex-state. Next, the MDP excludes any deferred vertex neighboring with an included vertex.^{2}^{2}2We also note that such a clean-up phase can be replaced by training the agent with a soft penalty, i.e., negative reward, for solutions corresponding to an invalid independent set. In our experiments, such an algorithm also performs well with only marginal degradation in its performance. See Figure 3 for a more detailed illustration of the transition between two states.

When the MDP makes all the determination, i.e., at termination, one can (optionally) improve the determined solution by applying the -improvement local search algorithm (feo1994greedy; andrade2012fast); it increases the size of the independent set greedily by removing one vertex and adding two vertices until no modification is possible.

Reward. Finally, we define the cardinality reward as the increase in cardinality of included vertices. To be specific, we define it as , where and are the set of vertices with deferred vertex-state with respect to and , respectively. By doing so, the overall reward of the MDP corresponds to the cardinality of the independent set returned by our algorithm.

### 3.2 Diversification Reward

Next, we introduce an additional diversification reward for encouraging diversification of solutions generated by the agent. Such regularization is motivated by our evaluation method, which samples multiple candidate solutions to report the best one as the final output. For such scenarios, it would be beneficial to generate diverse solutions of a high maximum score, rather than ones of similar scores. One might argue that the existing entropy regularization (williams1991function) for encouraging exploration over MDP could be used for this purpose. However, the entropy regularization attempts to generate diverse trajectories of the same MDP, which does not necessarily lead to diverse solutions at last, since there exist many trajectories resulting in the same solution (see Section 3.1). We instead directly maximize the diversity among solutions by a new reward term. To this end, we “couple” two copies of MDPs defined in Section 3.1 into a new MDP by sharing the same graph with a pair of distinct vertex-state vectors . Although we define the coupled MDP on the same graph, the corresponding agents work independently to result in a pair of solutions . Then, we directly reward the deviation between the coupled solutions in terms of -norm, i.e., . Similar to the original objective of MIS, we decompose it into rewards in each iteration of the MDP defined as follows:

where and denotes the next pair of vertex-states in the coupled MDP. One can observe that indicates the most recently updated vertices in each MDP. In practice, such reward can be used along with the maximum entropy regularization for training the agent to achieve the best performance. See Figure 3 for an example of coupled MDP with the proposed reward.

Classic | SL-based | RL-based | |||||||

Type | CPLEX | KaMIS | TGS | TGS | S2V-DQN | LwD | LwD | ||

ER | 50 | 100 | 21.11 | 21.11 | 19.90 (0.32) | 21.11 (0.65) | 20.61 (0.03) | 21.04 (0.01) | 21.11 (0.15) |

100 | 200 | 27.87 | 27.95 | 24.94 (1.46) | 27.95 (1.54) | 26.27 (0.08) | 27.67 (0.03) | 27.95 (0.61) | |

400 | 500 | 31.73 | 39.61 | 33.46 (0.93) | 39.43 (12.37) | 35.05 (0.63) | 38.29 (0.16) | 39.81 (5.91) | |

BA | 50 | 100 | 32.07 | 32.07 | 31.77 (0.24) | 32.07 (0.25) | 31.96 (0.02) | 32.07 (0.01) | 32.07 (0.11) |

100 | 200 | 66.07 | 66.07 | 65.25 (0.33) | 66.07 (0.52) | 65.52 (0.05) | 66.05 (0.01) | 66.07 (0.22) | |

400 | 500 | 204.1 | 204.1 | 201.2 (0.72) | 204.1 (7.86) | 202.9 (0.18) | 204.0 (0.02) | 204.1 (0.87) | |

HK | 50 | 100 | 23.95 | 23.95 | 23.39 (0.29) | 23.95 (0.60) | 23.77 (0.03) | 23.95 (0.02) | 23.95 (0.15) |

100 | 200 | 50.15 | 50.15 | 48.74 (0.43) | 50.15 (2.00) | 49.64 (0.05) | 50.12 (0.04) | 50.15 (0.34) | |

400 | 500 | 156.8 | 157.0 | 152.1 (0.92) | 157.0 (7.63) | 152.8 (0.22) | 156.8 (0.14) | 157.0 (1.63) | |

WS | 50 | 100 | 23.08 | 23.08 | 21.90 (0.34) | 23.08 (0.71) | 22.64 (0.03) | 23.07 (0.03) | 23.08 (0.11) |

100 | 200 | 47.17 | 47.18 | 44.55 (0.49) | 47.18 (1.89) | 45.39 (0.06) | 47.11 (0.06) | 47.18 (0.23) | |

400 | 500 | 138.3 | 143.3 | 134.8 (1.15) | 143.2 (6.08) | 132.2 (0.23) | 142.1 (0.17) | 143.3 (0.90) | |

SATLIB | 1209 | 1347 | 426.8 | 426.9 | 418.1 (19.6) | 426.7 (63.0) | 413.8 (2.3) | 424.8 (1.8) | 426.7 (7.1) |

PPI | 591 | 3480 | 1148 | 1148 | 1128 (20.9) | 1148 (568.9) | 893 (6.3) | 1147 (1.8) | 1148 (30.8) |

REDDIT-M-5K | 22 | 3648 | 370.6 | 370.6 | 367.1 (0.88) | 370.6 (1.7) | 370.1 (0.1) | 370.6 (0.6) | 370.6 (2.0) |

REDDIT-M-12K | 2 | 3782 | 303.5 | 303.5 | 300.5 (0.75) | 303.5 (22.1) | 302.8 (1.9) | 292.6 (0.1) | 303.5 (2.0) |

REDDIT-B | 6 | 3782 | 329.3 | 329.3 | 327.6 (0.7) | 329.3 (2.5) | 328.6 (0.1) | 329.3 (0.2) | 329.3 (3.0) |

as-Caida | 8020 | 26 475 | 20 049 | 20 049 | 19 921 (65.85) | 20 049 (601.4) | 324 (34.8) | 20 049 (6.1) | 20 049 (34.6) |

Classic | SL-based | RL-based | |||||

Type | CPLEX | KaMIS | TGS | TGS | LwD | LwD | |

BA | 500 000 | 137 821 (1129) | 228 123 (1002) | 227 701 (344) | 228 733 (6211) | 228 803 (45) | 228 829 (340) |

1 000 000 | 275 633 (1098) | 457 541 (2502) | 455 354 (620) | 457 073 (10 484) | 457 698 (117) | 457 752 (651) | |

2 000 000 | 551 767 (1152) | 909 988 (3637) | 910 856 (1016) | OB | 915 887 (276) | 915 968 (1296) | |

HK | 500 000 | 90 012 (1428) | 175 153 (1477) | 173 852 (465) | 176 253 (777) | 176 850 (96) | 177 143 (519) |

1 000 000 | 179 326 (1172) | 347 350 (4463) | 350 819 (887) | 353 244 (10 415) | 353 504 (248) | 353 723 (1151) | |

2 000 000 | OB | 695 544 (10 870) | OB | OB | 706 975 (648) | 707 422 (1601) | |

WS | 500 000 | 135 217 (1424) | 157 298 (1002) | 153 190 (354) | 155 230 (10 172) | 155 086 (62) | 155 574 (403) |

1 000 000 | 270 526 (1093) | 303 810 (1699) | 308 065 (1009) | OB | 310 308 (144) | 311 041 (725) | |

2 000 000 | 540 664 (1159) | 603 502 (4252) | 611 057 (1628) | OB | 620 615 (345) | 621 687 (1388) |

SL-based | RL-based | |||||

Type | TGS | TGS | S2V-DQN | LwD | LwD | |

Citation-Cora | 2708 | 1.00 (4) | 1.00 (4) | 0.96 (3) | 1.00 (3) | 1.00 (3) |

Citation-Citeseer | 3327 | 1.00 (3) | 1.00 (3) | 0.99 (3) | 1.00 (2) | 1.00 (4) |

Amazon-Photo | 7487 | 0.99 (9) | 1.00 (485) | 0.27 (66) | 0.99 (4) | 1.00 (33) |

Amazon-Computers | 13 381 | 0.99 (8) | 1.00 (823) | 0.26 (236) | 0.99 (3) | 1.00 (101) |

Coauthor-CS | 18 333 | 0.99 (17) | 1.00 (80) | 0.88 (197) | 1.00 (3) | 1.00 (78) |

Coauthor-Physics | 34 493 | 0.98 (52) | 1.00 (1304) | 0.19 (1564) | 0.98 (9) | 1.00 (186) |

^{6}

^{6}6 We measure the approximation ratio using the optimal solutions from running the CPLEX solver without any time limit. of the deep-learning based MIS solvers on real-world graphs, unseen during training. Best approximation ratios are marked in bold. Running times (in seconds) are provided in brackets, and the number of vertices is denoted by .

dataset. The solid line and shaded regions represent the mean and standard deviation across 3 runs respectively Note that the standard deviation in (c) was enlarged ten times for better visibility.

MWIS | PCMIS | Ising | Ising | |||||||

Type | CPLEX | LwD | CPLEX | LwD | CPLEX | LwD | CPLEX | LwD | ||

ER | 50 | 100 | 21.46 | 21.46 (0.16) | 22.64 | 21.77 (0.01) | 104.1 | 148.2 (0.01) | 108.8 | 151.9 (0.01) |

100 | 200 | 27.94 | 28.44 (0.39) | 27.12 | 29.56 (0.11) | -536.5 | 396.7 (0.01) | -479.7 | 406.8 (0.02) | |

400 | 500 | 34.29 | 39.21 (0.54) | 4.56 | 38.39 (0.42) | -13897 | 1646 (0.09) | -14180 | 1899 (0.12) |

### 3.3 Training with Proximal Policy Optimization

Our algorithm is based on actor-critic training with policy network and value network following the GraphSAGE architecture (hamilton2017inductive). Each network consists of multiple layers with where the -th layer with weights and performs the following transformation on input :

Here and correspond to adjacency and degree matrix of the graph , respectively. At the final layer, the policy and value networks apply softmax function and graph readout function with sum pooling (xu2018how) instead of ReLU

to generate actions and value estimates, respectively. We only consider the subgraph that is induced on the deferred vertices

as the input of the networks since the determined part of the graph no longer affects the future rewards of the MDP. We utilize vertex degrees and the current iteration-index of the MDP as the input features of the neural network. To train the agent, we use the proximal policy optimization (schulman2017proximal). To be specific, we train the networks to maximize the following objective:where , , and denotes the -th vertex-state vector, action vector, cardinality reward, and diversification reward, respectively. In addition, is the policy network with parameters from the previous iteration of updates, is the maximum number of steps for the MDP, and is the hyper-parameter to be tuned. The clipping function

projects the ratio of action probabilities

into an interval for updating the agent more conservatively. Note that the clipping is applied for each vertex, unlike the original framework where clipping is applied once (schulman2017proximal).## 4 Experiments

In this section, we report experimental results on the proposed Learning what to Defer (LwD) framework described in Section 3 for solving the maximum independent set (MIS) problem. To this end, we evaluate our framework with and without the local search element described in Section 3.1, coined LwD, and LwD, respectively. We also include the evaluations of our framework on other combinatorial problems to demonstrate its potential for being applied to a broader domain. We perform every using a single GPU (NVIDIA RTX 2080Ti) and a single CPU (Intel Xeon E5-2630 v4).

Baselines. For comparison with the deep learning-based methods, we consider the deep reinforcement learning (DRL) framework by khalil2017learning, coined S2V-DQN, and supervised learning (SL) framework by li2018combinatorial, coined TGS. We also consider its variant, coined TGS, equipped with additional graph reduction and local search algorithms. We remark that other existing deep learning-based schemes for solving combinatorial optimization, e.g., works done by bello2016neural and kool2018attention, are not comparable since they propose a neural architecture specialized to TSP-like problems.

We additionally consider two conventional MIS solvers as competitors. First, we consider the integer programming solver IBM ILOG CPLEX Optimization Studio V12.9.0 (ilog2014cplex), coined CPLEX. We also consider the MIS solver based on the recently developed techniques (lamm2015graph; lamm2017finding; hespe2019scalable), coined KaMIS, which won the PACE 2019 challenge at the vertex cover track (hespe2019wegotyoucovered). We provide specific details of the implementations in Appendix B.1.

We remark that comparisons among the algorithms should be made carefully by accounting for the scope of applications concerning each algorithm. In particular, KaMIS, TGS and LwD, rely on heuristics specialized for the MIS problem, e.g., the local search algorithm, while CPLEX, S2V-DQN, and LwD can be applied even to non-MIS problems. The integer programming solver CPLEX can provide the proof of optimality in addition to its solution. Furthermore, TGS and TGS are only applicable to problems where solvers are available for obtaining supervised solutions.

Datasets. Experiments were conducted on a broad range of graphs to include both real-world and large-scale graphs. First, we consider random graphs generated from models designed to imitate the characteristics of real-world graphs. Specifically, we consider the models proposed by Erdös-Rényi (ER, erdHos1960evolution), Barabási-Albert (BA, albert2002statistical), Holme and Kim (HK, holme2002growing), and Watts-Strogatz (WS, watts1998collective). For convenience of notation, the synthetic datasets are specified by their type of generative model and size, e.g., ER- denotes the set of ER graphs generated with the number of vertices uniformly sampled from the interval .

We further consider real-world graph datasets, namely the SATLIB, PPI, REDDIT, as-Caida, Citation, Amazon, and Coauthor datasets constructed from the SATLIB benchmark (hoos2000satlib), protein-protein interactions (hamilton2017inductive), social networks (yanardag2015deep), road networks (leskovec2016snap), co-citation network (yang2016revisiting), co-purchasing network (mcauley2015image), and academic relationship network (sen2008collective), respectively. See Appendix B.3 for further information of the datasets.

### 4.1 Performance Evaluation

We now demonstrate the performance of our algorithm along with other baselines for solving the MIS problem.

Moderately sized graphs.
First, we provide the experimental results for the datasets with the number of vertices up to .^{3}^{3}3Note that khalil2017learning only reported results from training on graphs with the number of vertices up to vertices. In Table 1, we observe that LwD consistently achieves the best objective except for the SATLIB dataset. Surprisingly, it even outperforms CPLEX and KaMIS on some datasets, e.g., LwD achieves a strictly better objective than every other baseline on the ER- dataset. Next, LwD significantly outperforms its DL competitors, i.e., S2V-DQN and TGS, for datasets except for the REDDIT-M-12K dataset. The gap grows for the large-scale dataset, e.g., S2V-DQN underperforms significantly for the as-Caida dataset.

Million-scale graphs.
Next, we highlight the scalability of our framework by evaluating the algorithms under the million-scale graphs. To this end, we generate synthetic graphs from the BA, SW, and WS models with half, one, and two million vertices.^{4}^{4}4The ER model requires a computationally prohibitive amount of memory for the large-scale graph generation. We use the networks trained on the BA, HK, WS- datasets for evaluation on graphs from the same generative model. We run CPLEX and KaMIS with a time limit of seconds so that they spend more time than our methods to find their solutions.^{5}^{5}5However, the solvers consistently violate the time limit for tested large-scale inputs due to their expensive pre-solving process. Furthermore, we exclude S2V-DQN from comparison for being computationally infeasible to be evaluated on such large-scale graphs.

In Table 2, we observe that LwD and LwD mostly outperform other algorithms, i.e., they achieve a better objective in a shorter amount of time. In particular, LwD achieves a higher objective than KaMIS in BA graph with two million vertices along with approximately speedup. The only exception is the WS graph with half a million vertices, where LwD and LwD achieve smaller objective than KaMIS, while still being much faster. Such a result highlights the scalability of our algorithm. It is somewhat surprising to observe that LwD achieves objectives similar to LwD, while TGS underperforms significantly compared to TGS. This observation validates that our method is not sensitive to use the existing heuristics for achieving high performance, while the previous method is.

Trade-offs between objective and time. We further investigate the trade-offs between objective and time for the considered algorithms. To this end, we evaluate algorithms on ER- and SATLIB datasets with varying numbers of samples or time limits. In Figure 4, it is remarkable to observe that both LwD and LwD achieves a better objective than the CPLEX solver on both datasets under reasonably limited time. Furthermore, LwD and LwD outperforms KaMIS for running time smaller than and seconds, respectively.

Generalization capability. Finally, we examine the potential of the deep learning-based solvers as generic solvers, i.e., whether the solvers generalize well to graph types unseen during training. To this end, we evaluate LwD, LwD, TGS, TGS, and S2V-DQN on Citation, Amazon and Coauthor graph datasets. In Table 6, we observe that LwD achieves near-optimal objective for all of the considered graphs, despite being trained on graphs with different type and a smaller number of vertices. On the other side, S2V-DQN underperforms significantly compared to LwD, i.e., it achieves worse approximation ratios while being slower. We also observe that TGS achieves similar approximation ratios compared to LwD but takes a longer time in our experimental setups.

### 4.2 Ablation Study

We now ablate each component of our algorithm to validate its effectiveness. First, we confirm that stretching the determination process indeed improves the performance of LwD. Then we show that the effectiveness of solution diversification reward.

Stretching of determination. We first show that “stretching” the determination with deferred MDP indeed helps to solve the MIS problem. Specifically, we experiment with varying the maximum number of iterations in MDP by on ER- dataset. Figure 4(a) reports the corresponding training curves. We observe that the performance of LwD improves whenever the agent receives more time to generate the final solution, which verifies that the deferring of decisions plays a crucial role in solving the MIS problem.

Solution diversification reward. Next, we inspect the contribution of the solution diversity reward used in our algorithm. To this end, we trained agents with four options: (a) without any exploration bonus, coined Base, (b) with the conventional entropy bonus (williams1991function), coined Entropy, (c) with the proposed diversification bonus, coined Diverse, and (d) with both of the bonuses, coined EntropyDiverse. Figure 4(b) demonstrates the corresponding training curves for validation scores. We observe that the agent trained with the proposed diversification bonus outperforms other agents in terms of validation score, confirming the effectiveness of our proposed reward. One also observes that both methods can be combined to yield better performance, i.e., EntropyDiverse achieves the best performance.

Finally, we verify our claim that the maximum entropy regularization fails to capture the diversity of solutions effectively, while the proposed solution diversity reward term does. To this end, we compare the fore-mentioned agents with respect to the -deviations between the coupled intermediate vertex-states and , defined as . We show the corresponding results in Figure 4(c). We observe that the entropy regularization promotes large deviation during the intermediate stages, but converges to solutions with smaller deviation. On the contrary, agents trained on diversification rewards succeed in enlarging the deviation between the final solutions.

### 4.3 Other Combinatorial Problems

Now we evaluate our framework on other combinatorial problems, namely the maximum weighted independent set problem, coined MWIS, the prize collecting maximum independent set problem (hassin2006minimum), coined PCMIS, and the maximum-a-posteriori inference problem for the Ising models (onsager1944crystal), coined Ising. We compare our algorithm to CPLEX, which is also capable of solving the considered problems with its integer programming framework. See Appendix C for detailed description of the problems.

We report the results of the experiments conducted on the ER datasets in Table 4. Surprisingly, we observe that LwD outperforms CPLEX for most problems and graphs under the time limit of seconds. In particular, CPLEX fails to produce reasonable solutions for the PCMIS and the Ising problems for ER- and ER-

graphs on the PCMIS and the Ising problems. This is because the CPLEX solves the PCMIS and the Ising problems by hard integer quadratic programming (IQP). Such an issue does not exist in the MIS and MWIS problems as CPLEX solves them by integer linear programming (ILP), which is easier to solve. The proposed LwD framework does not rely on such domain-specific knowledge of IQP vs. ILP, and is more robust under various problems in our experiments.

## 5 Conclusion

In this paper, we propose a new reinforcement learning framework for the maximum independent set problem that is scalable to large graphs. Our main contribution is the framework of learning what to defer, which allows the agent to defer the decisions on vertices for efficient expression of complex structures in the solutions. Through extensive experiments, our algorithm shows performance that is both superior to the existing reinforcement learning baseline and competitive with the conventional solvers.

## References

## Appendix A Graphical Illustration of LwD

## Appendix B Experimental Details

### b.1 Implementation of LwD

In this section, we provide additional details for our implementation of the experiments.

Normalization of feature and reward. The iteration-index of MDP used for input of the policy and value networks was normalized by the maximum number of iterations. Both the cardinality reward (defined in Section 3.1) and the solution diversification reward (defined in Section 3.2) were normalized by maximum number of vertices in the corresponding dataset.

Hyper-parameter. Every hyper-parameter was optimized on a per graph type basis and used across all sizes within each graph type. Throughout every experiment, the policy and the value networks were parameterized by graph convolutional network with layers and hidden dimensions. Every instance of the model was trained for 20 000 updates of proximal policy optimization schulman2017proximal, based on the Adam optimizer with a learning rate of . The validation dataset was used for choosing the best performing model while using samples for evaluating the performance. Reward was not decayed throughout the episodes of the Markov decision process. Gradient norms were clipped by a value of . We further provide details specific to each type of datasets in Table 5. For the compared baselines, we used the default hyper-parameters provided in the respective codes.

Parameters | ER, BA, HK, WS | SATLIB | PPI | as-Caida | |

Maximum iterations per episode | 32 | 128 | 128 | 64 | 128 |

Number of unrolling iteration | 32 | 128 | 128 | 64 | 128 |

Number of environments per batch (graph instances) | 32 | 32 | 10 | 64 | 1 |

Batch size for gradient step | 16 | 8 | 8 | 16 | 8 |

Number of gradient steps per update | 4 | 8 | 8 | 16 | 8 |

Solution diversity reward coefficient | 0.1 | 0.01 | 0.1 | 0.1 | 0.1 |

Maximum entropy coefficient | 0.1 | 0.01 | 0.001 | 0.0 | 0.1 |

Choice of hyperparameters for the experiments on performance evaluation. The REDDIT column indicates hyperparameters used for the REDDIT-B, REDDIT-M-5K, and REDDIT-M-12K datasets.

### b.2 Implementation of Baselines

S2V-DQN.
We implement the S2V-DQN algorithm based on the code (written in C++) provided by the authors. ^{6}^{6}6https://github.com/Hanjun-Dai/graph_comb_opt For the synthetic graphs generated from ER, BA, HK, and SW models, S2V-DQN is unstable to be trained on graphs of size and without pre-training. Hence, we perform fine-tuning as mentioned in the original paper khalil2017learning. For instance, when we train S2V-DQN on the ER- datasets, we fine-tune the model trained on ER-. Next, for the ER-, we perform “curriculum learning”; we first train S2V-DQN on the ER- dataset, then fine-tune on the ER-, ER-, ER- and ER- in sequence. Finally, for training S2V-DQN on graphs with size larger than , we were unable to train it on the raw graph under the available computational budget. Hence we train S2V-DQN on subgraphs sampled from the training graphs. To this end, we sample edges from the model uniformly at random without replacement, until the number of vertices reach . Then we used the subgraph induced from the sampled vertices.

TGS.
We use the official implementation and models provided by the authors.^{7}^{7}7https://github.com/intel-isl/NPHard Unfortunately, the provided code runs out of memory for larger graphs since that they keep all of the intermediate solutions in the breadth-first search queue. For such cases, we modify the algorithm by discarding the oldest graph in the queue whenever the queue reaches its maximum size, i.e., ten times the number of required solutions for the problem.

CPLEX.
We use CPLEX ilog2014cplex provided on the official homepage.^{8}^{8}8https://www.ibm.com/products/ilog-cplex-optimization-studio In order to optimize its performance under limited time, we set its emphasis parameter, i.e., MIPEmphasisFeasibility, to prefer higher objective over proof of optimality.

KaMIS.
We use KaMIS hespe2019wegotyoucovered from its official hompage without modification.^{9}^{9}9http://algo2.iti.kit.edu/kamis/

### b.3 Dataset Details

Dataset | Number of nodes | Number of edges | Number of graphs |

SATLIB | (1209, 1347) | (4696, 6065) | (38 000, 1000, 1000) |

PPI | (591, 3480) | (3854, 53 377) | (20, 2, 2) |

REDDIT (BINARY) | (6, 3782) | (4, 4071) | (1600, 200, 200) |

REDDIT (MULTI-5K) | (22, 3648) | (21, 4783) | (4001, 499, 499) |

REDDIT (MULTI-12K) | (2, 3782) | (1, 5171) | (9545, 1192, 1192) |

as-Caida | (8020, 26 475) | (36 406, 106 762) | (108, 12, 12) |

Dataset | Number of nodes | Number of edges |

Citeseer | 3327 | 3668 |

Cora | 2708 | 5069 |

Pubmed | 19 717 | 44 324 |

Coauthor CS | 18 333 | 81 894 |

Coauthor Physics | 34 493 | 247 962 |

Amazon Computers | 13 381 | 245 778 |

Amazon Photo | 7487 | 119 043 |

In this section, we provide additional details on the datasets used for the experiments.

Synthetic datasets. For the ER, BA, HK, and WS datasets, we train on graphs randomly generated on the fly and perform validation and evaluation on a fixed set of graphs.

SATLIB dataset. The SATLIB dataset is a popular benchmark for evaluating SAT algorithms. We specifically use the synthetic problem instances from the category of random 3-SAT instances with controlled backbone size singer2000backbone. Next, we describe the procedure for reducing the SAT instances to MIS instances. To this end, a vertex is added to the graph for each literal of the SAT instance. Then edges are added for each pair of vertices satisfying the following conditions: (a) that are in the same clause or (b) they correspond to the same literals with different signs. Consequently, the MIS in the resulting graph corresponds to the truth assignment to the optimal assignments of the SAT problem dasgupta2008algorithms.

PPI dataset. The PPI dataset is the protein-protein-interaction dataset with vertices representing proteins and edges representing interactions between them.

REDDIT datasets.
The REDDIT-B, REDDIT-M-5K, and REDDIT-M-12K datasets are constructed from online discussion threads in reddit^{10}^{10}10https://www.reddit.com/ where vertices represent users and edges mean at least one of two users responded to the other user’s comment.

Autonomous system dataset. The as-Caida dataset is a set of autonomous system graphs derived from a set of RouteViews BGP table snapshots leskovec2005graphs.

Citation dataset. The Cora and the Citeseer are networks constructed by vertices and edges representing documentation and citation links between them, respectively sen2008collective.

Amazon dataset. The Computers and Photo graphs are segmented from the Amazon co-purchase graph mcauley2015image, where vertices correspond to goods and edges represent goods which are frequently purchased together.

Coauthor dataset.
The CS and Physics graphs represent authors and the corresponding co-authorships by vertices and edges, respectively. It was collected from Microsoft Academic Graph from the KDD Cup 2016 challenge3.^{11}^{11}11https://kddcup2016.azurewebsites.net/

## Appendix C Details of Other Combinatorial Problems

### c.1 Maximum Weighted Indpendent Set Problem

First, we describe the maximum weighted independent set (MWIS) problem balas1986finding. Consider a graph associated with positive weight function . The goal of the MWIS problem is to find the independent set where the total sum of weight is maximum. In order to apply the LwD framework to the MWIS problem, we include the weight of each vertex as its feature to the policy network and modify the reward function by the increase in weight of included vertices, i.e.,

. We sample the weights of each vertices from a normal distribution with mean and standard deviation fixed to

and , respectively.### c.2 Prize Collecting Maximum Independent Set Problem

Next, we introduce the prize collecting maximum independent set (PCMIS) problem, which is an instance of the generalized minimum vertex cover problem hassin2006minimum. To this end, consider a graph and a subset of vertices . Then the PCMIS problem is associated with the following the “prize” function to maximize:

where is the penalty function for including two adjacent vertices. We set in the experiments. Such a problem could be interpreted as relaxing the hard constraints on independent set to a penalty function in the MIS problem. Especially, one can examine that optimal solution of the PCMIS problem becomes the maximum independent set when . For applying the LwD framework on the PCMIS problem, we remove the clean-up phase in the transition function of MDP and modify the reward function as the increase in prize function at each iteration, expressed as follows:

### c.3 Maximum-a-posteriori Inference Problem on the Ising Model

Finally, we describe the maximum-a-posteriori (MAP) inference problem on the anti-ferromagnetic Ising model onsager1944crystal. Given a graph

, the probability distribution of the Ising model is described as

, where is the normalization constant and the function is the objective to maximize, defined as follows:Here, and corresponds to the interaction and magnetic field parameters, respectively. Furthermore, is an indicator function for set of vertices and for . In order to solve the MAP inference problem on the Ising model, we remove the clean-up phase in the transition function as in the PCMIS problem and modify the reward function as the increase in objective function at each iteration, expressed as follows:

reference icml2020

Comments

There are no comments yet.