Solving NP-Hard Problems on Graphs by Reinforcement Learning without Domain Knowledge

05/28/2019 ∙ by Kenshin Abe, et al. ∙ The University of Tokyo 13

We propose an algorithm based on reinforcement learning for solving NP-hard problems on graphs. We combine Graph Isomorphism Networks and the Monte-Carlo Tree Search, which was originally used for game searches, for solving combinatorial optimization on graphs. Similarly to AlphaGo Zero, our method does not require any problem-specific knowledge or labeled datasets (exact solutions), which are difficult to calculate in principle. We show that our method, which is trained by generated random graphs, successfully finds near-optimal solutions for the Maximum Independent Set problem on citation networks. Experiments illustrate that the performance of our method is comparable to SOTA solvers, but we do not require any problem-specific reduction rules, which is highly desirable in practice since collecting hand-crafted reduction rules is costly and not adaptive for a wide range of problems.



There are no comments yet.


page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

NP-hard problems arise in many real-world optimization problems. Although solving them in a realistic time is believed to be impossible, since many NP-hard problems are highly relevant in the real-world, several exact heuristic algorithms and local search approaches, such as simulated annealing and evolutionary computation, have been developed for providing near-optimal solutions to these problems

[1; 2; 3; 4; 5; 6].

Recently, machine learning has been applied to solve NP-hard problems.

Li et al. [7]

used supervised learning to train a Graph Convolutional Network (GCN) for solving the

Maximum Independent Set111Maximum Independent Set is an NP-hard problem. An independent set of a graph is a set of nodes such that no two nodes in the set are connected. Finding an independent set of the maximum size is NP-hard. (MIS) problem and showed that it performed better than heuristic solvers. This result suggests the usefulness of the machine learning technique for automatically extracting useful features to handle a problem of interest, which previously required experts to manually design and implement the hand-crafted reduction rules for searching. One bottleneck of Li et al.’s method is that exact solutions for NP-hard problems must be prepared in advance for supervised learning. Since preparing labeled datasets for NP-hard problems is difficult in principle, in this paper, we challenge this problem by adopting a reinforcement learning method to solve NP-hard problems.

The biggest advantage of our algorithm is that it does not require any domain knowledge. Most existing algorithms, including Li et al.’s supervised method and Akiba and Iwata’s exact algorithm for solving the Minimum Vertex Cover222Minimum Vertex Cover is also an NP-hard problem. A vertex cover for a graph is a set of nodes such that for any edge , at least one of is contained in the set. The complement of a maximum independent set is a minimum vertex cover. problem (which is equivalent to MIS), have implemented many reduction rules to make the search efficient (some of the sophisticated reduction rules are listed in Appendix A). Most of such reduction rules are highly nontrivial and require a deep understanding of graph theory. Since a variety of research has been conducted for well-known NP-hard problems including MIS and Traveling Salesperson Problem, there are indeed some practical fast algorithms and heuristics [3; 7; 8]. However, when we are facing a new NP-hard problem that has not been widely studied, it can be difficult to theoretically obtain efficient reduction rules. Khalil et al. created a framework called S2V-DQN, combining Deep Q-Networks and a graph embedding network called structure2vec to provide a solution to this issue. Although they could outperform some of the heuristic solvers, their results were not comparable to those from Li et al.’s supervised learning. In this paper, we propose a method to solve NP-hard problems without human knowledge that has much higher performance than S2V-DQN and is even comparable to algorithms with complex reduction rules. As explained in Section 3.1

, the proposed algorithm can be applied to any combinatorial optimization problems as long as it can be reduced to a Markov Decision process, whose

state is a graph and whose action is to choose a vertex.

In our algorithm, Monte-Carlo Tree Search (MCTS) enables the networks to learn from self-exploration. In Section 3.2, we explain how we modify the ordinary MCTS used in the game search for combinatorial optimization.

2 Preliminary

In this section, we introduce two main ingredients that play essential roles in our algorithm. The first one is the Graph Neural Networks (GNNs), including the Graph Isomorphism Networks (GINs)

[10], and the second one is the Monte-Carlo Tree Search (MCTS) [11; 12], which was also used in AlphaGo Zero [13].

2.1 Graph Neural Network

Graph neural networks are a neural network framework on graphs. They recursively aggregate neighboring feature vectors to obtain node embedding that captures the structural information of a graph. There have been many studies on aggregation schemes

[14; 15; 16], and GCNs [17] are de facto standard, achieving better results in many fields.

We adopted GCNs in our preliminary studies, but better results were obtained by using GINs [10]. For this reason, we focus on GINs in this paper. Let be an adjacency matrix of a graph of vertices and be an adjacency matrix with self-connections. We applied the aggregation rule of GIN-0 with features in th layer:

where is a feature matrix of the –th layer and

refers to the multi-layer perceptron in the

–th layer. GIN was shown to be one of the most powerful GNN architectures from the perspective of discriminative functions [10] and is known to be capable of discriminating different graphs that GCN cannot. Since nodes are not distinguished from each other, we give the feature matrix of all ones as an input. Such a characteristic of GINs makes it suitable to our problem setting where the information of each node is not abundant and structural information is important.

2.2 AlphaGo Zero

AlphaGo Zero is a well-known superhuman engine designed for use with the game of Go [13]. AlphaGo Zero defeated its previous engine AlphaGo [18] with - score without human knowledge, i.e, the records of the games of professional players.

It trains a deep neural network with parameter by reinforcement learning. Given a state (game board), the network outputs , where

is the probability vector of each move and

is a scalar denoting the value of the state. If is close to , the player who takes a corresponding action from state is very likely to win.

MCTS Nodes and Edges

AlphaGo Zero uses the MCTS to obtain , which is a much better probability vector of node selection than . To explain the algorithm, we introduce the structure of MCTS search tree. The search tree is a rooted tree, where each node corresponds to a state and the root is the initial state. Each edge denotes the action, meaning the action at state . Each edge stores action-value and visit-count .

MCTS and Training

The algorithm runs as follows. Initially, no node in the game tree is expanded, i.e., its is evaluated by the network.

  1. (Self-Play)

    1. (Simulation)

      1. (Selection) From the root, select the best action iteratively until reaching an unexpanded node. The best action is an action whose


        is maximum, where is the upper confidence bound of the action-value. used in AlphaGo Zero is determined by a variant of PUCT algorithm [19] and


        where is a nonnegative constant to be determined by experiments.

      2. (Expansion) When hitting an unexpanded node , evaluate it with the network and obtain .

      3. (Backpropagation) After expanding a new node, each edge used in the simulation is traversed and its

        and are updated. We increment and update with the mean of among all leaf nodes in the subtree of . Formally, , where denotes the summation over every leaf that can be eventually reached from after taking action .

      Iterate this process for enough number of times.

    2. (Play) Select an action and proceed to the next state (game board). Calculate an enhanced probability vector of node selection and select a node according to . is calculated as , where is the temperature. Initially, we use a large and reduce it to closer to zero as learning proceeds. When is large, the selection is approximately uniformly random and when , it selects the node with the maximum visit-count.

    Iterate Simulation and Play until the end of the game. We record in each Play at step . Once the game ends, we know the winner and loser. Let if the player who made action at wins the game, otherwise .

  2. (Training) We select minibatches from and minimize the loss


    where is a nonnegative constant for regularization. The network optimizes the probability vector of node selection and the value of the state at the same time.

3 Method

In this section, we give a detailed explanation of our algorithm to solve the Maximum Independent Set (MIS) problem. First, we formalize the problem setting; we define the problem as a Markov Decision Process (MDP) so that reinforcement learning can be applied. Note that our algorithm can be applied not only to MIS but also to any problem that can be reduced to an MDP of the same form. Then we extend MCTS to our problem settings. We divide the simulation in MCTS into three processes, selection, expansion, and backpropagation.

3.1 Reduction to MDP

Let us consider approximately solving MIS in the following way.

  1. Given a graph, find the probability for each vertex to be included in a maximum independent set.

  2. Choose a vertex according to the probabilities obtained from Step 1.

  3. Erase and its neighbors from the graph.

  4. If the graph is not empty, go back to Step 1.

Clearly, the set of the selected nodes in Step 2 forms a maximal

(not necessarily maximum) independent set. Our aim here is to train a network to estimate a good policy in Step

1. We see that the whole process is an MDP whose states are the set of graphs and whose action is node selection. In MIS, by setting the reward as for all and and discount rate , the return becomes the number of selected nodes, i.e., the size of the maximal independent set. Note that we do not necessarily set the reward and the discount rate in this manner when applying this algorithm to other problems (it depends on what one wants to maximize in that problem).

Now the problem is reduced to MDP and our goal is to maximize the return. We can apply various algorithms to solve MDP, including Policy Gradient [20], Q-learning [21], and many of their derivations [22; 23; 24; 25; 26; 27; 28]. In our preliminary experiments, we tried REINFORCE [29] and Deep Q-Networks [30] but they did not work as well as MCTS. See Section 3.4 for details.

Since we have reduced MIS to MDP, any problems that can be reduced to the same MDP can be solved with the same approach. Such problems include the Maximum Clique, the Minimum Vertex Cover and the Traveling Salesperson Problem, which are also NP-hard [31].

3.2 Extension of MCTS to Combinatorial Optimization

Here, we extend MCTS to combinatorial optimization problems.

The Difference between Game Search and Combinatorial Optimization

We exploit MCTS together with reinforcement learning, as used in AlphaGo Zero. However, we cannot directly apply the same approach to our problem setting. In most MCTS operations used in the game search, the value of a state is determined according to “how likely it is the player is going to win” [18; 13; 32]. Commonly, the value is within , meaning that the larger the value is, the more likely the player is to win. The aim of the players is only to win the game, and thus they are not interested in the cumulative value, i.e., how well they have been playing so far. More precisely, in a game, if a player reaches a state whose value is , that player will never win the game. On the other hand, in combinatorial optimization, say MIS, there are many common states between an optimal trajectory and a poor trajectory.

Another difference is that the size of the input is always the same in AlphaGo Zero. However, in our case, the size of the input depends on the size of the graph. To resolve this issue, we use GNNs to encode the graphs into the variable-sized feature vectors.

Ideas for Extended MCTS

One can evaluate the value of a state in such a way that is the size of the maximum independent set of . However, since this evaluation is not normalized, there would be a difference of scale in (1) and (3). For example in (3), the first term could be very large when and are large. As a result, we mitigate this issue as follows.

Given a state (graph), the network outputs , where is the probability vector of node selection and is a vector of node size whose –th element is the normalized return from state if we choose vertex for the next action. We will elaborate the definition of normalized return in Section 3.3, but intuitively, the normalized return means “how good that return is compared to the return obtained by random actions”.

By virtue of this normalization, we can calculate the loss in the same way as AlphaGo Zero as in Eq. (3). Moreover, it frees the algorithm from problem specification, that is, when we are to maximize some criterion, we always evaluate the action by “how good it is compared to random actions”.

MCTS Nodes and Edges

Now we are ready to state the algorithm. First, we articulate the structure of the search tree used in our MCTS. Each node of the tree represents a graph and each edge represents the action from state . Each edge stores and . Each node stores and

, the mean and the standard deviation of the return from state

if played randomly. We estimate and by sampling when expanding the node.

3.3 Algorithms

Different from AlphaGo Zero, in order to calculate the exact reward in every state in a trajectory, the expansion phase continues until it reaches the termination state; a graph with no vertices.


In a simulation of MCTS, we choose the action in exactly the same as AlphaGo Zero. We select the action maximizing where the upper confidence bound is (2).


When reaching a node , we evaluate by the network. Then we initialize with the estimated normalized action-value for each action . Note that is a vector and is the element of corresponding to taking the action . We also estimate and by using random rollouts. Note that changes as the network learns, while and only depend on the state . Therefore, we can save them with a hashmap for reuse when we reach an unexpanded node whose graph has been already seen.


As well as AlphaGo Zero, is incremented by one. The difference from AlphaGo Zero is the update technique of action-value estimator. is set with the regularized mean of all exact rewards from the current state:


where is an exact reward obtained by the –th rollout from state . In implementation, this is calculated by incrementing the reward by one from , each time going up to the parent state.


In training, each play is selected exactly the same way as AlphaGo Zero, according to the enhanced policy ; however, we do not use the same approach to obtain the answer in inference because it is very time-costly. Note that our goal here is not to obtain the best action in a state but to find the best sequence of actions given an input graph. For this reason, we repeat rolling out until the end. In each rollout, we iteratively select the best node that maximizes UCB .


The pseudocode of our algorithm is provided in Appendix C.

3.4 Other Approaches to Solving MDP

Here, we discuss other approaches to solve MDP that we have examined other than MCTS.


We built a deep Q-network that estimates the action-value given a state and an action. We used experience replay and updated the parameter by minimizing TD-error . This worked for some of the very small graphs, however, the learning was so volatile that once it accidentally outputs a large , the whole parameter got contaminated and the action-values diverged.

Policy Gradient

We also have attempted to solve the MDP by Policy Gradient. We used GIN as the policy network and used REINFORCE for learning. We found that the learning procedure was highly unstable, and as often seen in Policy Gradient. Once the policy was polluted, it took many epochs to get recovered. Although we have found that it outperformed DQN empirically, the approach based on MCTS that is proposed in this paper is significantly more stable and achieve superior performance.

4 Experiments

In this section, we present the experimental results on solving the MIS problem.

The experiments consist of three parts. In the first part, we visualize how the networks were trained by MCTS. In the second part, we compared our algorithm with S2V-DQN, Li et al.’s supervised method and Akiba and Iwata’s exact algorithm. Lastly, we show the normalized action-value vector for some graphs and verify that the networks could automatically detect some structural information.

4.1 Global Setup

GIN Model

We tuned hyperparameters and used GIN with

layers. Each MLP has two layers with hidden units. Dropout ratio was set to .


In the MCTS, we have five hyperparameters: the weight of the UCB used in (2), the temperature when calculating the enhanced policy , the number of rollouts to find and when expanding a new node, the number of simulations and L2 regularization constant . We set , , for each epoch in epochs learning, for epochs learning. Also, we set and where is the number of nodes in the graph.

Training and Environment

We trained 10 models of different initial parameters for epochs using Reedbush-H computing system (CPU: E5-2695v4 2.1GHz 18core, GPU: NVIDIA Tesla P100 SXM2 x2) and each training took 9 hours. In each epoch, a random graph of nodes and edges was generated and used for training.

Note that the time complexity for training on a graph is where is the number of edges and is the size of the solution, if we ignore the sizes of feature matrices (which are usually much smaller than the size of the graph).

Unless mentioned otherwise, we used these ten trained models throughout the experiments.

Test Configuration

Through all experiments, when testing a graph, we follow the policy mentioned in 3.3. In Section 4.3, we do this in parallel by networks and take the maximum return as the answer.

4.2 Visualization of Training

We visualized how the size of the solution found changed according to the epochs.

In each epoch, we used fixed random graphs of nodes and edges for testing. Note that the size of a maximum independent set for each graph may differ.

Figure 1 shows that the algorithm achieved stable performance. Each colored curve corresponds to a test graph and indicates the improvement of the mean of the solution sizes. See Figure 3 in Appendix D for the results for different initial parameters.

Figure 1: The performance of the network. Shaded region denotes the error bar, i.e., standard deviation.

4.3 Comparison to other Algorithms

We compared our algorithm with machine learning methods and Akiba and Iwata’s Branch-and-Reduce algorithm. We first provide simple explanations for these algorithms.

Machine Learning Method

Khalil et al. proposed a framework S2V-DQN for a combinatorial optimization problem by a combination of a deep Q-network and a graph embedding network called structure2vec [9]. Li et al. presented an approach to solving the Maximum Independent Set problem combining machine learning techniques and classical heuristics which require many reduction rules (see Appendix A). They trained GCNs to predict the likelihood of each node being included in the optimal solution using a training dataset where is a graph and is the label of an optimal solution. Their experimental results demonstrated that it performed as well as the SOTA heuristic solvers for NP-hard problems.

Exact Algorithm

For some NP-hard problems, despite their theoretical hardness, there are some practical fast exact algorithms. Akiba and Iwata [8]’s Branch-and-Reduce algorithm is the algorithm for solving Minimum Vertex Cover (and equivalently, Maximum Independent Set) and works even for social graphs with millions of nodes. However, this algorithm does not work for all graphs unfortunately, and can easily get stuck with a dense graph of only hundreds of nodes. It cannot be applied to random graphs over one thousand nodes either. In Appendix B, we introduce this Branch-and-Reduce algorithm in greater detail and explain why it fails in these corner cases.


We tested the algorithms on random graphs of different sizes generated based on the rules above and citation networks (Sen et al., 2008). Citation networks provide real-world directed sparse graphs with labeled nodes. Here, we ignored the class labels and treated it as an undirected graph. The number of nodes and edges for each graph is listed in Table 1.

Graph Vertices Edges
Cora 2708 5429
Citeseer 3327 4732
PubMed 19717 44335
Table 1: Size of citation networks.


We selected the best solution from trained networks for each graph in our method. For each algorithm, computation is stopped in minutes and output the best solution found at the time.


Table 2 shows the answer found on each of the random graphs of sizes , , , and (See details at Table 4 in Appendix D). Our MCTS algorithm successfully found the optimal solution for small graphs.

On citation networks, our method obtained much better solution than S2V-DQN on all the three graphs shown in Table 3. The solutions for Cora and Citeseer by our algorithm were optimal. Again, note that our algorithm does not use any reduction rules that were used in both Li et al.’s supervised method and the Branch-and-Reduce algorithm. When a graph is highly sparse, many reduction rules, such as “always choose a node with a degree of less than if it exists”, can be applied. Therefore, PubMed, the sparsest one, is easier for both the supervised method and Branch-and-Reduce algorithm.

Another remark is that the training was conducted with generated random graphs with nodes and edges as explained in Section 4.1. Table 2 implies that the size of the graphs for training can be a limit of accuracy. See Table 5 in Appendix D for the results of experiments in which the network was trained with different sizes of graphs.

Vertex, Edge MCTS Supervised Branch-and-Reduce
10, 25 5 5 5
100, 250 44 44 44
500, 1250 212 220 220
1000, 2500 422 439 439
Table 2: Comparison on random graphs. Each value refers to the answer found in minutes by each algorithm on one of the random graphs. Our algorithm successfully found the optimal solutions on the graphs of size and . Bold values mean that they are the theoretically optimal solutions obtained by the exact algorithm. See Table 4 in Appendix D for the full information.
Graph MCTS S2V-DQN Supervised Branch-and-Reduce
Cora 1451 1381 1451 1451
Citeseer 1867 1705 1867 1867
PubMed 15906 15709 15912 15912
Table 3: Comparison on citation networks. Our algorithm successfully found the optimal solutions for Cora and Citeseer. The results for S2V-DQN is obtained from [7].

4.4 Visualization of the Probability Vector

Figure 2 shows the action-value of each node estimated by the same networks used in previous experiments. Since we used GIN, the action-values of symmetric nodes are the same. In the right-hand side of Figure 3, both Supervised method and Branch-and-Reduce algorithm select the leaf nodes by reduction rules, but we can see that our proposed method automatically learned the structure behind the problem. Since the values mean “how good it is to remove that node compared to random moves”, the values for leaf nodes in the right-hand size figure are small compared to the red nodes in the left-hand size graph.

[width=9cm]data/graph3-1.pdf [width=9cm]data/graph3-2.pdf
Figure 2: Visualization of action-values. Red nodes have higher value and blue nodes have lower value. Action-values are normalized so that the values denote the preference over random actions.

5 Conclusion and Future Work

In this paper, we presented an algorithm to solve certain combinatorial optimization problems on graphs without human knowledge. At first, we tried to make the network deep with layers, but we found that even a shallow network with layers can learn Maximum Independent Set. The performance outperformed S2V-DQN, that is also a framework to solve NP-hard problems without domain knowledge.

We have so far focused on MDPs whose action is the node selection because GNNs output the feature vectors for each node and hence it is natural to use GNNs to learn a policy. However, if we have some networks that estimate the features for edges, then we can combine those networks with MCTS to solve a similar MDP whose action is to choose an edge. The search algorithm – MCTS with normalized return – is applicable to many settings. It may also work when nodes or edges are weighted. Therefore, it is interesting to explore the analysis and experiments for such tasks as future work.


MS was supported by the International Research Center for Neurointelligence (WPI-IRCN) at The University of Tokyo Institutes for Advanced Study.


Appendix A Reduction Rules

To exactly solve the Maximum Independent Set problem, it is known that the algorithm must have an exponential time complexity. However, we can improve the time complexity from 333 is a notation for exponential complexity. In notation, we ignore the polynomial coefficients., which is a brute-force algorithm to search all the subset of the graph nodes. For example, if a graph has a node whose degree is , that is, an isolated node, it must be included in the maximum independent set. Also, one can easily check that if a node has degree , there is a maximum independent set that contains this node. This is the simplest reduction rule, called pendant rule, but there are indeed many reduction rules in MIS. We introduce the other three reduction rules that are used in Li et al.’s supervised method to solve MIS.

Degree-2 folding

Let be a node of degree 2 and be the neighbors of . If and are not adjacent, then there is a maximum independent set which contains either or both and . Therefore, we can merge these three nodes into one new node and later add or into the maximum independent set.


Let be the neighbors of vertex set . Whether a node is unconfined or not is determined as follows.

  1. Let

  2. Find such that and the size of is minimized

  3. If there is no such , is not unconfined

  4. If , is unconfined

  5. If , add to and go back to 2 otherwise is not unconfined

One can show that there is a maximum independent set without any unconfined vertices. Therefore, unconfined vertices can be removed.


Two vertices and of degree 3 are called twin if . Let be the set of vertices and be the induced subgraph of from . If has an edge, there is a maximum independent set that contains both and . Otherwise, we can remove and from , and introduce a new vertex to connected to ’s order-2 neighbors. If is in a maximum independent set, then none of ’s order-2 neighbor can be in the MIS. Therefore we can add to the MIS. If was not in a MIS, some of ’s order-2 neighbors can be in the MIS and hence can be added to the MIS.

Again, these reduction rules are known because the Maximum Independent Set problem has been widely studied. If we are to solve a new kind of NP-hard problem without any famous reduction rule, it is pretty hard to construct an efficient algorithm.

These reduction rules tend to be very useful in a sparse network such as social graphs because, in such graphs, most of the nodes have small degrees like or . Therefore, by just iteratively applying these reduction rules, the size of the graph usually get much smaller.

Appendix B Branch-and-Reduce Algorithm

Akiba and Iwata [8] proposed an algorithm which contains many rules for branching and reduction in addition to lower bounds. Its time complexity is . However, the Branch-and-Reduce algorithm actually works for some graphs with millions of nodes [8]. This is partially because the FPT lower bound in the algorithm is quite strong. In order to explain the corner cases for Branch-and-Reduce algorithm, we first explain this FPT algorithm.

FPT Algorithm for MVC

Let us define the MVC problem as IP (Integer Programming). For each vertex , we have a variable and for each edge we have a constraint . The MVC problem is equivalent to the minimization of under these constraints. Since solving IP is NP-hard, we solve a relaxed LP instead. The difference between LP and IP (exact solution) is now the parameter in the FPT.

One can show that the solution of the LP is always or for all [33]. Moreover, for those nodes with or , there is always a minimum vertex cover which contains such that and does not contain such that . Therefore, if the difference between LP solution and IP solution is small, we can only consider small number of nodes to determine the MVC. The time complexity can actually be bound by where is the gap between the size of the optimal solution and LP lower bound [33].

Corner Cases for Branch-and-Reduce Algorithm

Since the Branch-and-Reduce algorithm utilizes the FPT lower bound, its time complexity can also be bound by . Let us consider a corner case for Branch-and-Reduce algorithm in terms of this FPT bound. The corner case is actually quite simple, which is complete graphs. The answer of LP for a complete graph of more than one node is for all . In such a case, one cannot make use of the FPT lower bound. Indeed, in the experiments of [8], the Branch-and-Reduce algorithm failed to solve MVC in a limited time of 24 hours with some very dense graphs (DIMACS graphs).

Since Branch-and-Reduce algorithm includes a reduction rule for cliques, a complete graph may not the best corner case, however, it is generally weak against very dense graphs or random graphs where the constraints among vertices are pretty complex.

Appendix C Pseudocode

In this section, we provide the pseudocode for our algorithm. As explain in Section. 3.3, one can make use of a hashmap to store and in addition to for reuse.

0:  MCTS Search Tree
0:  Updated through a simulation {repeat selection and expansion}
   root of
  while  is not final state do
     if  is new node then
        Evaluate with GNNs
        Initialize with
        Calculate and by random sampling
        Set to for all
     end if
      choose best action that maximizes UCB
      next state of by selecting
  end while{backpropagate}
  while  is not root do
      previous action
      parent of
  end while
Algorithm 1 rollout
0:  MCTS Search Tree , number of iterations , temperature
0:  , the probability vector used for Play
  for  to  do
  end for
Algorithm 2 get_improved_
0:  Graph , GNN parameter , regularization constant
0:  Updated parameter
  while  is not empty do
      sample vertex according to probability
      remove and its neighbors from
     Add to history
  end while
  for all  in history do
      how much return obtained from
     normalize by and of
     Replace with
  end for
  for each minibatch of shuffled history do
     for each in minibatch do
        Evaluate policy and normalized action-value by GNNs
     end for
     Normalize with batch size
     Optimize to minimize loss
  end for
Algorithm 3 train

Appendix D Additional Experiment Results

In this section, we present additional experimental results to support the findings of the main part of the paper.

Figure 3 shows the improvements in the solution size for each of the five test graphs with 10 different initial parameters. In each epoch, a random graph of nodes and edges is generated for training. In most cases, the learning was completed within the first 30 epochs, however, with some poor initial parameters, it took more than 100 epochs. Although it sometimes took time to converge, the learning was highly stable compared to Policy Gradient that was explored in our preliminary studies: In Policy Gradient, even the network once produced near-optimal solutions, the performance commonly plunged.

Figure 4 shows how the networks were trained with generated graphs of nodes and edges during the first epochs. The test graphs were fixed random graphs with nodes and edges. Similarly to Figure 3, there was no significant drop in performance after once obtaining near-optimal solutions.

Table 4 shows the comparison of our MCTS algorithm, Li et al.’s supervised method and Akiba and Iwata’s Branch-and-Reduce algorithm on random graphs. The network was trained by generated random graphs of nodes and edges for epochs. It successfully found the optimal solutions for all the graphs of nodes and (out of ) graphs of nodes.

Table 5 shows the difference in performance when trained by graphs of nodes and nodes. For the network trained by graphs of nodes, we only trained it for epochs since Figure 4 illustrated that it already converged while we trained the other network for epochs. Other hyperparameters such as are set to be the same. MCTS (200), the network trained by random graphs of nodes, got optimal solutions for all the graphs of nodes and had slightly better performance on random graphs of nodes. However, its performance on random graphs of nodes were much better than MCTS (100) and for the fourth test graph of nodes, MCTS (200) found a solution whose size is greater than that of MCTS (100) by 15. The results suggest that by training the network with larger random graphs can produce better results. Li et al.’s supervised method used some graphs with more than nodes for training. That would partially be the reason why Li et al.’s algorithm obtained good solutions even for larger graphs. If we use such larger graphs in our algorithm, it would take days to finish the training and that is one disadvantage of reinforcement learning. However, with social graphs with thousands of nodes, since the structure of social networks is much simpler than that of random graphs, the results were still comparable even we used small graphs (of nodes) for training.

[width=9cm]data/graphleft3.pdf [width=9cm]data/graphleft0.pdf
[width=9cm]data/graphleft1.pdf [width=9cm]data/graphleft2.pdf
[width=9cm]data/graphleft4.pdf [width=9cm]data/graphleft5.pdf

[width=9cm]data/graphleft6.pdf [width=9cm]data/graphleft7.pdf
[width=9cm]data/graphleft8.pdf [width=9cm]data/graphleft9.pdf
Figure 3: The performances of the network with 10 different initial parameters. Shaded region denotes the error bar, i.e., standard deviation.
Vertex, Edge MCTS Supervised Branch-and-Reduce
10, 25 (0) 5 5 5
10, 25 (1) 4 5 5
10, 25 (2) 4 5 5
10, 25 (3) 5 5 5
10, 25 (4) 3 5 5
10, 25 (5) 4 5 5
10, 25 (6) 4 5 5
10, 25 (7) 4 5 5
10, 25 (8) 4 5 5
10, 25 (9) 4 5 5
100, 250 (0) 44 44 44
100, 250 (1) 45 45 45
100, 250 (2) 43 43 43
100, 250 (3) 45 45 45
100, 250 (4) 43 44 44
100, 250 (5) 44 44 44
100, 250 (6) 46 46 46
100, 250 (7) 42 42 42
100, 250 (8) 45 45 45
100, 250 (9) 42 42 42
500, 1250 (0) 212 220 220
500, 1250 (1) 215 221 221
500, 1250 (2) 210 220 220
500, 1250 (3) 216 222 222
500, 1250 (4) 210 214 214
500, 1250 (5) 213 220 220
500, 1250 (6) 210 216 216
500, 1250 (7) 209 216 216
500, 1250 (8) 210 216 216
500, 1250 (9) 213 218 218
1000, 2500 (0) 422 439 439
1000, 2500 (1) 417 439 438
1000, 2500 (2) 428 443 443
1000, 2500 (3) 429 448 448
1000, 2500 (4) 419 442 442
1000, 2500 (5) 416 434 433
1000, 2500 (6) 417 432 430
1000, 2500 (7) 424 438 437
1000, 2500 (8) 429 445 445
1000, 2500 (9) 424 441 441
Table 4: Comparison on random graphs. Each value refers to the answer found in minutes by each algorithm on one of the random graphs. Our algorithm successfully found the optimal solutions on all the graphs of size and graphs of size out of . Bold values mean that they are the theoretically optimal solutions obtained by the exact algorithm.
[width=9cm]data/graph500_0.pdf [width=9cm]data/graph500_1.pdf

[width=9cm]data/graph500_2.pdf [width=9cm]data/graph500_3.pdf

[width=9cm]data/graph500_4.pdf [width=9cm]data/graph500_5.pdf

[width=9cm]data/graph500_6.pdf [width=9cm]data/graph500_7.pdf

[width=9cm]data/graph500_8.pdf [width=9cm]data/graph500_9.pdf
Figure 4: The performance of the network when trained by generated random graphs of nodes and edges. Each graph corresponds to a network with different parameters. Shaded region denotes the error bar, i.e., standard deviation.
Vertex, Edge MCTS (100) MCTS(200)
100, 250 (0) 44 44
100, 250 (1) 45 45
100, 250 (2) 43 43
100, 250 (3) 45 45
100, 250 (4) 43 44
100, 250 (5) 44 44
100, 250 (6) 46 46
100, 250 (7) 42 42
100, 250 (8) 45 45
100, 250 (9) 42 42
500, 1250 (0) 212 216
500, 1250 (1) 215 218
500, 1250 (2) 210 213
500, 1250 (3) 216 215
500, 1250 (4) 210 208
500, 1250 (5) 213 211
500, 1250 (6) 210 210
500, 1250 (7) 209 209
500, 1250 (8) 210 209
500, 1250 (9) 213 214
1000, 2500 (0) 422 428
1000, 2500 (1) 417 422
1000, 2500 (2) 428 434
1000, 2500 (3) 429 437
1000, 2500 (4) 419 434
1000, 2500 (5) 416 413
1000, 2500 (6) 417 414
1000, 2500 (7) 424 425
1000, 2500 (8) 429 436
1000, 2500 (9) 424 432
Table 5: Comparison of two models trained by graphs of different sizes. Bold values indicate better performance compared to the other. MCTS (100) and MCTS (200) mean the networks trained by random graphs of size and respectively.