NP-hard problems arise in many real-world optimization problems. Although solving them in a realistic time is believed to be impossible, since many NP-hard problems are highly relevant in the real-world, several exact heuristic algorithms and local search approaches, such as simulated annealing and evolutionary computation, have been developed for providing near-optimal solutions to these problems[1; 2; 3; 4; 5; 6].
Recently, machine learning has been applied to solve NP-hard problems.Li et al. 
used supervised learning to train a Graph Convolutional Network (GCN) for solving theMaximum Independent Set111Maximum Independent Set is an NP-hard problem. An independent set of a graph is a set of nodes such that no two nodes in the set are connected. Finding an independent set of the maximum size is NP-hard. (MIS) problem and showed that it performed better than heuristic solvers. This result suggests the usefulness of the machine learning technique for automatically extracting useful features to handle a problem of interest, which previously required experts to manually design and implement the hand-crafted reduction rules for searching. One bottleneck of Li et al.’s method is that exact solutions for NP-hard problems must be prepared in advance for supervised learning. Since preparing labeled datasets for NP-hard problems is difficult in principle, in this paper, we challenge this problem by adopting a reinforcement learning method to solve NP-hard problems.
The biggest advantage of our algorithm is that it does not require any domain knowledge. Most existing algorithms, including Li et al.’s supervised method and Akiba and Iwata’s exact algorithm for solving the Minimum Vertex Cover222Minimum Vertex Cover is also an NP-hard problem. A vertex cover for a graph is a set of nodes such that for any edge , at least one of is contained in the set. The complement of a maximum independent set is a minimum vertex cover. problem (which is equivalent to MIS), have implemented many reduction rules to make the search efficient (some of the sophisticated reduction rules are listed in Appendix A). Most of such reduction rules are highly nontrivial and require a deep understanding of graph theory. Since a variety of research has been conducted for well-known NP-hard problems including MIS and Traveling Salesperson Problem, there are indeed some practical fast algorithms and heuristics [3; 7; 8]. However, when we are facing a new NP-hard problem that has not been widely studied, it can be difficult to theoretically obtain efficient reduction rules. Khalil et al. created a framework called S2V-DQN, combining Deep Q-Networks and a graph embedding network called structure2vec to provide a solution to this issue. Although they could outperform some of the heuristic solvers, their results were not comparable to those from Li et al.’s supervised learning. In this paper, we propose a method to solve NP-hard problems without human knowledge that has much higher performance than S2V-DQN and is even comparable to algorithms with complex reduction rules. As explained in Section 3.1
, the proposed algorithm can be applied to any combinatorial optimization problems as long as it can be reduced to a Markov Decision process, whosestate is a graph and whose action is to choose a vertex.
In our algorithm, Monte-Carlo Tree Search (MCTS) enables the networks to learn from self-exploration. In Section 3.2, we explain how we modify the ordinary MCTS used in the game search for combinatorial optimization.
In this section, we introduce two main ingredients that play essential roles in our algorithm. The first one is the Graph Neural Networks (GNNs), including the Graph Isomorphism Networks (GINs), and the second one is the Monte-Carlo Tree Search (MCTS) [11; 12], which was also used in AlphaGo Zero .
2.1 Graph Neural Network
Graph neural networks are a neural network framework on graphs. They recursively aggregate neighboring feature vectors to obtain node embedding that captures the structural information of a graph. There have been many studies on aggregation schemes[14; 15; 16], and GCNs  are de facto standard, achieving better results in many fields.
We adopted GCNs in our preliminary studies, but better results were obtained by using GINs . For this reason, we focus on GINs in this paper. Let be an adjacency matrix of a graph of vertices and be an adjacency matrix with self-connections. We applied the aggregation rule of GIN-0 with features in th layer:
where is a feature matrix of the –th layer and
refers to the multi-layer perceptron in the–th layer. GIN was shown to be one of the most powerful GNN architectures from the perspective of discriminative functions  and is known to be capable of discriminating different graphs that GCN cannot. Since nodes are not distinguished from each other, we give the feature matrix of all ones as an input. Such a characteristic of GINs makes it suitable to our problem setting where the information of each node is not abundant and structural information is important.
2.2 AlphaGo Zero
AlphaGo Zero is a well-known superhuman engine designed for use with the game of Go . AlphaGo Zero defeated its previous engine AlphaGo  with - score without human knowledge, i.e, the records of the games of professional players.
It trains a deep neural network with parameter by reinforcement learning. Given a state (game board), the network outputs , where
is the probability vector of each move andis a scalar denoting the value of the state. If is close to , the player who takes a corresponding action from state is very likely to win.
MCTS Nodes and Edges
AlphaGo Zero uses the MCTS to obtain , which is a much better probability vector of node selection than . To explain the algorithm, we introduce the structure of MCTS search tree. The search tree is a rooted tree, where each node corresponds to a state and the root is the initial state. Each edge denotes the action, meaning the action at state . Each edge stores action-value and visit-count .
MCTS and Training
The algorithm runs as follows. Initially, no node in the game tree is expanded, i.e., its is evaluated by the network.
(Selection) From the root, select the best action iteratively until reaching an unexpanded node. The best action is an action whose
is maximum, where is the upper confidence bound of the action-value. used in AlphaGo Zero is determined by a variant of PUCT algorithm  and
where is a nonnegative constant to be determined by experiments.
(Expansion) When hitting an unexpanded node , evaluate it with the network and obtain .
(Backpropagation) After expanding a new node, each edge used in the simulation is traversed and itsand are updated. We increment and update with the mean of among all leaf nodes in the subtree of . Formally, , where denotes the summation over every leaf that can be eventually reached from after taking action .
Iterate this process for enough number of times.
(Play) Select an action and proceed to the next state (game board). Calculate an enhanced probability vector of node selection and select a node according to . is calculated as , where is the temperature. Initially, we use a large and reduce it to closer to zero as learning proceeds. When is large, the selection is approximately uniformly random and when , it selects the node with the maximum visit-count.
Iterate Simulation and Play until the end of the game. We record in each Play at step . Once the game ends, we know the winner and loser. Let if the player who made action at wins the game, otherwise .
(Training) We select minibatches from and minimize the loss
where is a nonnegative constant for regularization. The network optimizes the probability vector of node selection and the value of the state at the same time.
In this section, we give a detailed explanation of our algorithm to solve the Maximum Independent Set (MIS) problem. First, we formalize the problem setting; we define the problem as a Markov Decision Process (MDP) so that reinforcement learning can be applied. Note that our algorithm can be applied not only to MIS but also to any problem that can be reduced to an MDP of the same form. Then we extend MCTS to our problem settings. We divide the simulation in MCTS into three processes, selection, expansion, and backpropagation.
3.1 Reduction to MDP
Let us consider approximately solving MIS in the following way.
Clearly, the set of the selected nodes in Step 2 forms a maximal
(not necessarily maximum) independent set. Our aim here is to train a network to estimate a good policy in Step1. We see that the whole process is an MDP whose states are the set of graphs and whose action is node selection. In MIS, by setting the reward as for all and and discount rate , the return becomes the number of selected nodes, i.e., the size of the maximal independent set. Note that we do not necessarily set the reward and the discount rate in this manner when applying this algorithm to other problems (it depends on what one wants to maximize in that problem).
Now the problem is reduced to MDP and our goal is to maximize the return. We can apply various algorithms to solve MDP, including Policy Gradient , Q-learning , and many of their derivations [22; 23; 24; 25; 26; 27; 28]. In our preliminary experiments, we tried REINFORCE  and Deep Q-Networks  but they did not work as well as MCTS. See Section 3.4 for details.
Since we have reduced MIS to MDP, any problems that can be reduced to the same MDP can be solved with the same approach. Such problems include the Maximum Clique, the Minimum Vertex Cover and the Traveling Salesperson Problem, which are also NP-hard .
3.2 Extension of MCTS to Combinatorial Optimization
Here, we extend MCTS to combinatorial optimization problems.
The Difference between Game Search and Combinatorial Optimization
We exploit MCTS together with reinforcement learning, as used in AlphaGo Zero. However, we cannot directly apply the same approach to our problem setting. In most MCTS operations used in the game search, the value of a state is determined according to “how likely it is the player is going to win” [18; 13; 32]. Commonly, the value is within , meaning that the larger the value is, the more likely the player is to win. The aim of the players is only to win the game, and thus they are not interested in the cumulative value, i.e., how well they have been playing so far. More precisely, in a game, if a player reaches a state whose value is , that player will never win the game. On the other hand, in combinatorial optimization, say MIS, there are many common states between an optimal trajectory and a poor trajectory.
Another difference is that the size of the input is always the same in AlphaGo Zero. However, in our case, the size of the input depends on the size of the graph. To resolve this issue, we use GNNs to encode the graphs into the variable-sized feature vectors.
Ideas for Extended MCTS
One can evaluate the value of a state in such a way that is the size of the maximum independent set of . However, since this evaluation is not normalized, there would be a difference of scale in (1) and (3). For example in (3), the first term could be very large when and are large. As a result, we mitigate this issue as follows.
Given a state (graph), the network outputs , where is the probability vector of node selection and is a vector of node size whose –th element is the normalized return from state if we choose vertex for the next action. We will elaborate the definition of normalized return in Section 3.3, but intuitively, the normalized return means “how good that return is compared to the return obtained by random actions”.
By virtue of this normalization, we can calculate the loss in the same way as AlphaGo Zero as in Eq. (3). Moreover, it frees the algorithm from problem specification, that is, when we are to maximize some criterion, we always evaluate the action by “how good it is compared to random actions”.
MCTS Nodes and Edges
Now we are ready to state the algorithm. First, we articulate the structure of the search tree used in our MCTS. Each node of the tree represents a graph and each edge represents the action from state . Each edge stores and . Each node stores and
, the mean and the standard deviation of the return from stateif played randomly. We estimate and by sampling when expanding the node.
Different from AlphaGo Zero, in order to calculate the exact reward in every state in a trajectory, the expansion phase continues until it reaches the termination state; a graph with no vertices.
In a simulation of MCTS, we choose the action in exactly the same as AlphaGo Zero. We select the action maximizing where the upper confidence bound is (2).
When reaching a node , we evaluate by the network. Then we initialize with the estimated normalized action-value for each action . Note that is a vector and is the element of corresponding to taking the action . We also estimate and by using random rollouts. Note that changes as the network learns, while and only depend on the state . Therefore, we can save them with a hashmap for reuse when we reach an unexpanded node whose graph has been already seen.
As well as AlphaGo Zero, is incremented by one. The difference from AlphaGo Zero is the update technique of action-value estimator. is set with the regularized mean of all exact rewards from the current state:
where is an exact reward obtained by the –th rollout from state . In implementation, this is calculated by incrementing the reward by one from , each time going up to the parent state.
In training, each play is selected exactly the same way as AlphaGo Zero, according to the enhanced policy ; however, we do not use the same approach to obtain the answer in inference because it is very time-costly. Note that our goal here is not to obtain the best action in a state but to find the best sequence of actions given an input graph. For this reason, we repeat rolling out until the end. In each rollout, we iteratively select the best node that maximizes UCB .
The pseudocode of our algorithm is provided in Appendix C.
3.4 Other Approaches to Solving MDP
Here, we discuss other approaches to solve MDP that we have examined other than MCTS.
We built a deep Q-network that estimates the action-value given a state and an action. We used experience replay and updated the parameter by minimizing TD-error . This worked for some of the very small graphs, however, the learning was so volatile that once it accidentally outputs a large , the whole parameter got contaminated and the action-values diverged.
We also have attempted to solve the MDP by Policy Gradient. We used GIN as the policy network and used REINFORCE for learning. We found that the learning procedure was highly unstable, and as often seen in Policy Gradient. Once the policy was polluted, it took many epochs to get recovered. Although we have found that it outperformed DQN empirically, the approach based on MCTS that is proposed in this paper is significantly more stable and achieve superior performance.
In this section, we present the experimental results on solving the MIS problem.
The experiments consist of three parts. In the first part, we visualize how the networks were trained by MCTS. In the second part, we compared our algorithm with S2V-DQN, Li et al.’s supervised method and Akiba and Iwata’s exact algorithm. Lastly, we show the normalized action-value vector for some graphs and verify that the networks could automatically detect some structural information.
4.1 Global Setup
We tuned hyperparameters and used GIN withlayers. Each MLP has two layers with hidden units. Dropout ratio was set to .
In the MCTS, we have five hyperparameters: the weight of the UCB used in (2), the temperature when calculating the enhanced policy , the number of rollouts to find and when expanding a new node, the number of simulations and L2 regularization constant . We set , , for each epoch in epochs learning, for epochs learning. Also, we set and where is the number of nodes in the graph.
Training and Environment
We trained 10 models of different initial parameters for epochs using Reedbush-H computing system (CPU: E5-2695v4 2.1GHz 18core, GPU: NVIDIA Tesla P100 SXM2 x2) and each training took 9 hours. In each epoch, a random graph of nodes and edges was generated and used for training.
Note that the time complexity for training on a graph is where is the number of edges and is the size of the solution, if we ignore the sizes of feature matrices (which are usually much smaller than the size of the graph).
Unless mentioned otherwise, we used these ten trained models throughout the experiments.
4.2 Visualization of Training
We visualized how the size of the solution found changed according to the epochs.
In each epoch, we used fixed random graphs of nodes and edges for testing. Note that the size of a maximum independent set for each graph may differ.
Figure 1 shows that the algorithm achieved stable performance. Each colored curve corresponds to a test graph and indicates the improvement of the mean of the solution sizes. See Figure 3 in Appendix D for the results for different initial parameters.
4.3 Comparison to other Algorithms
We compared our algorithm with machine learning methods and Akiba and Iwata’s Branch-and-Reduce algorithm. We first provide simple explanations for these algorithms.
Machine Learning Method
Khalil et al. proposed a framework S2V-DQN for a combinatorial optimization problem by a combination of a deep Q-network and a graph embedding network called structure2vec . Li et al. presented an approach to solving the Maximum Independent Set problem combining machine learning techniques and classical heuristics which require many reduction rules (see Appendix A). They trained GCNs to predict the likelihood of each node being included in the optimal solution using a training dataset where is a graph and is the label of an optimal solution. Their experimental results demonstrated that it performed as well as the SOTA heuristic solvers for NP-hard problems.
For some NP-hard problems, despite their theoretical hardness, there are some practical fast exact algorithms. Akiba and Iwata ’s Branch-and-Reduce algorithm is the algorithm for solving Minimum Vertex Cover (and equivalently, Maximum Independent Set) and works even for social graphs with millions of nodes. However, this algorithm does not work for all graphs unfortunately, and can easily get stuck with a dense graph of only hundreds of nodes. It cannot be applied to random graphs over one thousand nodes either. In Appendix B, we introduce this Branch-and-Reduce algorithm in greater detail and explain why it fails in these corner cases.
We tested the algorithms on random graphs of different sizes generated based on the rules above and citation networks (Sen et al., 2008). Citation networks provide real-world directed sparse graphs with labeled nodes. Here, we ignored the class labels and treated it as an undirected graph. The number of nodes and edges for each graph is listed in Table 1.
We selected the best solution from trained networks for each graph in our method. For each algorithm, computation is stopped in minutes and output the best solution found at the time.
On citation networks, our method obtained much better solution than S2V-DQN on all the three graphs shown in Table 3. The solutions for Cora and Citeseer by our algorithm were optimal. Again, note that our algorithm does not use any reduction rules that were used in both Li et al.’s supervised method and the Branch-and-Reduce algorithm. When a graph is highly sparse, many reduction rules, such as “always choose a node with a degree of less than if it exists”, can be applied. Therefore, PubMed, the sparsest one, is easier for both the supervised method and Branch-and-Reduce algorithm.
Another remark is that the training was conducted with generated random graphs with nodes and edges as explained in Section 4.1. Table 2 implies that the size of the graphs for training can be a limit of accuracy. See Table 5 in Appendix D for the results of experiments in which the network was trained with different sizes of graphs.
4.4 Visualization of the Probability Vector
Figure 2 shows the action-value of each node estimated by the same networks used in previous experiments. Since we used GIN, the action-values of symmetric nodes are the same. In the right-hand side of Figure 3, both Supervised method and Branch-and-Reduce algorithm select the leaf nodes by reduction rules, but we can see that our proposed method automatically learned the structure behind the problem. Since the values mean “how good it is to remove that node compared to random moves”, the values for leaf nodes in the right-hand size figure are small compared to the red nodes in the left-hand size graph.
5 Conclusion and Future Work
In this paper, we presented an algorithm to solve certain combinatorial optimization problems on graphs without human knowledge. At first, we tried to make the network deep with layers, but we found that even a shallow network with layers can learn Maximum Independent Set. The performance outperformed S2V-DQN, that is also a framework to solve NP-hard problems without domain knowledge.
We have so far focused on MDPs whose action is the node selection because GNNs output the feature vectors for each node and hence it is natural to use GNNs to learn a policy. However, if we have some networks that estimate the features for edges, then we can combine those networks with MCTS to solve a similar MDP whose action is to choose an edge. The search algorithm – MCTS with normalized return – is applicable to many settings. It may also work when nodes or edges are weighted. Therefore, it is interesting to explore the analysis and experiments for such tasks as future work.
MS was supported by the International Research Center for Neurointelligence (WPI-IRCN) at The University of Tokyo Institutes for Advanced Study.
- Gonzalez  Teofilo F. Gonzalez. Handbook of Approximation Algorithms and Metaheuristics (Chapman & Hall/Crc Computer & Information Science Series). Chapman & Hall/CRC, 2007. ISBN 1584885505.
- Croes  Georges A Croes. A method for solving traveling-salesman problems. Operations research, 6(6):791–812, 1958.
- Reinelt  Gerhard Reinelt. Tsplib—a traveling salesman problem library. ORSA journal on computing, 3(4):376–384, 1991.
- Lin and Kernighan  Shen Lin and Brian W Kernighan. An effective heuristic algorithm for the traveling-salesman problem. Operations research, 21(2):498–516, 1973.
- Li and Xul  Youmei Li and Zongben Xul. An ant colony optimization heuristic for solving maximum independent set problems. In Proceedings Fifth International Conference on Computational Intelligence and Multimedia Applications. ICCIMA 2003, pages 206–211. IEEE, 2003.
A genetic algorithm-based heuristic for solving the weighted maximum independent set and some equivalent problems.Journal of the Operational Research Society, 48(6):612–622, 1997.
- Li et al.  Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Combinatorial optimization with graph convolutional networks and guided tree search. In Advances in Neural Information Processing Systems, pages 539–548, 2018.
- Akiba and Iwata  Takuya Akiba and Yoichi Iwata. Branch-and-reduce exponential/fpt algorithms in practice: A case study of vertex cover. Theoretical Computer Science, 609:211–225, 2016.
- Khalil et al.  Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. pages 6348–6358, 2017.
- Xu et al.  Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? CoRR, abs/1810.00826, 2018. URL http://arxiv.org/abs/1810.00826.
- Chaslot et al.  Guillaume Maurice Jean-Bernard Chaslot, Sander Bakkes, István Szita, and Pieter Spronck. Monte-Carlo Tree Search: A New Framework for Game AI. In Proc. Artif. Intell. Interact. Digital Entert. Conf., pages 216–217, Stanford Univ., California, 2008. URL http://www.aaai.org/Papers/AIIDE/2008/AIIDE08-036.pdf.
- Browne et al. [2012a] Cameron Browne, Edward Powley, Daniel Whitehouse, Simon Lucas, Peter I. Cowling, Stephen Tavener, Diego Perez, Spyridon Samothrakis, Simon Colton, and et al. A survey of monte carlo tree search methods. IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI, 2012a.
- Silver et al.  David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
- Hamilton et al. [2017a] William L. Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and applications. CoRR, abs/1709.05584, 2017a.
- Defferrard et al.  Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. pages 3844–3852, 2016.
- Hamilton et al. [2017b] William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 1025–1035, 2017b. URL http://papers.nips.cc/paper/6703-inductive-representation-learning-on-large-graphs.
- Schlichtkrull et al.  Michael Sejr Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In The Semantic Web - 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3-7, 2018, Proceedings, pages 593–607, 2018. doi: 10.1007/978-3-319-93417-4_38. URL https://doi.org/10.1007/978-3-319-93417-4_38.
- Silver et al.  David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
Christopher D Rosin.
Multi-armed bandits with episode context.
Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011.
- Zhao et al.  Tingting Zhao, Hirotaka Hachiya, Gang Niu, and Masashi Sugiyama. Analysis and improvement of policy gradient estimation. pages 262–270, 2011.
- Watkins and Dayan  Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
- Van Hasselt et al.  Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
- Mnih et al.  Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
- Schulman et al.  John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Lillicrap et al.  Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Wang et al.  Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224, 2016.
- Peters and Schaal  Jan Peters and Stefan Schaal. Natural actor-critic. Neurocomputing, 71(7-9):1180–1190, 2008.
- Wang et al.  Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.
- Williams  Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229–256, 1992.
- Mnih et al.  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- Woeginger  Gerhard J Woeginger. Exact algorithms for np-hard problems: A survey. In Combinatorial optimization—eureka, you shrink!, pages 185–207. Springer, 2003.
- Browne et al. [2012b] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012b.
- Iwata et al.  Yoichi Iwata, Keigo Oka, and Yuichi Yoshida. Linear-time fpt algorithms via network flow. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1749–1761. Society for Industrial and Applied Mathematics, 2014.
Appendix A Reduction Rules
To exactly solve the Maximum Independent Set problem, it is known that the algorithm must have an exponential time complexity. However, we can improve the time complexity from 333 is a notation for exponential complexity. In notation, we ignore the polynomial coefficients., which is a brute-force algorithm to search all the subset of the graph nodes. For example, if a graph has a node whose degree is , that is, an isolated node, it must be included in the maximum independent set. Also, one can easily check that if a node has degree , there is a maximum independent set that contains this node. This is the simplest reduction rule, called pendant rule, but there are indeed many reduction rules in MIS. We introduce the other three reduction rules that are used in Li et al.’s supervised method to solve MIS.
Let be a node of degree 2 and be the neighbors of . If and are not adjacent, then there is a maximum independent set which contains either or both and . Therefore, we can merge these three nodes into one new node and later add or into the maximum independent set.
Let be the neighbors of vertex set . Whether a node is unconfined or not is determined as follows.
Find such that and the size of is minimized
If there is no such , is not unconfined
If , is unconfined
If , add to and go back to 2 otherwise is not unconfined
One can show that there is a maximum independent set without any unconfined vertices. Therefore, unconfined vertices can be removed.
Two vertices and of degree 3 are called twin if . Let be the set of vertices and be the induced subgraph of from . If has an edge, there is a maximum independent set that contains both and . Otherwise, we can remove and from , and introduce a new vertex to connected to ’s order-2 neighbors. If is in a maximum independent set, then none of ’s order-2 neighbor can be in the MIS. Therefore we can add to the MIS. If was not in a MIS, some of ’s order-2 neighbors can be in the MIS and hence can be added to the MIS.
Again, these reduction rules are known because the Maximum Independent Set problem has been widely studied. If we are to solve a new kind of NP-hard problem without any famous reduction rule, it is pretty hard to construct an efficient algorithm.
These reduction rules tend to be very useful in a sparse network such as social graphs because, in such graphs, most of the nodes have small degrees like or . Therefore, by just iteratively applying these reduction rules, the size of the graph usually get much smaller.
Appendix B Branch-and-Reduce Algorithm
Akiba and Iwata  proposed an algorithm which contains many rules for branching and reduction in addition to lower bounds. Its time complexity is . However, the Branch-and-Reduce algorithm actually works for some graphs with millions of nodes . This is partially because the FPT lower bound in the algorithm is quite strong. In order to explain the corner cases for Branch-and-Reduce algorithm, we first explain this FPT algorithm.
FPT Algorithm for MVC
Let us define the MVC problem as IP (Integer Programming). For each vertex , we have a variable and for each edge we have a constraint . The MVC problem is equivalent to the minimization of under these constraints. Since solving IP is NP-hard, we solve a relaxed LP instead. The difference between LP and IP (exact solution) is now the parameter in the FPT.
One can show that the solution of the LP is always or for all . Moreover, for those nodes with or , there is always a minimum vertex cover which contains such that and does not contain such that . Therefore, if the difference between LP solution and IP solution is small, we can only consider small number of nodes to determine the MVC. The time complexity can actually be bound by where is the gap between the size of the optimal solution and LP lower bound .
Corner Cases for Branch-and-Reduce Algorithm
Since the Branch-and-Reduce algorithm utilizes the FPT lower bound, its time complexity can also be bound by . Let us consider a corner case for Branch-and-Reduce algorithm in terms of this FPT bound. The corner case is actually quite simple, which is complete graphs. The answer of LP for a complete graph of more than one node is for all . In such a case, one cannot make use of the FPT lower bound. Indeed, in the experiments of , the Branch-and-Reduce algorithm failed to solve MVC in a limited time of 24 hours with some very dense graphs (DIMACS graphs).
Since Branch-and-Reduce algorithm includes a reduction rule for cliques, a complete graph may not the best corner case, however, it is generally weak against very dense graphs or random graphs where the constraints among vertices are pretty complex.
Appendix C Pseudocode
In this section, we provide the pseudocode for our algorithm. As explain in Section. 3.3, one can make use of a hashmap to store and in addition to for reuse.
Appendix D Additional Experiment Results
In this section, we present additional experimental results to support the findings of the main part of the paper.
Figure 3 shows the improvements in the solution size for each of the five test graphs with 10 different initial parameters. In each epoch, a random graph of nodes and edges is generated for training. In most cases, the learning was completed within the first 30 epochs, however, with some poor initial parameters, it took more than 100 epochs. Although it sometimes took time to converge, the learning was highly stable compared to Policy Gradient that was explored in our preliminary studies: In Policy Gradient, even the network once produced near-optimal solutions, the performance commonly plunged.
Figure 4 shows how the networks were trained with generated graphs of nodes and edges during the first epochs. The test graphs were fixed random graphs with nodes and edges. Similarly to Figure 3, there was no significant drop in performance after once obtaining near-optimal solutions.
Table 4 shows the comparison of our MCTS algorithm, Li et al.’s supervised method and Akiba and Iwata’s Branch-and-Reduce algorithm on random graphs. The network was trained by generated random graphs of nodes and edges for epochs. It successfully found the optimal solutions for all the graphs of nodes and (out of ) graphs of nodes.
Table 5 shows the difference in performance when trained by graphs of nodes and nodes. For the network trained by graphs of nodes, we only trained it for epochs since Figure 4 illustrated that it already converged while we trained the other network for epochs. Other hyperparameters such as are set to be the same. MCTS (200), the network trained by random graphs of nodes, got optimal solutions for all the graphs of nodes and had slightly better performance on random graphs of nodes. However, its performance on random graphs of nodes were much better than MCTS (100) and for the fourth test graph of nodes, MCTS (200) found a solution whose size is greater than that of MCTS (100) by 15. The results suggest that by training the network with larger random graphs can produce better results. Li et al.’s supervised method used some graphs with more than nodes for training. That would partially be the reason why Li et al.’s algorithm obtained good solutions even for larger graphs. If we use such larger graphs in our algorithm, it would take days to finish the training and that is one disadvantage of reinforcement learning. However, with social graphs with thousands of nodes, since the structure of social networks is much simpler than that of random graphs, the results were still comparable even we used small graphs (of nodes) for training.
|10, 25 (0)||5||5||5|
|10, 25 (1)||4||5||5|
|10, 25 (2)||4||5||5|
|10, 25 (3)||5||5||5|
|10, 25 (4)||3||5||5|
|10, 25 (5)||4||5||5|
|10, 25 (6)||4||5||5|
|10, 25 (7)||4||5||5|
|10, 25 (8)||4||5||5|
|10, 25 (9)||4||5||5|
|100, 250 (0)||44||44||44|
|100, 250 (1)||45||45||45|
|100, 250 (2)||43||43||43|
|100, 250 (3)||45||45||45|
|100, 250 (4)||43||44||44|
|100, 250 (5)||44||44||44|
|100, 250 (6)||46||46||46|
|100, 250 (7)||42||42||42|
|100, 250 (8)||45||45||45|
|100, 250 (9)||42||42||42|
|500, 1250 (0)||212||220||220|
|500, 1250 (1)||215||221||221|
|500, 1250 (2)||210||220||220|
|500, 1250 (3)||216||222||222|
|500, 1250 (4)||210||214||214|
|500, 1250 (5)||213||220||220|
|500, 1250 (6)||210||216||216|
|500, 1250 (7)||209||216||216|
|500, 1250 (8)||210||216||216|
|500, 1250 (9)||213||218||218|
|1000, 2500 (0)||422||439||439|
|1000, 2500 (1)||417||439||438|
|1000, 2500 (2)||428||443||443|
|1000, 2500 (3)||429||448||448|
|1000, 2500 (4)||419||442||442|
|1000, 2500 (5)||416||434||433|
|1000, 2500 (6)||417||432||430|
|1000, 2500 (7)||424||438||437|
|1000, 2500 (8)||429||445||445|
|1000, 2500 (9)||424||441||441|
|Vertex, Edge||MCTS (100)||MCTS(200)|
|100, 250 (0)||44||44|
|100, 250 (1)||45||45|
|100, 250 (2)||43||43|
|100, 250 (3)||45||45|
|100, 250 (4)||43||44|
|100, 250 (5)||44||44|
|100, 250 (6)||46||46|
|100, 250 (7)||42||42|
|100, 250 (8)||45||45|
|100, 250 (9)||42||42|
|500, 1250 (0)||212||216|
|500, 1250 (1)||215||218|
|500, 1250 (2)||210||213|
|500, 1250 (3)||216||215|
|500, 1250 (4)||210||208|
|500, 1250 (5)||213||211|
|500, 1250 (6)||210||210|
|500, 1250 (7)||209||209|
|500, 1250 (8)||210||209|
|500, 1250 (9)||213||214|
|1000, 2500 (0)||422||428|
|1000, 2500 (1)||417||422|
|1000, 2500 (2)||428||434|
|1000, 2500 (3)||429||437|
|1000, 2500 (4)||419||434|
|1000, 2500 (5)||416||413|
|1000, 2500 (6)||417||414|
|1000, 2500 (7)||424||425|
|1000, 2500 (8)||429||436|
|1000, 2500 (9)||424||432|