1 Introduction
NPhard problems arise in many realworld optimization problems. Although solving them in a realistic time is believed to be impossible, since many NPhard problems are highly relevant in the realworld, several exact heuristic algorithms and local search approaches, such as simulated annealing and evolutionary computation, have been developed for providing nearoptimal solutions to these problems
[1; 2; 3; 4; 5; 6].Recently, machine learning has been applied to solve NPhard problems.
Li et al. [7]used supervised learning to train a Graph Convolutional Network (GCN) for solving the
Maximum Independent Set^{1}^{1}1Maximum Independent Set is an NPhard problem. An independent set of a graph is a set of nodes such that no two nodes in the set are connected. Finding an independent set of the maximum size is NPhard. (MIS) problem and showed that it performed better than heuristic solvers. This result suggests the usefulness of the machine learning technique for automatically extracting useful features to handle a problem of interest, which previously required experts to manually design and implement the handcrafted reduction rules for searching. One bottleneck of Li et al.’s method is that exact solutions for NPhard problems must be prepared in advance for supervised learning. Since preparing labeled datasets for NPhard problems is difficult in principle, in this paper, we challenge this problem by adopting a reinforcement learning method to solve NPhard problems.The biggest advantage of our algorithm is that it does not require any domain knowledge. Most existing algorithms, including Li et al.’s supervised method and Akiba and Iwata’s exact algorithm for solving the Minimum Vertex Cover^{2}^{2}2Minimum Vertex Cover is also an NPhard problem. A vertex cover for a graph is a set of nodes such that for any edge , at least one of is contained in the set. The complement of a maximum independent set is a minimum vertex cover. problem (which is equivalent to MIS), have implemented many reduction rules to make the search efficient (some of the sophisticated reduction rules are listed in Appendix A). Most of such reduction rules are highly nontrivial and require a deep understanding of graph theory. Since a variety of research has been conducted for wellknown NPhard problems including MIS and Traveling Salesperson Problem, there are indeed some practical fast algorithms and heuristics [3; 7; 8]. However, when we are facing a new NPhard problem that has not been widely studied, it can be difficult to theoretically obtain efficient reduction rules. Khalil et al. created a framework called S2VDQN, combining Deep QNetworks and a graph embedding network called structure2vec to provide a solution to this issue. Although they could outperform some of the heuristic solvers, their results were not comparable to those from Li et al.’s supervised learning. In this paper, we propose a method to solve NPhard problems without human knowledge that has much higher performance than S2VDQN and is even comparable to algorithms with complex reduction rules. As explained in Section 3.1
, the proposed algorithm can be applied to any combinatorial optimization problems as long as it can be reduced to a Markov Decision process, whose
state is a graph and whose action is to choose a vertex.In our algorithm, MonteCarlo Tree Search (MCTS) enables the networks to learn from selfexploration. In Section 3.2, we explain how we modify the ordinary MCTS used in the game search for combinatorial optimization.
2 Preliminary
In this section, we introduce two main ingredients that play essential roles in our algorithm. The first one is the Graph Neural Networks (GNNs), including the Graph Isomorphism Networks (GINs)
[10], and the second one is the MonteCarlo Tree Search (MCTS) [11; 12], which was also used in AlphaGo Zero [13].2.1 Graph Neural Network
Graph neural networks are a neural network framework on graphs. They recursively aggregate neighboring feature vectors to obtain node embedding that captures the structural information of a graph. There have been many studies on aggregation schemes
[14; 15; 16], and GCNs [17] are de facto standard, achieving better results in many fields.We adopted GCNs in our preliminary studies, but better results were obtained by using GINs [10]. For this reason, we focus on GINs in this paper. Let be an adjacency matrix of a graph of vertices and be an adjacency matrix with selfconnections. We applied the aggregation rule of GIN0 with features in th layer:
where is a feature matrix of the –th layer and
refers to the multilayer perceptron in the
–th layer. GIN was shown to be one of the most powerful GNN architectures from the perspective of discriminative functions [10] and is known to be capable of discriminating different graphs that GCN cannot. Since nodes are not distinguished from each other, we give the feature matrix of all ones as an input. Such a characteristic of GINs makes it suitable to our problem setting where the information of each node is not abundant and structural information is important.2.2 AlphaGo Zero
AlphaGo Zero is a wellknown superhuman engine designed for use with the game of Go [13]. AlphaGo Zero defeated its previous engine AlphaGo [18] with  score without human knowledge, i.e, the records of the games of professional players.
It trains a deep neural network with parameter by reinforcement learning. Given a state (game board), the network outputs , where
is the probability vector of each move and
is a scalar denoting the value of the state. If is close to , the player who takes a corresponding action from state is very likely to win.MCTS Nodes and Edges
AlphaGo Zero uses the MCTS to obtain , which is a much better probability vector of node selection than . To explain the algorithm, we introduce the structure of MCTS search tree. The search tree is a rooted tree, where each node corresponds to a state and the root is the initial state. Each edge denotes the action, meaning the action at state . Each edge stores actionvalue and visitcount .
MCTS and Training
The algorithm runs as follows. Initially, no node in the game tree is expanded, i.e., its is evaluated by the network.

(SelfPlay)

(Simulation)

(Selection) From the root, select the best action iteratively until reaching an unexpanded node. The best action is an action whose
(1) is maximum, where is the upper confidence bound of the actionvalue. used in AlphaGo Zero is determined by a variant of PUCT algorithm [19] and
(2) where is a nonnegative constant to be determined by experiments.

(Expansion) When hitting an unexpanded node , evaluate it with the network and obtain .

(Backpropagation) After expanding a new node, each edge used in the simulation is traversed and its
and are updated. We increment and update with the mean of among all leaf nodes in the subtree of . Formally, , where denotes the summation over every leaf that can be eventually reached from after taking action .
Iterate this process for enough number of times.


(Play) Select an action and proceed to the next state (game board). Calculate an enhanced probability vector of node selection and select a node according to . is calculated as , where is the temperature. Initially, we use a large and reduce it to closer to zero as learning proceeds. When is large, the selection is approximately uniformly random and when , it selects the node with the maximum visitcount.
Iterate Simulation and Play until the end of the game. We record in each Play at step . Once the game ends, we know the winner and loser. Let if the player who made action at wins the game, otherwise .


(Training) We select minibatches from and minimize the loss
(3) where is a nonnegative constant for regularization. The network optimizes the probability vector of node selection and the value of the state at the same time.
3 Method
In this section, we give a detailed explanation of our algorithm to solve the Maximum Independent Set (MIS) problem. First, we formalize the problem setting; we define the problem as a Markov Decision Process (MDP) so that reinforcement learning can be applied. Note that our algorithm can be applied not only to MIS but also to any problem that can be reduced to an MDP of the same form. Then we extend MCTS to our problem settings. We divide the simulation in MCTS into three processes, selection, expansion, and backpropagation.
3.1 Reduction to MDP
Let us consider approximately solving MIS in the following way.
Clearly, the set of the selected nodes in Step 2 forms a maximal
(not necessarily maximum) independent set. Our aim here is to train a network to estimate a good policy in Step
1. We see that the whole process is an MDP whose states are the set of graphs and whose action is node selection. In MIS, by setting the reward as for all and and discount rate , the return becomes the number of selected nodes, i.e., the size of the maximal independent set. Note that we do not necessarily set the reward and the discount rate in this manner when applying this algorithm to other problems (it depends on what one wants to maximize in that problem).Now the problem is reduced to MDP and our goal is to maximize the return. We can apply various algorithms to solve MDP, including Policy Gradient [20], Qlearning [21], and many of their derivations [22; 23; 24; 25; 26; 27; 28]. In our preliminary experiments, we tried REINFORCE [29] and Deep QNetworks [30] but they did not work as well as MCTS. See Section 3.4 for details.
Since we have reduced MIS to MDP, any problems that can be reduced to the same MDP can be solved with the same approach. Such problems include the Maximum Clique, the Minimum Vertex Cover and the Traveling Salesperson Problem, which are also NPhard [31].
3.2 Extension of MCTS to Combinatorial Optimization
Here, we extend MCTS to combinatorial optimization problems.
The Difference between Game Search and Combinatorial Optimization
We exploit MCTS together with reinforcement learning, as used in AlphaGo Zero. However, we cannot directly apply the same approach to our problem setting. In most MCTS operations used in the game search, the value of a state is determined according to “how likely it is the player is going to win” [18; 13; 32]. Commonly, the value is within , meaning that the larger the value is, the more likely the player is to win. The aim of the players is only to win the game, and thus they are not interested in the cumulative value, i.e., how well they have been playing so far. More precisely, in a game, if a player reaches a state whose value is , that player will never win the game. On the other hand, in combinatorial optimization, say MIS, there are many common states between an optimal trajectory and a poor trajectory.
Another difference is that the size of the input is always the same in AlphaGo Zero. However, in our case, the size of the input depends on the size of the graph. To resolve this issue, we use GNNs to encode the graphs into the variablesized feature vectors.
Ideas for Extended MCTS
One can evaluate the value of a state in such a way that is the size of the maximum independent set of . However, since this evaluation is not normalized, there would be a difference of scale in (1) and (3). For example in (3), the first term could be very large when and are large. As a result, we mitigate this issue as follows.
Given a state (graph), the network outputs , where is the probability vector of node selection and is a vector of node size whose –th element is the normalized return from state if we choose vertex for the next action. We will elaborate the definition of normalized return in Section 3.3, but intuitively, the normalized return means “how good that return is compared to the return obtained by random actions”.
By virtue of this normalization, we can calculate the loss in the same way as AlphaGo Zero as in Eq. (3). Moreover, it frees the algorithm from problem specification, that is, when we are to maximize some criterion, we always evaluate the action by “how good it is compared to random actions”.
MCTS Nodes and Edges
Now we are ready to state the algorithm. First, we articulate the structure of the search tree used in our MCTS. Each node of the tree represents a graph and each edge represents the action from state . Each edge stores and . Each node stores and
, the mean and the standard deviation of the return from state
if played randomly. We estimate and by sampling when expanding the node.3.3 Algorithms
Different from AlphaGo Zero, in order to calculate the exact reward in every state in a trajectory, the expansion phase continues until it reaches the termination state; a graph with no vertices.
Selection
In a simulation of MCTS, we choose the action in exactly the same as AlphaGo Zero. We select the action maximizing where the upper confidence bound is (2).
Expansion
When reaching a node , we evaluate by the network. Then we initialize with the estimated normalized actionvalue for each action . Note that is a vector and is the element of corresponding to taking the action . We also estimate and by using random rollouts. Note that changes as the network learns, while and only depend on the state . Therefore, we can save them with a hashmap for reuse when we reach an unexpanded node whose graph has been already seen.
Backpropagation
As well as AlphaGo Zero, is incremented by one. The difference from AlphaGo Zero is the update technique of actionvalue estimator. is set with the regularized mean of all exact rewards from the current state:
(4) 
where is an exact reward obtained by the –th rollout from state . In implementation, this is calculated by incrementing the reward by one from , each time going up to the parent state.
Play
In training, each play is selected exactly the same way as AlphaGo Zero, according to the enhanced policy ; however, we do not use the same approach to obtain the answer in inference because it is very timecostly. Note that our goal here is not to obtain the best action in a state but to find the best sequence of actions given an input graph. For this reason, we repeat rolling out until the end. In each rollout, we iteratively select the best node that maximizes UCB .
Code
The pseudocode of our algorithm is provided in Appendix C.
3.4 Other Approaches to Solving MDP
Here, we discuss other approaches to solve MDP that we have examined other than MCTS.
Dqn
We built a deep Qnetwork that estimates the actionvalue given a state and an action. We used experience replay and updated the parameter by minimizing TDerror . This worked for some of the very small graphs, however, the learning was so volatile that once it accidentally outputs a large , the whole parameter got contaminated and the actionvalues diverged.
Policy Gradient
We also have attempted to solve the MDP by Policy Gradient. We used GIN as the policy network and used REINFORCE for learning. We found that the learning procedure was highly unstable, and as often seen in Policy Gradient. Once the policy was polluted, it took many epochs to get recovered. Although we have found that it outperformed DQN empirically, the approach based on MCTS that is proposed in this paper is significantly more stable and achieve superior performance.
4 Experiments
In this section, we present the experimental results on solving the MIS problem.
The experiments consist of three parts. In the first part, we visualize how the networks were trained by MCTS. In the second part, we compared our algorithm with S2VDQN, Li et al.’s supervised method and Akiba and Iwata’s exact algorithm. Lastly, we show the normalized actionvalue vector for some graphs and verify that the networks could automatically detect some structural information.
4.1 Global Setup
GIN Model
We tuned hyperparameters and used GIN with
layers. Each MLP has two layers with hidden units. Dropout ratio was set to .Mcts
In the MCTS, we have five hyperparameters: the weight of the UCB used in (2), the temperature when calculating the enhanced policy , the number of rollouts to find and when expanding a new node, the number of simulations and L2 regularization constant . We set , , for each epoch in epochs learning, for epochs learning. Also, we set and where is the number of nodes in the graph.
Training and Environment
We trained 10 models of different initial parameters for epochs using ReedbushH computing system (CPU: E52695v4 2.1GHz 18core, GPU: NVIDIA Tesla P100 SXM2 x2) and each training took 9 hours. In each epoch, a random graph of nodes and edges was generated and used for training.
Note that the time complexity for training on a graph is where is the number of edges and is the size of the solution, if we ignore the sizes of feature matrices (which are usually much smaller than the size of the graph).
Unless mentioned otherwise, we used these ten trained models throughout the experiments.
Test Configuration
4.2 Visualization of Training
We visualized how the size of the solution found changed according to the epochs.
In each epoch, we used fixed random graphs of nodes and edges for testing. Note that the size of a maximum independent set for each graph may differ.
Figure 1 shows that the algorithm achieved stable performance. Each colored curve corresponds to a test graph and indicates the improvement of the mean of the solution sizes. See Figure 3 in Appendix D for the results for different initial parameters.
[width=9cm]data/graphleft3.pdf 
4.3 Comparison to other Algorithms
We compared our algorithm with machine learning methods and Akiba and Iwata’s BranchandReduce algorithm. We first provide simple explanations for these algorithms.
Machine Learning Method
Khalil et al. proposed a framework S2VDQN for a combinatorial optimization problem by a combination of a deep Qnetwork and a graph embedding network called structure2vec [9]. Li et al. presented an approach to solving the Maximum Independent Set problem combining machine learning techniques and classical heuristics which require many reduction rules (see Appendix A). They trained GCNs to predict the likelihood of each node being included in the optimal solution using a training dataset where is a graph and is the label of an optimal solution. Their experimental results demonstrated that it performed as well as the SOTA heuristic solvers for NPhard problems.
Exact Algorithm
For some NPhard problems, despite their theoretical hardness, there are some practical fast exact algorithms. Akiba and Iwata [8]’s BranchandReduce algorithm is the algorithm for solving Minimum Vertex Cover (and equivalently, Maximum Independent Set) and works even for social graphs with millions of nodes. However, this algorithm does not work for all graphs unfortunately, and can easily get stuck with a dense graph of only hundreds of nodes. It cannot be applied to random graphs over one thousand nodes either. In Appendix B, we introduce this BranchandReduce algorithm in greater detail and explain why it fails in these corner cases.
Datasets
We tested the algorithms on random graphs of different sizes generated based on the rules above and citation networks (Sen et al., 2008). Citation networks provide realworld directed sparse graphs with labeled nodes. Here, we ignored the class labels and treated it as an undirected graph. The number of nodes and edges for each graph is listed in Table 1.
Graph  Vertices  Edges 

Cora  2708  5429 
Citeseer  3327  4732 
PubMed  19717  44335 
Setup
We selected the best solution from trained networks for each graph in our method. For each algorithm, computation is stopped in minutes and output the best solution found at the time.
Results
Table 2 shows the answer found on each of the random graphs of sizes , , , and (See details at Table 4 in Appendix D). Our MCTS algorithm successfully found the optimal solution for small graphs.
On citation networks, our method obtained much better solution than S2VDQN on all the three graphs shown in Table 3. The solutions for Cora and Citeseer by our algorithm were optimal. Again, note that our algorithm does not use any reduction rules that were used in both Li et al.’s supervised method and the BranchandReduce algorithm. When a graph is highly sparse, many reduction rules, such as “always choose a node with a degree of less than if it exists”, can be applied. Therefore, PubMed, the sparsest one, is easier for both the supervised method and BranchandReduce algorithm.
Another remark is that the training was conducted with generated random graphs with nodes and edges as explained in Section 4.1. Table 2 implies that the size of the graphs for training can be a limit of accuracy. See Table 5 in Appendix D for the results of experiments in which the network was trained with different sizes of graphs.
Vertex, Edge  MCTS  Supervised  BranchandReduce 

10, 25  5  5  5 
100, 250  44  44  44 
500, 1250  212  220  220 
1000, 2500  422  439  439 
Graph  MCTS  S2VDQN  Supervised  BranchandReduce 

Cora  1451  1381  1451  1451 
Citeseer  1867  1705  1867  1867 
PubMed  15906  15709  15912  15912 
4.4 Visualization of the Probability Vector
Figure 2 shows the actionvalue of each node estimated by the same networks used in previous experiments. Since we used GIN, the actionvalues of symmetric nodes are the same. In the righthand side of Figure 3, both Supervised method and BranchandReduce algorithm select the leaf nodes by reduction rules, but we can see that our proposed method automatically learned the structure behind the problem. Since the values mean “how good it is to remove that node compared to random moves”, the values for leaf nodes in the righthand size figure are small compared to the red nodes in the lefthand size graph.
[width=9cm]data/graph31.pdf [width=9cm]data/graph32.pdf 
5 Conclusion and Future Work
In this paper, we presented an algorithm to solve certain combinatorial optimization problems on graphs without human knowledge. At first, we tried to make the network deep with layers, but we found that even a shallow network with layers can learn Maximum Independent Set. The performance outperformed S2VDQN, that is also a framework to solve NPhard problems without domain knowledge.
We have so far focused on MDPs whose action is the node selection because GNNs output the feature vectors for each node and hence it is natural to use GNNs to learn a policy. However, if we have some networks that estimate the features for edges, then we can combine those networks with MCTS to solve a similar MDP whose action is to choose an edge. The search algorithm – MCTS with normalized return – is applicable to many settings. It may also work when nodes or edges are weighted. Therefore, it is interesting to explore the analysis and experiments for such tasks as future work.
Acknowledgement
MS was supported by the International Research Center for Neurointelligence (WPIIRCN) at The University of Tokyo Institutes for Advanced Study.
References
 Gonzalez [2007] Teofilo F. Gonzalez. Handbook of Approximation Algorithms and Metaheuristics (Chapman & Hall/Crc Computer & Information Science Series). Chapman & Hall/CRC, 2007. ISBN 1584885505.
 Croes [1958] Georges A Croes. A method for solving travelingsalesman problems. Operations research, 6(6):791–812, 1958.
 Reinelt [1991] Gerhard Reinelt. Tsplib—a traveling salesman problem library. ORSA journal on computing, 3(4):376–384, 1991.
 Lin and Kernighan [1973] Shen Lin and Brian W Kernighan. An effective heuristic algorithm for the travelingsalesman problem. Operations research, 21(2):498–516, 1973.
 Li and Xul [2003] Youmei Li and Zongben Xul. An ant colony optimization heuristic for solving maximum independent set problems. In Proceedings Fifth International Conference on Computational Intelligence and Multimedia Applications. ICCIMA 2003, pages 206–211. IEEE, 2003.

Hifi [1997]
Mhand Hifi.
A genetic algorithmbased heuristic for solving the weighted maximum independent set and some equivalent problems.
Journal of the Operational Research Society, 48(6):612–622, 1997.  Li et al. [2018] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Combinatorial optimization with graph convolutional networks and guided tree search. In Advances in Neural Information Processing Systems, pages 539–548, 2018.
 Akiba and Iwata [2016] Takuya Akiba and Yoichi Iwata. Branchandreduce exponential/fpt algorithms in practice: A case study of vertex cover. Theoretical Computer Science, 609:211–225, 2016.
 Khalil et al. [2017] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. pages 6348–6358, 2017.
 Xu et al. [2018] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? CoRR, abs/1810.00826, 2018. URL http://arxiv.org/abs/1810.00826.
 Chaslot et al. [2008] Guillaume Maurice JeanBernard Chaslot, Sander Bakkes, István Szita, and Pieter Spronck. MonteCarlo Tree Search: A New Framework for Game AI. In Proc. Artif. Intell. Interact. Digital Entert. Conf., pages 216–217, Stanford Univ., California, 2008. URL http://www.aaai.org/Papers/AIIDE/2008/AIIDE08036.pdf.
 Browne et al. [2012a] Cameron Browne, Edward Powley, Daniel Whitehouse, Simon Lucas, Peter I. Cowling, Stephen Tavener, Diego Perez, Spyridon Samothrakis, Simon Colton, and et al. A survey of monte carlo tree search methods. IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI, 2012a.
 Silver et al. [2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
 Hamilton et al. [2017a] William L. Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and applications. CoRR, abs/1709.05584, 2017a.
 Defferrard et al. [2016] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. pages 3844–3852, 2016.
 Hamilton et al. [2017b] William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pages 1025–1035, 2017b. URL http://papers.nips.cc/paper/6703inductiverepresentationlearningonlargegraphs.
 Schlichtkrull et al. [2018] Michael Sejr Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In The Semantic Web  15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 37, 2018, Proceedings, pages 593–607, 2018. doi: 10.1007/9783319934174_38. URL https://doi.org/10.1007/9783319934174_38.
 Silver et al. [2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.

Rosin [2011]
Christopher D Rosin.
Multiarmed bandits with episode context.
Annals of Mathematics and Artificial Intelligence
, 61(3):203–230, 2011.  Zhao et al. [2011] Tingting Zhao, Hirotaka Hachiya, Gang Niu, and Masashi Sugiyama. Analysis and improvement of policy gradient estimation. pages 262–270, 2011.
 Watkins and Dayan [1992] Christopher JCH Watkins and Peter Dayan. Qlearning. Machine learning, 8(34):279–292, 1992.
 Van Hasselt et al. [2016] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
 Mnih et al. [2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
 Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Lillicrap et al. [2015] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Wang et al. [2016] Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actorcritic with experience replay. arXiv preprint arXiv:1611.01224, 2016.
 Peters and Schaal [2008] Jan Peters and Stefan Schaal. Natural actorcritic. Neurocomputing, 71(79):1180–1190, 2008.
 Wang et al. [2015] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.
 Williams [1992] Ronald J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229–256, 1992.
 Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Woeginger [2003] Gerhard J Woeginger. Exact algorithms for nphard problems: A survey. In Combinatorial optimization—eureka, you shrink!, pages 185–207. Springer, 2003.
 Browne et al. [2012b] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012b.
 Iwata et al. [2014] Yoichi Iwata, Keigo Oka, and Yuichi Yoshida. Lineartime fpt algorithms via network flow. In Proceedings of the twentyfifth annual ACMSIAM symposium on Discrete algorithms, pages 1749–1761. Society for Industrial and Applied Mathematics, 2014.
Appendix A Reduction Rules
To exactly solve the Maximum Independent Set problem, it is known that the algorithm must have an exponential time complexity. However, we can improve the time complexity from ^{3}^{3}3 is a notation for exponential complexity. In notation, we ignore the polynomial coefficients., which is a bruteforce algorithm to search all the subset of the graph nodes. For example, if a graph has a node whose degree is , that is, an isolated node, it must be included in the maximum independent set. Also, one can easily check that if a node has degree , there is a maximum independent set that contains this node. This is the simplest reduction rule, called pendant rule, but there are indeed many reduction rules in MIS. We introduce the other three reduction rules that are used in Li et al.’s supervised method to solve MIS.
Degree2 folding
Let be a node of degree 2 and be the neighbors of . If and are not adjacent, then there is a maximum independent set which contains either or both and . Therefore, we can merge these three nodes into one new node and later add or into the maximum independent set.
Unconfined
Let be the neighbors of vertex set . Whether a node is unconfined or not is determined as follows.

Let

Find such that and the size of is minimized

If there is no such , is not unconfined

If , is unconfined

If , add to and go back to 2 otherwise is not unconfined
One can show that there is a maximum independent set without any unconfined vertices. Therefore, unconfined vertices can be removed.
Twin
Two vertices and of degree 3 are called twin if . Let be the set of vertices and be the induced subgraph of from . If has an edge, there is a maximum independent set that contains both and . Otherwise, we can remove and from , and introduce a new vertex to connected to ’s order2 neighbors. If is in a maximum independent set, then none of ’s order2 neighbor can be in the MIS. Therefore we can add to the MIS. If was not in a MIS, some of ’s order2 neighbors can be in the MIS and hence can be added to the MIS.
Again, these reduction rules are known because the Maximum Independent Set problem has been widely studied. If we are to solve a new kind of NPhard problem without any famous reduction rule, it is pretty hard to construct an efficient algorithm.
These reduction rules tend to be very useful in a sparse network such as social graphs because, in such graphs, most of the nodes have small degrees like or . Therefore, by just iteratively applying these reduction rules, the size of the graph usually get much smaller.
Appendix B BranchandReduce Algorithm
Akiba and Iwata [8] proposed an algorithm which contains many rules for branching and reduction in addition to lower bounds. Its time complexity is . However, the BranchandReduce algorithm actually works for some graphs with millions of nodes [8]. This is partially because the FPT lower bound in the algorithm is quite strong. In order to explain the corner cases for BranchandReduce algorithm, we first explain this FPT algorithm.
FPT Algorithm for MVC
Let us define the MVC problem as IP (Integer Programming). For each vertex , we have a variable and for each edge we have a constraint . The MVC problem is equivalent to the minimization of under these constraints. Since solving IP is NPhard, we solve a relaxed LP instead. The difference between LP and IP (exact solution) is now the parameter in the FPT.
One can show that the solution of the LP is always or for all [33]. Moreover, for those nodes with or , there is always a minimum vertex cover which contains such that and does not contain such that . Therefore, if the difference between LP solution and IP solution is small, we can only consider small number of nodes to determine the MVC. The time complexity can actually be bound by where is the gap between the size of the optimal solution and LP lower bound [33].
Corner Cases for BranchandReduce Algorithm
Since the BranchandReduce algorithm utilizes the FPT lower bound, its time complexity can also be bound by . Let us consider a corner case for BranchandReduce algorithm in terms of this FPT bound. The corner case is actually quite simple, which is complete graphs. The answer of LP for a complete graph of more than one node is for all . In such a case, one cannot make use of the FPT lower bound. Indeed, in the experiments of [8], the BranchandReduce algorithm failed to solve MVC in a limited time of 24 hours with some very dense graphs (DIMACS graphs).
Since BranchandReduce algorithm includes a reduction rule for cliques, a complete graph may not the best corner case, however, it is generally weak against very dense graphs or random graphs where the constraints among vertices are pretty complex.
Appendix C Pseudocode
In this section, we provide the pseudocode for our algorithm. As explain in Section. 3.3, one can make use of a hashmap to store and in addition to for reuse.
Appendix D Additional Experiment Results
In this section, we present additional experimental results to support the findings of the main part of the paper.
Figure 3 shows the improvements in the solution size for each of the five test graphs with 10 different initial parameters. In each epoch, a random graph of nodes and edges is generated for training. In most cases, the learning was completed within the first 30 epochs, however, with some poor initial parameters, it took more than 100 epochs. Although it sometimes took time to converge, the learning was highly stable compared to Policy Gradient that was explored in our preliminary studies: In Policy Gradient, even the network once produced nearoptimal solutions, the performance commonly plunged.
Figure 4 shows how the networks were trained with generated graphs of nodes and edges during the first epochs. The test graphs were fixed random graphs with nodes and edges. Similarly to Figure 3, there was no significant drop in performance after once obtaining nearoptimal solutions.
Table 4 shows the comparison of our MCTS algorithm, Li et al.’s supervised method and Akiba and Iwata’s BranchandReduce algorithm on random graphs. The network was trained by generated random graphs of nodes and edges for epochs. It successfully found the optimal solutions for all the graphs of nodes and (out of ) graphs of nodes.
Table 5 shows the difference in performance when trained by graphs of nodes and nodes. For the network trained by graphs of nodes, we only trained it for epochs since Figure 4 illustrated that it already converged while we trained the other network for epochs. Other hyperparameters such as are set to be the same. MCTS (200), the network trained by random graphs of nodes, got optimal solutions for all the graphs of nodes and had slightly better performance on random graphs of nodes. However, its performance on random graphs of nodes were much better than MCTS (100) and for the fourth test graph of nodes, MCTS (200) found a solution whose size is greater than that of MCTS (100) by 15. The results suggest that by training the network with larger random graphs can produce better results. Li et al.’s supervised method used some graphs with more than nodes for training. That would partially be the reason why Li et al.’s algorithm obtained good solutions even for larger graphs. If we use such larger graphs in our algorithm, it would take days to finish the training and that is one disadvantage of reinforcement learning. However, with social graphs with thousands of nodes, since the structure of social networks is much simpler than that of random graphs, the results were still comparable even we used small graphs (of nodes) for training.
[width=9cm]data/graphleft3.pdf [width=9cm]data/graphleft0.pdf 
[width=9cm]data/graphleft1.pdf [width=9cm]data/graphleft2.pdf 
[width=9cm]data/graphleft4.pdf [width=9cm]data/graphleft5.pdf 
[width=9cm]data/graphleft6.pdf [width=9cm]data/graphleft7.pdf 
[width=9cm]data/graphleft8.pdf [width=9cm]data/graphleft9.pdf 
Vertex, Edge  MCTS  Supervised  BranchandReduce 

10, 25 (0)  5  5  5 
10, 25 (1)  4  5  5 
10, 25 (2)  4  5  5 
10, 25 (3)  5  5  5 
10, 25 (4)  3  5  5 
10, 25 (5)  4  5  5 
10, 25 (6)  4  5  5 
10, 25 (7)  4  5  5 
10, 25 (8)  4  5  5 
10, 25 (9)  4  5  5 
100, 250 (0)  44  44  44 
100, 250 (1)  45  45  45 
100, 250 (2)  43  43  43 
100, 250 (3)  45  45  45 
100, 250 (4)  43  44  44 
100, 250 (5)  44  44  44 
100, 250 (6)  46  46  46 
100, 250 (7)  42  42  42 
100, 250 (8)  45  45  45 
100, 250 (9)  42  42  42 
500, 1250 (0)  212  220  220 
500, 1250 (1)  215  221  221 
500, 1250 (2)  210  220  220 
500, 1250 (3)  216  222  222 
500, 1250 (4)  210  214  214 
500, 1250 (5)  213  220  220 
500, 1250 (6)  210  216  216 
500, 1250 (7)  209  216  216 
500, 1250 (8)  210  216  216 
500, 1250 (9)  213  218  218 
1000, 2500 (0)  422  439  439 
1000, 2500 (1)  417  439  438 
1000, 2500 (2)  428  443  443 
1000, 2500 (3)  429  448  448 
1000, 2500 (4)  419  442  442 
1000, 2500 (5)  416  434  433 
1000, 2500 (6)  417  432  430 
1000, 2500 (7)  424  438  437 
1000, 2500 (8)  429  445  445 
1000, 2500 (9)  424  441  441 
[width=9cm]data/graph500_0.pdf [width=9cm]data/graph500_1.pdf 
[width=9cm]data/graph500_2.pdf [width=9cm]data/graph500_3.pdf 
[width=9cm]data/graph500_4.pdf [width=9cm]data/graph500_5.pdf 

[width=9cm]data/graph500_6.pdf [width=9cm]data/graph500_7.pdf 
[width=9cm]data/graph500_8.pdf [width=9cm]data/graph500_9.pdf 
Vertex, Edge  MCTS (100)  MCTS(200) 

100, 250 (0)  44  44 
100, 250 (1)  45  45 
100, 250 (2)  43  43 
100, 250 (3)  45  45 
100, 250 (4)  43  44 
100, 250 (5)  44  44 
100, 250 (6)  46  46 
100, 250 (7)  42  42 
100, 250 (8)  45  45 
100, 250 (9)  42  42 
500, 1250 (0)  212  216 
500, 1250 (1)  215  218 
500, 1250 (2)  210  213 
500, 1250 (3)  216  215 
500, 1250 (4)  210  208 
500, 1250 (5)  213  211 
500, 1250 (6)  210  210 
500, 1250 (7)  209  209 
500, 1250 (8)  210  209 
500, 1250 (9)  213  214 
1000, 2500 (0)  422  428 
1000, 2500 (1)  417  422 
1000, 2500 (2)  428  434 
1000, 2500 (3)  429  437 
1000, 2500 (4)  419  434 
1000, 2500 (5)  416  413 
1000, 2500 (6)  417  414 
1000, 2500 (7)  424  425 
1000, 2500 (8)  429  436 
1000, 2500 (9)  424  432 
Comments
There are no comments yet.