Learning Heuristics over Large Graphs via Deep Reinforcement Learning

In this paper, we propose a deep reinforcement learning framework called GCOMB to learn algorithms that can solve combinatorial problems over large graphs. GCOMB mimics the greedy algorithm in the original problem and incrementally constructs a solution. The proposed framework utilizes Graph Convolutional Network (GCN) to generate node embeddings that predicts the potential nodes in the solution set from the entire node set. These embeddings enable an efficient training process to learn the greedy policy via Q-learning. Through extensive evaluation on several real and synthetic datasets containing up to a million nodes, we establish that GCOMB is up to 41 state of the art, up to seven times faster than the greedy algorithm, robust and scalable to large dynamic networks.


page 1

page 2

page 3

page 4


DyGCN: Dynamic Graph Embedding with Graph Convolutional Network

Graph embedding, aiming to learn low-dimensional representations (aka. e...

Graph Convolutional Policy for Solving Tree Decomposition via Reinforcement Learning Heuristics

We propose a Reinforcement Learning based approach to approximately solv...

End-to-end Structure-Aware Convolutional Networks for Knowledge Base Completion

Knowledge graph embedding has been an active research topic for knowledg...

Learning Combinatorial Node Labeling Algorithms

We present a graph neural network to learn graph coloring heuristics usi...

OpenGraphGym-MG: Using Reinforcement Learning to Solve Large Graph Optimization Problems on MultiGPU Systems

Large scale graph optimization problems arise in many fields. This paper...

Deep Reinforcement Learning of Graph Matching

Graph matching under node and pairwise constraints has been a building b...

SMGRL: A Scalable Multi-resolution Graph Representation Learning Framework

Graph convolutional networks (GCNs) allow us to learn topologically-awar...

1 Introduction

Optimization problems on graphs appear routinely in various applications such as viral marketing in social networks [Kempe et al.2003], computational sustainability [Dilkina et al.2011], and health-care [Wilder et al.2018]. These optimization problems are often combinatorial in nature, which results in NP-hardness. Therefore, designing an exact algorithm is infeasible and polynomial-time algorithms, with or without approximation guarantees, are often desired and used in practice [Goyal et al.2011, Jung et al.2012, Medya et al.2018]. Furthermore, these graphs are often dynamic in nature and the approximation algorithms need to be run repeatedly at regular intervals. Since real-world graphs may contain millions of nodes and edges, this entire process becomes tedious and time-consuming.

To provide a concrete example, consider the problem of viral marketing on social networks. Given a graph and a budget , the goal is to select nodes (users) from the graph such that their endorsement of a certain product (ex: through a tweet) is expected to initiate a cascade that reaches the largest number of nodes in the graph. It has been shown that this problem is NP-hard by reducing it to the max-coverage problem [Kempe et al.2003]. Advertising through social networks is a common practice today and needs to solved repeatedly due to the networks being dynamic in nature. Furthermore, even the greedy approximation algorithm has been shown to not scale on large networks [Arora et al.2017].

At this juncture, we highlight two key observations. First, although the graph is changing, the underlying model generating the graph is likely to remain the same. Second, the nodes that get selected in the answer set of the approximation algorithm may have certain properties common in them. Motivated by these observations, we ask the following question: Given an optimization problem on graph from a distribution of graph instances, can we learn an approximation algorithm and solve the problem on an unseen graph generated from distribution ? In this paper, we show that this is indeed possible.

The above observation was first highlighted by Khalil et al. [Khalil et al.2017], where they proposed an algorithm to learn combinatorial algorithms on graphs. Unfortunately, this study is limited to networks containing less than nodes and hence performance on real networks containing millions of nodes and edges remains to be seen. In this work, we bridge this gap. Specifically, we develop a deep reinforcement learning based architecture, called GCOMB, to learn combinatorial algorithms on graphs at scale. We also show that GCOMB outperforms [Khalil et al.2017].

Our contributions are summarized as follows:

  • Novel Framework. We propose a deep reinforcement learning based framework called GCOMB to learn algorithms for combinatorial problems on graphs. GCOMB first generates node embeddings through Graph Convolutional Networks (GCN)

    . These embeddings encode the effect of a node on the budget-constrained solution set. Next, these embeddings are fed to a neural network to learn a

    -function and predict the solution set.

  • Application. We benchmark GCOMB on datasets containing up to million nodes. The results show that GCOMB is times faster than greedy, up to better than the state-of-the-art neural method for learning algorithms [Khalil et al.2017], and scalabale. More significantly, GCOMB can be operationalized on real networks to solve practical problems.

2 Problem Formulation and Preliminaries

Our goal is to learn algorithms for solving combinatorial optimization problems on graphs. Formally, we define our learning task as follows.

Problem 1.

Given a combinatorial optimization problem over graphs drawn from distribution , learn a heuristic to solve problem on an unseen graph generated from .

The input to our problem is therefore a set of training graphs from distribution and the set of solution sets from an approximation algorithm for problem on each of these graphs. Given an unseen graph from , we need to predict its solution set corresponding to problem .

2.1 Instances of the proposed problem

To motivate our work, we discuss some graph combinatorial problems that have practical applications and fit well within our framework.

Definition 1 (Maximum Coverage Problem (MCP)).

Given a collection of subsets from a universal set of items , the problem is to choose at most sets to cover as many items as possible.

The MCP problem is the optimization version of the classical Set Cover decision problem. The MCP problem can be equivalently expressed on bipartite graph (with nodes) as follows: There are two sets () of nodes; and . There is an undirected edge whenever . Given a budget , the goal is to find a set of nodes in such that is maximized, where .

The MCP problem is used as a building block of many problems on graphs. Influence maximization on social networks [Kempe et al.2003] is one such prominent example.

Definition 2 (Minimum Vertex Cover (MVC)).

Given an undirected graph , find the smallest subset of vertices , such that each edge in the graph is incident to at least one vertex in .

MVC is a decision problem. An optimization version of MVC can be defined is the same manner as MCP [Apollonio and Simeone2014]. Specifically, given a graph and a budget , find a set of nodes such that is maximized, where . The MVC problem has applications in several domains with drug discovery being one of the highlights [Guha et al.2002].

2.2 The greedy approach

0:  , optimization function , budget
0:  solution set ,
3:  while  do
5:     ,
6:  Return
Algorithm 1 The greedy approach

The greedy approach is one of the most popular and well-performing strategies to solve combinatorial problems on graphs. Alg. 1 presents the pseudocode. The input to the algorithm is a graph , an optimization function on a set of nodes , and budget . Starting from an empty solution set , the solution is built iteratively by adding the “best” node to in each iteration (lines 3-5). The best node is the one that provides the highest marginal gain on the optimization function (line 4). The process ends after iterations where is the budget.

3 Gcomb

Figure 1: The flowchart of the training phase of GCOMB.

GCOMB consists of two phases: the training phase and the testing phase. The input to the training phase is a set of graphs and the optimization function corresponding to the combinatorial problem being solved. The output of the training phase is a sequence of two different neural networks with their corresponding learned parameters. In the testing phase, the inputs are identical as in the greedy algorithm, which are the graph , the optimization function and the budget . The output of the testing phase is the solution set, which is constructed using the learned neural networks from the training phase.

Fig. 1 presents the pipeline of the training phase. The training phase can be divided into two parts: a network embedding phase through Graph Convolutional Network (GCN) and a -learning phase. Given a training graph and its solution set, the GCN learns network embeddings that separates the potential solution nodes from the rest. Next, the embeddings of only the potential solution nodes are fed to a Q-learning framework, which allows us to predict those nodes that collectively form a good solution set. The next sections elaborate further on these two key components of the training phase.

3.1 Embedding via GCN

Our goal is to learn embeddings of the nodes such that they can predict the nodes that are likely to be part of the answer set. Towards that, one can set up a classification-based pipeline, where, given a training graph and its greedy solution set corresponding to the optimization function , a node is called positive if

; otherwise it is negative. One can next train a neural network to learn node embeddings with an appropriate loss function such as

cross-entropy loss. This approach, however, has two key weaknesses. First, it assumes all nodes that are not a part of to be equally bad. In reality this may not be the case. To elaborate, consider the case where =, but the marginal gain of node given , i.e., , is and vice versa. In this scenario, only one of and would be selected in the answer set although both are of equal quality on their own.

To capture the above aspect of a combinatorial optimization problem, we sample from the solution space

and learn embeddings that reflect the probability of a node being part of the solution. To sample from the solution space, we perform a probabilistic version of the greedy search in Alg. 

1. Specifically, in each iteration, instead of selecting the node with the highest marginal gain, we choose a node with probability proportional to its marginal gain. The probabilistic greedy algorithm runs times to construct different solution sets and the score of node is set to . denotes the maximum number of times a node appeared in the solution sets.

0:  , input features , depth , weight matrices

and weight vector

, dimension size .
0:  -dimensional vector representations
2:  for  do
3:     for  do
Algorithm 2 Graph Convolutional Network (GCN)

Given for each node, our next task is to learn embeddings that can predict its score. Towards that, we use a Graph Convolutional Network (GCN)[Hamilton et al.2017]. The pseudocode for this component is provided in Alg. 2. Each iteration in the outer loop represents the depth (line 2). In the inner loop, we iterate over all nodes (line 3). While iterating over node , we fetch the current representations of a sampled set of ’s neighbors and aggregate them through a MaxPool layer(lines 4-5). The MaxPool of a set of vectors of dimension returns a -dimensional vector by taking the element wise maximum across each dimension. The aggregated vector is next concatenated with the representation of , which is then fed through a fully connected layer with ReLUactivation function (line 6), where ReLU is the rectified linear unit (). The output of this layer becomes the input to the next iteration of the outer loop. Intuitively, in each iteration of the outer loop, nodes aggregate information from their local neighbors, and with more iterations, nodes incrementally receive information from neighbors of higher depth (i.e., distance).

At depth , the embedding of each node is , while the final embedding is (line 9). In the fully-connected layers, Alg. 2 requires the parameter set to apply the non-linearity (line 6). Intuitively, is used to propagate information across different depths of the mode.

To train the parameter set and obtain predictive representations, the final representations are passed through another fully connected layer to obtain their predicted value (line 8). The parameters for the proposed framework are therefore the weight matrices and the weight vector . To learn

, we apply stochastic gradient descent on the

mean squared error loss function.Specifically,


Defining : The initial feature vector at depth should have the raw features that are relevant with respect to the combinatorial problem being solved. For example, in MCP and MVC, the degree of a node is an indicator of its own coverage. As we discuss later in Sec. 4, we use only node degree as the node feature. In principle, any feature can be used including node labels.

While in Alg. 2, the parameters are learned by minimizing the loss function across all nodes, in practice, we use minibatches of a small sample of nodes.

0:  , hyper-parameters , relayed to fitted -learning, number of episodes and sample size .
0:  Learn parameter set
1:  Initialize experience replay memory to capacity
2:  for episode to  do
3:     for step to  do
6:         if   then
7:            Add tuple to
8:            Sample random batch from
9:            Update the parameters by SGD for
10:  return  
Algorithm 3 Learning -function

3.2 Learning -function

While GCN captures the individual importance of a node towards a particular combinatorial problem, through -learning [Sutton and Barto2018], we capture nodes that collectively form a good solution set. More specifically, given some set of nodes and a node , we aim to predict (intuitively long-term reward for adding to ) through the surrogate function . For any -learning task, we need to define the following five aspects: state space, actions, rewards, policy and termination.

  • State space: A state is the aggregation of two sets of nodes: nodes selected in the current solution set and those not selected, i.e., . Thus, the state corresponding to solution set is captured using two vectors: and

  • Action: An action corresponds to adding a node (represented as ) to the solution set.

  • Rewards: The reward function at state is the marginal gain of adding node to , i.e. .

  • Policy: The policy (given state and action) is deterministic and, as in the greedy policy, selects the node with the highest predicted marginal, i.e.,

  • Termination: We terminate when ; is the budget.

Learning : Alg. 3 presents the pseudocode of learning the parameter set . We partition into four weight vectors , , , such that, , where


If the dimension of the initial node embeddings is , the dimensions of the weight vectors are as follows: . In Eq. 3, ReLU is applied element-wise to its input vector.

The standard -learning updates parameters in a single episode via a SGD step to minimize the squared loss.


denotes current solution set, is the discount factor, and is the considered node. To better learn the parameters, we perform -step -learning instead of -step -learning. -step -learning incorporates delayed rewards, where the final reward of interest is received later in the future during an episode (lines 6-9). This avoids the myopic setting of -step update. The key idea here is to wait for

steps so that the approximator’s parameters are updated and therefore, more accurately estimate future rewards. To incorporate

-step rewards, Eq. 5 is modified as follows.


Efficiency: For efficient learning of the parameters, we perform two optimizations. First, we exploit fitted -iteration [Riedmiller2005], which results in faster convergence using a neural network as a function approximator [Mnih et al.2013]. Specifically, instead of updating the -function sample-by-sample, the fitted -iteration approach uses experience replay with a batch of samples from a previously populated dataset ( defined in line 1 of Alg. 3). Second, we reduce the state space, by learning marginal gains only for the top- nodes with the predicted values learned during the GCN step.

3.3 Summary

The entire pipeline of GCOMB works as follows.

  • Training Phase: Given a training graph , and optimization function , learn parameter set and corresponding to the GCN component and Q-learning component. This is a one-time, offline computation. The sub-tasks in the training phase are:

    • Learn node embeddings along with .

    • Feed to -learning framework and learn .

  • Testing Phase: Given an unseen graph ,

    • Embed all nodes using .

    • Iteratively compute the solution set based on the learned function. Specifically, in each iteration we add the node , where is the solution set in the iteration. As in greedy (Alg. 1), we iterate for iterations, where is the budget.

4 Experimental Results

In this section, we benchmark GCOMB and establish:

  • Quality: GCOMB is up to better in quality than the state of the art technique [Khalil et al.2017].

  • Scalability: GCOMB scales to million-sized networks where [Khalil et al.2017] crashes. Furthermore, GCOMB achieves almost the same quality as the greedy algorithm, while being times faster.

  • Application: GCOMB is applicable to dynamic networks and achieves quality at par with the greedy algorithm, which is the best possible polynomial-time approximation scheme for MCP unless .

4.1 Experimental Setup

All experiments are performed on a machine running Intel Xeon E5-2698v4 processor with 20 cores, having

Nvidia 1080 Ti GPU cards and 512 GB RAM with Ubuntu 16.04 operating system. All our codes are written in Python with the support of TensorFlow.

loc-Gowalla (LG) 196.5K 950.3K
loc-Brightkite (LB) 58.2K 214K
sx-mathoverflow (SM) 24.8K 506.5K
Table 1: Dataset description and statistics of real networks.

Datasets: We use both synthetic and real datasets for our experiments. For synthetic dataset, we generate graphs from two different models:

  • Barabási–Albert (BA): In BA, the default edge density is set to , i.e., . We use the notation BA- to denote the size of the generated network, where is the number of nodes.

  • Bipartite Graph (BP): [Khalil et al.2017] proposes a model to generate bipartite graphs as follows: Given the number of nodes, they are partitioned into two sets with nodes in one side and the rest in other. The edge between any pair of nodes from different partitions is generated with probability .

In addition, we also use the real datasets111http://snap.stanford.edu/data/index.html listed in Table 1. Among them, sx-mathoverflow (SM) is a dynamic (temporal) network where each edge is annotated with a timestamp.

Baselines: We denote our method as GCOMB, the greedy algorithm as GR, the state-of-the-art method from [Khalil et al.2017] as SoA. For SoA, we use the code shared by the authors. Note that computing the optimal solution is not feasible since the problems being learned, such as MCP and MVC, are NP-hard. GR guarantees a approximation for both MCP and MVC (optimization version). Given a budget , we compute a random selection of nodes. This baseline is called Random.

Other Settings:

GCN is trained for 200 epochs with a learning rate of 0.0005, a dropout rate of 0.1 and a convolution depth (

) of . For training the n-step Q-Learning neural network, is set to and a learning rate of 0.0001 is used. In each epoch of training, training examples are sampled uniformly from the Replay Memory M as described in Alg. 3. This Q Learning network is trained for - graph instances with .

4.2 Comparison with State-of-the-art

Graph SoA GCOMB Gain
Table 2: MCP: Comparison of our method with the state-of-the-art method on synthetic graphs (BP). In BP-X, X denotes the number of nodes. Gain is defined as the ratio between coverage produced by GCOMB and SoA.
Graph SoA GCOMB Gain
Table 3: MVC: Comparison of our method with the state-of-the-art method on large BA graphs. Gain is defined as the ratio between coverage produced by GCOMB and SoA.
Graph SoA GCOMB Gain
Table 4:

MVC: Comparison for absolute variance. The SoA has higher variance over 5 training instances. Gain is defined as the ratio between variance of SoA and GCOMB


First, we benchmark GCOMB against the state-of-the-art learning method by [Khalil et al.2017] (SoA) on the bipartite graph model proposed by [Khalil et al.2017]. We measure the quality of the solution sets obtained by these two methods on both MCP and MVC.For a fair comparison, we keep the training time ( hour), training dataset and testing dataset same for both methods. Furthermore, to measure robustness, all results are reported by averaging the quality over training instances.

As Khalil et al. show results on small graphs, we first test both GCOMB and SoA on relatively small graphs. Both the methods are trained on a graph with nodes and tested for the MCP problem (Section 2) with budget . As presented in Table 2, GCOMB outperforms SoA across all graph sizes.

Next, we show the results for BA datasets with larger sizes on the MVC problem with budget as . Both methods are trained on BA graphs with nodes. Table 3 shows that our method produces better results consistently with the quality being up to better than SoA. Furthermore, SoA fails to scale on large graphs ( nodes and beyond).

GCOMB is not only better in quality, but also more robust. This property is captured in Table 4, which shows the variance in quality across the different training instances. The variance is SoA is up to 3 times higher than GCOMB. Overall, GCOMB is up to better in quality, more scalable, and robust when compared to SoA.

Next, we move to evaluating GCOMB on large graphs. We omit SoA from the next set of experiments since it crashes on these datasets due to scalability issues.

Graph GR Random GCOMB Ratio
BA-10k 3496 278 2831 .80
BA-40k 7558 298 6148 .81
LB 8935 232 8103 .91
LG 43663 182 37722 .86
Table 5: MCP: Comparison of our method with baselines on large graphs. The ratio denotes the ratio between the quality produced between GCOMB and GR.
Graph GR Random GCOMB Ratio
BA-100k 14490 270 7975 .55
BA-500k 30723 268 14911 .49
LB 14787 238 7468 .50
LG 76793 416 29531 .38
Table 6: MVC: Comparison of our method with baselines on large graphs. Ratio denotes the ratio between the quality produced by GCOMB and GR.
(a) Quality
(b) Time
Figure 2: MVC: (a) Quality and (b) running Time of GCOMB and GR on BA graphs with million nodes against budget.

4.3 Comparison with Greedy

In this section, we benchmark GCOMB against GR.

Quality: Table 5 shows the results on the MCP problem over multiple synthetic and real datasets. GCOMB is trained on BA-1000. The “Ratio” column in Table 5 denotes the ratio between the quality produced by GCOMB against GR. While GCOMB consistently produces quality above on synthetic graphs, it improves further to on real datasets. For MVC problem, the results are around (Table 6).

Efficiency (MVC): The real benefit of using GCOMB for combinatorial optimization problems comes from the property that it is significantly faster than greedy. Specifically, once the parameters have been learned in the training phase and the node embeddings have been constructed on the unseen graph, for any given budget , we only need to pass the top- nodes through the -learning neural network. This is an extremely lightweight task when compared to the greedy solution, and therefore, significantly faster.

To bring out this aspect, we present the online running times of GCOMB and greedy in Fig. (b)b as the budget is increased. As clearly visible, GCOMB is more than times faster with a lower growth rate than GR. The quality, although slightly lower than GR, grows at par with GR (Fig. (a)a).

Impact of dimension (MCP): We run GCOMB for MCP by varying the node embedding dimension in GCN. This experiment is performed on BP datasets of different sizes. The model is trained on a BP graph with nodes. Table 7 presents the results. The quality of GCOMB is quite stable and does not vary much across dimensions.

BP-4k 2678 2668 2589
BP-5k 3344 3296 3250
BP-10k 6603 6565 6443
Table 7: MCP: Performance of GCOMB with varying embedding dimensions.

4.4 Application: Temporal Networks

The primary motivation behind GCOMB is the observation that many real-world networks change with time although the underlying model remains the same. Temporal networks is one such example. In this section, we show that GCOMB is ideally suited for this scenario. Specifically, we establish that if GCOMB is trained on today’s network snapshot, it can predict solutions to combinatorial optimization problems on future network snapshots accurately.

To demonstrate GCOMB’s predictive power on temporal networks, we considered a question-answer platform, namely the sx-mathoverflow (SM) dataset (Table 1). The nodes correspond to users and an edge at a time depicts an interaction between two users (Ex: answering or commenting to questions) at time . The data consists of interaction over a span of days. To generate train and test graphs, we split the span of 2350 days into 55 equal spans (around 42 days) and consider all interactions (edges) inside a span for the corresponding graph. We denote these as . We use multiple interactions between a pair of users as the weight of the edge between them. The edge-weight is used as a feature (activity of an user) in the GCN. For training, we pick the graphs and for the MCP and MVC problems respectively and then test on all of the remaining graphs.

Figure 3 presents the performance of GCOMB. As visible, the quality of GCOMB’s solutions sets are almost identical to GR across all test graphs.

(a) MVC
(b) MCP
Figure 3: Quality of the solution sets produced by GCOMB and GR on temporal networks.

5 Previous Work

Many combinatorial graph problems lead to NP-hardness [Karp1972].There are classical NP-hard problems on graphs such as Minimum Vertex Cover, Minimum Set Cover, Travelling Salesman Problem (TSP); as well as other important combinatorial problems with many applications. Examples include finding top influential nodes [Kempe et al.2003], maximizing centrality of nodes [Yoshida2014], and optimizing networks [Medya et al.2018, Dilkina et al.2011].

There has been recent interest in solving graph combinatorial problems with neural networks and reinforcement learning [Bello et al.2016, Khalil et al.2017]. Learning-based approaches are useful in producing good empirical results for NP-hard problems. The methods proposed in [Bello et al.2016] are generic and do not explore structural properties of graphs, and use sample-inefficient policy gradient methods. Khalil et al. [Khalil et al.2017] has investigated the same problem with a network embedding approach combined with a reinforcement learning technique. The same problem has been studied by Li et al. [Li et al.2018] via a supervised approach using GCN. Among other interesting work, for branch-and-bound algorithms, He et al. studied the problem of learning a node selection policy [He et al.2014]. Another examples include pointer networks [Vinyals et al.2015] proposed by Vinyals et al. and reinforcement learning approaches by Silver et al. [Silver et al.2016] to learn strategies for the game Go.

6 Conclusion

In this paper, we have proposed a deep reinforcement learning based framework called GCOMB to learn algorithms for combinatorial problems on graphs. GCOMB first generates node embeddings through Graph Convolutional Networks (GCN). These embeddings encode the effect of a node on the budget-constrained solution set. Next, these embeddings are fed to a neural network to learn a -function and predict the solution set. Through extensive experiments on both real and synthetic datasets containing up to a million nodes, we show that GCOMB is up to better in quality over the state-of-the-art neural method, more robust and more scalable. In addition, GCOMB is up to times faster than the greedy approach, while being of comparable quality. Overall, with these qualities, GCOMB can be operationalized on real dynamic networks to solve practical problems at scale.


  • [Apollonio and Simeone2014] Nicola Apollonio and Bruno Simeone. The maximum vertex coverage problem on bipartite graphs. Discrete Applied Mathematics, 165:37–48, 2014.
  • [Arora et al.2017] Akhil Arora, Sainyam Galhotra, and Sayan Ranu. Debunking the myths of influence maximization: An in-depth benchmarking study. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 651–666. ACM, 2017.
  • [Bello et al.2016] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940, 2016.
  • [Dilkina et al.2011] Bistra Dilkina, Katherine J. Lai, and Carla P. Gomes. Upgrading shortest paths in networks. In Integration of AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems, pages 76–91. Springer, 2011.
  • [Goyal et al.2011] Amit Goyal, Wei Lu, and Laks VS Lakshmanan. Simpath: An efficient algorithm for influence maximization under the linear threshold model. In 2011 IEEE 11th international conference on data mining, pages 211–220. IEEE, 2011.
  • [Guha et al.2002] Sudipto Guha, Refael Hassin, Samir Khuller, and Einat Or. Capacitated vertex covering with applications. In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, pages 858–865. Society for Industrial and Applied Mathematics, 2002.
  • [Hamilton et al.2017] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
  • [He et al.2014] He He, Hal Daume III, and Jason M Eisner. Learning to search in branch and bound algorithms. In Advances in neural information processing systems, pages 3293–3301, 2014.
  • [Jung et al.2012] Kyomin Jung, Wooram Heo, and Wei Chen. Irie: Scalable and robust influence maximization in social networks. In 2012 IEEE 12th International Conference on Data Mining, pages 918–923. IEEE, 2012.
  • [Karp1972] Richard M Karp. Reducibility among combinatorial problems. In Complexity of computer computations, pages 85–103. Springer, 1972.
  • [Kempe et al.2003] David Kempe, Jon Kleinberg, and Éva Tardos. Maximizing the spread of influence through a social network. In KDD, 2003.
  • [Khalil et al.2017] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems, pages 6348–6358, 2017.
  • [Li et al.2018] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Combinatorial optimization with graph convolutional networks and guided tree search. In Advances in Neural Information Processing Systems, pages 537–546, 2018.
  • [Medya et al.2018] Sourav Medya, Jithin Vachery, Sayan Ranu, and Ambuj Singh. Noticeable network delay minimization via node upgrades. Proceedings of the VLDB Endowment, 11(9):988–1001, 2018.
  • [Mnih et al.2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • [Riedmiller2005] Martin Riedmiller. Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In

    European Conference on Machine Learning

    , pages 317–328. Springer, 2005.
  • [Silver et al.2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
  • [Sutton and Barto2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • [Vinyals et al.2015] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Advances in Neural Information Processing Systems, pages 2692–2700, 2015.
  • [Wilder et al.2018] Bryan Wilder, Han Ching Ou, Kayla de la Haye, and Milind Tambe. Optimizing network structure for preventative health. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 841–849, 2018.
  • [Yoshida2014] Yuichi Yoshida. Almost linear-time algorithms for adaptive betweenness centrality using hypergraph sketches. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1416–1425. ACM, 2014.