1 Introduction
Graph algorithms appear in a wide variety of fields and applications, from the study of gene interactions (Özgür et al., 2008) to social networks (Ugander et al., 2011) to computer systems (Grandl et al., 2016). Today, most graph algorithms are designed by human experts. However, in many applications, designing graph algorithms with strong performance guarantees is very challenging. These algorithms often involve difficult combinatorial optimization problems for which finding optimal solutions is computationally intractable or current algorithmic understanding is limited (e.g., approximation gaps in CS theory literature).
In recent years, deep learning has achieved impressive results on many tasks, from object recognition to language translation to learning complex heuristics directly from data
(Silver et al., 2016; Krizhevsky et al., 2012). It is thus natural to ask whether we can apply deep learning to automatically learn complex graph algorithms. To apply deep learning to such problems, graphstructured data first needs to be embedded in a high dimensional Euclidean space. Graph representationrefers to the problem of embedding graphs or their vertices/edges in Euclidean spaces. Recent works have proposed several graph representation techniques, notably, a family of representations called graph convolutional neural networks (GCNN) that use architectures inspired by CNNs for images
(Bruna et al., 2014; Monti et al., 2017; Khalil et al., 2017; Niepert et al., 2016; Defferrard et al., 2016; Hamilton et al., 2017b; Bronstein et al., 2017; Bruna & Li, 2017). Some GCNN representations capture signals on a fixed graph while others support varying sized graphs.In this paper, we consider how to learn graph algorithms in a way that generalizes to large graphs and graphs of different topologies. We ask: Can a graph neural network trained at a certain scale perform well on ordersof magnitude larger graphs (e.g., the size of training graphs) from diverse topologies? We particularly focus on learning algorithms for combinatorial optimization on graphs. Prior works have typically considered applications where the training and test graphs are similar in scale, shape, and attributes, and consequently have not addressed this generalization problem. For example, Khalil et al. (2017) train models on graph of size 50100 and test on graphs of up to size 1200 from the same family; Hamilton et al. (2017a) propose inductive learning methods where they train on graphs with edges, and test on graphs with edges (less than generalization in size); Ying et al. (2018) train and test over the same large Pinterest graph ( billion nodes).
We propose Graph2Seq, a scalable embedding that represents vertices of a graph as a timeseries (§3). Our key insight is that the fixedsized vector representation produced by prior GCNN designs limits scalability. Instead, Graph2Seq uses the entire timeseries of vectors produced by graph convolution layers as the vertex representation. This approach has two benefits: (1) it can capture subgraphs of increasing diameter around each vertex as the timeseries evolves; (2) it allows us to vary the dimension of the vertex representation based on the input graph; for example, we can use a small number of graph convolutions during training with small graphs and perform more convolutions at test time for larger graphs. We show both theoretically and empirically that this timeseries representation significantly improves the scalability and generalization of the model. Our framework is general and can be applied to various existing GCNN architectures.
We prove that Graph2Seq is informationtheoretically lossless, i.e., the graph can be fully recovered from the timeseries representations of its vertices (§3.1). Our proof leverages mathematical connections between Graph2Seq and causal inference theory (Granger, 1980; Rahimzamani & Kannan, 2016; Quinn et al., 2015). Further, we show that Graph2Seq and many previous GCNN variants are all examples of a certain computational model over graphs that we call localgather, providing for a conceptual and algorithmic unification. Using this computational model, we prove that unlike Graph2Seq, fixedlength representations fundamentally cannot compute certain functions over graphs.
To apply Graph2Seq, we combine graph convolutions with an appropriate RNN that processes the timeseries representations (§4). We use this neural network model, G2SRNN
, to tackle three classical combinatorial optimization problems of varying difficulty using reinforcement learning:
minimum vertex cover, maximum cut and maximum independent set (§5). Our experiments show that Graph2Seq performs as well or better than the best nonlearning heuristic on all three problems and exhibits significantly better scalability and generalization than previous stateoftheart GCNN (Khalil et al., 2017; Hamilton et al., 2017a; Lei et al., 2017) or graph kernel based (Shervashidze et al., 2011) representations. Highlights of our experimental findings include:
[noitemsep, topsep=0pt]

G2SRNN models trained on graphs of size 15–20 scale to graphs of size 3,200 and beyond. To conduct experiments in a reasonable time, we used graphs of size up to 3,200 in most experiments. However, stress tests show similar scalability even at 25,000 vertices.

G2SRNN models trained on one graph type (e.g., ErdosRenyi) generalize to other graph types (e.g., random regular and bipartite graphs).

G2SRNN exhibits strong scalability and generalization in each of minimum vertex cover, maximum cut and maximum independent set problems.

Training over a carefully chosen adversarially set of graph examples further boosts G2SRNN’s scalability and generalization capabilities.
2 Related Work
Neural networks on graphs. Early works to apply neuralnetworkbased learning to graphs are Gori et al. (2005); Scarselli et al. (2009), which consider an information diffusion mechanism. The notion of convolutional networks for graphs as a generalization of classical convolutional networks for images was introduced by Bruna et al. (2014). A key contribution of this work is the definition of graph convolution in the spectral domain
using graph Fourier transform theory. Subsequent works have developed local spectral convolution techniques that are easier to compute
(Defferrard et al., 2016; Kipf & Welling, 2016). Spectral approaches do not generalize readily to different graphs due to their reliance on the particular Fourier basis on which they were trained. To address this limitation, recent works have considered spatial convolution methods (Khalil et al., 2017; Monti et al., 2017; Niepert et al., 2016; Such et al., 2017; Duvenaud et al., 2015; Atwood & Towsley, 2016). Li et al. (2015); Johnson (2016)propose a variant that uses gated recurrent units to perform the state updates, which has some similarity to our representation dynamics; however, the sequence length is fixed between training and testing.
Veličković et al. (2017); Hamilton et al. (2017a) use additional aggregation methods such as vertex attention or pooling mechanisms to summarize neighborhood states. In Appendix A we show that local spectral GCNNs and spatial GCNNs are mathematically equivalent, providing a unifying view of the variety of GCNN representations in the literature.Another line of work (Jain et al., 2016; Marcheggiani & Titov, 2017; Tai et al., 2015) combines graph neural networks with RNN modules. They are not related to our approach, since in these cases the sequence (e.g., timeseries of object relationship graphs from a video) is already given as part of the input. In contrast our approach generates a sequence as the desired embedding from a single input graph. Perozzi et al. (2014); Grover & Leskovec (2016) use random walks to learn vertex representations in an unsupervised or semisupervised fashion. However they consider prediction or classification tasks over a fixed graph.
Combinatorial optimization. Using neural networks for combinatorial optimization problems dates back to the work of Hopfield & Tank (1985) and has received considerable attention in the deep learning community in recent years. Vinyals et al. (2015); Bello et al. (2016); Kool & Welling (2018) consider the traveling salesman problem using reinforcement learning. These papers consider twodimensional coordinates for vertices (e.g. cities on a map), without any explicit graph structure. Graves et al. (2016) propose a more general approach: a differential neural computer that is able to perform tasks like finding the shortest path in a graph. The work of Khalil et al. (2017) is closest to ours. It applies a spatial GCNN representation in a reinforcement learning framework to solve combinatorial optimization problems such as minimum vertex cover.
3 Graphs as Dynamical Systems
3.1 The Graph2Seq Representation
The key idea behind Graph2Seq is to represent vertices of a graph by the trajectory of an appropriately chosen dynamical system induced by the graph. Such a representation has the advantage of progressively capturing more and more information about a vertex as the trajectory unfolds. Consider a directed graph whose vertices we want to represent (undirected graphs will be represented by having bidirectional edges between pairs of connected vertices). We create a discretetime dynamical system in which vertex has a state of at time , for all , and is the dimension of the state space. In Graph2Seq, we consider an evolution rule of the form
(1) 
where ,
are trainable parameters and relu
. is a dimensional Gaussian noise, and if there is an edge from to in the graph . For any , starting with an initial value for (e.g., random or all zero) this equation defines a dynamical system, the (random) trajectory of which is the Graph2Seq representation of . More generally, graphs could have features on vertices or edges (e.g., weights on vertices), which can be included in the evolution rule; these generalizations are outside the scope of this paper. We use Graph2Seq () to mean the set of all Graph2Seq vertex representations of .Graph2Seq is invertible. Our first key result is that Graph2Seq
’s representation allows recovery of the adjacency matrix of the graph with arbitrarily high probability. Here the randomness is with respect to the noise term in
Graph2Seq; see equation 1. In Appendix B.1, we prove:Theorem 1.
For any directed graph and associated (random) representation Graph2Seq () with sequence length , there exists an inference procedure (with time complexity polynomial in
) that produces an estimate
such that .Note that there are many ways to represent a graph that are lossless. For example, we can simply output the adjacency matrix row by row. However such representations depend on assigning labels or identifiers to vertices, which would cause downstream deep learning algorithms to memorize the label structure and not generalize to other graphs. Graph2Seq’s key property is that it is does not depend on a labeling of the graph. Theorem 1 is particularly significant (despite the representation being infinite dimensional) since it shows lossless labelindependent vertex representations are possible. To our understanding this is the first result making such a connection. Next, we show that the noise term in Graph2Seq’s evolution rule (equation 1) is crucial for Theorem 1 to hold (proof in Appendix B.2).
Proposition 1.
Under any deterministic evolution rule of the form in equation 1, there exists a graph which cannot be reconstructed exactly from its Graph2Seq representation with arbitrarily high probability.
The astute reader might observe that invertible, labelindependent representations of graphs can be used to solve the graph isomorphism problem Schrijver (2003). However, Proposition 1 shows that Graph2Seq cannot solve the graph isomorphism problem, as that would require a deterministic representation. Noise is necessary to break symmetry in the otherwise deterministic dynamical system. Observe that the timeseries for a vertex in equation 1 depends only on the timeseries of its neighboring nodes, not any explicit vertex identifiers. As a result, two graphs for which all vertices have exactly the same neighbors will have exactly the same representations, even though they may be structurally different. The proof of Proposition 1 illustrates this phenomenon for regular graphs.
3.2 Formal Computation Model
Although Graph2Seq is an invertible representation of a graph, it is unclear how it compares to other GCNN representations in the literature. Below we define a formal computational model on graphs, called localgather, that includes Graph2Seq as well as a large class of GCNN representations in the literature. Abstracting different representations into a formal computational model allows us reason about the fundamental limits of these methods. We show that GCNNs with a fixed number of convolutional steps cannot compute certain functions over graphs, where a sequencebased representation such as Graph2Seq is able to do so. For simplicity of notation, we consider undirected graphs in this section and in the rest of this paper.
localgather model. Consider an undirected graph on which we seek to compute a function , where is the space of all undirected graphs. In the localgather model, computations proceed in two rounds: In the local step, each vertex computes a representation that depends only on the subgraph of vertices that are at a distance of at most from . Following this, in the gather step, the function is computed by applying another function over the collection of vertex representations . Graph2Seq is an instance of the localgather model. GCNNs that use localized filters with a global aggregation (e.g., Kipf & Welling (2016); Khalil et al. (2017)) also fit this model (proof in Appendix B.3).
Proposition 2.
Fixedlength representations are insufficient. We show below that for a fixed , no algorithm from the localgather model can compute certain canonical graph functions exactly (proof in AppendixB.4).
Theorem 2.
For any fixed , there exists a function and an input graph instance such that no localgather algorithm can compute exactly.
For the graph and function used in the proof of Theorem 2, we present a sequencebased representation (from the localgather) in Appendix B.5 that is able to asymptotically compute . This example demonstrates that sequencebased representations are more powerful than fixedlength graph representations in the localgather model. Further, it illustrates how a trained neural network can produce sequential representations that can be used to compute specific functions.
Graph2Seq and graph kernels. Graph kernels (Yanardag & Vishwanathan, 2015; Vishwanathan et al., 2010; Kondor & Pan, 2016) are another popular method of representing graphs. The main idea here is to define or learn a vocabulary of substructures (e.g., graphlets, paths, subtrees), and use counts of these substructures in a graph as its representation. The WeisfeilerLehman (WL) graph kernel (Shervashidze et al., 2011; Weisfeiler & Lehman, 1968) is closest to Graph2Seq. Starting with ‘labels’ (e.g., vertex degree) on vertices, the WL kernel iteratively performs local label updates similar to equation 1 but typically using discrete functions/maps. The final representation consists of counts of these labels (i.e., a histogram) in the graph. Each label corresponds to a unique subtree pattern. However, the labels themselves are not part of any structured space, and cannot be used to compare the similarity of the subtrees they represent. Therefore, during testing if new unseen labels (or equivalently subtrees) are encountered the resulting representation may not generalize.
4 Neural Network Design
We consider a reinforcement learning (RL) formulation for combinatorial optimization problems on graphs. RL is wellsuited to such problems since the true ‘labels’ (i.e., the optimal solution) may be unavailable or hard to compute. Additionally, the objective functions in these problems can be used as natural reward signals. An RL approach has been explored in recent works Vinyals et al. (2015); Bello et al. (2016); Khalil et al. (2017) under different representation techniques. Our learning framework uses the Graph2Seq representation. Fig. 1 shows our neural network architecture. We feed the trajectories output by Graph2Seq
for all the vertices into a recurrent neural network (specifically RNNGRU), whose outputs are then aggregated by a feedforward network to select a vertex. Henceforth we call this the
G2SRNN neural network architecture. The key feature of this design is that the length of the sequential representation is not fixed; we vary it depending on the input instance. We show that our model is able to learn rules—for both generating the sequence and processing it with the RNN—that generalize to operate on long sequences. In turn, this translates to algorithmic solutions that scale to large graph sizes.Reinforcement learning model. We consider a RL formulation in which vertices are chosen one at a time. Each time the RL agent chooses a vertex, it receives a reward. The goal of training is to learn a policy such that cumulative reward is maximized. We use learning to train the network.
For input graph instance , a subset and , this involves using a neural network to approximate a function . Here represents the set of vertices already picked. The neural network comprises of three modules: (1) Graph2Seq, that takes as input the graph and set of vertices chosen so far. It generates a sequence of vectors as output for each vertex. (2) Seq2Vec reads the sequences output of Graph2Seq and summarizes it into one vector per vertex. (3) Network takes the vector summary of each vertex and outputs the estimated value. The overall architecture is illustrated in Fig. 1. To make the network practical, we truncate the sequence outputs of Graph2Seq to a length of . However the value of is not fixed, and is varied both during training and testing according to the size and complexity of the graph instances encountered; see § 5 for details. We describe each module below.
Graph2Seq. Let denote the state of vertex and
denote the binary variable that is one if
and zero otherwise, at timestep in the Graph2Seq evolution. Then, the trajectory of each vertex evolves as for . are trainable parameters, and is initialized to allzeros for each .Seq2Vec and Network. The sequences are processed by GRU units (Chung et al., 2014) at the vertices. At timestep , the recurrent unit at vertex computes as input a function that depends on (i) , the embedding for node , (ii) , the neighboring node embeddings and (iii) , a summary of embeddings of all nodes. This input is combined with the GRU’s cell state to produce an updated cell state . The cell state at the final timestep is the desired vector summary of the Graph2Seq sequence, and is fed as input to the network. We refer to Appendix C for equations on Seq2Vec. The values are estimated as
(2) 
with and being learnable parameters. All transformation functions in the network leading up to equation 2 are differentiable. This makes the whole network differentiable, allowing us to train it end to end.
Remark. In Fig. 1 the Graph2Seq RNN and the GRU RNN can also be thought of as two layers of a twolayer GRU. We have deliberately separated the two RNNs to highlight the fact that the sequence produced by Graph2Seq (blue in Fig, 1) is the embedding of the graph, which is then read using a GRU layer (red in Fig. 1). Our architecture is not unique, and other designs for generating and/or reading the sequence are possible.
5 Evaluations

and GCNN in (a) ErdosRenyi graphs (left), random bipartite graphs (right), and (b) on much larger graphs. The neural networks have been trained over the same graph types as the test graphs. Error bars show one standard deviation.
In this section we present our evaluation results for Graph2Seq. We address the following central questions: (1) How well does G2SRNN scale? (2) How well does G2SRNN generalize to new graph types? (3) Can we apply G2SRNN to a variety of problems? (4) Does adversarial training improve scalability and generalization? To answer these questions, we experiment with G2SRNN on three classical graph optimization problems: minimum vertex cover (MVC), max cut (MC) and maximum independent set (MIS). These are a set of problems well known to be NPhard, and also greatly differ in their structure (Williamson & Shmoys, 2011). We explain the problems below.
Minimum vertex cover.
The MVC of a graph is the smallest cardinality set such that for every edge at least one of or is in .
Approximation algorithms to within a factor 2 are known for MVC; however it cannot be approximated better than 1.3606 unless PNP.
Max cut.
In the MC problem, for an input graph instance we seek a cut where such that the number of edges crossing the cut is maximized.
This problem can be approximated within a factor 1.1383 of optimal, but not within 1.0684 unless PNP.
Maximum independent set.
For a graph the MIS denotes a set of maximum cardinality such that for any , .
The maximum independent set is complementary to the minimum vertex cover—if is a MIC of , then is the MVC.
However, from an approximation standpoint, MIS is hard to approximate within for any , despite constant factor approximation algorithms known for MVC.
Heuristics compared. In each problem, we compare G2SRNN against: (1) Structure2Vec (Khalil et al., 2017), (2) GraphSAGE (Hamilton et al., 2017a) using (a) GCN, (b) mean and (c) pool aggregators, (3) WL kernel NN (Lei et al., 2017), (4) WL kernel embedding, in which the feature map corresponding to WL subtree kernel of the subgraph in a 5hop neighborhood around each vertex is used as its vertex embedding (Shervashidze et al., 2011). Since we test on large graphs, instead of using a learned label lookup dictionary we use a standard hash function for label shortening at each step. In each of the above, the outputs of the last layer are fed to a learning network as in §4. Unlike G2SRNN the depth of the above neural network (NN) models are fixed across input instances. We also consider the following wellknown (nonlearning based) heuristics for each problem: (5) Greedy algorithms, (6) List heuristic, (7) Matching heuristic. We refer to Appendix D.1 for details on these heuristics.
We attempt to compute the optimal solution via the Gurobi optimization package (Gurobi Optimization, 2016). We run the Gurobi solver with a cutoff time of 240 s, and report performance in the form of approximation ratios relative to the solution found by the Gurobi solver. We do not compare against Deepwalk (Perozzi et al., 2014) or node2vec (Grover & Leskovec, 2016) since these methods are designed for obtaining vertex embeddings over a single graph. They are inappropriate for models that need to generalize over multiple graphs. This is because the vertex embeddings in these approaches can be arbitrarily rotated without a consistent ‘alignment’ across graphs. The number of parameters can also grow linearly with graph size. We refer to Hamilton et al. (2017a, Appendix D) for details.
Training. We train G2SRNN, Structure2Vec, GraphSAGE and WL kernel NN all on small ErdosRenyi (ER) graphs of size 15, and edge probability . During training we truncate the Graph2Seq representation to 5 observations (i.e., , see Fig. 1). In each case, the model is trained for 100,000 iterations, except WL kernel NN which is trained for 200,000 iterations since it has more parameters. We use experience replay (Mnih et al., 2013), a learning rate of , Adam optimizer (Kingma & Ba, 2015) and an exploration probability that is reduced from to a resting value of over 10,000 iterations. The amount of noise added in the evolution ( in Equation 1
) seemed to not matter; we have set the noise variance
to zero in all our experiments (training and testing). As far as possible, we have tried to keep the hyperparameters in
G2SRNN and all the neural network baselines to be the same. For example, all of the networks have a vertex embedding dimension of , use the same neural architecture for the network and Adam optimizer for training.Testing. For any , let denote the neural network (§4) in which Graph2Seq is restricted to a sequence length of . To test a graph , we feed as input to the networks , and choose the ‘best’ output as our final output. For each , outputs a solution set . The best output corresponds to that having the maximum objective value. For e.g., in the case of MVC the having the smallest cardinality is the best output. This procedure is summarized in detail in Algorithm 2 in Appendix C. We choose in our experiments. The time complexity of G2SRNN is .
To test generalization across graph types, we consider the following graph types: (1) ER graphs with edge probability ; (2) random regular graphs with degree ; (3) random bipartite graphs with equal sized partites and edge probability ; (4) worstcase examples, such as the worstcase graph for the greedy heuristic on MVC, which has a approximation factor Johnson (1973). (5) Twodimensional grid graphs, in which the sides contain equal number of vertices. For each type, we test on graphs with number of vertices ranging from 25–3200 in exponential increments (except WL embedding which is restricted to size 100 or 200 since it is computationally expensive). Some of the graphs considered are dense—e.g., a 3200 node ER graph has 700,000 edges; a 3200 node random bipartite graph has 1.9 million edges. We also test on sparse ER and bipartite graphs of sizes and with an average degree of 7.5.
5.1 Scalability and Generalization across Graph Types
Scalability. To test scalability, we train all the NN models on small graphs of a type, and test on larger graphs of the same type. For each NN model, we use the same trained parameters for all of the test examples in a graph type. We consider the MVC problem and train on: (1) size15 ER graphs, and (2) size20 random bipartite graphs. The models trained on ER graphs are then tested on ER graphs of sizes 25–3200; similarly the models trained on bipartite graphs are tested on bipartite graphs of sizes 24–3200. We present the results in Fig. 2. We have also included nonlearningbased heuristics for reference. In both ER and bipartite graphs, we observe that G2SRNN generalizes well to graphs of size roughly 25 through 3200, even though it was trained on size15 graphs. Other NN models, however, either generalize well on only one of the two types (e.g., Structure2Vec performs well on ER graphs, but not on bipartite graphs) or do not generalize in both types. G2SRNN generalizes well to even larger graphs. Fig. 2 presents results of testing on size 10,000 and 25,000 ER and random bipartite graphs. We observe the vertex cover output by G2SRNN is at least 100 nodes fewer than Structure2Vec.
Generalization across graph types. Next we test how the models generalize to different graph types. We train the models on size15 ER graphs, and test them on three graph types: (i) worstcase graphs, (ii) random regular graphs, and (iii) random bipartite graphs. For each graph type, we vary the graph size from 25 to 3200 as before. Fig. 3 plots results for the different baselines. In general, G2SRNN has a performance that is within 10% of the optimal, across the range of graph types and sizes considered. The other NN baselines demonstrate behavior that is not consistent and have certain classes of graph types/sizes where they perform poorly.
Adversarial training. We also trained G2SRNN on a certain class of adversarial ‘hard’ examples for minimum vertex cover, and observed further improvements in generalization. We refer to Appendix D for details and results of this method.
5.2 Other Problems: MC and MIS
We test and compare G2SRNN on the MC and MIS problems. As in MVC, our results demonstrate consistently good scalability and generalization of G2SRNN across graph types and sizes. As before, we train the NN models on size15 ER graphs () and test on different graphs.
Max cut. We test on (1) ER graphs, and (2) twodimensional grid graphs. For each graph type, we vary the number of vertices in the range –, and use the same trained model for all of the tests. The results of our tests are presented in Fig. 4. We notice that for both graph types G2SRNN is able to achieve an approximation less that times the (timed) integer program output.
Maximum independent set. We test on (1) ER graphs, and (2) worstcase bipartite graphs for the greedy heuristic. The number of vertices is varied in the range 25–3200 for each graph type. We present our results in Fig. 4. In ER graphs, G2SRNN shows a reasonable consistency in which it is always less than 1.10 times the (timed) integer program solution. In the bipartite graph case we see a performance within 8% of optimal across all sizes.
6 Conclusion
We proposed Graph2Seq that represents vertices of graphs as infinite timeseries of vectors. The representation melds naturally with modern RNN architectures that take timeseries as inputs. We applied this combination to three canonical combinatorial optimization problems on graphs, ranging across the complexitytheoretic hardness spectrum. Our empirical results best stateoftheart approximation algorithms for these problems on a variety of graph sizes and types. In particular, Graph2Seq exhibits significantly better scalability and generalization than existing GCNN representations in the literature. An open direction involves a more systematic study of the capabilities of Graph2Seq across the panoply of graph combinatorial optimization problems, as well as its performance in concrete (and myriad) downstream applications. Another open direction involves interpreting the policies learned by Graph2Seq to solve specific combinatorial optimization problems (e.g., as in LIME (Ribeiro et al., 2016)). A detailed analysis of the Graph2Seq dynamical system to study the effects of sequence length on the representation is also an important direction.
References
 Angelopoulos & Borodin (2003) Spyros Angelopoulos and Allan Borodin. Randomized priority algorithms. In WAOA, pp. 27–40. Springer, 2003.
 Atwood & Towsley (2016) James Atwood and Don Towsley. Diffusionconvolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1993–2001, 2016.
 Bello et al. (2016) Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940, 2016.
 Borodin et al. (2003) Allan Borodin, Morten N Nielsen, and Charles Rackoff. (incremental) priority algorithms. Algorithmica, 37(4):295–326, 2003.
 Borodin et al. (2010) Allan Borodin, Joan Boyar, Kim S Larsen, and Nazanin Mirmohammadi. Priority algorithms for graph optimization problems. Theoretical Computer Science, 411(1):239–258, 2010.
 Bronstein et al. (2017) Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
 Bruna & Li (2017) Joan Bruna and Xiang Li. Community detection with graph neural networks. arXiv preprint arXiv:1705.08415, 2017.
 Bruna et al. (2014) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Lecun. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR2014), CBLS, April 2014, 2014.
 Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014, 2014.
 Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pp. 3844–3852, 2016.
 Duvenaud et al. (2015) David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán AspuruGuzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232, 2015.

Gilmer et al. (2017)
Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and
George E Dahl.
Neural message passing for quantum chemistry.
In
International Conference on Machine Learning
, pp. 1263–1272, 2017.  Gori et al. (2005) Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in graph domains. In Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on, volume 2, pp. 729–734. IEEE, 2005.
 Grandl et al. (2016) Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, and Janardhan Kulkarni. G: Packing and dependencyaware scheduling for dataparallel clusters. In Proceedings of OSDI?16: 12th USENIX Symposium on Operating Systems Design and Implementation, pp. 81, 2016.
 Granger (1980) Clive WJ Granger. Testing for causality: a personal viewpoint. Journal of Economic Dynamics and control, 2:329–352, 1980.
 Graves et al. (2016) Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka GrabskaBarwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471–476, 2016.
 Grover & Leskovec (2016) Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. ACM, 2016.
 Gurobi Optimization (2016) Inc. Gurobi Optimization. Gurobi optimizer reference manual, 2016. URL http://www.gurobi.com.
 Hamilton et al. (2017a) Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1025–1035, 2017a.
 Hamilton et al. (2017b) William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584, 2017b.
 Hopfield & Tank (1985) John J Hopfield and David W Tank. Neural computation of decisions in optimization problems. Biological cybernetics, 52(3):141–152, 1985.

Jain et al. (2016)
Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena.
Structuralrnn: Deep learning on spatiotemporal graphs.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 5308–5317, 2016.  Johnson (2016) Daniel D Johnson. Learning graphical state transitions. 2016.

Johnson (1973)
David S Johnson.
Approximation algorithms for combinatorial problems.
In
Proceedings of the fifth annual ACM symposium on Theory of computing
, pp. 38–49. ACM, 1973.  Khalil et al. (2017) Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems, pp. 6348–6358, 2017.
 Kingma & Ba (2015) Diederik P Kingma and Jimmy Lei Ba. Adam: Amethod for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2015.
 Kipf & Welling (2016) Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 Kondor & Pan (2016) Risi Kondor and Horace Pan. The multiscale laplacian graph kernel. In Advances in Neural Information Processing Systems, pp. 2990–2998, 2016.
 Kool & Welling (2018) WWM Kool and M Welling. Attention solves your tsp. arXiv preprint arXiv:1803.08475, 2018.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
 Kuhn et al. (2016) Fabian Kuhn, Thomas Moscibroda, and Roger Wattenhofer. Local computation: Lower and upper bounds. Journal of the ACM (JACM), 63(2):17, 2016.
 Lei et al. (2017) Tao Lei, Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Deriving neural architectures from sequence and graph kernels. In International Conference on Machine Learning, pp. 2024–2033, 2017.
 Li et al. (2015) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.
 Liang et al. (2016) Xiaodan Liang, Xiaohui Shen, Jiashi Feng, Liang Lin, and Shuicheng Yan. Semantic object parsing with graph lstm. In European Conference on Computer Vision, pp. 125–143. Springer, 2016.

Marcheggiani & Titov (2017)
Diego Marcheggiani and Ivan Titov.
Encoding sentences with graph convolutional networks for semantic
role labeling.
In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
, pp. 1506–1515, 2017.  Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Monti et al. (2017) Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodolà, Jan Svoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 5425–5434. IEEE, 2017.
 Niepert et al. (2016) Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In International Conference on Machine Learning, pp. 2014–2023, 2016.
 Nowak et al. (2017) Alex Nowak, Soledad Villar, Afonso S Bandeira, and Joan Bruna. A note on learning algorithms for quadratic assignment with graph neural networks. arXiv preprint arXiv:1706.07450, 2017.
 Özgür et al. (2008) Arzucan Özgür, Thuy Vu, Güneş Erkan, and Dragomir R Radev. Identifying genedisease associations using centrality on a literature mined geneinteraction network. Bioinformatics, 24(13):i277–i285, 2008.
 Perozzi et al. (2014) Bryan Perozzi, Rami AlRfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. ACM, 2014.
 Quinn et al. (2011) Christopher J Quinn, Todd P Coleman, Negar Kiyavash, and Nicholas G Hatsopoulos. Estimating the directed information to infer causal relationships in ensemble neural spike train recordings. Journal of computational neuroscience, 30(1):17–44, 2011.
 Quinn et al. (2015) Christopher J Quinn, Negar Kiyavash, and Todd P Coleman. Directed information graphs. IEEE Transactions on information theory, 61(12):6887–6909, 2015.
 Rahimzamani & Kannan (2016) Arman Rahimzamani and Sreeram Kannan. Network inference using directed information: The deterministic limit. In Communication, Control, and Computing (Allerton), 2016 54th Annual Allerton Conference on, pp. 156–163. IEEE, 2016.

Ribeiro et al. (2016)
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin.
Why should i trust you?: Explaining the predictions of any classifier.
In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM, 2016.  Scarselli et al. (2009) Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009.
 Schrijver (2003) Alexander Schrijver. Combinatorial optimization: polyhedra and efficiency, volume 24. Springer Science & Business Media, 2003.
 Seo et al. (2016) Youngjoo Seo, Michaël Defferrard, Pierre Vandergheynst, and Xavier Bresson. Structured sequence modeling with graph convolutional recurrent networks. arXiv preprint arXiv:1612.07659, 2016.
 Shervashidze et al. (2011) Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. Weisfeilerlehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561, 2011.
 Shimizu et al. (2016) Satoshi Shimizu, Kazuaki Yamaguchi, Toshiki Saitoh, and Sumio Masuda. A fast heuristic for the minimum weight vertex cover problem. In Computer and Information Science (ICIS), 2016 IEEE/ACIS 15th International Conference on, pp. 1–5. IEEE, 2016.
 Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
 Such et al. (2017) Felipe Petroski Such, Shagan Sah, Miguel Dominguez, Suhas Pillai, Chao Zhang, Andrew Michael, Nathan Cahill, and Raymond Ptucha. Robust spatial filtering with graph convolutional neural networks. arXiv preprint arXiv:1703.00792, 2017.

Tai et al. (2015)
Kai Sheng Tai, Richard Socher, and Christopher D Manning.
Improved semantic representations from treestructured long shortterm memory networks.
In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pp. 1556–1566, 2015.  Ugander et al. (2011) Johan Ugander, Brian Karrer, Lars Backstrom, and Cameron Marlow. The anatomy of the facebook social graph. arXiv preprint arXiv:1111.4503, 2011.
 Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
 Vinyals et al. (2015) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Advances in Neural Information Processing Systems, pp. 2692–2700, 2015.
 Vishwanathan et al. (2010) S Vichy N Vishwanathan, Nicol N Schraudolph, Risi Kondor, and Karsten M Borgwardt. Graph kernels. Journal of Machine Learning Research, 11(Apr):1201–1242, 2010.
 Weisfeiler & Lehman (1968) Boris Weisfeiler and AA Lehman. A reduction of a graph to a canonical form and an algebra arising during this reduction. NauchnoTechnicheskaya Informatsia, 2(9):12–16, 1968.
 Williamson & Shmoys (2011) David P Williamson and David B Shmoys. The design of approximation algorithms. Cambridge university press, 2011.
 Yanardag & Vishwanathan (2015) Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1365–1374. ACM, 2015.
 Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. Graph convolutional neural networks for webscale recommender systems. arXiv preprint arXiv:1806.01973, 2018.
Appendix A Background: Graph Convolutional Neural Networks
An ideal graph representation is one that captures all innate structures of the graph relevant to the task at hand, and moreover can also be learned via gradient descent methods. However, this is challenging since the relevant structures could range anywhere from local attributes (example: node degrees) to longrange dependencies spanning across a large portion of the graph (example: does there exist a path between two vertices) (Kuhn et al., 2016). Such broad scale variation is also a wellknown issue in computer vision (image classification, segmentation etc.), wherein convolutional neural network (CNN) designs have been used quite successfully (Krizhevsky et al., 2012). Perhaps motivated by this success, recent research has focused on generalizing the traditional CNN architecture to develop designs for graph convolutional neural networks (GCNN) (Bruna et al., 2014; Niepert et al., 2016). By likening the relationship between adjacent pixels of an image to that of adjacent nodes in a graph, the GCNN seeks to emulate CNNs by defining localized ‘filters’ with shared parameters.
Current GCNN filter designs can be classified into one of two categories: spatial (Kipf & Welling, 2016; Khalil et al., 2017; Nowak et al., 2017), and spectral (Defferrard et al., 2016). For an integral hyperparameter , filters in either category process information from a local neighborhood surrounding a node to compute the output. Here we consider localized spectral filters such as proposed in Defferrard et al. (2016). The difference between the spatial and spectral versions arises in the precise way in which the aggregated local information is combined.
Spatial GCNN. For input feature vector at each node of a graph , a spatial filtering operation is the following:
(3) 
where is the filter output, and are learnable parameters, and
is a nonlinear activation function that is applied elementwise.
is the normalized adjacency matrix, and is the diagonal matrix of vertex degrees. Use of unnormalized adjacency matrix is also common. The power of the adjacency matrix selects nodes a distance of at most hops from . ReLU is a common choice for . We highlight two aspects of spatial GCNNs: (i) the feature vectors are aggregated from neighboring nodes directly specified through the graph topology, and (ii) the aggregated features are summarized via an addition operation.Spectral GCNN. Spectral GCNNs use the notion of graph Fourier transforms to define convolution operation as the inverse transform of multiplicative filtering in the Fourier domain. Since this is a nonlocal operation potentially involving data across the entire graph, and moreover it is computationally expensive to compute the transforms, recent work has focused on approximations to produce a local spectral filter of the form
(4) 
where is the normalized Laplacian of the graph, denotes the entry at the row corresponding to vertex and column corresponding to vertex in , and are parameters (Defferrard et al., 2016; Kipf & Welling, 2016). As in the spatial case, definitions using unnormalized version of Laplacian matrix are also used. is typically the identity function here. The function in equation 4 is a local operation because the th power of the Laplacian, at any row , has a support no larger than the hop neighborhood of . Thus, while the aggregation is still localized, the feature vectors are now weighted by the entries of the Laplacian before summation.
Spectral and Spatial GCNN are equivalent. The distinction between spatial and spectral convolution designs is typically made owing to their seemingly different definitions. However we show that both designs are mathematically equivalent in terms of their representation capabilities.
Proposition 3.
Proof.
Consider a vertex set and dimensional vertex states and at vertex . Let and be the matrices obtained by concatenating the state vectors of all vertices. Then the spatial transformation function of equation 3 can be written as
(5) 
while the spectral transformation function of equation 4 can be written as
(6)  
(7)  
(8)  
(9) 
The equation 7 follows by the definition of the normalized Laplacian matrix, and equation 8 derives from binomial expansion. To make the transformation in equation 5 and equation 9 equal, we can set
(10) 
and check if there are any feasible solutions for the primed quantities. Clearly there are, with one possible solution being and
(11)  
(12) 
Thus for any choice of values for for there exists for such that the spatial and spectral transformation functions are equivalent. The other direction (when and are fixed), is similar and straightforward. ∎
Depending on the application, the convolutional layers may be supplemented with pooling and coarsening layers that summarize outputs of nearby convolutional filters to form a progressively more compact spatial representation of the graph. This is useful in classification tasks where the desired output is one out of a few possible classes (Bruna et al., 2014). For applications requiring decisions at a pernode level (e.g. community detection), a popular strategy is to have multiple repeated convolutional layers that compute vector representations for each node, which are then processed to make a decision (Khalil et al., 2017; Bruna & Li, 2017; Nowak et al., 2017). The conventional wisdom here is to have as many layers as the diameter of the graph, since filters at each layer aggregate information only from nearby nodes. Such a strategy is sometimes compared to the message passing algorithm (Gilmer et al., 2017), though the formal connections are not clear as noted in Nowak et al. (2017). Finally the GCNNs described so far are all endtoend differentiable and can be trained using mainstream techniques for supervised, semisupervised or reinforcement learning applications.
Other lines of work use ideas inspired from word embeddings for graph representation (Grover & Leskovec, 2016; Perozzi et al., 2014). PostGCNN representation, LSTMRNNs have been used to analyze timeseries data structured over a graph. Seo et al. (2016) propose a model which combines GCNN and RNN to predict moving MNIST data. Liang et al. (2016) design a graph LSTM for semantic object parsing in images.
Appendix B Section 3 Proofs
b.1 Proof of Theorem 1
Proof.
Consider a Graph2Seq trajectory on graph according to equation 1 in which the vertex states are initialized randomly from some distribution. Let (resp.
) denote the random variable (resp. realization) corresponding to the state of vertex
at time . For time and a set , let denote the collection of random variables ; will denote the realizations.An information theoretic estimator to output the graph structure by looking at the trajectory is the directed information graph considered in Quinn et al. (2015). Roughly speaking, the estimator evaluates the conditional directed information for every pair of vertices , and declares an edge only if it is positive (see Definition 3.4 in Quinn et al. (2015) for details). Estimating conditional directed information efficiently from samples is itself an active area of research Quinn et al. (2011)
, but simple plugin estimators with a standard kernel density estimator will be consistent. Since the theorem statement did not specify sample efficiency (i.e., how far down the trajectory do we have to go before estimating the graph with a required probability), the inference algorithm is simple and polynomial in the length of the trajectory. The key question is whether the directed information graph is indeed the same as the underlying graph
. Under some conditions on the graph dynamics (discussed below in Properties 1–3), this holds and it suffices for us to show that the dynamics generated according to equation 1 satisfies those conditions.Property 1.
For any , for all .
This is a technical condition that is required to avoid degeneracies that may arise in deterministic systems. Clearly Graph2Seq’s dynamics satisfies this property due to the additive i.i.d. noise in the transformation functions.
Property 2.
The dynamics is strictly causal, that is factorizes as .
This is another technical condition that is readily seen to be true for Graph2Seq. The proof also follows from Lemma 3.1 in Quinn et al. (2015).
Property 3.
is the minimal generative model graph for the random processes .
Notice that the transformation operation equation 1 in our graph causes to factorize as
(13) 
for any , where is the set of neighboring vertices of in .
Now consider any other graph .
will be called a minimal generative model for the random processes if
(1) there exists an alternative factorization of as
(14) 
for any , where is the set of neighbors of in , and
(2) there does not exist any other graph with and a factorization of as for any , where is the set of neighbors of in .
Intuitively, a minimal generative model is the smallest spanning graph that can generate the observed dynamics. To show that is indeed a minimal generative model, let us suppose the contrary and assume there exists another graph with and a factorization of as in equation 14. In particular, let be any node such that . Then by marginalizing the right hand sides of equation 13 and equation 14, we get
(15) 
Note that equation 15 needs to hold for all possible realizations of the random variables and . However if the parameters and in equation 1 are generic, this is clearly not true. To see this, let be any vertex. By fixing the values of it is possible to find two values for , say and , such that
(16) 
As such the Gaussian distributions in these two cases will have different means. However the right hand side Equation equation
15 does not depend on at all, resulting in a contradiction. Thus is a minimal generating function of . Thus Property 3 holds as well. Now the result follows from the following Theorem.Theorem 3 (Theorem 3.6, Quinn et al. (2015)).
∎
b.2 Proof of Proposition 1
Proof.
Consider 4regular graphs and with vertices and edges and respectively. Then under a deterministic evolution rule, since and are 4regular graphs, the trajectory will be identical at all nodes across the two graphs. However the graphs and are structurally different. For e.g., has a minimum vertex cover size of 5, while for it is 6. As such, if any one of the graphs (, say) is provided as input to be represented, then from the representation it is impossible to exaclty recover ’s structure. ∎
b.3 Proof of Proposition 2
Proof.
Kipf & Welling (2016) use a two layer graph convolutional network, in which each layer uses convolutional filters that aggregate information from the immediate neighborhood of the vertices. This corresponds to a 2local representation function in our computational model. Following this step, the values at the vertices are aggregated using softmax to compute a probability score at each vertex. Since this procedure is independent of the structure of the input graph, it is a valid gathering function in localgather and the overall architecture belongs to a 2localgather model.
Similarly, Khalil et al. (2017)
also consider convolutional layers in which the neurons have a spatial locality of one. Four such convolutional layers are cascaded together, the outputs of which are then processed by a separate
learning network. Such a neural architecture is an instance of the 4localgather model. ∎b.4 Proof of Theorem 2
Proof.
Consider a family of undirected, unweights graphs. Let denote a function that computes the size of the minimum vertex cover of graphs from . For fixed, let denote any algorithm from the localgather model, with a representation function and aggregating function .^{1}^{1}1See beginning of Section 3 for explanations of and . We present two graphs and such that , but the set of computed states is the same for both the graphs (). Now, since the gather function operates only on the set of computed states (by definition of our model), this implies cannot distinguish between and , thus proving our claim.
For simplicity, we fix (the example easily generalizes for larger ). We consider the graphs and as shown in Fig. 5 and 5 respectively. To construct these graphs, we first consider binary trees and each having 7 nodes. is a completely balanced binary tree with a depth of 2, whereas is a completely imbalanced binary tree with a depth of 3. Now, to get and , we replace each node in and
by a chain of 3 nodes (more generally, by an odd number of nodes larger than
). At each location in (), the head of the chain of nodes connects to the tail of the parent’s chain of nodes, as shown in Fig. 5.The sizes of the minimum vertex cover of and are 9 and 10 respectively. However, there exists a onetoone mapping between the vertices of and the vertices of such that the hop neighborhood around corresponding vertices in and are the same. For example, in Fig. 5 and 5 the pair of nodes shaded in red have an identical 2hop neighborhood (shaded in yellow). As such, the representation function – which for any node depends only on its hop neighborhood – will be the same for corresponding pairs of nodes in and .
Finally, the precise mapping between pairs of nodes in and is obtained as follows. First consider a simple mapping between pairs of nodes in and in which (i) the 4 leaf nodes in are mapped to the leaf nodes in , (ii) the root of is mapped to the root of and (iii) the 2 interior nodes of are mapped to the interior nodes of . We generalize this mapping to and in two steps: (1) mapping chains of nodes in to chains of nodes in , according to the map, and (2) within corresponding chains of nodes, we map nodes according to order (headtohead, tailtotail, etc.). ∎
b.5 Sequential Heuristic to Compute MVC on Trees
Consider any unweighted, undirected tree . We let the state at any node be represented by a twodimensional vector . For any , takes values over the set while is in . Here is a parameter that we choose to be less than one over the maximum degree of the graph. Semantically stands for whether vertex is ‘active’ () or ‘inactive’ (). Similarly stands for whether has been selected to be part of the vertex cover (), has not been selected to be part of the cover (), or a decision has not yet been made (). Initially and for all vertices. The heuristic proceeds in rounds, wherein at each round any vertex updates its state based on the state of its neighbors as shown in Algorithm 1.
The update rules at vertex are (1) if is a leaf or if at least one of ’s neighbors are active, then becomes active; (2) if is active, and if at least one of ’s active neighbors have not been chosen in the cover, then is chosen to be in the cover; (3) if all of ’s neighbors are inactive, then remains inactive and no decision is made on .
At the end of the local computation rounds, the final vertex cover size is computed by first averaging the timeseries at each (with translation, and scaling as shown in Algorithm 1), and then summing over all vertices.
Appendix C Section 4 Details
c.1 Reinforcement Learning Formulation
Let be an input graph instance for the optimization problems mentioned above. Note that the solution to each of these problems can be represented by a set . In the case of the minimum vertex cover (MVC) and maximum independent set (MIS), the set denotes the desired optimal cover and independent set respectively; for max cut (MC) we let denote the optimal cut. For the following let be the objective function of the problem (i.e., MVC, MC or MIS) that we want to maximize, and let be the set of feasible solutions.
Dynamic programming formulation. Now, consider a dynamic programming heuristic in which the subproblems are defined by the pair , where is the graph and is a subset of vertices that have already been included in the solution. For a vertex let denote the marginal utility gained by selecting vertex . Such a function satisfies the Bellman equations given by
(17) 
It is easily seen that computing the functions solves the optimization problem, as . However exactly computing functions may be computationally expensive. One approach towards approximately computing is to fit it to a (polynomial time computable) parametrized function, in a way that an appropriately defined error metric is minimized. This approach is called learning in the reinforcement learning (RL) paradigm, and is described below.
State, action and reward. We consider a reinforcement learning policy in which the solution set is generated one vertex at a time. The algorithm proceeds in rounds, where at round the RL agent is presented with the graph and the set of vertices chosen so far. Based on this state information, the RL agent outputs an action . The set of selected vertices is updated as . Initially . Every time the RL agent performs an action it also incurs a reward . Note that the function is welldefined only if and are such that there exists an and . To enforce this let denote the set of feasible actions at time . Each round, the learning agent chooses an action . The algorithm terminates when .
Policy. The goal of the RL agent is to learn a policy for selecting actions at each time, such that the cumulative reward incurred is maximized. A measure of the generalization capability of the policy is how well it is able to maximize cumulative reward for different graph instances from a collection (or from a distribution) of interest.
learning. Let denote the approximation of obtained using a parametrized function with parameters . Further let denote a sequence of (state, action) tuples available as training examples. We define empirical loss as
(18) 
and minimize using stochastic gradient descent. The solution of the Bellman equations equation
17 is a stationary point for this optimization.Remark. Heuristics such as ours, which select vertices one at a time in an irreversible fashion are studied as ‘priority greedy’ algorithms in computer science literature (Borodin et al., 2003; Angelopoulos & Borodin, 2003). The fundamental limits (worstcase) of priority greedy algorithms for minimum vertex cover and maximum independent set has been discussed in Borodin et al. (2010).
c.2 Seq2Vec Update Equations
Seq2Vec. The sequence is processed by a gated recurrent network that sequentially reads vectors at each time index for all . Standard GRU (Chung et al., 2014). For timestep , let be the dimensional cell state, be the cell input and be the forgetting gate, for each vertex . Each timestep a fresh input is computed based on the current states of ’s neighbors in . The cell state is updated as a convex combination of the freshly computed inputs and the previous cell state , where the weighting is done according to a forgetting value that is also computed based on the current vertex states. The update equations for the input vector, forgetting value and cell state are chosen as follows:
(19) 
where and are trainable parameters, , and denotes the dimensional allones vector, and is elementwise multiplication. is initialized to allzeros for every . The cell state at the final timestep is the desired vector summary of the Graph2Seq sequence.
Appendix D Evaluation Details
d.1 Heuristics compared
We compare G2SRNN against:
(1) Structure2Vec (Khalil et al., 2017), a spatial GCNN with depth of 5.
(2) GraphSAGE (Hamilton et al., 2017a) using (a) GCN, (b) mean and (c) pool aggregators, with the depth restricted to 2 in each case.
(3) WL kernel NN (Lei et al., 2017), a neural architecture that embeds the WL graph kernel, with a depth of 3 and width of 4 (see Lei et al. (2017) for details).
(4) WL kernel embedding, in which the feature map corresponding to WL subtree kernel of the subgraph in a 5hop neighborhood around each vertex is used as its vertex embedding (Shervashidze et al., 2011).
Since we test on large graphs, instead of using a learned label lookup dictionary we use a standard SHA hash function for label shortening at each step.
In each of the above models, the outputs of the last layer are fed to a learning network, and trained the same way as G2SRNN.
(5) Greedy algorithms.
We consider greedy heuristics (Williamson & Shmoys, 2011) for each of MVC, MC and MIS.
(6) List heuristic. A fast listbased algorithm proposed recently in Shimizu et al. (2016) for MVC and MIS.
(7) Matching heuristic. A approximation algorithm for MVC (Williamson & Shmoys, 2011).
d.2 Adversarial Training
So far we have seen the generalization capabilities of a G2SRNN model trained on small ErdosRenyi graphs. In this section we ask the question: is even better generalization possible by training on a different graph family? The answer is in the affirmative. We show that by training on planted vertexcover graph examples—a class of ‘hard’ instances for MVC—we can realize further generalizations. A plantedcover example is a graph, in which a small graph is embedded (‘planted’) within a larger graph such that the vertices of the planted graph constitute the optimal minimum vertex cover. Figure 6 shows the result of testing G2SRNN models trained under both ErdosRenyi and planted vertex cover graphs. While both models show good scalability in ErdosRenyi and regular graphs, on bipartite graphs and worstcase graphs the model trained on plantedcover graphs shows even stronger consistency by staying 1% within optimal.
d.3 Geometry of Encoding and Semantics of Graph2Seq
Towards an understanding of what aspect of solving the MVC is learnt by Graph2Seq, we conduct empirical studies on the dynamics of the state vectors as well as present techniques and semantic interpretations of Graph2Seq.
In the first set of experiments, we investigate the vertex state vector sequence. We consider graphs of size up to 50 and of types discussed in Section 5. For each fixed graph, we observe the vertex state (Equation 1) evolution to a depth of 10 layers.
(1) Dimension collapse. As in the random parameter case, we observe that on an average more than 8 of the 16 dimensions of the vertex state become zeroed out after 4 or 5 layers.
(2) Principal components’ alignment. The principal component direction of the vertex state vectors at each layer converges. Fig. 7 shows this effect for the graph shown in Fig. 7. We plot the absolute value of the inner product between the principal component direction at each layer and the principal component direction at layer 10.
(3) Principal component scores and local connectivity. The component of the vertex state vectors along the principal direction roughly correlate to how well the vertex is connected to the rest of the graph. We demonstrate this again for the graph shown in Fig. 7, in Fig 7.
(4) Optimal depth. We study the effect of depth on approximation quality on the four graph types being tested (with size 50); we plot the vertex cover quality as returned by Graph2Seq as we vary the number of layers up to 25. Fig. 8 plots the results of this experiment, where there is no convergence behavior but nevertheless apparent that different graphs work optimally at different layer values. While the optimal layer value is 4 or 5 for random bipartite and random regular graphs, the worst case greedy example requires 15 rounds. This experiment underscores the importance of having a flexible number of layers is better than a fixed number; this is only enabled by the timeseries nature of Graph2Seq and is inherently missed by the fixeddepth GCNN representations in the literature.
(5) function semantics. Recall that the function of equation 2 comprises of two terms. The first term, denoted by , is the same for all the vertices and includes a sum of all the vectors. The second term, denoted by depends on the vector for the vertex being considered. In this experiment we plot these two values at the very first layer of the learning algorithm (on a planted vertex cover graph of size 15, same type as in the training set) and make the following observations: (a) the values of and are close to being integers. has a value that is one less than the negative of the minimum vertex cover size. (b) For a vertex , is binary valued from the set . is one, if vertex is part of an optimum vertex cover, and zero otherwise. Thus the neural network, in principle, computes the complete set of vertices in the optimum cover at the very first round itself.
(6) Visualizing the learning dynamics. The above observations suggests to ‘visualize’ how our learning algorithm proceeds in each layer of the evolution using the lens of the value of . In this experiment, we consider size15 planted vertex cover graphs on (i) Graph2Seq, and (ii) the fixeddepth GCNN trained on planted vertex cover graphs. Fig. 8 and 8 show the results of this experiment. The planted vertex cover graph considered for these figures has an optimal vertex cover comprising vertices . We center (subtract mean) the values at each layer, and threshold them to create the visualization. A dark green color signifies the vertex has a high value, while the yellow means a low value. We can see that in Graph2Seq the heuristic is able to compute the optimal cover, and moreover this answer does not change with more rounds. The fixed depth GCNN has a nonconvergent answer which oscillates between a complementary set of vertices. Take away message: having an upper LSTM layer in the learning network is critical to identify when an optimal solution is reached in the evolution, and “latch on" to it.