Graphs are one of the most flexible data structures for capturing relational information. Classical machine learning models such as neural networks or recurrent neural networks are not designed to handle graph input or output directly. To learn a flexible data structure like a graph, Gori et al.[gori2005new]
introduced the concept of a graph neural network (GNN). They represented the GNN as a recursive neural network where nodes are treated as state vectors, and the relationship among these nodes is quantified by the edges. Scarselli et al.[scarselli2008graph] extended the notion of unfolding equivalence that leads to the transformation of the approximation property of feed-forward networks (Scarselli and Tsoi [scarselli1998universal]) to GNNs.
Many combinatorial problems arise naturally in various applications, many of which are known to be NP-complete. In 1972, the seminal paper [karp1972reducibility] provided a list of NP-complete problems, many of which are graph-based. GNNs offer an alternative to traditional heuristic approximation algorithms; indeed the initial GNN model [scarselli2008graph] was used to approximate solutions to two classical graph-based problems: subgraph isomorphism and clique detection.
An interesting line of research is to explore the capabilities and limitations of graph neural networks as alternative approaches to existing heuristics and approximation algorithms to theoretical graph problems. This can open up modern methods to solve such problems at scale in practice, and also it can shed light on the fundamental capacity of such learning methods.
Graph neural networks have been a very active area of research during the past few years. To name a few, Lei et al. [lei2017deriving] introduced recurrent neural operations for graphs with their associated kernel spaces. The notion of studying graph neural models as Message Passing Neural Networks is discussed by Gilmer et al. [gilmer2017neural]. Garg et al. [garg2020generalization] generalized the standard message passing GNNs that rely on the local graph structure, proposing a novel framework for the GNN through graph-theoretic formalism which provided insights into the design of more effective GNNs.
Mishra et al. [mishra2020node]
demonstrated state-of-the-art spatial GNNs operations with theoretical tools. They proposed the node masking concept for better generalization and scaling in both transductive and inductive settings. Most recent works on learning the graph structure data focus on the distributed representation of the substructure of the graph such as nodes and subgraphs. But for the tasks related to graph clustering or graph classification, models require the entire graph as fixed-length feature vectors. Graph kernels remain as one of the most effective ways to obtain these vectors by using some handcrafted features such as shortest paths or graphlets, but this technique can also result in poor generalization. Narayanan et al.[narayanan2017graph2vec] proposed a neural embedding formulation named graph2vec to learn data-driven distributed representations of arbitrary graphs. The embeddings of graph2vec are learned in an unsupervised manner and are task agnostic. Different explainer models have been also proposed to understand the reason for decision-making by the models [dai2020framework, ying2019gnnexplainer].
1.2 Related Works
Graph neural networks have been widely used in many areas including physical systems [battaglia2016interaction, sanchez2018graph], protein-protein interaction networks [fout2017protein], social science [hamilton2017inductive, kipf2016semi]
, and knowledge graphs[hamaguchi2017knowledge]. For more information on graph neural networks, see the survey [zhou2018graph].
Subsequently, other methods have been proposed to solve other combinatorial problems; for example, Khalil et al. [khalil2017learning] proposed a framework to compute minimum vertex cover, maximum cut, and traveling salesman problems. The survey [vesselinova2020learning] discusses different combinatorial problems that have been approached using graph neural networks or related methods. Bello et al. [bello2016neural]
studied machine learning in general and deep reinforcement learning in particular for the planar traveling salesman problem (TSP). Kool et al.[kool2018attention] also studied the TSP problem to make progress in learning heuristics that can be applied to a broad scope of different practical problems. Prates [prates2019learning] et al. proposed a message-passing GNN model to predict the decision TSP problem. Lemos et al. [lemos2019graph]
show that GNNs can be trained to solve fundamental combinatorial optimization challenges such as the graph coloring problem. Different neural network models have been proposed to solve other combinatorial problems such as satisfiability (SAT) problem[selsam2018learning, li2018combinatorial], vertex cover [li2018combinatorial, dai2017learning], independent set [li2018combinatorial] etc. To the best of our knowledge, there does not exist any work that considered the Steiner tree problem.
1.3 Problem Statement
The Steiner Tree Problem is another classical problem mentioned in Karp’s initial NP-complete problem list [karp1972reducibility]. In this problem, we are given a weighted graph and a set of terminals , and the objective is to compute a minimum cost tree that spans . Several approximation algorithms have been proposed for this problem including a classical 2-approximation algorithm that first computes the metric closure of on and then returns the minimum spanning tree [agrawal1995trees].
In this paper we are interested in the following question: can graph neural networks provide good candidate solutions to the Steiner Tree Problem?
1.4 Summary of Contributions
In this paper, we tackle the Steiner Tree Problem from a graph neural network perspective. Specifically,
We propose four different models to predict solution nodes for the Steiner tree problem.
We propose two heuristics to compute low cost Steiner trees from these models. In one heuristic we use the model’s predictions to construct a connected induced graph. We then compute the spanning tree of that induced graph. In the other heuristic, we predict good Steiner nodes and connect them to the constructed tree using a greedy shortest path method.
We generate forty thousand Steiner tree instances using different random graph generators. We compute the exact solution of these instances, and train and test the models on this dataset. We also test the models on the SteinLib data library which is outside of the training set.
Our finding suggests that out-of-the-box application of GNN methods does worse that the classic 2-approximation method. However, when combined with a greedy shortest path construction, it even does slightly better than the 2-approximation algorithm.
2 Neural network-based models
To begin, we lay out the different learning models that will be tested in Section 5. The first model is not a graph learning model, but rather a feedforward model that turns out to not be well-suited for graph problems. However, we include it to get a baseline for the performance of other models.
2.1 Feedforward model
In the Steiner tree problem, given an input graph , we want to determine which nodes are present in the Steiner tree. The problem can be represented by a function that outputs a binary vector . The value of is 1 if and only if the -th node is present in the solution. One can use feedforward neural networks to approximate this function. A feedforward network defines a mapping and learns the value of the parameters that result in the best function approximation. The input of a feedforward network is a vector, and one can represent a graph by a binary vector where each element corresponds to a pair of nodes. An entry of is 1 if and only if there is an edge between the corresponding pair of nodes. In this representation, we need to assume that the number of node pairs or number of nodes will not exceed an upper bound. Indeed, we build the model for the graph’s size of the upper bound and treat any other smaller graph as a large graph with auxiliary, isolated non-terminal nodes.
2.2 Graph neural network model
A GNN model, in a transductive-inductive framework was proposed by [scarselli2008graph]
using the TensorFlow platform. The GNN model is capable of handling both directed and undirected graphs. In this framework, both edges and nodes have attributes that are called labels (typically vectors in some parameter space). Examples of node labels are average colors, area, and shape factors, and examples of edge features are barycenters of two adjacent regions of a triangulation, or distance between nodes. Each node also has a state, and a computational process named diffusion updates the states. The input graph topology drives the computation. The diffusion mechanism forms the computation schema which updates the state vector at each node as a function of node labels, neighboring edge labels, and states of the neighboring nodes. The information relevant to each task will be summarized by the state for each node. The node states are used to compute the class or target properties.
Let the state and output at node be and , respectively. Also, let represents the state transition function which ultimately drives the diffusion process and denotes the output function. The diffusion process can be described by the following equations:
Where and are the labels attached to and , respectively. The solution of the Jacobi iterative procedure can be written as
) implements the diffusion process for the state computation. Simple multilayer perceptrons (MLPs) with a unique hidden layer define the functionsand in this framework. The unfolding of the encoding network (Figure 1) is demonstrated by Equation (3) by calculating the and
for each node. Each node represents a replica of perceptron realizingwhereas each unit represents the state at time . The next time state at time is computed by the stored states in all nodes at time . When the state computation converges,
is applied to each node to compute the output. The error backpropagation is performed on the unfolding network for the gradient computation. The weights of the network are adopted such that it reduces the error between the output of the network and the expected targets on the supervised node sets during the training. This procedure is discussed further in Section4.
2.3 Graph Convolution Network
Graph Convolution Networks (GCNs) generalize the notion of convolution from grid data to graph data. Essentially, a GCN learns a desired set of node features from it’s own features and the neighboring nodes features. Unlike GNNs, GCNs stack multiple convolution layers to learn higher order node representations. We can consider the convolution operation as the multiplication of a graph signal with a filter in the Fourier domain, i.e.,
is the matrix of eigenvectors of the symmetric normalized graph Laplacian,
is a graph Fourier transform of, and
is a diagonal matrix containing the eigenvalues of(see, e.g., [shuman2016vertex] for more details on graph Fourier transforms). By comparing with the graph convolution operator in Equation 4, we can interpret as function of , i.e, Computing Equation (4) can prohibitively expensive because computing the spectral decomposition of the graph Laplacian is computationally expensive for large graphs ( flops). In this regard, [HAMMOND2011129] suggested an approximation of by means of Chebyshev polynomials up to order:
where . denotes the largest eigenvalue of and is a vector of Chebyshev coefficients (More details about Chebyshev polynomial can be found at [HAMMOND2011129]). Note that is a -localized filter, which means it depends only on nodes that are at most steps away from the source node, since it is a order polynomial. The complexity of computing (5) is linear in the the number of edges in the graph. In [kipf2016semi]
, the authors built a layer-wise convolutional neural network of (5) assuming . Each layer is followed by a pointwise nonlinearity, i.e., a function that is nonlinear w.r.t . In our work, we use this version of GCN due to it’s simplicity and ability to learn a rich class of convolutional filters by stacking multiple layers. Under this approximation, Equation (5) simplifies to:
Our GCN model architecture: We stacked two-layers of GCN followed by a multi-layer perceptron (MLP) with two hidden layers for our Steiner node predictor network (Figure 2
). The size of the hidden layers are 128 for both the GCN and MLP layers. Finally, we minimized cross-entropy loss to train our network for 500 epochs with a learning rate of 1e-3.
2.4 Graph Attention Network
The GCN in Section 2.3 is capable of learning a wide range of kernels. However, the learned filters depend on the eigenbasis of the graph Laplacian, which hinders it’s generalization power. For this reason, [velickovic2018graph]
proposed Graph Attention Networks (GATs) by stacking layers in which nodes are able to figure out which neighboring nodes are important to compute it’s updated node features. GATs enable specifying different weights to different nodes in a neighborhood without depending on knowing the graph structure up front. The attention module in a GAT is a single layer feed-forward network followed by a LeakyReLU nonlinearity. Initially, a linear transformation is applied at each node feature; then a shared attentional mechanismcomputes the attention coefficient:
were is the learned weight matrix applied to the features and of nodes , and , respectively. The quantity is called the attention score as it arises from the attention module .
3 Two frameworks of using the learning models
3.1 The learning-only black-box method
In this method, we just show the model several STP instances and its optimal solution and see if the model can correctly learn to identify nodes belonging to Steiner trees.
In this framework, the neural network models provide likelihood scores that each node is in the Steiner tree solution. Thus, we must determine which edges should be included to make the node set into a candidate solution for the Steiner tree. We use two heuristics to do this here. To begin, we first add the terminal nodes to the solution if they are not already present, and then compute the induced graph. We keep adding the rest of the nodes one-by-one by decreasing order of the likelihood score and forming the induced graph until we have a connected induced graph. In the last step, we compute a minimum spanning tree of the induced graph and prune non-terminal nodes that have degree 1. Checking the connectivity of the induced graph can be computed in , where is the number of edges [cormen2009introduction]. The running time of computing the minimum spanning tree is also [cormen2009introduction].
3.2 A learning-assisted heuristic: combining learning and algorithms
In this approach, we compare the simple heuristic with a classical -approximation algorithm. In this approximation algorithm, given an input graph and a set of terminals , we first compute a complete graph . The edge weights of are set equal to the shortest path distance between the terminal endpoints. The graph is called the metric closure of . The minimum spanning tree of the metric closure provides a Steiner tree of on terminals .
The simple heuristic provides a valid Steiner tree; however, in our numerical experiments, we find that the approximation ratio for this heuristic (ratio of the cost of the returned solution and the cost of the optimal solution) is often larger than that of the -approximation algorithm discussed in Section 5. Therefore, we also tested another heuristic for forming the Steiner tree from the learning model output as follows.
In Figure 3, A, B and C are the terminal nodes whereas D is not. The -approximation algorithm will first construct the metric closure by computing the shortest paths of all pair terminals. Note that the non-terminal vertex D does not appear in any shortest path. For example, the shortest path length from A to B is 5, and this is from the direct edge connecting the terminals. Hence, the non-terminal node will not be present in the -approximation. Without loss of generality, the -approximation algorithm will choose the A-C-B path with a total cost of 10. Even though the path contains all the terminal nodes, this is not the optimal solution in terms of cost. We update the heuristic with a myopic decision such that the heuristic chooses the optimal cost that may also include nodes that are not terminal. For example, the edges AD, DB, and DC form the optimal solution, it contains all the terminal nodes with the minimum cost of 9. The -approximation algorithm does not add any node that does not belong to a shortest path between two terminal nodes. We use our neural network models to predict nodes that may not belong to any shortest path, but whose inclusion improves the solution. We include these predicted nodes as terminals and then compute the -approximation algorithm. Each time we include a new node, we run the -approximation. How many new nodes will be added can be considered as a parameter of the heuristic. For our implementation, we keep adding nodes until the induced graph of the set of current nodes is connected.
4 Model setup and training
In order to train the models, one has to provide training data consisting of input graphs , edge weights , and terminals . Given , our goal is to produce a binary label for each vertex in
, such that label 1 indicates that a vertex is in the Steiner tree and label 0 indicates that it is not. The model is trained with Stochastic Gradient Descent (SGD) using the ADAM optimizer[kingma2014adam] to minimize the binary cross-entropy loss between the models’ prediction and the ground-truth (a boolean vector in indicating whether a node for the Steiner tree problem is in the solution or not) for each training sample.
4.1 Data generation
We produce training instances using several different random graph generation models: Erdős–Rényi (ER) [erdos1959random], Watts–Strogatz (WS) [watts1998collective], Barabási–Albert (BA) [barabasi1999emergence], and random geometric (GE) [penrose2003random]
graphs. For (ER), there is an edge selection probability, which we set to be at least to ensure that the generated graphs are connected with high probability. In the (WS) model, we initially create a ring lattice of constant degree . We then rewire each edge with probability while avoiding self-loops or duplicate edges. For our experiments we use and . In the (BA) model, a new node is connected to existing nodes. In the random geometric graph model, we uniformly select points from the Euclidean cube, and connect nodes whose Euclidean distance is not larger than a threshold , which we choose to be for some to ensure the graph is connected with high probability.
The Steiner tree problem is NP-complete even if the input graph is unweighted [garey1979computers]. We generate both unweighted and weighted Steiner tree instances using the random generators described above. The number of nodes of these instances is in . For the number of terminals, we use two distributions. In the first distribution, the percentage of the number of terminals with respect to the total number of terminals is in . In the second distribution the percentage is in . These two cases are considered to determine the behavior of the learning models on large and small terminal sets (compared with the overall graph size).
4.2 Computing optimal solutions
We have generated around 40,000 Steiner tree instances. For each of these instances, we have computed the exact solution via a known flow-based integer linear program (ILP)[ahmed2019multi]. While there are various ILPs for solving the Steiner tree problem, our choice of this ILP is due to its fast runtime compared with others.
We used CPLEX 12.6.2 as an ILP solver in a high-performance computer for all experiments (Lenovo NeXtScale nx360 M5 system with 400 nodes). Each node has 192 GB of memory. We have used Python for implementing the algorithms and spanner constructions. Since we have run the experiment on a couple of thousand instances, we run the solver for four hours to solve each instance. The exact solution of each instance of ER, WS, and BA random graph generator was able to finish in four hours. For GE instances, 99.17% of the instances were able to finish in four hours.
While random graphs can sometimes be ideal, and thus the Steiner tree problem on them may be easily solved, the SteinLib library [KMV00] provides a catalog of hard graph instances for solving the Steiner tree problem. For thorough comparison, we perform experiments on instances from two subsets of SteinLib: I080 and I160.
4.3 Model architectures
For the feedforward model, we have used two hidden layers each having 100 neurons with a ReLU activation function. For the output layer, we have used the sigmoid activation function. For the graph neural network model, we set the size of the state dimension equal to 5. For the multi-layer perceptrons representingand we have used one hidden layer of size 40 with the tanh activation function. For the graph convolutional network and attention model, we have used two hidden layers of size 128. For the GCN model, we have used the ReLU activation function. For the GAT model, we have used the ELU activation function.
4.4 Feature selection
We have used different properties of input instances as features to train the neural networks. We provide a list of these features here:
Shortest paths: For every pair of vertices, we compute the shortest path. We add the shortest path distance as a feature.
Vertex degree: For every vertex we use the number of adjacent vertices (degree) as a feature. We denote the degree of vertex by .
Clustering coefficient: We use the clustering coefficient for every vertex as a feature. We first define this coefficient for unweighted graphs. We denote the number of triangles through vertex as . Then the clustering coefficient of vertex in an unweighted graph is the fraction of possible triangles through :
For weighted graphs, one can define the clustering coefficient in multiple ways [saramaki2007generalizations]. We rely on the method that computes the geometric average of the subgraph edge weights [onnela2005intensity]:
Here are normalized edge weights, , where is the maximum edge weight in the network.
5 Experimental results and analyses
Before proceeding with the learning experiments, we note some characeristics of the randomly generated graphs – namely the distributions of the density (Figure 4) and radius (Figure 5) of the graphs from each generator.
Geometric graphs have much larger radii on average than the others, as the other generators create high degree nodes that make the radius of the graphs relatively small. The presence of many high degree nodes makes the Steiner tree problem simpler, as terminal nodes are closely connected more often, and there is often no need to include other nodes in the solution. This also reflects the learning process of the neural network models: after 500 epochs, the loss decreases significantly for datasets generated from only one of the Erdős–Rényi, Watts-Strogatz, and Barabási–Albert models, while the training for datasets generated solely from the geometric model typically requires around 1000 epochs to significantly reduce the loss. Hence, in the combined dataset we train the models for 1000 epochs.
We have used of the problem instances for training and the rest for testing. Table 1 shows the results of testing on unweighted graphs. As a baseline for comparison with standard algorithms, we compare our learning results with the 2-approximation algorithm described above. The Steiner tree problem is well-studied and there are many different approximation algorithms, see [hauptmann2013compendium, promel2012steiner, winter1987steiner], but out of these we choose to compare with the -approximation algorithm given that it has a straightforward implementation and performs very well in practice. In the tables below, FF1, GNN1, GCN1, and GAT1 represent the learning models combined with the simple heuristic of Section 3.2, and FF2, GNN2, GCN2, and GAT2 represent the learning models combined with the more advanced heuristic. The numbers in the table represent the experimental approximation ratios: the ratio of the cost of the heuristic solution to the cost of the optimal solution. We can see that the simple heuristic performs relatively poorly. However, it is 5-10 times faster compared to other heuristics (Table 2). The Erdős–Rényi graphs have the largest maximum ratio. The Watts-Strogatz graphs have a relatively larger average ratio. Boldface numbers indicate the best ratio along each column. We can see that most of the time and for most generators, the second heuristic provides the best average experimental approximation ratio. However, due to the complexity of the second heuristic, some of the learning models using it take somewhat longer to run than the 2-approximation.
Results for weighted graphs are shown in Table 3. For the random graph generators, we draw edge weights uniformly at random from . We also use the SteinLib library dataset for testing, as it provides hard instances of the Steiner tree problem. As can be seen from the table, the SteinLib instances have the largest maximum and average ratios, which is to be expected. The Watts-Strogatz model has a relatively lower ratio. Again we compare the different neural network-based models with the -approximation algorithm. The simple heuristics (FF1, GNN1, GCN1, and GAT1) provide a relatively larger ratio, though the running time of these methods is several magnitudes faster compared to the -approximation algorithm. The best ratio in each column is marked in bold. For most of the graph generators, GNN2, GCN2, and GAT2 perform better than the other heuristics on average at the cost of somewhat more computation time.
We have used different neural network models to compute Steiner trees for weighted and unweighted graphs, and compared the output to the exact solutions. We trained the models on a combination of randomly generated graphs from various models. For testing on weighted graphs, we also added SteinLib instances. The models here output a binary vector indicating whether a given node is part of the Steiner tree or not, and therefore we must determine a rule to include or exclude edges in the final Steiner tree solution. To this end, we first proposed a simple heuristic that can compute the Steiner tree faster than the classical -approximation algorithm; however, the experimental approximation ratio of this heuristic worse than the -approximation in practice. We then proposed another heuristic that uses more computation but performs better on average than the -approximation. Our study shows that the application of different combinatorial techniques along with neural network prediction can provide better solutions to the Steiner tree problem.
For future work, we believe that the application of other combinatorial techniques as a post-processing step as well as other models like reinforcement learning remain an interesting avenue for exploration. Additionally, more sophisticated GNN models which output edge information should be explored, as this could potentially eliminate the need for running a heuristic after the output of the learning model. Finally, some GNN models indicate that the use of additional input information, such as a candidate solution to the Steiner tree problem, into the learning model can yield much better success. Applying this technique to similar models than those presented here could lead to significantly better learning models.