Graph Neural Networks: A Review of Methods and Applications

12/20/2018 ∙ by Jie Zhou, et al. ∙ Tsinghua University 62

Lots of learning tasks require dealing with graph data which contains rich relation information among elements. Modeling physics system, learning molecular fingerprints, predicting protein interface, and classifying diseases require that a model to learn from graph inputs. In other domains such as learning from non-structural data like texts and images, reasoning on extracted structures, like the dependency tree of sentences and the scene graph of images, is an important research topic which also needs graph reasoning models. Graph neural networks (GNNs) are connectionist models that capture the dependence of graphs via message passing between the nodes of graphs. Unlike standard neural networks, graph neural networks retain a state that can represent information from its neighborhood with an arbitrary depth. Although the primitive graph neural networks have been found difficult to train for a fixed point, recent advances in network architectures, optimization techniques, and parallel computation have enabled successful learning with them. In recent years, systems based on graph convolutional network (GCN) and gated graph neural network (GGNN) have demonstrated ground-breaking performance on many tasks mentioned above. In this survey, we provide a detailed review over existing graph neural network models, systematically categorize the applications, and propose four open problems for future research.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graphs are a kind of data structure which models a set of objects (nodes) and their relationships (edges). Recently, researches of analyzing graphs with machine learning have been receiving more and more attention because of the great expressive power of graphs, i.e. graphs can be used as denotation of a large number of systems across various areas including social science (social networks) 

[1, 2], natural science (physical systems [3, 4] and protein-protein interaction networks [5]

), knowledge graphs 

[6] and many other research areas [7]. As a unique non-Euclidean data structure for machine learning, graph analysis focuses on node classification, link prediction, and clustering. Graph neural networks (GNNs) are deep learning based methods that operate on graph domain. Due to its convincing performance and high interpretability, GNN has been a widely applied graph analysis method recently. In the following paragraphs, we will illustrate the fundamental motivations of graph neural networks.

The first motivation of GNNs roots in convolutional neural networks (CNNs) 

[8]. CNNs have the ability to extract multi-scale localized spatial features and compose them to construct highly expressive representations, which led to breakthroughs in almost all machine learning areas and started the new era of deep learning [9]. However, CNNs can only operate on regular Euclidean data like images (2D grid) and text (1D sequence) while these data structures can be regarded as instances of graphs. As we are going deeper into CNNs and graphs, we found the keys of CNNs: local connection, shared weights and the use of multi-layer [9]. These are also of great importance in solving problems of graph domain, because 1) graphs are the most typical locally connected structure. 2) shared weights reduce the computational cost compared with traditional spectral graph theory [10]. 3) multi-layer structure is the key to deal with hierarchical patterns, which captures the features of various sizes. Therefore, it is straightforward to think of finding the generalization of CNNs to graphs. However, as shown in 1, it is hard to define localized convolutional filters and pooling operators, which hinders the transformation of CNN from Euclidean domain to non-Euclidean domain.

Fig. 1: Left: image in Euclidean space. Right: graph in non-Euclidean space

The other motivation comes from graph embedding

, which learns to represent graph nodes, edges or subgraphs in low-dimensional vectors. In the field of graph analysis, traditional machine learning approaches usually rely on hand engineered features and are limited by its inflexibility and high cost. Following the idea of

representation learning and the success of word embedding [11], DeepWalk [12], which is regarded as the first graph embedding method based on representation learning, applies SkipGram model [11] on the generated random walks. Similar approaches such as node2vec [13], LINE [14] and TADW [15] also achieved breakthroughs. However, these methods suffer two severe drawbacks [16]. First, no parameters are shared between nodes in the encoder, which leads to computationally inefficiency, since it means the number of parameters grows linearly with the number of nodes. Second, the direct embedding methods lack the ability of generalization, which means they cannot deal with dynamic graphs or generalize to new graphs.

Based on CNNs and graph embedding, graph neural networks (GNNs) are proposed to collectively aggregate information from graph structure. Thus they can model input and/or output consisting of elements and their dependency. Further, graph neural network can simultaneously model the diffusion process on the graph with the RNN kernel.

In the following part, we explain the fundamental reasons why graph neural networks are worth investigating. Firstly, the standard neural networks like CNNs and RNNs cannot handle the graph input properly in that they stack the feature of nodes by a specific order. However, there isn’t a natural order of nodes in the graph. To present a graph completely, we should traverse all the possible orders as the input of the model like CNNs and RNNs, which is very redundant when computing. To solve this problem, GNNs propagate on each node respectively, ignoring the input order of nodes. In other words, the output of GNNs is invariant for the input order of nodes. Secondly, an edge in a graph represents the information of dependency between two nodes. In the standard neural networks, the dependency information is just regarded as the feature of nodes. However, GNNs can do propagation guided by the graph structure instead of using it as part of features. Generally, GNNs update the hidden state of nodes by a weighted sum of the states of their neighborhood. Thirdly, reasoning is a very important research topic for high-level artificial intelligence and the reasoning process in human brain is almost based on the graph which is extracted from daily experience. The standard neural networks have shown the ability to generate synthetic images and documents by learning the distribution of data while they still cannot learn the reasoning graph from large experimental data. However, GNNs explore to generate the graph from non-structural data like scene pictures and story documents, which can be a powerful neural model for further high-level AI. Recently, it has been proved that an untrained GNN with a simple architecture also perform well 

[17].

There exist several comprehensive reviews on graph neural networks. [18] gives a formal definition of early graph neural network approaches. And [19] demonstrates the approximation properties and computational capabilities of graph neural networks. [20] proposed a unified framework, MoNet, to generalize CNN architectures to non-Euclidean domains (graphs and manifolds) and the framework could generalize several spectral methods on graphs[2, 21] as well as some models on manifolds[22, 23]. [24] provides a thorough review of geometric deep learning, which presents its problems, difficulties, solutions, applications and future directions. [20] and [24] focus on generalizing convolutions to graphs or manifolds, however in this paper we only focus on problems defined on graphs and we also investigate other mechanisms used in graph neural networks such as gate mechanism, attention mechanism and skip connection. [25] proposed the message passing neural network (MPNN) which could generalize several graph neural network and graph convolutional network approaches. It presents the definition of the message passing neural network and demonstrates its application on quantum chemistry. [26] proposed the non-local neural network (NLNN) which unifies several “self-attention”-style methods. However, the model is not explicitly defined on graphs in the original paper. Focusing on specific application domains, [25] and [26] only give examples of how to generalize other models using their framework and they do not provide a review over other graph neural network models. [27] proposed the graph network (GN) framework. The framework has a strong capability to generalize other models and its relational inductive biases promote combinatorial generalization, which is thought to be a top priority for AI. However, [27] is part position paper, part review and part unification and it only gives a rough classification of the applications. In this paper, we provide a thorough review of different graph neural network models as well as a systematic taxonomy of the applications.

To summarize, this paper presents an extensive survey of graph neural networks with the following contributions.

  • We provide a detailed review over existing graph neural network models. We introduce the original model, its variants and several general frameworks. We examine various models in this area and provide a unified representation to present different propagation steps in different models. One can easily make a distinction between different models using our representation by recognizing corresponding aggregators and updaters.

  • We systematically categorize the applications and divide the applications into structural scenarios, non-structural scenarios and other scenarios. We present several major applications and their corresponding methods in different scenarios.

  • We propose four open problems for future research. Graph neural networks suffer from over-smoothing and scaling problems. There are still no effective methods for dealing with dynamic graphs as well as modeling non-structural sensory data. We provide a thorough analysis of each problem and propose future research directions.

The rest of this survey is organized as follows. In Sec. 2, we introduce various models in the graph neural network family. We first introduce the original framework and its limitations. And then we present its variants that try to release the limitations. And finally, we introduce several general frameworks proposed recently. In Sec. 3, we will introduce several major applications of graph neural networks applied to structural scenarios, non-structural scenarios and other scenarios. In Sec. 4, we propose four open problems of graph neural networks as well as several future research directions. And finally, we conclude the survey in Sec. 5.

2 Models

Graph neural networks are useful tools on non-Euclidean structures and there are various methods proposed in the literature trying to improve the model’s capability.

In Sec 2.1, we describe the original graph neural networks proposed in [18]. We also list the limitations of the original GNN in representation capability and training efficiency. In Sec 2.2 we introduce several variants of graph neural networks aiming to release the limitations. These variants operate on graphs with different types, utilize different propagation functions and advanced training methods. In Sec 2.3 we present three general frameworks which could generalize and extend several lines of work. In detail, the message passing neural network (MPNN)[25] unifies various graph neural network and graph convolutional network approaches; the non-local neural network (NLNN)[26] unifies several “self-attention”-style methods. And the graph network(GN)[27] could generalize almost every graph neural network variants mentioned in this paper.

Before going further into different sections, we give the notations that will be used throughout the paper. The detailed descriptions of the notations could be found in Table I.

Notations Descriptions
-dimensional Euclidean space
Scalar, vector, matrix
Matrix transpose
Identity matrix of dimension
Convolution of and
Number of nodes in the graph
Number of nodes in the graph
Number of edges in the graph
Neighborhood set of node
Vector of node at time step
Hidden state of node
Hidden state of node at time step
Features of edge from node to
Features of edge with label
Output of node
Matrices for computing
Vectors for computing

The logistic sigmoid function

An alternative non-linear function
The hyperbolic tangent function
LeakyReLU The LeakyReLU function
Element-wise multiplication operation
Vector concatenation
TABLE I: Notations used in this paper.

2.1 Graph Neural Networks

The concept of graph neural network (GNN) was first proposed in [18], which extended existing neural networks for processing the data represented in graph domains. In a graph, each node is naturally defined by its features and the related nodes. The target of GNN is to learn a state embedding which contains the information of neighborhood for each node. The state embedding is an -dimension vector of node and can be used to produce an output such as the node label. Let be a parametric function, called local transition function, that is shared among all nodes and updates the node state according to the input neighborhood. And let be the local output function that describes how the output is produced. Then, and are defined as follows:

(1)
(2)

where are the features of , the features of its edges, the states, and the features of the nodes in the neighborhood of , respectively.

Let , , , and be the vectors constructed by stacking all the states, all the outputs, all the features, and all the node features, respectively. Then we have a compact form as:

(3)
(4)

where , the global transition function, and , the global output function are stacked versions of and for all nodes in a graph, respectively. The value of is the fixed point of Eq. 3 and is uniquely defined with the assumption that is a contraction map.

With the suggestion of Banach’s fixed point theorem [28], GNN uses the following classic iterative scheme for computing the state:

(5)

where denotes the -th iteration of . The dynamical system Eq. 5 converges exponentially fast to the solution of Eq. 3 for any initial value . Note that the computations described in and can be interpreted as the feedforward neural networks.

When we have the framework of GNN, the next question is how to learn the parameters of and . With the target information ( for a specific node) for the supervision, the loss can be written as follow:

(6)

where is the number of supervised nodes. The learning algorithm is based on a gradient-descent strategy and is composed of the following steps.

  • The states are iteratively updated by Eq. 1 until a time . They approach the fixed point solution of Eq. 3: .

  • The gradient of weights is computed from the loss.

  • The weights are updated according to the gradient computed in the last step.

Limitations

Though experimental results showed that GNN is a powerful architecture for modeling structural data, there are still several limitations of the original GNN. Firstly, it is inefficient to update the hidden states of nodes iteratively for the fixed point. If relaxing the assumption of the fixed point, we can design a multi-layer GNN to get a stable representation of node and its neighborhood. Secondly, GNN uses the same parameters in the iteration while most popular neural networks use different parameters in different layers, which serve as a hierarchical feature extraction method. Moreover, the update of node hidden states is a sequential process which can benefit from the RNN kernel like GRU and LSTM. Thirdly, there are also some informative features on the edges which cannot be effectively modeled in the original GNN. For example, the edges in the knowledge graph have the type of relations and the message propagation through different edges should be different according to their types. Besides, how to learn the hidden states of edges is also an important problem. Lastly, it is unsuitable to use the fixed points if we focus on the representation of nodes instead of graphs because the distribution of representation in the fixed point will be much smooth in value and less informative for distinguishing each node.

2.2 Variants of Graph Neural Networks

In this subsection, we present several variants of graph neural networks. Sec 2.2.1 focuses on variants operating on different graph types. These variants extend the representation capability of the original model. Sec 2.2.2 lists several modifications (convolution, gate mechanism, attention mechanism and skip connection) on the propagation step and these models could learn representations with higher quality. Sec 2.2.3 describes variants using advanced training methods, which improve the training efficiency. An overview of different variants of graph neural networks could be found in Fig. 2.

(a) Graph Types
(b) Training Methods
(c) Propagation Steps
Fig. 2: An overview of variants of graph neural networks.

2.2.1 Graph Types

In the original GNN [18], the input graph consists of nodes with label information and undirected edges, which is the simplest graph format. However, there are many variants of graph in the world. In this subsection, we will introduce some methods designed to model different kinds of graphs.

Directed Graphs The first variant of graph is directed graph. Undirected edge which can be treated as two directed edges shows that there is a relation between two nodes. However, directed edges can bring more information than undirected edges. For example, in a knowledge graph where the edge starts from the head entity and ends at the tail entity, the head entity is the parent class of the tail entity, which suggests we should treat the information propagation process from parent classes and child classes differently. ADGPM [29] uses two kinds of weight matrix, and , to incorporate more precise structural information. The propagation rule is shown as follows:

(7)

where , are the normalized adjacency matrix for parents and children respectively.

Heterogeneous Graphs The second variant of graph is heterogeneous graph, where there are several kinds of nodes. The simplest way to process heterogeneous graph is to convert the type of each node to a one-hot feature vector which is concatenated with the original feature. What’s more, GraphInception [30] introduces the concept of metapath into the propagation on the heterogeneous graph. With metapath, we can group the neighbors according to their node types and distances. For each neighbor group, GraphInception treats it as a sub-graph in a homogeneous graph to do propagation and concatenates the propagation results from different homogeneous graphs to do a collective node representation.

Graphs with Edge Information In the final variant of graph, each edge also has its information like the weight or the type of the edge. There are two ways to handle this kind of graphs: Firstly, we can convert the graph to a bipartite graph where the original edges also become nodes and one original edge is split into two new edges which means there are two new edges between the edge node and begin/end nodes. The encoder of G2S [31] uses the following aggregation function for neighbors:

(8)

where and are the propagation parameters for different types of edges (relations). Secondly, we can adapt different weight matrices for the propagation on different kinds of edges. When the number of relations is very large, r-GCN [32] introduces two kinds of regularization to reduce the number of parameters for modeling amounts of relations: basis- and block-diagonal-decomposition. With the basis decomposition, each is defined as follows:

(9)

i.e. as a linear combination of basis transformations with coefficients such that only the coefficients depend on . In the block-diagonal decomposition, r-GCN defines each through the direct sum over a set of low-dimensional matrices, which needs more parameters than the first one.

2.2.2 Propagation Types

The propagation step and output step are of vital importance in the model to obtain the hidden states of nodes (or edges). As we list below, there are several major modifications in the propagation step from the original graph neural network model while researchers usually follow a simple feed-forward neural network setting in the output step. The comparison of different variants of GNN could be found in Table

II. The variants utilize different aggregators to gather information from each node’s neighbors and specific updaters to update nodes’ hidden states.

Name
Variant Aggregator Updater
ChebNet
Spectral Methods -order model
Single parameter
GCN
Convolutional networks in [33]
Non-spectral Methods DCNN
Node classification:
Graph classification:
GraphSAGE
Graph Attention Networks GAT
Multi-head concatenation:
Multi-head average:
Gated Graph Neural Networks GGNN
Tree LSTM (Child sum)
Graph LSTM Tree LSTM (N-ary)
Graph LSTM in [34]
TABLE II: Different variants of graph neural networks.

Convolution. There is an increasing interest in generalizing convolutions to the graph domain. Advances in this direction are often categorized as spectral approaches and non-spectral approaches.

Spectral approaches work with a spectral representation of the graphs. [35] proposed the spectral network. The convolution operation is defined in the Fourier domain by computing the eigendecomposition of the graph Laplacian. The operation can be defined as the multiplication of a signal (a scalar for each node) with a filter diag parameterized by :

(10)

where

is the matrix of eigenvectors of the normalized graph Laplacian

( is the degree matrix and

is the adjacency matrix of the graph), with a diagonal matrix of its eigenvalues

.

This operation results in potentially intense computations and non-spatially localized filters.[36] attempts to make the spectral filters spatially localized by introducing a parameterization with smooth coefficients. [37] suggests that can be approximated by a truncated expansion in terms of Chebyshev polynomials up to order. Thus the operation is:

(11)

with . denotes the largest eigenvalue of . is now a vector of Chebyshev coefficients. The Chebyshev polynomials are defined as , with and . It can be observed that the operation is -localized since it is a -order polynomial in the Laplacian. [38] uses this -localized convolution to define a convolutional neural network which could remove the need to compute the eigenvectors of the Laplacian.

[2] limits the layer-wise convolution operation to to alleviate the problem of overfitting on local neighborhood structures for graphs with very wide node degree distributions. It further approximates and the equation simplifies to:

(12)

with two free parameters and . After constraining the number of parameters with , we can obtain the following expression:

(13)

Note that stacking this operator could lead to numerical instabilities and exploding/vanishing gradients, [2] introduces the renormalization trick: , with and . Finally, [2] generalizes the definition to a signal with input channels and filters for feature maps as follows:

(14)

where is a matrix of filter parameters and is the convolved signal matrix.

[39]

presents a Gaussian process-based Bayesian approach to solve the semi-supervised learning problem on graphs. It shows parallels between the model and the spectral filtering approaches, which could give us some insights from another perspective.

However, in all of the spectral approaches mentioned above, the learned filters depend on the Laplacian eigenbasis, which depends on the graph structure, that is, a model trained on a specific structure could not be directly applied to a graph with a different structure.

Non-spectral approaches define convolutions directly on the graph, operating on spatially close neighbors. The major challenge of non-spectral approaches is defining the convolution operation with differently sized neighborhoods and maintaining the local invariance of CNNs.

[33] uses different weight matrices for nodes with different degrees,

(15)

where is the weight matrix for nodes with degree at Layer . And the main drawback of the method is that it cannot be applied to large-scale graphs with more node degrees.

[21] proposed the diffusion-convolutional neural networks (DCNNs). Transition matrices are used to define the neighborhood for nodes in DCNN. For node classification, it has

(16)

where is an tensor of input features ( is the number of nodes and is the number of features). is an tensor which contains the power series {, …, } of matrix . And is the degree-normalized transition matrix from the graphs adjacency matrix . Each entity is transformed to a diffusion convolutional representation which is a matrix defined by hops of graph diffusion over features. And then it will be defined by a

weight matrix and a non-linear activation function

. Finally (which is ) denotes the diffusion representations of each node in the graph.

As for graph classification, DCNN simply takes the average of nodes’ representation,

(17)

and here is an vector of ones. DCNN can also be applied to edge classification tasks, which requires converting edges to nodes and augmenting the adjacency matrix.

[40] extracts and normalizes a neighborhood of exactly nodes for each node. And then the normalized neighborhood serves as the receptive field for the convolutional operation.

[20] proposed a spatial-domain model (MoNet) on non-Euclidean domains which could generalize several previous techniques. The Geodesic CNN (GCNN)[22] and Anisotropic CNN (ACNN) [23] on manifolds or GCN[2] and DCNN[21] on graphs could be formulated as particular instances of MoNet.

[1] proposed the GraphSAGE, a general inductive framework. The framework generates embeddings by sampling and aggregating features from a node’s local neighborhood.

(18)

However, [1] does not utilize the full set of neighbors in Eq.18 but a fixed-size set of neighbors by uniformly sampling. And [1] suggests three aggregator functions.

  • Mean aggregator. It could be viewed as an approximation of the convolutional operation from the transductive GCN framework[2], so that the inductive version of the GCN variant could be derived by

    (19)

    The mean aggregator is different from other aggregators because it does not perform the concatenation operation which concatenates and in Eq.18. It could be viewed as a form of “skip connection”[41] and could achieve better performance.

  • LSTM aggregator. [1] also uses an LSTM-based aggregator which has a larger expressive capability. However, LSTMs process inputs in a sequential manner so that they are not permutation invariant. [1] adapts LSTMs to operate on an unordered set by permutating node’s neighbors.

  • Pooling aggregator. In the pooling aggregator, each neighbor’s hidden state is fed through a fully-connected layer and then a max-pooling operation is applied to the set of the node’s neighbors.

    (20)

    Note that any symmetric functions could be used in place of the max-pooling operation here.

Recently, the structure-aware convolution and Structure-Aware Convolutional Neural Networks (SACNNs) have been proposed [42]. Univariate functions are used to perform as filters and they can deal with both Euclidean and non-Euclidean structured data.

Gate. There are several works attempting to use the gate mechanism like GRU[43] or LSTM[44] in the propagation step to diminish the restrictions in the former GNN models and improve the long-term propagation of information across the graph structure.

[45]

proposed the gated graph neural network (GGNN) which uses the Gate Recurrent Units (GRU) in the propagation step, unrolls the recurrence for a fixed number of steps

and uses backpropagation through time in order to compute gradients.

Specifically, the basic recurrence of the propagation model is

(21)

The node first aggregates message from its neighbors, where is the sub-matrix of the graph adjacency matrix and denotes the connection of node with its neighbors. The GRU-like update functions incorporate information from the other nodes and from the previous timestep to update each node’s hidden state. gathers the neighborhood information of node , and are the update and reset gates.

LSTMs are also used in a similar way as GRU through the propagation process based on a tree or a graph.

[46] proposed two extensions to the basic LSTM architecture: the Child-Sum Tree-LSTM and the N-ary Tree-LSTM. Like in standard LSTM units, each Tree-LSTM unit (indexed by ) contains input and output gates and , a memory cell and hidden state . Instead of a single forget gate, the Tree-LSTM unit contains one forget gate for each child , allowing the unit to selectively incorporate information from each child. The Child-Sum Tree-LSTM transition equations are the following:

(22)

is the input vector at time in the standard LSTM setting.

If the branching factor of a tree is at most and all children of a node are ordered, , they can be indexed from 1 to , then the -ary Tree-LSTM can be used. For node , and denote the hidden state and memory cell of its -th child at time respectively. The transition equations are the following:

(23)

The introduction of separate parameter matrices for each child allows the model to learn more fine-grained representations conditioning on the states of a unit’s children than the Child-Sum Tree-LSTM.

The two types of Tree-LSTMs can be easily adapted to the graph. The graph-structured LSTM in [47] is an example of the -ary Tree-LSTM applied to the graph. However, it is a simplified version since each node in the graph has at most 2 incoming edges (from its parent and sibling predecessor). [34] proposed another variant of the Graph LSTM based on the relation extraction task. The main difference between graphs and trees is that edges of graphs have their labels. And [34] utilizes different weight matrices to represent different labels.

(24)

where denotes the edge label between node and .

[48] proposed the Sentence LSTM (S-LSTM) for improving text encoding. It converts text into a graph and utilizes the Graph LSTM to learn the representation. The S-LSTM shows strong representation power in many NLP problems. [49] proposed a Graph LSTM network to address the semantic object parsing task. It uses the confidence-driven scheme to adaptively select the starting node and determine the node updating sequence. It follows the same idea of generalizing the existing LSTMs into the graph-structured data but has a specific updating sequence while methods we mentioned above are agnostic to the order of nodes.

Attention. The attention mechanism has been successfully used in many sequence-based tasks such as machine translation[50, 51, 52], machine reading[53] and so on. [54] proposed a graph attention network (GAT) which incorporates the attention mechanism into the propagation step. It computes the hidden states of each node by attending over its neighbors, following a self-attention strategy.

[54] proposed a single graph attentional layer and constructs arbitrary graph attention networks by stacking this layer. The layer computes the coefficients in the attention mechanism of the node pair by:

(25)

where is the attention coefficient of node to , represents the neighborhoods of node in the graph. The input set of node features to the layer is , where is the number of nodes and is the number of features of each node, the layer produces a new set of node features(of potentially different cardinality ), , as its output. is the weight matrix

of a shared linear transformation which applied to every node,

is the weight vector of a single-layer feedforward neural network. It is normalized by a softmax function and the LeakyReLU nonlinearity(with negative input slop ) is applied.

Then the final output features of each node can be obtained by (after applying a nonlinearity ):

(26)

Moreover, the layer utilizes the multi-head attention similarly to [52] to stabilize the learning process. It applies independent attention mechanisms to compute the hidden states and then concatenates their features(or computes the average), resulting in the following two output representations:

(27)
(28)

where is normalized attention coefficient computed by the -th attention mechanism.

The attention architecture in [54] has several properties: (1) the computation of the node-neighbor pairs is parallelizable thus the operation is efficient; (2) it can be applied to graph nodes with different degrees by specifying arbitrary weights to neighbors; (3) it can be applied to the inductive learning problems easily.

Skip connection. Many applications unroll or stack the graph neural network layer aiming to achieve better results as more layers (i.e layers) make each node aggregate more information from neighbors hops away. However, it has been observed in many experiments that deeper models could not improve the performance and deeper models could even perform worse[2]. This is mainly because more layers could also propagate the noisy information from an exponentially increasing number of expanded neighborhood members.

A straightforward method to address the problem, the residual network[55]

, could be found from the computer vision community. But, even with residual connections, GCNs with more layers do not perform as well as the 2-layer GCN on many datasets

[2].

[56] proposed a Highway GCN which uses layer-wise gates similar to highway networks[57]. The output of a layer is summed with its input with gating weights:

(29)

By adding the highway gates, the performance peaks at 4 layers in a specific problem discussed in [56].

[58] studies properties and resulting limitations of neighborhood aggregation schemes. It proposed the Jump Knowledge Network which could learn adaptive, structure-aware representations. The Jump Knowledge Network selects from all of the intermediate representations (which ”jump” to the last layer) for each node at the last layer, which makes the model adapt the effective neighborhood size for each node as needed. [58] uses three approaches of concatenation, max-pooling and LSTM-attention in the experiments to aggregate information. The Jump Knowledge Network performs well on the experiments in social, bioinformatics and citation networks. It could also be combined with models like Graph Convolutional Networks, GraphSAGE and Graph Attention Networks to improve their performance.

2.2.3 Training Methods

The original graph convolutional neural network has several drawbacks in training and optimization methods. Specifically, GCN requires the full graph Laplacian, which is computational-consuming for large graphs. Furthermore, The embedding of a node at layer is computed recursively by the embeddings of all its neighbors at layer . Therefore, the receptive field of a single node grows exponentially with respect to the number of layers, so computing gradient for a single node costs a lot. Finally, GCN is trained independently for a fixed graph, which lacks the ability for inductive learning.

GraphSAGE [1] is a comprehensive improvement of original GCN. To solve the problems mentioned above, GraphSAGE replaced full graph Laplacian with learnable aggregation functions, which are key to perform message passing and generalize to unseen nodes. As shown in Eq.18, they first aggregate neighborhood embeddings, concatenate with target node’s embedding, then propagate to the next layer. With learned aggregation and propagation functions, GraphSAGE could generate embeddings for unseen nodes. Also, GraphSAGE uses neighbor sampling to alleviate receptive field expansion.

FastGCN [59] further improves the sampling algorithm. Instead of sampling neighbors for each node, FastGCN directly samples the receptive field for each layer. FastGCN uses importance sampling, which the importance factor is calculated as below:

(30)

In contrast to fixed sampling methods above, [60]

introduces a parameterized and trainable sampler to perform layer-wise sampling conditioned on the former layer. Furthermore, this adaptive sampler could find optimal sampling importance and reduce variance simultaneously.

[61] proposed a control-variate based stochastic approximation algorithms for GCN by utilizing the historical activations of nodes as a control variate. This method limits the receptive field in the 1-hop neighborhood, but use the historical hidden state as an affordable approximation.

[62] focused on the limitations of GCN, which include that GCN requires many additional labeled data for validation and also suffers from the localized nature of the convolutional filter. To solve the limitations, the authors proposed Co-Training GCN and Self-Training GCN to enlarge the training dataset. The former method finds the nearest neighbors of training data while the latter one follows a boosting-like way.

2.3 General Frameworks

Apart from different variants of graph neural networks, several general frameworks are proposed aiming to integrate different models into one single framework. [25] proposed the message passing neural network (MPNN), which unified various graph neural network and graph convolutional network approaches. [26] proposed the non-local neural network (NLNN). It unifies several “self-attention”-style methods [63, 52, 54]. [27] proposed the graph network (GN) which unified the MPNN and NLNN methods as well as many other variants like Interaction Networks[4, 64], Neural Physics Engine[65], CommNet[66], structure2vec[67, 7], GGNN[45], Relation Network[68, 69], Deep Sets[70] and Point Net[71].

2.3.1 Message Passing Neural Networks

[25] proposed a general framework for supervised learning on graphs called Message Passing Neural Networks (MPNNs). The MPNN framework abstracts the commonalities between several of the most popular models for graph-structured data, such as spectral approaches [35] [38, 2] and non-spectral approaches[33] in graph convolution, gated graph neural networks [45], interaction networks [4], molecular graph convolutions [72], deep tensor neural networks [73] and so on.

The model contains two phases, a message passing phase and a readout phase. The message passing phase (namely, the propagation step) runs for time steps and is defined in terms of message function and vertex update function . Using messages , the updating functions of hidden states are as follows:

(31)

where represents features of the edge from node to . The readout phase computes a feature vector for the whole graph using the readout function according to

(32)

where T denotes the total time steps. The message function , vertex update function and readout function could have different settings. Hence the MPNN framework could generalize several different models via different function settings. Here we give an example of generalizing GGNN, and other models’ function settings could be found in [25]. The function settings for GGNNs are:

(33)

where is the adjacency matrix, one for each edge label . The GRU is the Gated Recurrent Unit introduced in [43]. and are neural networks in function .

2.3.2 Non-local Neural Networks

[26] proposed the Non-local Neural Networks (NLNN) for capturing long-range dependencies with deep neural networks. The non-local operation is a generalization of the classical non-local mean operation [74] in computer vision. A non-local operation computes the response at a position as a weighted sum of the features at all positions. The set of positions can be in space, time or spacetime. Thus the NLNN can be viewed as a unification of different “self-attention”-style methods [63, 52, 54]. We will first introduce the general definition of non-local operations and then some specific instantiations.

Following the non-local mean operation[74], the generic non-local operation is defined as:

(34)

where is the index of an output position and is the index that enumerates all possible positions. computes a scalar between and representing the relation between them. denotes a transformation of the input and a factor is utilized to normalize the results.

There are several instantiations with different and settings. For simplicity, [26] uses the linear transformation as the function . That means , where is a learned weight matrix. Next we list the choices for function in the following.

Gaussian. The Gaussian function is a natural choice according to the non-local mean[74] and bilateral filters[75]. Thus:

(35)

Here is dot-product similarity and .

Embedded Gaussian. It is straightforward to extend the Gaussian function by computing similarity in the embedding space, which means:

(36)

where , and .

It could be found that the self-attention proposed in [52] is a special case of the Embedded Gaussian version. For a given , becomes the softmax computation along the dimension . So that , which matches the form of self-attention in [52].

Dot product. The function can also be implemented as a dot-product similarity:

(37)

Here the factor , where is the number of positions in .

Concatenation. Here we have: