Get Rid of Suspended Animation Problem: Deep Diffusive Neural Network on Graph Semi-Supervised Classification

01/22/2020 ∙ by Jiawei Zhang, et al. ∙ IFM Lab 5

Existing graph neural networks may suffer from the "suspended animation problem" when the model architecture goes deep. Meanwhile, for some graph learning scenarios, e.g., nodes with text/image attributes or graphs with long-distance node correlations, deep graph neural networks will be necessary for effective graph representation learning. In this paper, we propose a new graph neural network, namely DIFNET (Graph Diffusive Neural Network), for graph representation learning and node classification. DIFNET utilizes both neural gates and graph residual learning for node hidden state modeling, and includes an attention mechanism for node neighborhood information diffusion. Extensive experiments will be done in this paper to compare DIFNET against several state-of-the-art graph neural network models. The experimental results can illustrate both the learning performance advantages and effectiveness of DIFNET, especially in addressing the "suspended animation problem".

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graph neural networks (GNNs) [Wu et al.2019, Zhang2019] have been one of the most important research topic in recent years, and many different types of GNN models have been proposed already [Kipf and Welling2016]. Generally, via the aggregations of information from the neighbors [Hammond et al.2011, Defferrard et al.2016], GNNs can learn the representations of nodes in graph data effectively. The existing GNNs, e.g., GCN [Kipf and Welling2016], GAT [Veličković et al.2018] and their derived variants [Klicpera et al.2018, Li et al.2018b, Gao et al.2019], are mostly based on the approximated graph convolutional operator [Hammond et al.2011, Defferrard et al.2016]. As introduced in [Zhang and Meng2019], such an approximated aggregation operator actually reduces the neighboring information integration as an one-hop random walk layer followed by a fully-connected layer. Generally, in many learning scenarios, e.g., nodes with text/image attributes or graphs with close correlations among nodes that are far away, a deep GNN architecture will be necessary to either achieve the high-order latent attribute representations or integrate information from those long-distance nodes for graph representation learning.

(a) Training Accuracy
(b) Testing Accuracy
Figure 1: The learning performance of GCN (bias disabled) with 2-layer, , 7-layer, and 10-layer, , 50-layer on the Cora dataset. The x axis denotes the iterations over the whole training set. The y axes denote the training and testing accuracy, respectively.

Also as pointed out in [Zhang and Meng2019], most of the existing GNN models based on approximated graph convolutional operator will suffer from the “suspended animation problem” when the model architecture goes deep. Especially, when the GNNs’ model depth reaches certain limit (namely, the GNNs’ suspended animation limit), the model will not respond to the training data anymore and become not learnable. As illustrate in Figure 1, we show a case study of the vanilla GCN model (with bias term disabled) on the Cora dataset (a well-known benchmark graph dataset). According to the results, the model can achieve the best performance with 2 layers, and its performance gets degraded as the depth increases and further get choked as the depth reaches and more. Therefore, 5-layer is also called the “suspended animation limit” of GCN. Similar phenomena have been observed for the GCN model (with bias term enabled), GAT and their other derived models as well.

In this paper, we will introduce a novel graph neural network model, namely DifNet (Graph Diffusive Neural Net), for graph representation learning. DifNet is not based on the approximated graph convolutional operator and can work perfectly with deep model architectures. To be more specific, DifNet

introduces a new neuron unit, i.e.,

GDU (gated diffusive unit), for modeling and updating the graph node hidden states at each layer. Equipped with both neural gates and graph residual learning techniques, GDU can help address the aforementioned “suspended animation problem” effectively. Furthermore, DifNet also involves the attention mechanism for the neighboring node representation diffusion prior to feeding them into GDU. All the variables involved in GDU

can be learned automatically with the backpropagation algorithm efficiently.

(a) Training Accuracy
(b) Testing Accuracy
Figure 2: The learning performance of DifNet with 2-layer, 3-layer, , 50-layer on the Cora dataset.

To demonstrate the effectiveness of DifNet, in Figure 2, we also illustrate the learning performance of DifNet on the Cora dataset (all the settings are identical to those of Figure 1). From the plots, we observe that DifNet proposed in this paper can converge very fast and also outperforms GCN for the testing results. What’s more, DifNet generalizes very well to the deep architectures, and its learning performance can even get improved as the model depth increases. For instance, among all these layers, the highest testing accuracy score is achieved by DifNet (10-layer). To be more specific, the layers for DifNet denotes the number of GDU based layers. More detailed information and experimental results about DifNet will be provided in the following sections of this paper.

We summarize the contributions of this paper as follows:

  • We introduce a novel graph neural network model DifNet in this paper, which can help address the “suspended animation problem” effectively on graph representation learning.

  • We propose a new neuron model, i.e., GDU, which involves both neural gates and residual learning for node representation modeling and updating.

  • DifNet introduces an attention mechanism for the node representation diffusion and integration, which can work both effectively and efficiently.

  • We test the effectiveness of DifNet on several commonly used graph benchmark datasets and compare against both classic and state-of-the-art GNN models.

The remaining parts of this paper will be organized as follows. We will introduce the related work in Section 2. Definitions of important terminologies and formulation of the studied problem will be provided in Section 3. Detailed information about the DifNet model will be introduced in Section 4. The effectiveness of DifNet will be tested in Section 5. Finally, we will conclude this paper in Section 6.

2 Related Work

Several research topics are closely correlated with this paper, which include graph neural network, GNN learning problems, neural gates and residual learning, respectiely.

Graph Neural Network: Due to the inter-connected structure of graph data, traditional deep models (assuming the training instances are i.i.d.) cannot be directly applied to the graph data. In recent years, graph neural networks [Monti et al.2016, Atwood and Towsley2015, Masci et al.2015, Kipf and Welling2016, Scarselli et al.2009, Zhou et al.2018, Niepert et al.2016] have become one of the most popular research topic for effective graph representation learning. GCN proposed in [Kipf and Welling2016] feeds the generalized spectral features into the convolutional layer for representation learning. Similar to GCN, deep loopy graph neural network [Zhang2018] proposes to update the node states in a synchronous manner, and it introduces a spanning tree based learning algorithm for training the model. LoopyNet accepts nodes’ raw features into each layer of the model, and it doesn’t suffer from the suspended animation problem according to [Zhang and Meng2019]. GAT [Veličković et al.2018] leverages masked self-attentional layers to address the shortcomings of GCN. In recent years, we have also observed several derived variants of GCN and GAT models, e.g., [Klicpera et al.2018, Li et al.2018b, Gao et al.2019]. Due to the limited space, we can only name a few number of the representative graph neural network here. The readers are also suggested to refer to page111https://paperswithcode.com/task/node-classification, which provides a summary of the latest graph neural network research papers with code on the node classification problem.

GNNs’ Performance Problems: For the existing graph neural networks, several performance problems have been observed. In [Zhang and Meng2019], the unique “suspended animation problem” with graph neural networks is identified and formally defined. A theoretic analysis about the causes of such a problem is also provided in [Zhang and Meng2019], which also introduce the graph residual learning as the potential solution. Several other problems have also been observed in learning deep graph neural networks, e.g., the “over-smoothing problem” [Li et al.2018a]. To resolve such a problem, many research efforts have been witnessed [Li et al.2019, Gürel et al.2019, Sun et al.2019, Huang and Carley2019]. To enable deep GCNs, [Li et al.2019] proposes to adopt residual/dense connections and dilated convolutions into the GCN architecture. [Gürel et al.2019] tries to provide an explanation about the cause why GCN cannot benefit from deep architectures. In [Sun et al.2019], the authors propose a novel RNN-like deep graph neural network architecture by incorporating AdaBoost into the computation of network. Similar methods is also observed in [Huang and Carley2019], which also extends the recurrent network for deep graph representation learning.

Neural Gates and Residual Learning: Besides these two main topics, this paper is also closely correlated with neural gates learning [Hochreiter and Schmidhuber1997, Chung et al.2014] and residual learning [Srivastava et al.2015, He et al.2015, Zhang and Meng2019], respectively. Neural gate learning initially proposed in LSTM [Hochreiter and Schmidhuber1997] proposes a way to handle the gradient vanishing/exploding problem with deep model learning. Later on, such concepts have been widely used in deep model learning [Chung et al.2014, Gao and Glowacka2016]. Residual network [He et al.2015] has been demonstrated to be effective for deep CNN learning, which extends the high-way network [Srivastava et al.2015] to a more general case. In [Zhang and Meng2019], the authors also observe that conventional naive residual learning cannot work for graph neural networks any more, and they introduce several new graph residual terms for representation learning instead.

3 Problem Formulation

Here, we will provide the formulation of the problem studied in this paper. Prior to that, we will also introduce the notations and terminologies to be used in the problem formulation

3.1 Notations

In the sequel of this paper, we will use the lower case letters (e.g., ) to represent scalars, lower case bold letters (e.g.,

) to denote column vectors, bold-face upper case letters (e.g.,

) to denote matrices, and upper case calligraphic letters (e.g.,

) to denote sets or high-order tensors. Given a matrix

, we denote and as its row and column, respectively. The (, ) entry of matrix can be denoted as either . We use and to represent the transpose of matrix and vector . For vector , we represent its -norm as . The Frobenius-norm of matrix is represented as . The element-wise product of vectors and of the same dimension is represented as , whose concatenation is represented as .

3.2 Terminology Definition

In this paper, we focus on studying the node representation learning problem in graph data.

Definition 1.

(Graph): Formally, we can denote the graph data studied in this paper as , where and denote the sets of nodes and links in the graph. Mapping projects the links to their weights; whereas mappings and project the nodes to their raw features and labels. Here, and denote the raw feature space and label space, respectively.

According to the above definition, for presentation simplicity, given a node , we can also denote its raw feature vector and label vector as and , respectively. Meanwhile, given a link between nodes and , we can also denote its corresponding link weight as (or for simplicity). In the case that studied graph is unweighted, we have and .

Definition 2.

(Neighbor Set): Given the input graph , for any node in the graph, we can denote its neighbor set as the group of nodes adjacent to based on the links in the graph, i.e., .

In the case when the studied graph is directed, the neighborhood set can be furthermore divided into the the in-neighbor and out-neighbor, which will not be considered in this paper.

3.3 Problem Statement

In this paper, we will study the semi-supervised node classification problem via graph representation learning. Based on the notations and terminologies defined above, we can provide studied problem formulation in this paper as follows.

Definition 3.

(Semi-Supervised Node Classification): Given the input graph with a subset of partially labeled nodes , the node classification problem studied in this paper aims at learning a mapping to project the nodes to their desired labels. To be more specific, in building the mapping , we aim to to learn the representations of nodes based on both nodes’ self-raw features and the diffused information from their nearby neighbors. What’s more, the mapping should also be generalizable to deep architectures without suffering from the suspended animation problem.

4 Method

To address the semi-supervised node classification problem, we will introduce the DifNet (deep diffusive neural network) model in this part. We will first talk about the GDU neuron used in DifNet, and also provide necessary information analyzing its effectiveness in addressing the suspended animation problem. For the neighborhood information diffusion, we will propose an attention mechanism for information propagation in DifNet. Finally, detailed information about learning the DifNet model will be introduced at the end.

4.1 Gated Diffusive Unit (Gdu)

Figure 3: Detailed architecture of the GDU neuron (for node ).

GDU is a new neuron model proposed for graph representation learning specifically. Formally, given the input graph data involving the node set and link set , according to Section 3, we can denote the raw feature vector and neighbor set of node as and , respectively. The DifNet model to be build involves multiple layers, and we can denote the hidden state of node at the layer of DifNet as . For the nearby neighbors of , we can denote their hidden states as . DifNet introduce an attention based diffusion operator to propagate and aggregate the information among such neighbor nodes, which can be denoted as

(1)

The above operator will be defined in detail in Section 4.3.

As illustrated in Figure 3, the GDU (e.g., for node ) has multiple input portals. Besides the neighborhood aggregated information , GDU will also accept ’s lower-layer representation and graph residual term as the other two inputs. For the aggregated input from the neighbors, certain information in the vector can be useless for learning ’s new state. Therefore, GDU defines an adjustment gate to update as follows:

(2)

Here,

is the sigmoid function parameterized by

.

Meanwhile, for the representation input from the lower-layer of , i.e., , GDU also introduces a similar evolving gate for adjusting the hidden state vector:

(3)

In addition, to involve the graph residual learning [Zhang and Meng2019] in DifNet to overcome the suspended animation problem, GDU can also accept the graph residual terms as the third input, which can be denoted as

(4)

Formally, can represent any graph residual terms defined in [Zhang and Meng2019]. For instance, if the naive residual term is adopted here, we will have ; whereas for the raw residual term, we have . For the graph-naive residual term and graph-naive residual term, can be defined accordingly based on the graph adjacency matrix. In this paper, we will use the graph-raw residual term, as it can achieve the best performance as shown in [Zhang and Meng2019].

Based on these three inputs, , and , GDU can effectively integrate the useful information from them to update the hidden state of node . Instead of simple summation, integration of such information is achieved in GDU via two selection gates and denoted as follows:

(5)

where the selection gates

(6)

In the above equation, term denotes a vector filled with value of the same dimensions as the selection gate vectors and . Operator

denotes the hyperbolic tangent activation function and

denotes the entry-wise product as introduced in Section 3.1.

4.2 More Discussions on Gdu

Here, we would like to provide more discussions about the GDU neuron, and also introduce a simplified version of GDU to lower down the learning cost of DifNet.

4.2.1 Design of Gdu on Suspended Animation

The design of the GDU architecture can help DifNet solve the “suspended animation problem” when the model goes deep in three perspectives:

  • Anti-Suspended Animation: Instead of integrating the neighborhood information with the approximated graph convolutional operator via either equal importance (as GCN) or with weighted summation (as GAT), GDU introduces (a) an attentive neighborhood information diffusion and integration mechanism (to be introduced in the following subsection), and (b) a node state adjustment mechanism via four neural gates defined above. GDU is not based on the approximated graph convolutional operator at all. Both the attentive diffusion mechanism and neural gates based adjustment mechanism are dynamic, which can compute the parameters automatically depending on the specific learning scenarios and can overcome the suspended animation side-effects on the model learning performance.

  • Residual Learning: As introduced in [Zhang and Meng2019], the graph residual terms (especially the raw residual term and graph-raw residual term) can help the GNN models address the “suspended animation problem”. Both theoretic proofs and empirical experiments have demonstrated its effectiveness. The architecture of GDU also effectively integrate the graph residual learning mechanism into the learning process, where the input portal can accept various graph residual terms.

  • Gradient Exploding/Vanishing: In addition, as introduced in [Hochreiter and Schmidhuber1997], for the neural network models in a deep architecture, the learning process may also suffer from the gradient exploding/vanishing problem. Similar problem has been observed for the GNN models, which is different from the “suspended animation problem” that we mention above. Meanwhile, similar to LSTM [Hochreiter and Schmidhuber1997] and GRU [Chung et al.2014], the neural gates introduced in GDU can help ease even solve the “gradient exploding/vanishing” problem effectively.

Viewed in such perspectives, GDU is very different from the neurons used in the existing GNN models. In the following part, we will focus on applying GDU to construct the DifNet model for the graph data representation learning.

Figure 4: Framework architecture of the DifNet model.

4.2.2 Simplified Gdu

Meanwhile, we also observed that the GDU neuron involves a complex architecture involving multiple variables, which may lead to a relatively high time cost for learning the model. To lower down the learning time cost, we also introduce a simplified version of GDU in this paper here. The main difference of the simplified GDU versus the original one introduced before lie in Equations 5-6, which can be denoted as follows:

(7)

where the selection gates

(8)

According to the equations, one of the selection gate defined in Equation 6 is removed, and the definition of gate is also simplified, which is determined by the adjusted representations and only. Also the architecture of GDU is also updated, where the node state updating equation will balance between and (controlled by the gate ). The residual term is added to the representation at the end only, where is the variable for vector dimension adjustment, which can be shared across layers with the same hidden dimensions.

4.3 Attention based Neighborhood Diffusion

In this part, we will define the operator used in Equation (1) for node neighborhood information integration. The DifNet model defines such an operator based on an attention mechanism. Formally, given the nodes hidden states , DifNet defines the diffusion operator together with the nodes integrated representations as follows:

(9)

where covers the hidden states of all the nodes in the input graph. Matrices and are the identical copies of . The output matrix denotes the aggregated neighbor representations of all the nodes. Term denotes a mask matrix, where entry iff . Matrix can effectively help prune the influences among nodes which are not connected in the original input graph.

According to the above equation, DifNet quantifies the influence of node on in the layer based on their hidden states. If we expand the above equation, given and its adjacent neighbor set , we can also rewrite the learned ’s aggregated neighborhood representation vector as

(10)

where the influence weight is defined as

(11)

4.4 Graph Diffusive Neural Network Learning

Based on both the GDU neurons and the attentive diffusion operator, we can construct the DifNet model, whose detailed architecture is provided in Figure 4. Based on the input graph, DifNet involves multiple layers of GDU neurons, which accept inputs from the previous layer, diffusion operators and the graph residual terms. In DifNet, at each layer of the model, one GDU neuron is constructed to represent the state of each node. Based on the learned representations of each node at the last layer, one fully-connected layer is adopted to project the node representations to their labels. Formally, , according to Figure 4, its learning process of DifNet can be denoted as the following equations:

(12)

In the above equation, notation will embed the nodes’ raw input features (usually in a large dimension) into a latent representation. Depending on the raw features, different models can be used to define the function. For instance, CNN can be used if is an image; LSTM/RNN can be adopted if is text; and fully connected layers can be used if is just a long feature vector. The nodes’ hidden states is also initialized based on the embedded raw feature vectors with a linear layer parameterized by . The learning process involves several diffusion and GDU layers, which will update nodes’ representations iteratively. The final outputted latent state vectors will be projected to the nodes’ labels with a fully-connected layer (also normalized by the softmax function).

Based on the set of labeled nodes, i.e., the training set , DifNet computes the introduced loss by comparing the learned label against the ground-truth, which can be denoted as the following equation:

(13)

By minimizing the above loss term, DifNet can be effectively learned with the error back-propagation algorithm.

Method (Dataset) Model Depth (Accuracy)
10 20 30 40 50
GCN (Cora) 0.215 0.057 0.057 0.057 0.057
DifNet (Cora) 0.851 0.833 0.834 0.825 0.825
GCN (Citeseer) 0.064 0.064 0.064 0.064 0.064
DifNet (Citeseer) 0.650 0.665 0.663 0.657 0.647
GCN (Pubmed) 0.366 0.366 0.366 0.366 0.366
DifNet (Pubmed) 0.790 0.788 0.785 0.787 0.779
Table 1: Learning result by models with deeper architectures.

5 Experiments

To test the effectiveness of the proposed DifNet model, extensive experiments have been done on several graph benchmark datasets in this paper, including Cora, Citeseer, and Pubmed. Cora, Citeseer and Pubmed are the benchmark datasets used for semi-supervised node classification problem used in almost all the state-of-the-art GNN papers [Kipf and Welling2016, Veličković et al.2018, Zhang and Meng2019]. In the experiments, for fair comparison, we will follow the identical experimental settings (e.g., train/validation/test partition) as [Kipf and Welling2016].

Reproducibility: Both the datasets and source code used can be accessed via link222https://github.com/jwzhanggy/DifNet. Detailed information about the server used to run the model can be found at the footnote333GPU Server: ASUS X99-E WS motherboard, Intel Core i7 CPU 6850K@3.6GHz (6 cores), 3 Nvidia GeForce GTX 1080 Ti GPU (11 GB buffer each), 128 GB DDR4 memory and 128 GB SSD swap. For the deep models which cannot fit in the GPU memory, we run them with CPU instead..

Experimental Settings: By default, the experimental settings of the DifNet

model is as follows on all these datasets: learning rate: 0.01 (0.005 on pubmed), max-epoch: 1000, dropout rate: 0.5, weight-decay: 5e-4, hidden size: 16, graph residual term: graph-raw.

5.1 Model Depth vs. Suspended Animation

As illustrated by Figure 2 provided in the introduction section, we have shown that DifNet model can converge very fast even with less epochs than GCN. What’s more, as the model depth increases, we didn’t observe the suspended animation problem with DifNet at all, which can perform well even with very depth architectures, e.g., layers.

Furthermore, in Table 1, we also provide the learning accuracy of DifNet (with different depths) on Cora, Citeseer and Pubmed, respectively. To illustrate its performance advantages, we also add the results of GCN (with different depths) on these three datasets in the table as well. The results demonstrate that DifNet can also work well in handling the suspended animation problem for the other datasets from different sources as well.

Methods Datasets (Accuracy)
Cora Citeseer Pubmed
LP 0.680 0.453 0.630
ICA ([Lu and Getoor2003]) 0.751 0.691 0.739
ManiReg ([Belkin et al.2006]) 0.595 0.601 0.707
SemiEmb ([Weston et al.2008]) 0.590 0.596 0.711
DeepWalk ([Perozzi et al.2014]) 0.672 0.432 0.653
Planetoid ([Yang et al.2016]) 0.757 0.647 0.772
MoNet ([Monti et al.2016]) 0.817 - 0.788
GCN ([Kipf and Welling2016]) 0.815 0.703 0.790
GAT ([Veličković et al.2018]) 0.830 0.725 0.790
LoopyNet ([Zhang2018]) 0.826 0.716 0.792
SF-GCN ([Lin et al.2019]) 0.833 0.734 0.793
DifNet (original GDU) 0.851 0.727 0.783
DifNet (simplified GDU) 0.834 0.715 0.795
Time Cost (seconds)
DifNet (original GDU) 47.55 34.40 215.40
DifNet (simplified GDU) 23.34 24.96 113.77
Table 2: Learning result accuracy of node classification methods. In the table, ‘-’ denotes the results of the methods on these datasets are not reported in the existing works. We have run DifNet with depth values from . Results reported of DifNet in this paper denotes the best one among all these depths.

5.2 Comparison with State-of-the-Art

Besides the results showing that DifNet can perform well in addressing the suspended animation problem, to make the experimental results more complete, we also compare DifNet with both classic and state-of-the-art baseline models, whose results are provided in Table 2. According to the results, compared against these baseline methods, DifNet can outperform them with great advantages. Also for DifNet with the simplified GDU, it can achieve comparable performance as DifNet with the original GDU with the same learning epochs. However, the learning time cost with simplified GDU is greatly reduced according to the experiment results.

6 Conclusion

In this paper, we focus on studying the graph representation learning problem and apply the learned graph representations for node semi-supervised classification. To address the common “suspended animation problem” with the existing graph neural networks, we introduce a new graph neural network model, namely DifNet, in this paper. DifNet is constructed based on both the GDU neurons and the attentive diffusion operator. To handle the suspended animation problem, GDU includes both neural gate learning and graph residual learning into its architecture, which can also handle the conventional gradient vanishing/exploding problem as well. Extensive experiments have been done on several benchmark graph datasets for node classification, whose results also demonstrate the effectiveness of both DifNet and GDU in deep graph representation learning.

References