Semi-supervised Node Classification via Hierarchical Graph Convolutional Networks

02/13/2019 ∙ by Fenyu Hu, et al. ∙ 8

Graph convolutional networks (GCNs) have been successfully applied in node classification tasks of network mining. However, most of models based on neighborhood aggregation are usually shallow and lack the "graph pooling" mechanism, which prevents the model from obtaining adequate global information. In order to increase the receptive field, we propose a novel deep Hierarchical Graph Convolutional Network (H-GCN) for semi-supervised node classification. H-GCN first repeatedly aggregates structurally similar nodes to hyper-nodes and then refines the coarsened graph to the original to restore the representation for each node. Instead of merely aggregating one- or two-hop neighborhood information, the proposed coarsening procedure enlarges the receptive field for each original node, hence more global information can be learned. Comprehensive experiments conducted on public datasets demonstrate effectiveness of the proposed method over the state-of-art methods. Notably, our model gains substantial improvements when only very few labeled samples are provided.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graphs nowadays become ubiquitous owing to the ability to model complex systems such as social relationships, biological molecules, and publication citations. The problem of classifying graph-structured data is fundamental in many areas. Besides, since there is tremendous amount of unlabeled data in nature and labeling data is often expensive and time-consuming, it is often challenging and crucial to analyze graphs in a semi-supervised manner. For instance, for semi-supervised node classification in citation networks, where nodes denote articles and edges represent citation, the task is to predict the label of every article with only few labeled data.

As an efficient and effective approach to graph analysis, network embedding has attracted a lot of research interests. It aims to learn low-dimensional representations for nodes while still preserve the topological structure and node feature attributes. Many work has been proposed for network embedding, which can be used in the node classification task, such as DeepWalk [Perozzi et al.2014], LINE [Tang et al.2015], and node2vec [Grover and Leskovec2016]

. They convert the graph structure into sequences by performing random walks on the graph. Then, the first- and second-order structural similarities can be captured based on the co-occurence rate in these sequences. However, they are unsupervised algorithms and ignore feature attributes of nodes. Therefore, they cannot perform node classification tasks in an end-to-end fashion. Unlike previous random-walk-based approaches, employing neural networks on graphs has been studied extensively in recent years. In the graph neural network (GNN) model, both of the highly non-linear topological structure and node attributes are fed into networks to obtain the graph embedding. Using an information diffusion mechanism, GNNs update states of the nodes and propagate them until a stable equilibrium

[Scarselli et al.2009]. Recently, there is an increasing research interest in applying convolutional operations on the graph. These graph convolutional networks (GCNs) [Kipf and Welling2017, Veličković et al.2018] are based on the neighborhood aggregation scheme which generates node embedding by combining information from neighborhoods. Comparing with conventional methods, GCNs achieve promising performance in various tasks such as node classification and graph classification [Defferrard et al.2016].

Nevertheless, GCN-based models are usually shallow and lack the “graph pooling” mechanism, which restricts the scale of the receptive field. For example, there are only 2 layers in GCN [Kipf and Welling2017]. As each graph convolutional layer acts as the approximation of aggregation on the first-order neighbors, the 2-layer GCN model only aggregates information from 2-hop neighborhoods for each node. Because of the restricted receptive field, the model has difficulty in obtaining adequate global information. However, it has been observed from the reported results [Kipf and Welling2017] that simply adding more layers will degrade the performance. As explained in [Li et al.2018], each GCN layer acts as a form of Laplacian smoothing in essence, which makes the features of nodes in the same connected component similar. Thereby, adding too many convolutional layers will result in the output features over-smoothed and make them indistinguishable. Meanwhile, deeper neural networks with more parameters are harder to train. Although some recent methods [Chen et al.2018, Xu et al.2018, Ying et al.2018] try to get the global information through deeper models, they are either unsupervised models or need many training examples. As a result, they are still not capable of solving the semi-supervised node classification task directly.

To this end, we propose a novel architecture of Hierarchical Graph Convolutional Networks, H-GCN for brevity, for node classification on graphs. Inspired from the flourish of applying deep architectures and the pooling mechanism into image classification tasks, the H-GCN model is able to increase the receptive field of graph convolutions and better capture global information. As illustrated in Figure 1, H-GCN mainly consists of several coarsening layers and refining layers. For each coarsening layer, the graph convolutional operation is first conducted to learn node representations. Then, a coarsening operation is performed to aggregate structurally similar nodes into hyper-nodes, as depicted in Figure 2. After such coarsening operation, each hyper-node represents a local structure, which can facilitate exploiting global structures on the graph. Following coarsening layers, we apply symmetric graph refining layers to restore the original graph structure for node classification tasks. Such a hierarchical model manages to comprehensively capture nodes’ information from local to global perspectives, leading to better node representations.

Figure 1: The workflow of H-GCN. In this illustration, there are 7 layers with 3 coarsening layers, 3 symmetric refining layers, and one output layer. Coarsening layer at level produces hyper-nodes with -dimensional latent representations, vice versa for refining layers.
Figure 2: The graph coarsening operation of an example graph. Numbers indicate edge weights and nodes in shadow are hyper-nodes. In SEG, node , , and share the same neighbors, so they are grouped into a hyper-node. In SSG, node and are grouped because they have the largest normalized connection weight.

The main contributions of this paper are twofold. Firstly, to the best of our knowledge, it is the first work to design a deep hierarchical model for the semi-supervised node classification task. Compared to previous work, the proposed model consists of more layers with larger receptive fields, which is able to obtain more global information through the coarsening and refining procedures. Secondly, we conduct extensive experiments on a variety of public datasets and show that the proposed method constantly outperforms other state-of-the-art approaches. Particularly, our model gains a considerable improvement over other approaches with only few labeled samples provided for each class.

The remaining part of this paper is organized as follows. Section 2 reviews prior related work; Section 3 is devoted to present the proposed hierarchical structure-aware graph convolutional networks; Section 4 provides experiments and analysis; we conclude this paper and point out future directions in Section 5.

2 Related Work

In this section, we review some previous work on graph convolutional networks for semi-supervised node classification, hierarchical representation learning on graphs, and graph reduction algorithms.

Graph convolutional networks for semi-supervised learning.

In the past few years, there has been a surge of applying convolutions on graphs. These approaches are essentially based on the neighborhood aggregation scheme and can be further divided into two branches: spectral approaches and spatial approaches.

The spectral approaches are based on the spectral graph theory to define parameterized filters. [Bruna et al.2014] first defines the convolutional operation in the Fourier domain. However, its heavy computational burden limits the application to large-scale graphs. In order to improve efficiency, [Defferrard et al.2016] proposes ChebNet to approximate the -polynomial filters by means of a Chebyshev expansion of the graph Laplacian. [Kipf and Welling2017] further simplifies the ChebNet by truncating the Chebyshev polynomial to the first-order neighborhood. DGCN [Zhuang and Ma2018] uses random walks to construct a positive mutual information matrix. Then, it utilizes that matrix along with the graph’s adjacency matrix to encode both local consistency and global consistency.

The spatial approaches generate node embedding by combining the neighborhood information in the vertex domain. MoNet [Monti et al.2017] and SplineCNN [Fey et al.2018] integrate the local signals by designing a universe patch operator. To generalize to unseen nodes in an inductive setting, GraphSAGE [Hamilton et al.2017]

samples a fixed number of neighbors and employs several aggregation functions, such as concatenation, max-pooling, and LSTM aggregator. GAT

[Veličković et al.2018] introduces the attention mechanism to model different influences of neighbors with learnable parameters. [Gao et al.2018] selects a fixed number of neighborhood nodes for each feature and enables the use of regular convolutional operations on Euclidean spaces. However, the above two branches of GCNs are usually shallow and cannot obtain adequate global information as a consequence.

Hierarchical representation learning on graphs. Some work has been proposed for learning hierarchical information on graphs. [Chen et al.2018, Liang et al.] use a coarsening procedure to construct a coarsened graph of smaller size and then employ unsupervised methods, such as Deepwalk [Perozzi et al.2014] and node2vec [Grover and Leskovec2016] to learn node embedding based on that coarsened graph. Then, they conduct a refining procedure to get the original graph embedding. Their two-stage methods is not capable of utilizing node attribute information and cannot conduct node classification task in an end-to-end fashion either. JK-Nets [Xu et al.2018] proposes general layer aggregation mechanisms to combine the output representation in every GCN layer. However, it can only propagate information across edges of the graph and are unable to aggregate information hierarchically. Therefore, the hierarchical structure of the graph cannot be learned by JK-Nets. To solve this problem, DiffPool [Ying et al.2018] proposes a pooling layer for graph embedding to reduce the size by a differentiable network. As DiffPool is designed for graph classification tasks, it cannot generate embedding for every node in the graph, hence it cannot be directly applied in node classification scenarios.

Graph reduction. Many approaches have been proposed to reduce the graph size without losing too much information, which facilitate downstream network analysis tasks such as community discovery and data summarization. There are two main classes of methods that reduce the graph size: graph sampling and graph coarsening. The first category is based on graph sampling strategy [Papagelis et al.2013, Hu and Lau2013, Chen et al.2017]

, which might lose key information during the sampling process. The second category applies graph coarsening strategies that collapse structure-similar nodes into hyper-nodes to generate a series of increasingly coarser graphs. The coarsening operation typically consists of two steps, i.e. grouping and collapsing. At first, every vertex is assigned to groups in a heuristic manner. Here a group refers to a set of nodes that constitute a hyper-node. Then, these groups are used to generate a coarser graph. For an unmatched node,

[Hendrickson and Leland1995] randomly selects one of its un-matched neighbors and merge these two vertices. [Karypis and Kumar1998] merges the two un-matched nodes by selecting those with the maximum weight edge. [LaSalle and Karypis2015] uses a secondary jump during matching.

However, these graph reduction approaches are usually used in unsupervised scenarios, such as community detection and graph partition. For semi-supervised node classification tasks, existing graph reduction methods cannot be used directly, as they are not capable of learning complex attributive and structural features of graphs. In this paper, H-GCN conducts graph reduction like pooling mechanisms on Euclidean data. In this sense, our work bridges graph reduction for unsupervised tasks to the practical but more challenging semi-supervised node classification problems.

3 The Proposed Method

3.1 Preliminaries

3.1.1 Notations and Problem Definition

For the input undirected graph , where and are respectively the set of nodes and edges, let be the adjacency matrix and be the node feature matrix. For the H-GCN network with layers, graph at layer is represented as with

nodes. The adjacency matrix and hidden representation matrix of

is represented by and respectively. Since coarsening layers and refining layers are symmetrical, is identical to .

Given the labeled node set containing nodes, where each node is associated with a label , our objective is to predict labels of .

3.1.2 Graph Convolutional Networks

Graph convolutional networks achieve promising generalization in various tasks and our work is built upon the GCN module. At layer , taking graph adjacency matrix and previous hidden representation matrix as input, each GCN module outputs a hidden representation matrix , which is described as:


where , , adjacency matrix with self-loop , is the degree matrix of , and is a trainable weight matrix.

3.2 The Overall Architecture

For a H-GCN network of layers, the th graph coarsening layer first conducts a graph convolutional operation as formulated in Eq. (1) and then aggregates structurally similar nodes into hyper-nodes, producing a coarser graph and node embedding matrix with less nodes. The corresponding adjacent matrix and will be fed into the th layer. Symmetrically, the th graph refining layer also performs a graph convolution at first and then refines the coarsened graph embedding back to to restore the finer graph structure. In order to boost optimization in deeper networks, we add shortcut connections [He et al.2016] across each coarsened graph and its corresponding refined part.

Since the topological structure of the graph changes between layers, we further introduce a node weight embedding matrix

, which transforms the number of nodes contained in each hyper-node into real-valued vectors. Besides, we add multiple channels by employing

different GCNs to explore different feature subspaces.

The graph coarsening layers and refining layers altogether integrate different levels of node features and thus avoid oversmmothing during repeated neighborhood aggregation. After the refining process, we obtain a node embedding matrix

, where each row represents a node representation vector. In order to classify each node, we apply an additional GCN module followed by a softmax layer on


3.3 The Graph Coarsening Layer

Every graph coarsening layer consists of two steps: graph convolution and graph coarsening. A GCN module is firstly used to extract structural and attributive features by aggregating neighborhoods’ information as described in Eq. (1). For the graph coarsening procedure, we design the following two hybrid grouping strategies to assign the structure similar nodes into a hyper-node in the coarser graph.

Structural equivalence grouping (SEG). If two nodes share the same set of neighbors, they are considered to be structurally equivalent. We then assign these two nodes to be a hyper-node. For example, as illustrated in Figure 2, node , , and are structurally equivalent, so these three nodes are allocated as a hyper-node. We mark all these structurally equivalent nodes and leave other nodes unmarked.

Structural similarity grouping (SSG). Then, we calculate the structural similarity between the unmarked node pairs and as the normalized connection strength :


where is the edge weight between and , and is the node weight.

We iteratively take out an unmarked node and calculate similarity scores with all its unmarked neighbors. After that, we select its neighbor node which has the largest structural similarity to form a new hyper-node and mark the two nodes. Specially, if one node is left unmarked and all of its neighbors are marked, it will be marked as well and constitutes a hyper-node by itself. For example, in Figure 2, node pair and has the largest structural similarity score, so they are assigned as a group. After that, since only node is remained unmarked, it constitutes a hyper-node by itself.

Please note that if we take out unmarked nodes in different order, the resulting hyper-graph will be different. As a node with less neighbors has less chance to be grouped, we give these neighbors higher priority. Therefore, in this paper, we take out the unmarked node in ascending order according to the number of neighbors.

Using above two grouping strategies, we are able to acquire all the hyper-nodes. For one hyper-node , its node weight is defined as the summation over the weight of each node contained in this hyper-node. Its edge weight to neighbor node is calculated as the summation over edge weights adjacent to node of nodes contained in hyper-node . The updated node weights and edge weights will be used in Eq. (2) in the next coarsening layer.

In order to help restore the coarsened graph to original graph, we preserve the collapsing relationship between nodes and their corresponding hyper-nodes in a grouping matrix , which can further help restore the graph back to the finer one. Formally, at layer , we construct , whose element in is calculated as:


An example of grouping matrix is given in Figure 2. Please note that in this illustration, since node constitutes its hyper-node by itself. Next, the hidden node embedding matrix is determined as:


In the end, we generate a collapsed graph , whose adjacency matrix can be calculated by:


The coarser graph along with the current representation matrix will be fed into the next layer as input. The resulting node embedding to generate in each coarsening layer will then be of lower resolution. The graph coarsening procedure is summarized in Algorithm 1.

Input: Graph and node representation
Output: Coarsened graph and node representation
Calculate GCN output according to Eq. (1) Initialize all nodes as unmarked /* Structural equivalence grouping */
Group and mark node pairs having the same neighbors /* Structural similarity grouping */
1 Sort all unmarked nodes in ascending order according to the number of neighbors repeat
2       for each unmarked node pair  do
3             Calculate according to Eq. (2)
4      Group and mark the node pair having the largest
until all nodes are marked/* Graph collapsing */
Construct matching matrix according to Eq. (3) Calculate node representation according to Eq. (4) Construct coarsened graph according to Eq. (5) return
Algorithm 1 The graph coarsening operation

3.4 The Graph Refining Layer

To restore the original topological structure of the graph and further facilitate node classification, we stack the same numbers of graph refining layers as coarsening layers. Like the coarsening procedure, each refining layer contains two steps, namely generating node embedding vectors and restoring node representations.

To learn a hierarchical representation of nodes, a GCN is employed at first. Since we have saved the collapsing relationship in the matching matrix during the coarsening process, we utilize to restore the refined node representation matrix of layer

. We further employ residual connections between the two corresponding coarsening and refining layers. In summary, node representations are computed by:


3.5 Node Weight Embedding and Multiple Channels

Since different hyper-nodes may carry different numbers of nodes, as depicted in Figure 2, we assume such node weights could reflect the hierarchical characteristics of coarsened graphs. Here we transform the node weight into real-valued vectors by looking up one randomly initialized node weight embedding matrix where is a fixed-sized weight set and is the dimension of the embedding. We apply node weight embedding in every coarsening and refining layer. For example, for graph , we obtain its representation and its node weight embedding . We then concat and and the resulting matrix will be fed into next layer subsequently.

Multi-head mechanisms help explore features in different subspaces and H-GCN employs multiple channels on GCN to obtain rich information jointly at each layer. After obtained channels , we perform weighted average on these feature maps:


3.6 The Output Layer

Finally, in the output layer , we use a GCN with a softmax layer on

to output probabilities of nodes:


where is a trainable weight matrix and denotes the probabilities of nodes belonging to each class .

The loss function is defined as the cross-entropy of predictions over the labeled nodes:


where is the indicator function, is the true label for , is the prediction for labeled node , and is the predicted probability that is of class .

4 Experiments and Analysis

In this section, we summarize the experimental setting, present the results, analyze the impact on available labeled data, and conduct parameter sensitivity analysis.

4.1 Experimental Settings

4.1.1 Datasets

For a comprehensive comparison with state-of-the-art methods, we use four widely-used datasets including 3 citation networks and 1 knowledge graph. The statistics of these datasets is summarized in Table

1. We set the node weight and edge weight of the graph to 1 for all four datasets. The dataset configuration follows the same setting in [Yang et al.2016, Kipf and Welling2017] for fair comparison. For citation networks, documents and citations are treated as vertices and edges, respectively. For the knowledge graph, each triplet will be converted into three nodes and two undirected edges and , where and are entities and is the relation between them. During training, only 20 labels per class are used for each citation network and only 1 label per class is used for NELL during training. Besides, 500 nodes in each dataset is selected randomly as the validation set. We do not use the validation set for model training.

Dataset Cora Citeseer Pubmed NELL
Type Citation network Knowledge graph
# Vectices 2,708 3,327 19,717 65,755
# Edges 5,429 4,732 44,338 266,144
# Classes 7 6 3 210
# Features 1,433 3,703 500 5,414
Labeling rate 0.052 0.036 0.003 0.003
Table 1: Statistics of datasets used in experiments

4.1.2 Baseline Algorithms

To evaluate the performance of H-GCN, we compare our method with the following representative methods:

  • DeepWalk [Perozzi et al.2014] generates the node embedding via random walks in an unsupervised manner, then nodes are classified by feeding the embedding vectors into a SVM classifier.

  • Planetoid [Yang et al.2016] not only learns node embedding but also predicts the context in graph. It also leverages label information to build both transductive and inductive formulations.

  • GCN [Kipf and Welling2017] produces node embedding vectors by truncating the Chebyshev polynomial to the first-order neighborhoods.

  • GAT [Veličković et al.2018] generates node embedding vectors by modeling the differences between the node and its one-hop neighbors.

  • DGCN [Zhuang and Ma2018] utilizes the graph adjacency matrix and the positive mutual information matrix to encode both local consistency and global consistency.

4.1.3 Parameter Setting

We train our model using the Adam optimizer with learning rate of for epochs. The dropout is applied on all feature vectors with rates of . Besides, the regularization factor is set to . Considering different scales of datasets, we set the total number of layers to for citation networks and for the knowledge graph, and apply -channel GCNs in both coarsening and refining layers.

4.2 Node Classification Results

To demonstrate the overall performance of semi-supervised node classification, we compare the proposed method with other state-of-the-art methods. The performance in terms of accuracy is shown in Table 2. The best performance of each column is highlighted in boldface. The performance of our proposed method is reported based on the average of 20 measurements. Note that running GAT on the NELL dataset requires more than 64G memory, hence its performance on NELL is not reported.

Method Cora Citeseer Pubmed NELL
DeepWalk 67.2% 43.2% 65.3% 58.1%
Planetoid 75.7% 64.7% 77.2% 61.9%
GCN 81.5% 70.3% 79.0% 73.0%
GAT 83.0 ± 0.7% 72.5 ± 0.7% 79.0 ± 0.3%
DGCN 83.5% 72.6% 79.3% 74.2%
H-GCN 84.5 ± 0.5% 72.8 ± 0.5% 79.8 ± 0.4% 80.1 ± 0.4%
Table 2: Results of node classification in terms of accuracy

The results show that the proposed method consistently outperforms other state-of-the-art methods, which verify the effectiveness of the proposed coarsening and refining mechanisms. Regarding traditional random-walk-based algorithms such as DeepWalk and Planetoid, their performance is relatively poor. Deepwalk cannot model the attribute information, which heavily restricts its performance. Though Planetoid combines supervised information with unsupervised loss, there is still information loss during random sampling. To avoid that problem, GCN and GAT employ the neighborhood aggregation scheme to boost performance. GAT outperforms GCN as it can model different relations to different neighbors rather than with a pre-defined order. DGCN further jointly models both local and global consistency, yet its global consistency is still attained through random walks. On the contrary, the proposed H-GCN manages to capture global information through different levels of convolutional layers and achieves the best results among all four datasets. Notably, compared with citation networks, H-GCN surpasses other baselines by larger margins on the NELL dataset. It is probably because there are less training samples per class on NELL than citation networks. Under such circumstance, training nodes are further away from testing nodes on average. As a result, the proposed H-GCN with larger receptive field and deeper layers shows more obvious improvements than baselines.

4.3 Impact of Varying Number of Training Data

We suppose that the larger receptive field in the convolutional model promotes the propagation of features and labels on graphs. To verify the proposed H-GCN can get a larger receptive field, we reduce the number of training samples to check if H-GCN still performs well when limited labeled data is given. As in nature there are plenty of unlabeled data, it is also of great significance to train the model with limited labeled data. In this section we conduct experiments with different numbers of labeled instances of the Pubmed dataset. We vary the number of labeled vertices from 20 to 5 per class, where the labeled data is randomly chosen from the original training set. All parameters are the same as previously described. The corresponding performance in terms of accuracy is reported in Table 3.

Method 5 10 15 20
GCN 69.0% 72.2% 76.9% 79.0%
GAT 70.3% 75.4% 77.3% 79.0%
DGCN 70.1% 76.7% 77.4% 79.3%
H-GCN 76.5% 78.6% 79.3% 79.8%
Table 3: Results of node classification in terms of accuracy on Pubmed with labeled vertices varying from 5 per class to 20.

From the table, it can be observed that, our method outperform other baselines in all cases. With the number of labeled data decreasing, our method obtains a larger margin over these baseline algorithms. Especially when the number of labeled node per class is only 5 ( 0.08% labeling rate), the accuracy of H-GCN exceeds GCN, DGCN, and GAT by 7.5%, 6.4%, and 6.2% respectively. When the number of training data decreases, it is more likely that nodes used for testing will be further away from the training nodes. Only when the receptive field is large enough can information from those training nodes be captured. As the receptive field of GCN and GAT does not exceed 2-hop neighborhoods, these baselines downgrade considerably. However, owing to its larger receptive field, the performance of H-GCN declines slightly when labeled data decreases dramatically. Overall, it is verified that the proposed H-GCN is well-suited when training data is extremely scarce.

4.4 Ablation Study

In order to verify the effectiveness of the proposed coarsening and refining layers, we conduct ablation study on coarsening and refining layers and node weight embeddings respectively in this section.

4.4.1 Coarsening and Refining Layers

We remove all coarsening and refining operations of H-GCN and compare its performance with the original H-GCN. The results are shown in Table 4. From the results, it is obvious that the proposed H-GCN has better performance compared with H-GCN with no coarsening mechanisms on all datasets. It can be verified that the coarsening and refining mechanisms contribute to the performance improvements, since they can obtain global information with larger receptive fields.

Method Cora Citeseer Pubmed NELL
H-GCN without coarsen-
ing and refining layers
80.3% 70.5% 76.8% 75.9%
H-GCN 84.5% 72.8% 79.8% 80.1%
Table 4: Results of the ablation study on coarsening and refining layers.

4.4.2 Node Weight Embeddings

To study the impact of node weight embeddings, we compare H-GCN with no node weight embeddings used. It can be seen from Table 5 that the model with node weight embeddings performs better, which verifies the necessity to add this embedding vector in the node embeddings.

Method Cora Citeseer Pubmed NELL
H-GCN without node
weight embeddings
84.2% 72.4% 79.5% 79.6%
H-GCN 84.5% 72.8% 79.8% 80.1%
Table 5: Results of the ablation study on node weight embeddings.

4.5 Sensitivity Analysis

Last, we conduct parameter sensitivity analysis. Specifically, we investigate how different numbers of coarsening layers and different numbers of channels will affect the results respectively. The performance is reported in terms of accuracy on all four datasets. While one parameter studied in the sensitivity analysis is changed, other hyper-parameters remain the same.

(a) Coarsening layers
(b) Channel numbers
Figure 3: Results with varying layers and channels in terms of accuracy.

4.5.1 Effects of Coarsening Layers

Since the coarsening layers in our model control the granularity of the receptive field enlargement, we conduct the experiment with 1 to 8 coarsening and symmetric refining layers, where the results are shown in Figure (a)a. It is seen that the performance of H-GCN achieves the best when there are 4 coarsening layers on citation networks and 5 on the knowledge graph. It is suspected that, since less labeled nodes are supplied on NELL than others, deeper layers and larger receptive fields are needed. However, when adding too many coarsening layers, the performance drops due to overfitting.

4.5.2 Effects of Channel Numbers

Next, we investigate the impact of different numbers of channels on the performance. Multiple channels benefit the graph convolutional network model, since they explore different feature subspaces, as shown in Figure (b)b. From the figure, it can be found that the performance improves with the number of channels increasing until 4 channels, which demonstrates that more channels are helpful for capturing accurate node features. Nevertheless, too many channels will inevitably introduce redundant parameters to the model, leading to overfitting as well.

5 Conclusions

In this paper, we propose a novel hierarchical graph convolutional networks for the semi-supervised node classification task. The proposed H-GCN consists of coarsening layers and symmetric refining layers. By collapsing structurally similar nodes into hyper-nodes, our model can get a larger receptive field and enable sufficient information propagation. Compared with other previous work, our proposed H-GCN is deeper and can fully utilize both local and global information. Comprehensive experiments confirm that the proposed method can consistently outperform other state-of-the-art methods. In particular, our method achieves substantial gains over them in the case that labeled data is extremely scarce.

The study of semi-supervised learning over networks, in general remains widely open with various challenges and application in diverse areas. For the future of our work, we plan to investigate the following two directions. On the one hand, we want to further apply the proposed H-GCN to other node classification scenarios, especially in heterogeneous networks. On the other hand, we will explore more efficient convolutional filters for better performance.


  • [Bruna et al.2014] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral Networks and Locally Connected Networks on Graphs. In ICLR, 2014.
  • [Chen et al.2017] Haibo Chen, Jianfei Zhao, Xiaoji Chen, Ding Xiao, and Chuan Shi. Visual analysis of large heterogeneous network through interactive centrality based sampling. In ICNSC, pages 378–383, May 2017.
  • [Chen et al.2018] Haochen Chen, Bryan Perozzi, Yifan Hu, and Steven Skiena. Harp: Hierarchical representation learning for networks. In AAAI, 2018.
  • [Defferrard et al.2016] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, pages 3844–3852, 2016.
  • [Fey et al.2018] Matthias Fey, Jan Eric Lenssen, Frank Weichert, and Heinrich Müller.

    SplineCNN: Fast geometric deep learning with continuous B-spline kernels.

    In CVPR, 2018.
  • [Gao et al.2018] Hongyang Gao, Zhengyang Wang, and Shuiwang Ji. Large-scale learnable graph convolutional networks. In KDD, pages 1416–1424, 2018.
  • [Grover and Leskovec2016] Aditya Grover and Jure Leskovec. Node2vec: Scalable feature learning for networks. In KDD, pages 855–864, 2016.
  • [Hamilton et al.2017] William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NIPS, pages 1025–1035, 2017.
  • [He et al.2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [Hendrickson and Leland1995] Bruce Hendrickson and Robert Leland. A multilevel algorithm for partitioning graphs. In Supercomputing, 1995.
  • [Hu and Lau2013] Pili Hu and Wing Cheong Lau. A survey and taxonomy of graph sampling. CoRR, abs/1308.5865, 2013.
  • [Karypis and Kumar1998] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359–392, 1998.
  • [Kipf and Welling2017] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
  • [LaSalle and Karypis2015] Dominique LaSalle and George Karypis. Multi-threaded modularity based graph clustering using the multilevel paradigm. J. Parallel Distrib. Comput., 76(C):66–80, February 2015.
  • [Li et al.2018] Qimai Li, Zhichao Han, and Xiao ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI, pages 3538–3545, 2018.
  • [Liang et al.] J. Liang, S. Gurukar, and S. Parthasarathy. MILE: A Multi-Level Framework for Scalable Graph Embedding. ArXiv e-prints.
  • [Monti et al.2017] F. Monti, D. Boscaini, J. Masci, E. Rodolà, J. Svoboda, and M. M. Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In CVPR, pages 5425–5434, 2017.
  • [Papagelis et al.2013] M. Papagelis, G. Das, and N. Koudas. Sampling online social networks. TKDE, 25(3):662–676, March 2013.
  • [Perozzi et al.2014] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In KDD, pages 701–710, 2014.
  • [Scarselli et al.2009] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The Graph Neural Network Model. TNN, 20(1):61–80, 2009.
  • [Tang et al.2015] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale information network embedding. In WWW, pages 1067–1077, 2015.
  • [Veličković et al.2018] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In ICLR, 2018.
  • [Xu et al.2018] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In ICML, pages 5453–5462, 2018.
  • [Yang et al.2016] Zhilin Yang, William Cohen, and Ruslan Salakhudinov. Revisiting semi-supervised learning with graph embeddings. In ICML, pages 40–48, 2016.
  • [Ying et al.2018] Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L. Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. ArXiv e-prints, abs/1806.08804, 2018.
  • [Zhuang and Ma2018] Chenyi Zhuang and Qiang Ma. Dual graph convolutional networks for graph-based semi-supervised classification. In WWW, pages 499–508, 2018.