Graph-Revised Convolutional Network

11/17/2019 ∙ by Donghan Yu, et al. ∙ Carnegie Mellon University 0

Graph Convolutional Networks (GCNs) have received increasing attention in the machine learning community for effectively leveraging both the content features of nodes and the linkage patterns across graphs in various applications. As real-world graphs are often incomplete and noisy, treating them as ground-truth information, which is a common practice in most GCNs, unavoidably leads to sub-optimal solutions. Existing efforts for addressing this problem either involve an over-parameterized model which is difficult to scale, or simply re-weight observed edges without dealing with the missing-edge issue. This paper proposes a novel framework called Graph-Revised Convolutional Network (GRCN), which avoids both extremes. Specifically, a GCN-based graph revision module is introduced for predicting missing edges and revising edge weights w.r.t. downstream tasks via joint optimization. A theoretical analysis reveals the connection between GRCN and previous work on multigraph belief propagation. Experiments on six benchmark datasets show that GRCN consistently outperforms strong baseline methods by a large margin, especially when the original graphs are severely incomplete or the labeled instances for model training are highly sparse.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Graph Convolutional Networks (GCNs) have received increasing attention in recent years as they are highly effective in graph-based node feature induction and belief propagation, and widely applicable to many real-world problems, including computer vision 

[wang2018dynamic, landrieu2018large]

, natural language processing 

[kipf2016semi, marcheggiani2017encoding], recommender systems [monti2017geometric, ying2018graph], epidemiological forecasting [wu2018deep], and more.

However, the power of GCNs has not been fully exploited as most of the models assume that the given graph perfectly depicts the ground-truth of the relationship between nodes. Such assumptions are bound to yield sub-optimal results as real-world graphs are usually highly noisy, incomplete (with many missing edges), and not necessarily ideal for different downstream tasks. Ignoring these issues is a fundamental weakness of many existing GCN methods.

Recent methods that attempt to modify the original graph can be split into two major streams: 1) Edge reweighting: GAT [velivckovic2017graph] and GLCN [jiang2019semi] use attention mechanism or feature similarity to reweight the existing edges of the given graph. Since the topological structure of the graph is not changed, the model is prone to be affected by noisy data when edges are sparse. 2) Full graph parameterization: LDS [franceschi2019learning], on the other hand, allows every possible node pairs in a graph to be parameterized. Although this design is more flexible, the memory cost is intractable for large datasets, since the number of parameters increases quadratically with the number of nodes. Therefore, finding a balance between model expressiveness and memory consumption remains an open challenge.

To enable flexible edge editing while maintaining scalability, we develop a GCN-based graph revision module that performs edge addition and edge reweighting. In each iteration, we calculate an adjacency matrix via GCN-based node embeddings, and select the edges with high confidence to be added. Our method permits a gradient-based training of an end-to-end neural model that can predict unseen edges. Our theoretical analysis demonstrates the effectiveness of our model from the perspective of multigraph [balakrishnan1997graph], which allows more than one edges from different sources between a pair of vertices. To the best of our knowledge, we are the first to reveal the connection between graph convolutional networks and multigraph propagation. Our contributions can be summarized as follows:

  • We introduce a novel structure that simultaneously learns both graph revision and node classification through different GCN modules.

  • Through theoretical analysis, we show our model’s advantages in the view of multigraph propagation.

  • Comprehensive experiments on six benchmark datasets from different domains show that our proposed model achieves the best or highly competitive results, especially under the scenarios of highly incomplete graphs or sparse training labels.

Background

We first introduce some basics of graph theory. An undirected graph can be represented as where denotes the set of vertices and denotes the set of edges. Let and be the number of vertices and edges, respectively. Each graph can also be represented by an adjacency matrix of size where if there is an edge between and , and otherwise. We use to denote the -th row of the adjacency matrix. A graph with adjacency matrix is denoted as . Usually each node has its own feature where

is the feature dimension (for example, if nodes represent documents, the feature can be a bag-of-words vector). The node feature matrix of the whole graph is denoted as

.

Graph convolutional networks generalize the convolution operation on images to graph structure data, performing layer-wise propagation of node features. Suppose we are given a graph with adjacency matrix and node features . An -layer Graph Convolution Network (GCN) [kipf2016semi] conducts the following inductive layer-wise propagation:

(1)

where , and is a diagonal matrix with . are the model parameters and

is the activation function. The node embedding

can be used for downsteam tasks. For semi-supervised node classification, GCN defines the final output as:

(2)

where and denotes the number of classes. We note that in the GCN computation, is directly used as the underlining graph without any modification. Additionally, in each layer, GCN only updates node representations as a degree-normalized aggregation of neighbor nodes.

To allow for an adaptive aggregation paradigm, GLCN [jiang2019semi] learns to reweight the existing edges by node feature embeddings. The reweighted adjacancy matrix is calculated by:

(3)

where denotes the feature vector of node and are model parameters. Another model GAT [velivckovic2017graph]

reweights edges by a layer-wise self-attention across node-neighbor pairs to compute hidden representations. For each layer

, the reweighted edge is computed by:

(4)

where is a shared attention function to compute the attention coefficients. Compared with GLCN, GAT uses different layer-wise maskings to allow for more flexible representation. However, neither of the methods has the ability to add edges since the revised edge or only if the original edge .

In order to add new edges into the original graph, LDS [franceschi2019learning] makes the entire adjacency matrix parameterizable. Then it jointly learns the graph structure and the GCN parameters by approximately solving a bilevel program as follows:

(5)

where means sampling adjacency matrix

from Bernoulli distribution under parameter

. is the convex hull of the set of all adjecency matrices for nodes. and denote the node classification loss on training and validation data respectively. However, this method can hardly scale to large graphs since the parameter size of is where is the number of nodes. In the next section, we’ll present our method which resolves the issues in previous work.

Proposed Method

Graph-Revised Convolutional Network

Figure 1: Architecture of the proposed GRCN model for semi-supervised node classification. The node classification GCN is enhanced with a revised graph constructed by the graph revision GCN module.

Our Graph-Revised Convolutional Network (GRCN) contains two modules: a graph revision module and a node classification module. The graph revision module adjusts the original graph by adding or reweighting edges, and the node classification module performs classification using the revised graph. Specifically, in our graph revision module, we choose to use a GCN to combine the node features and the original graph input, as GCNs are effective at fusing data from different sources. We first learn the node embedding as follows:

(6)

where denotes the graph convolutional network for graph revision, is the original graph adjacency matrix and is node feature. Then we calculate a similarity graph based on node embedding using certain kernel function :

(7)

The revised adjacency matrix is formed by an elementwise summation of the original adjacency matrix and the calculated similarity matrix: . Compared with the graph revision in GAT and GLCN which use entrywise product, we instead adopt the entrywise addition operator “” in order for new edges to be considered. In this process, the original graph is revised by the similarity graph , which can insert new edges to and potentially reweight or delete existing edges in . In practice, we apply a sparsification technique on dense matrix to reduce computational cost and memory usage, which will be introduced in the next section. Then the predicted labels are calculated by:

(8)

where denotes the graph convolutional network for the downstream node classification task. Figure 1 provides an illustration of our model. Finally, we use cross-entropy loss as our objective function:

(9)

where is the set of node indices that have labels and

is the number of classes. It’s worth emphasizing that our model does not need other loss functions to guide the graph revision process.

Overall, our model can be formulated as:

(10)

where is the kernel matrix computed from the node embeddings in Equation (6). In our implementation, we use dot product as kernel function for simplicity, and we use a two-layer GCN [kipf2016semi] in both modules. Applying the two-layer GCN for graph revision is a design choice, but our framework is highly flexible, and thus can be adapted to other graph convolutional networks.

Sparsification

Since the adjacency matrix of similarity graph is dense, directly applying it in the classification module is inefficient. Besides, we only want those edges with higher confidence to avoid introducing too much noise. Thus we conduct a

-nearest-neighbour (KNN) sparsification on the dense graph: for each node, we keep the edges with top-

prediction scores. The adjacancy matrix of the KNN-sparse graph, denoted as , is computed as:

(11)

where is the set of top- values of vector . Finally, in order to keep the symmetric property, the output sparse graph is calculated by:

(12)

Now since both original graph and similarity graph

are sparse, efficient matrix multiplication can be applied on both GCNs as in the training time, gradients will only backpropagate through the top-

values.

Theoretical Analysis

In this section, we show the effectiveness of our model in the view of Multigraph [balakrishnan1997graph] propagation. The major observation is that for existing methods, the learned function from GCNs can be regarded as a linear combination of limited pre-defined kernels where the flexibility of kernels have a large influence on the final prediction accuracy.

We consider the simplified graph convolution neural network

for the ease of analysis. That is, we remove feature transformation parameter and non-linear activation function as:

(13)

where is the number of GCN layers. For simplicity we denote

as the adjacency matrix with self-loop after normalization. The final output can be acquired by applying a linear or logistic regression function

on the node embeddings above:

(14)

where denotes the predicted labels of nodes. Then the following theorem shows that under certain conditions, the optimal function can be expressed as a linear combination of kernel functions defined on training samples.

Representer Theorem.

[scholkopf2001generalized] Consider a non-empty set and a positive-definite real-valued kernel: with a corresponding reproducing kernel Hilbert space . If given: a. a set of training samples ; b. a strictly monotonically increasing real-valued function ; and c. an error function , which together define the following regularized empirical risk functional on :

Then, any minimizer of the empirical risk admits a representation of the form:

where .

In our case, is the embedding of node . As shown in the theorem, the final optimized output is the linear combination of certain kernels on node embeddings. We assume the kernel function to be dot product for simplicity, which means . The corresponding kernel matrix can be written as:

(15)

where is the adjacency matrix of graph induced by node features. Now we have two graphs based on the same node set: original graph (associated with adjacency matrix ) and feature graph (associated with adjacency matrix ). They form a multigraph [balakrishnan1997graph] where multiple edges is permitted between the same end nodes. Then the random-walk-like matrix can be regarded as one way to perform graph label/feature propagation on the multigraph. Its limitation is obvious: the propagation only happens once on the feature graph , which lacks flexibility. However, for our method, we have:

(16)

where labels/features can propagate multiple times on the feature graph . Thus our model is more flexible and more effective especially when the original graph is not reliable or cannot provide enough information for downstream tasks. In Equation (16), can be regarded as a combination of different edges in the multigraph. To reveal the connection between GRCN and GLCN [jiang2019semi], we first consider the special case of our model that : . The operator “” is analogous to the operator which incorporates information from both graph and . While GLCN [jiang2019semi] takes another combination denoted as using Hadamard (entrywise) product “”, which can be analogous to operation.

We can further extend our model to a layer-wise version for comparison to GAT [velivckovic2017graph]. More specifically, for the -th layer, we denote the input as . The output is then calculated by:

(17)

where . Similar to the analysis mentioned before, if we consider the special case of GRCN that and change the edge combination operator from entrywise sum “” to entrywise product “”, we have , which is the key idea behind GAT [velivckovic2017graph]. Due to the property of entrywise product, the combined edges of both GAT and GLCN are only the reweighted edges of , which becomes ineffective when the original graph is highly sparse. Through the analysis above, we see that our model is more general in combining different edges by varying the value of , and also has more robust combination operator “” compared to previous methods.

Experiments

We evaluate the proposed GRCN model on semi-supervised node classification tasks, and conduct extensive experimental analysis in the following sections.

Dataset #nodes #edges #feature #class
Cora 2708 5429 1433 7
CiteSeer 3327 4732 3703 6
PubMed 19717 44338 500 3
CoraFull 19793 65311 8710 70
Amazon Computers 13381 245778 767 10
Coauthor CS 18333 81894 6805 15
Table 1: Data statistics

Dataset

We use six benchmark datasets for semi-supervised node classification evaluation. Among them, Cora, CiteSeer [sen2008collective] and PubMed [namata2012query] are three commonly used datasets. For a more robust comparison of the model performance, we conduct 10 random splits while keeping the same number of labels for training, validation and testing as previous work [yang2016revisiting]. To further test the scalability of our model, we utilize three other datasets: Cora-Full [bojchevski2018deep], Amazon-Computers and Coauthor CS [shchur2018pitfalls]. The first is an extended version of Cora, while the second and the third are co-purchase and co-authorship graphs respectively. On these three datasets, we follow the previous work [shchur2018pitfalls] and take 20 labels of each classes for training, 30 for validation, and the rest for testing. We also delete the classes with less than 50 labels to make sure each class contains enough instances. The data statistics are shown in Table 1.

Baselines

We compare the effectiveness of our GRCN model with several baselines, where the first two models are vanilla graph convolutional networks without any graph revision:

  • GCN [kipf2016semi]: one of the earlier models which performs a linear approximation to spectral graph convolution.

  • SGC [wu2019simplifying] removes the nonlinearities and collapse weight matrices between consecutive layers, and thus can increase number of layers without introducing more model parameters.

  • GAT [velivckovic2017graph] uses an attention mechanism for edge reweighting during the feature aggregation step.

  • LDS [franceschi2019learning] jointly learn the graph structure and parameters of graph convolution networks by solving a bilevel program.

  • GLCN [jiang2019semi] integrates both graph learning and graph convolution in a unified network architecture, which is most related to our model.

Cora
(rand. splits)
CiteSeer
(rand. splits)
PubMed
(rand. splits)
GCN
SGC
GAT
LDS N/A
GLCN
GRCN
Cora-Full
Amazon
Computers
Coauthor
CS
GCN
SGC
GAT
LDS N/A N/A N/A
GLCN
GRCN
Table 2:

Mean test classification accuracy and standard deviation in percent averaged for all models and all datasets. For each dataset, the highest accuracy score is marked in

bold. N/A stands for the datasets that couldn’t be processed by the full-batch version because of GPU RAM limitations.

Implementation Details

Transductive setting is used for node classification on all the datasets. We train GRCN for epochs using Adam [kingma2014adam] and select the model with highest validation accuracy for test. We set learning rate as for graph refinement module and for label prediction module. Weight decay and sparsification parameter are tuned by grid search on validation set, with the search space and

respectively. Our code is based on Pytorch 

[paszke2017automatic]

and one geometric deep learning extension library 

[fey2019fast], which provides implementation for GCN [kipf2016semi], SGC [wu2019simplifying] and GAT [velivckovic2017graph]. For LDS [franceschi2019learning], the results were obtained using the publicly available code. Since an implementation for GLCN [jiang2019semi] was not available, we report the results based on our own implementation of the original paper.

Main Results

Table 2 shows the mean accuracy and the corresponding standard deviation for all models across the 6 datasets averaged over 10 different runs. We see that our proposed model achieves the best or highly competitive results for all the datasets. The effectiveness of our model over the other baselines demonstrates that taking the original graph as input for GCN is not optimal for graph propagation in semi-supervised classification.

To further test the superiority of our model, we consider the edge-sparse scenario when a certain fraction of edges in the given graph is randomly removed. Given an edge retaining ratio, we randomly sample the retained edges 10 times and report the mean classification accuracy and standard deviation. Figure 2 shows the results under different ratios of retained edges. There are several observations from this figure. First, our model GRCN achieves notable improvement on almost all the datasets, especially when edge retaining ratio is low. For instance, when edge retaining ratio is , our model outperforms the second best model by on each dataset. Second, the GAT and GLCN models which reweight the existing edges do not perform well, indicating that such a reweighting mechanism is not enough when the original graph is highly incomplete. Third, our method also outperforms the over-parameterized model LDS in Cora and CiteSeer because of our restrained edge editing procedure. Though LDS achieves better performances than other baseline methods in these two datasets, its inability to scale prevents us from testing it on four of the larger datasets.

(a) Results on Cora
(b) Results on CiteSeer
(c) Results on PubMed
(d) Results on Cora-Full
(e) Results on Amazon Computers
(f) Results on Coauthor CS
Figure 2: Mean test classification accuracy on all the datasets under different ratios of retained edges over 10 different runs.

Robustness on Training Labels

We also show that the gains achieved by our model are very robust to the reduction in the number of training labels for each class, denoted by . We compare all the models on the Cora-Full, Amazon Computers and Coauthor CS datasets and fix the edge sampling ratio to . We reduce from to and report the results in Table 3. While containing more parameters than vanilla GCN, our model still outperforms others. Moreover, it wins by a larger margin when is smaller. This demonstrates our model’s capability to handle tasks with sparse training labels.

Cora-Full 5 labels 10 labels 15 labels
GCN
SGC
GAT
GLCN
GRCN
Amazon
Computers
GCN
SGC
GAT
GLCN
GRCN
Coauthor CS
GCN
SGC
GAT
GLCN
GRCN
Table 3: Mean test classification accuracy and standard deviation on Cora-Full, Amazon Computers and Coauthor CS datasets under different number of training labels for each class. The edge retaining ratio is for all the results. For each dataset, the highest accuracy score is marked in bold.

Hyperparameter Analysis

We investigate the influence of the hyperparameter

in this section. After calculating the similarity graph in GRCN , we use a -nearest-neighbour to generate a sparse graph out of the dense graph. This is not only benificial to efficiency, but also important for effectiveness. Figure 3 shows the results of classification accuracy vs. sampling ratio on Cora dataset, where we vary the edge sampling ratio from to and change from to . From this figure, increasing the value of helps improve the classification accuracy at the initial stage. However, after reaching a peak, further increasing lowers the model performance. We conjecture that this is because a larger will introduce too much noise and thus lower the quality of the revised graph.

Figure 3: Results of GRCN under different settings of sparsification parameter on Cora dataset, with different edge retaining ratios.

Ablation Study

To further examine the effectiveness of our GCN-based graph revision module, we conduct an ablation study by testing three different simplifications of the graph revision module:

  • Feature-Only (FO):

  • Feature plus Graph (FG):

  • Random Walk Feature plus Graph (RWFG):

Note that FO is the simplest method and only uses the node features to construct the graph, without any information of the original graph. This is followed by the FG method, which adds the original graph to the feature similarity graph used in FO. Our model is most closely related to the third method, RWFG, which constructs the feature graph with similarity of node features via graph propagation, but without feature learning.

We conduct the ablation experiment on Cora dataset with different edge retaining ratios and report the results in Figure 4. The comparison between FO and FG shows that adding original graph as residual links is helpful for all edge retaing ratios, especially when there are more known edges in the graph. Examining the results of FG and RWFG, we can also observe a large improvement brought by graph propagation on features. Finally, the performance of our model and RWFG underscores the importance of feature learning, especially in the cases of low edge retraining ratio.

Figure 4: Results of our model and its simplified versions on Cora dataset with different ratios of retained edges

Related Work

Graph Convolutional Network

Graph Convolution Networks (GCNs) were first introduced in the work by [bruna2013spectral], with subsequent development and improvements from [henaff2015deep]. Overall, GCNs can be categorized into two categories: spectral convolution and spatial convolution. The spectral convolution operates on the spectral representation of graphs defined in the Fourier domain by the eigen-decomposition of graph Laplacian [defferrard2016convolutional, kipf2016semi]. The spatial convolution operates directly on the graph to aggregate groups of spatially close neighbors [atwood2016diffusion, hamilton2017inductive]. Besides those methods that are directly applied to an existing graph, GAT [velivckovic2017graph], GLCN [jiang2019semi] use attention mechanism or feature similarity to reweight the original graph for better GCN performance, while LDS [franceschi2019learning] reconstructs the entire graph via a bilevel optimization. Although our work is related to these methods, we develop a different strategy for graph revision that maintains both efficiency and high flexibility.

Link prediction

Link prediction aims at identifying missing links, or links that are likely to be formed in a given network. It is widely applicable to many tasks, including prediction of friendship in social network [dong2012link] and affinities between users and items in recommender systems [berg2017graph]

. Previous line of work uses heuristic methods based on local neighborhood structure of nodes, including first-order heuristics by common neighbors and preferential attachment

[barabasi1999emergence], second-order heuristics by Adamic-Adar and resource allocation [zhou2009predicting], or high-order heuristics by PageRank [brin1998anatomy]. To loose the strong assumptions of heuristic method, a number of neural network based methods [grover2016node2vec, zhang2018link] are proposed, which are capable to learn general structure features. The problem we study in this paper is related to link prediction since we try to revise the graph by adding or reweighting edges. However, instead of treating link prediction as an objective, our work focus on improving node classification by feeding the revised graph into GCNs.

Conclusion

This paper presents Graph-Revised Convolutional Network, a novel framework for incorporating graph revision into graph convolution networks. We show both theoretically and experimentally that the proposed way of graph revision can significantly enhance the prediction accuracy for downstream tasks, such as semi-supervised node classification. GRCN overcomes two main drawbacks in previous approaches to graph revision, which either employ over-parameterized models and consequently face scaling issues, or fail to consider missing edges. In our experiments with node classification tasks, the performance of GRCN stands out in particular when the input graphs are highly incomplete or if the labeled training instances are very sparse. Additionally, as a key advantage, GRCN is also highly scalable to large graphs.

In the future, we plan to explore GRCN in a broader range of prediction tasks, such as knowledge base completion, epidemiological forecasting and aircraft anomaly detection based on sensor network data.

References