Fully Linear Graph Convolutional Networks for Semi-Supervised Learning and Clustering

by   Yaoming Cai, et al.

This paper presents FLGC, a simple yet effective fully linear graph convolutional network for semi-supervised and unsupervised learning. Instead of using gradient descent, we train FLGC based on computing a global optimal closed-form solution with a decoupled procedure, resulting in a generalized linear framework and making it easier to implement, train, and apply. We show that (1) FLGC is powerful to deal with both graph-structured data and regular data, (2) training graph convolutional models with closed-form solutions improve computational efficiency without degrading performance, and (3) FLGC acts as a natural generalization of classic linear models in the non-Euclidean domain, e.g., ridge regression and subspace clustering. Furthermore, we implement a semi-supervised FLGC and an unsupervised FLGC by introducing an initial residual strategy, enabling FLGC to aggregate long-range neighborhoods and alleviate over-smoothing. We compare our semi-supervised and unsupervised FLGCs against many state-of-the-art methods on a variety of classification and clustering benchmarks, demonstrating that the proposed FLGC models consistently outperform previous methods in terms of accuracy, robustness, and learning efficiency. The core code of our FLGC is released at https://github.com/AngryCai/FLGC.



There are no comments yet.


page 1

page 7

page 11


Simple and Deep Graph Convolutional Networks

Graph convolutional networks (GCNs) are a powerful deep learning approac...

Anisotropic Graph Convolutional Network for Semi-supervised Learning

Graph convolutional networks learn effective node embeddings that have p...

Graph Convolutional Subspace Clustering: A Robust Subspace Clustering Framework for Hyperspectral Image

Hyperspectral image (HSI) clustering is a challenging task due to the hi...

Semi-Supervised Classification on Non-Sparse Graphs Using Low-Rank Graph Convolutional Networks

Graph Convolutional Networks (GCNs) have proven to be successful tools f...

Generalized Label Propagation Methods for Semi-Supervised Learning

The key challenge in semi-supervised learning is how to effectively leve...

A Graph Data Augmentation Strategy with Entropy Preserving

The Graph Convolutional Networks (GCNs) proposed by Kipf and Welling are...

New Insights into Graph Convolutional Networks using Neural Tangent Kernels

Graph Convolutional Networks (GCNs) have emerged as powerful tools for l...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Graph neural network (GNN) has emerged as a powerful technique for the representation learning of graphs

[57, 59]

. Current GNN models usually follow a neighborhood aggregation scheme, where the feature representation of a node is updated by recursively aggregating representation vectors of its neighbors

[31, 52]. Benefited from the promising ability to model graph structural data, traditional learning problems can be revisited from the perspective of graph representation learning, ranging from recommendation [13]

and computer vision

[36, 5, 11]

, to combinatorial optimization

[16, 50].

In recent years, there is increasing attention for generalizing convolution to the graph domain. The existing graph convolutional networks (GCNs) often broadly categorize into spectral approaches and spatial approaches [59, 52]. Spectral models commonly use the spectral graph theory to design spectral filters, such as ChebyNet [9] and the vanilla GCN [27]. While spatial models define graph convolutions based on a node’s spatial relations, e.g., GraphSAGE [19, 10] and Graph Attention networks (GAT) [46]. Due to the good theoretical guarantee, spectral models have been extensively studied by the recent mainstream. Despite their success, most of the state-of-the-art (SOTA) GCN models are dominated by shallow and simple models because of the over-smoothing problem [31, 35]

. To circumvent the problem, many efforts have been paid to develop deeper and more robust models. The proven techniques include residual connection

[35], randomly drop nodes [14] or edges [43], and data augmentation [58], etc.

More recent attempts in simplifying GCN have collectively suggested that a GCN model can be decoupled into two successive stages: parameter-free neighborhood propagation and task-specific prediction. Following this scheme, Wu et al. [51] initially proposed Simple Graph Convolution (SGC) [51]

by simplifying the vanilla GCN as a low-pass filter followed by a linear classifier. While Approximate Personalized Propagation of Neural Predictions (APPNP)

[28] swaps the two stages and establishes a connection between GCN and the well-known PageRank [40]. These successful attempts signify several helpful tendencies. First, simplified GCNs have a similar capability to the elaborated ones in handling structural data. Second, a linear GCN is as powerful as the nonlinear counterpart for most graph scenarios. Nonetheless, fewer studies have managed to develop a unified framework between traditional linear models and GCNs. Moreover, gradient descent-based training of simplified GCNs often suffers from local optimum, tedious hyper-parameters and training tricks. Hence, it would result in a more simplified GCN if a globally optimal solution in a closed form is provided.

Earlier linear models, such as ridge regression [2] and subspace clustering [12], have been frequently applied in practice owing to their simplicity, efficiency, and efficacy. However, these classic models essentially work in the Euclidean domain, leading to insufficient handling of graph structured data. Before GCNs emerged, graph Laplacian regularization (or manifold regularization) [3, 6] had been widely applied in various linear models to incorporate structural information. This inspires a series of classic graph-regularized semi-supervised approaches, e.g., Laplacian Regularized Least Squares (LapRLS) [3], and graph-regularized unsupervised approaches, e.g., Laplacian Regularized Low-Rank Representation [55]. Despite being a useful technique, the Laplacian regularization encounters three shortcomings: 1) it is typically dependent upon the Euclidean domain and, thus, it is hard to directly generalize to real-word graph data; 2) it merely considers the 1-order neighborhoods while ignoring important long-range interdependency; 3) its additional regularization coefficient needs to be appropriately adjusted.

In this paper, we propose a very simple and unified framework, referred to as Fully Linear Graph Convolution (FLGC), for both semi-supervised learning (SSL) and unsupervised learning (USL). Our goal is to further simplify GCNs and generalize it to existing linear models, finally enabling traditional models to explore graphs directly. Specifically, we linearize GCN and then decouple it into neighborhood propagation and prediction stages, resulting in a flexible framework. Such simplification makes it possible to efficiently calculate a globally optimal solution during training and also easy to incorporate with various previous linear models. On the basis of the resulting framework, we further present a semi-supervised FLGC for node classification problems and an unsupervised FLGC for subspace clustering problems. To prevent the over-smoothing issue, we introduce an initial residual in neighborhood propagation to capture the long-range dependency across a graph.

To sum up, the main contributions of this paper are as follows:

  1. We present a simple yet effective FLGC framework to deal with graph data and regular data. The framework consists of a parameter-free neighborhood propagation and a task-specific linear model with a closed-form solution. The framework not only simplifies the training of existing GCNs with a general but makes it easier to implement and apply in practice.

  2. We implement two FLGCs for semi-supervised node classification and unsupervised clustering tasks, respectively. Also, we show that the semi-supervised and unsupervised FLGCs act as a generalization of classic ridge regression and subspace clustering in the non-Euclidean domain. Such generalization enables traditional linear models to explore graph structure without losing their original simplicity and efficacy.

  3. We extend the personalized propagation scheme to balance the contribution of structure and node features, which endows FLGC with the capability of capturing long-range neighborhoods, thereby reducing the risk of over-smoothing.

  4. We empirically show that FLGCs consistently outperform a large number of previous SOTA approaches on both semi-supervised learning and clustering tasks across real-world graph data and regular grid data. Such superiority further offers a promising scheme to revisit traditional linear models in a pure non-Euclidean domain.

The rest of the paper is structured as follows. In Section II, we briefly review the related concepts on recent GCN models and subspace clustering. Section III systematically introduces the motivation, framework, formulation, and implementation of the proposed FLGCs. Extensive qualitative and quantitative evaluations and comparisons are presented in Section IV and Section V, followed by the conclusions. Conclusions and future works are given in Section VI.

Ii Notation and Concepts

Ii-a Notations

In this paper, boldface lowercase italics symbols (e.g., ), boldface uppercase roman symbols (e.g., ), regular italics symbols (e.g., ), and calligraphy symbols (e.g., ) orderly denote vectors, matrices, scalars, and sets. A graph is represented as , where denotes the node set with and , indicates the edge set with , and signifies the -dimensional node feature matrix of while the corresponding -class one-hot target matrix is . We define as the adjacency matrix of and the diagonal degree matrix as , where . The graph Laplacian matrix is defined as , and its normalized version is given by , which has an eigendecomposition of . Here,

is a diagonal matrix of the eigenvalues of

, and

is a unitary matrix consisting of the eigenvectors of

. Bedsides, denotes the transpose of matrix and

denotes an identity matrix with the size of

. The trace operation and the Frobenius norm of matrix are defined as and , respectively.

Ii-B Vanilla GCN

We refer to the GCN model proposed by Kipf et al. [27] as the vanilla GCN because of its great success and numerous followers. The vanilla GCN suggests that the graph convolution operation can be approximated by the -th order polynomial of Laplacians, i.e.,


where is a graph filter parameterized by in the Fourier domain and denotes the polynomial coefficient. The vanilla GCN [27] adopts two crucial strategies to simplify and enhance Eq. (1). First, it uses a 1-order polynomial with settings of and to approximate Eq. (1), resulting in a simplified convolution operation, i.e., . Second, it introduces a renormalization trick to guarantee its stability. Specifically, the trick can be expressed as


We call

as propagation matrix. As a result, in analogy to convolutional neural networks (CNN)

[18, 4], a general and layer-wise graph convolution propagation rule can be defined by


Here, is the -th layer’s graph embedding () and is a trainable parameters matrix. However, many works [59, 31, 51, 28, 52, 35] have demonstrated that stacking multiple GCN layers will lead to the over-smoothing effect, that is to say, all vertices will converge to the same value. Thus, the vanilla GCN usually adopts a shallow architecture, e.g., two GCN layers.

Ii-C Sgc

SGC [51] removes nonlinear activations in the vanilla GCN and collapses all trainable weights into a single matrix. This enables it to raise the repeated multiplication of the normalized adjacency matrix as a -th power of a matrix, i.e.,


Furthermore, SGC can be regarded as a fixed feature extraction/smoothing component

followed by a linear logistic regression classifier

. In [51], Wu et al. suggested that SGC acts as a low-pass filter and such a signified model performs comparably to many SOTA models.

Ii-D Appnp

Personalized propagation of neural predictions (PPNP) and its fast approximation, APPNP [28], consider the relationship between GCN and PageRank to derive an improved propagation scheme based on personalized PageRank. Let

be a multilayer perceptron (MLP) parameterized by

. Then PPNP is defined as



is the teleport (or restart) probability of the topic-sensitive PageRank.

Similar to SGC, PPNP separates the neural network used for generating predictions from the propagation scheme. While APPNP further approximates topic-sensitive PageRank via power iteration, i.e.,


One of the great advantages of PPNP and APPNP is that they decouple feature transformation and propagation procedures of the vanilla GCN without increasing the number of trainable parameters.

Ii-E Classic Linear Models

We broadly divide classic linear models into supervised methods and unsupervised methods. Similar to SGC, a typical supervised linear model can be treated as a fully linearized MLP, given by


Here, and denote training samples and corresponding target matrix, respectively. Such a model is also known as a ridge regression classifier [2, 20]. Besides, logistic regression and softmax regression are its two most frequently-used variants in deep neural networks [17, 57, 29, 18, 7].

The unsupervised fashion of a linear model often follows a common assumption, i.e., data points lie in a union of linear subspaces. While the subspace representation coefficients can be obtained by solving the following linear self-expressive model, i.e.,


Notably, the main difference between Eq. (7) and Eq. (8) is that the former considers the combination between every feature, while the latter considers samples. In order to achieve an effective solution, various norm constraints are often imposed on or . Sparse Subspace Clustering (SSC) [12] utilizes an norm , while Low Rank Subspace Clustering (LRSC) [48] adopts a nuclear norm , just to name a few. Despite their success, the objective functions derived from these constraints are not smooth, leading to inefficient solutions. In contrast, the Frobenius norm will result in a closed-form solution for linear models.

(a) Graph-form Illustration of our FLGC framework. FLGC first performs parameter-free propagation which aggregates multi-hop neighboring information with a power iteration of propagation matrix

and then predicts node labels by calculating a closed-form solution (affinity matrix).

(b) Matrix-form data flow of FLGC. It can be seen that FLGC conducts a personalized multi-hop propagation operation with a fixed propagation matrix , followed by a linear model parameterized by a weight matrix for semi-supervised classification or a coefficient matrix for unsupervised classification.
Fig. 1: Schematic representations of our proposed FLGC.

Iii Fully Linear Graph Convolutional Networks

We present the general FLGC framework for semi-supervised classification and unsupervised classification problems, as illustrated in Fig. 1 (a). The core behind our FLGC is to generalize GCNs to traditional linear models so that a) training linear GCN model with global optimal solution b) enabling traditional linear models to work on graph-structured data, 3) further simplifying the existing GCN models.

Iii-a Fully Linearization of GCN

Inspired by the SGC [51] model, we further remove all nonlinear operations of a -layer GCN, including the logistic regression classifier111It should be noticed that both Logistic (Sigmoid) and Softmax function are often used as nonlinear activation in neural networks. Thus, we consider SGC to be not fully linear. That is why it cannot calculate a closed-form solution.. It derives the following linear GCN


This linearization brings three major benefits. First, it is easy to decouple the fully linear GCN into two stages: a parameter-free feature propagation stage (i.e., , where denotes graph embedding) and a target-dependent prediction stage (i.e., ). The former aggregates -hop neighborhoods based on a predefined propagation matrix . While the later acts as a ridge regression classifier parameterized by . Second, it establishes a relationship between traditional linear models and GCN models. This relationship enables us to reconsider the traditional linear models from the graph representation learning point of view. Also, it endows the classic ridge regression classifier with the ability to handle graphs directly. Third, the linearization makes it possible to efficiently solve the global optimal solution of GCN without using gradient descent. We refer to such GCN as Fully Linear Graph Convolution (FLGC). A matrix-form data flow of FLGC is depicted in Fig. 1 (b). More details will be introduced as follows.

Iii-B Multi-hop Propagation

In light of the aforementioned linearization, we can define various propagation strategies and incorporate them into Eq. (9). Here we introduce a stable propagation scheme for our FLGC. According to the spectral graph theory, will converge to a stationary state as the number of propagation steps increases [28, 35]. While the node representations on the same connected component of a graph become indistinguishable, i.e., over-smoothing problem [28, 35]. Moreover, serves as a structure aggregation term and ignores the importance of initial node features. The initial node features often imply unique discriminant information, especially for those data without directly available graph structures.

To balance the structure between initial features, we adopt an extended APPNP’s propagation scheme to propagate multi-hop neighboring information. The propagation procedure with power iteration is recursively calculated by


Instead of using a neural network to generate a prediction as it is done in APPNP [28], we directly use the initial as topics to be ranked in the topic-sensitive PageRank [40]. Let the resulting final graph embedding be . This ensures that is always contributed by both structure and initial features with a fixed proportion . It is trivial to prove that SGC’s propagation (i.e., Eq. (9)) is a special case of Eq. (10) with . Furthermore, when , SGC degrades into an ordinary neural network, in which no structural information is used. From the residual network’s point of view [27, 35, 21], our propagation scheme is a special form of residual connection, where each forward step connects with initial inputs and weighted by [as shown in Fig. 1 (b)]. Thus, our propagation mechanism is also called initial residual [35].

Iii-C FLGC for Semi-Supervised Learning

Having introduced the FLFC framework, we are ready to calculate the closed-form solution for a specific downstream task. We first use the FLGC for the semi-supervised node classification problem. Let be an augmented target matrix, where labeled nodes are presented as one-hot vectors while unlabeled nodes are marked as zero vectors. Further let be a diagonal mask matrix associated with the labeled and unlabeled nodes. Thus, the semi-supervised FLGC can be denoted by


To effectively solve this problem, we rewrite the objective function as a Frobenius norm minimization problem, i.e.,


The problem can be further expressed in the following form by using the Lagrangian multiplier


where denotes a regularization coefficient. The partial derivative of with respect to is


The global optimal solution is derived when , i.e.,


Consequently, we denote the solution in closed form as


Finally, we infer the test node’s labels via a transductive manner.

Iii-D FLGC for Clustering

Our unsupervised FLGC follows the classic subspace clustering by assuming that the -hop graph embeddings lie in a union of

subspaces. More precisely, every node belonging to a certain subspace can be explicitly represented using other nodes in this subspace while subspaces do not interact with each other. We refer to this property of data as self-expressiveness. However, using an initial input to model such a property may lead to an unreliable estimate of subspace coefficients due to outliers and noisy points. Instead, we model our unsupervised FLGC based on the graph embedding of inputs. The motivation behind our method is that the intra-class variation of the initial inputs can be smoothed by using the graph convolution. Formally, we express our unsupervised FLGC as follows


Here, is an affinity matrix, in which the -th column denotes the representation coefficient vector of node , and indicates the -norm of a matrix. By analogy with classic subspace clustering models, will lead to a standard sparse subspace representation while replacing with the nuclear norm will derive a low-rank subspace representation.

In this paper, we aim to calculate a dense subspace representation by adopting the Frobenius norm, as well as maintaining the consistency of our FLGC framework. It has been proven [41] that the constraint can be relaxed and discarded by using a Frobenius norm. Hence, our unsupervised FLGC can be rewritten as


We further compute as


Similar to the semi-supervised FLGC, we can give the global optimal closed-form solution of Eq. (18), i.e.,


Following [5] and [5]

, we perform the spectral clustering on

to segment subspaces after using a block-structure heuristic.

Iii-E Remarks on FLGC

In Algorithm refalg:pesudocode, we provide the pseudocode for our semi-supervised and unsupervised FLGC. Noticed that both methods share a unified learning procedure and are easy to implement and train. In reality, our proposed FLGC models can be treated as natural generalizations of classic linear models in the non-Euclidean domain.

Algorithm 1

Pseudocode of FLGC in a PyTorch-like style.

# lambda: regularization coefficient
# alpha: teleport (or restart) probability
# fit: calculate closed form solutions
# sc: spectral clustering
# mm: matrix-matrix multiplication
# gcn_norm: normlize adjacency matrix
# compute augmented normalized adjacency
P = gcn_norm(A) # NxD
# compute K-hop graph embedding
H = X # NxD
for i range(K):
     # propagate neighborhood using Eq.(10)
     H = (1 - alpha) * mm(A_hat, H) + alpha * X # NxD
if task == ’ssl’:
     # compute W using Eq.(16)
     W = fit(H, Y_aug, M) # DxC
     # predict unlabeled nodes
     y = mm(H, W)
elif task == ’usl’:
     # compute Z using Eq.(20)
     Z = fit(H, X)
     # assign node labels through spectral clustering
     y = sc(Z)

Let be the generalized linear learning model defined on a graph . FLGC can be broadly denoted as a classic model, , multiplied by a -hop propagation matrix, , i.e.,


The only trainable parameter, , is derived from the classic model which can be solved in a similar manner to the existing models. It is easy to prove that FLGC is a generalization of classic linear models in the non-Euclidean domain. When , signifying that does not have any edges except for self-loops, FLGC is equivalent to classic linear models. Benefited from the decoupled design, is target-independent and can be precomputed. Thus, FLGC does not increase the computational burden compared to the classic linear models. Here, we show the connections of FLGC to existing models.

  • FLGC v.s. Manifold Regularization Let be the objective function of a manifold regularized model, where , , and denote the empirical error term, the structural risk term, and the manifold prior term, orderly. Further let be the proposed FLGC. As suggested in [60], both and are derived from the same optimization framework. Nonetheless, there is a considerable difference between them. That is, our FLGC directly works in the non-Euclidean domain, while the manifold regularization is proposed for Euclidean data. During the learning, manifold regularized models use the graph structure as the prior knowledge, generally defined as , which is incorporated as a single term balanced by an additional regularization coefficient. In particular, the manifold regularization cannot model the long-range relationships. In contrast, our FLGC propagates multi-hop structural information in a more general and flexible manner.

  • FLGC v.s. AutoEncoder

    Our unsupervised FLGC is highly related to linear autoencoders

    [5]. We define a linear graph autoencoder as , where and are the decoder and encoder, respectively. By collapsing weights matrices of a -layer encoder in a single matrix , an autoencoder with a self-expressive layer becomes and . By replacing using a fixed unitary matrix , such encoder becomes the propagation stage of the unsupervised FLGC, while our is a single layer self-expression. Furthermore, our FLGC aims to reconstruct node features rather than the structure that adopted in the graph autoencoder [26].

Iv Experiments of SSL

In this section, we compare the proposed semi-supervised FLGC model against many SOTAs and classic methods on 3 real-world graph benchmark datasets and 24 regular UCI datasets. Furthermore, numerous ablation experiments are conducted to systematically and comprehensively analyze the effectiveness and robustness of the proposed FLGC.

Dataset #Nodes #Edges #Classes #Features Train/Dev/Test
Cora 2,708 5,429 7 1433 140/500/1,000
Citeseer 3,327 4,732 6 3703 120/500/1,000
Pubmed 19,717 44,338 3 500 60/500/1,000
TABLE I: Summary of citation network datasets.
Method Cora Citeseer PubMed
GCN 81.4 0.4 70.9 0.5 79.0 0.4
GAT 83.3 0.7 72.6 0.6 78.5 0.3
FastGCN 79.8 0.3 68.8 0.6 77.4 0.3
GIN 77.6 1.1 66.1 0.9 77.0 1.2
LNet 80.2 3.0 67.3 0.5 78.3 0.6
AdaLNet 81.9 1.9 70.6 0.8 77.8 0.7
DGI 82.5 0.7 71.6 0.7 78.4 0.7
SGC 81.00.0 71.9 0.1 78.9 0.0
MixHop 81.9 0.4 71.40.8 80.80.6
DropEdge 82.8 72.3 79.6
GNN 82.50.2 74.40.3 77.9 0.4
GCN 81.10.2 69.80.2 79.40.1
GCN-Linear 80.80.0 68.70.1 79.40.1
SGC 81.70.0 71.10.0 76.60.0
APPNP 82.50.2 70.40.1 79.40.3
FLGC 82.90.0 72.70.0 79.20.0
FLGC* 84.00.0 73.20.0 81.10.0
TABLE II: Summary of baseline comparison under public splits for Cora, Citeseer, and Pubmed. FLGC* corresponds to FLGC with . The results at the top are collected from literature while the middle is reproduced according to our settings.
Method Random Split Full Split
Cora Citeseer PubMed Cora Citeseer PubMed
GCN 79.11.8 67.91.2 76.92.9 86.4 75.4 85.9
GCN-Linear 79.82.1 68.42.1 76.71.4 87.0 75.7 83.9
SGC 81.31.7 68.52.2 76.43.3 86.2 77.6 83.7
APPNP 81.00.0 68.50.0 75.10.0 88.4 78.6 82.3
FLGC 81.50.6 71.00.9 77.60.3 87.0 78.1 87.9
FLGC* 82.00.1 72.10.0 77.70.0 88.5 79.2 88.3
TABLE III: Summary of baseline comparison under random splits and full splits for Cora, Citeseer, and Pubmed.
Fig. 2:

Classification accuracy vs. training time with varying propagation step on (a) Cora, (b) Citeseer, and (c) Pubmed datasets. We keep other hyperparameters fixed except for

for each model so that running time is affected only by the propagation step. In the figures, each curve contains 15 points (i.e., 15 propagation steps) and each corresponds to the running time (x-axis) and test accuracy (y-axis) yielded by a specific -hop model, where and increases with running time. The insets indicate the zoom-up of SGC.
Fig. 3: Effect of and on (a) Cora, (b) Citeseer, and (c) Pubmed datasets.

Iv-a Results on Real-World Benchmarks

Iv-A1 Dataset Description

We evaluate our proposed FLGCs on three standard citation network datasets available from the PyTorch Geometric library, including the Cora, Citeseer, and PubMed [44]. The summary of these datasets is reported in Table I. In these datasets, nodes correspond to documents, and edges correspond to citations; each node feature corresponds to the bag-of-words representation of the document and belongs to one of the academic topics [54]. Given a portion of nodes and their labeled categories, e.g., history and science, the task is to predict the category for other unlabeled nodes over the same graph.

Iv-A2 Baselines and Setup

For citation network datasets, the proposed FLGCs compare against numerous SOTA graph neural network models, including the vanilla GCN [27], GAT [46], FastGCN [8], GIN [53], LNET, AdaLNet [32], DGI [47], SGC [51], MixHop [1], DropEdge [43], and GNN [34]. For these models, we give their results reported in the corresponding literature. Moreover, we reproduce the vanilla GCN w/o non-linear activation (GCN or GCN-Linear), SGC, and APPNP [28]. In our reproduction, we follow the settings suggested in the corresponding papers. Specifically, we implement GCN and GCN-Linear using two-layer graph convolution each with

hidden neurons, and apply an

regularization with on trainable parameters. For APPNP, we adopt a two-layer MLP, each of which contains hidden neurons and

. For a fair comparison, we discard other training tricks involved in backpropagation except for weight decay.

We implement two variants of our FLGC model with the PyTorch library222Relies on Pytorch Geometric 1.6.3., i.e., FLGC* indicates our method that uses our propagation mechanism and FLGC denotes our model that uses the SGC propagation. The hyper-parameters in our models are determined by a grid search among , , and . We train and test all baselines with the same data splits and random seeds on an NVIDIA GTX 1080 Ti GPU, and report the average accuracy over runs. In our experiment, we provide three types of data splits, i.e., public splits as described in [44], random splits where training/validation/test sets are generated randomly with the same proportion as the public splits, and full splits where all remaining nodes are considered as the training set.

Iv-A3 Comparison with SOTAs

Table II reports the classification accuracies of node classification with public splits. The results shown in the top part of Table II are collected from [27, 46, 8, 53, 32, 47, 51, 1, 43, 34] while the results shown in the middle part of Table II are reproduced in our experiment. It can be seen that our FLGC models consistently achieve large-margin outperformance across all datasets. Through a point-by-point comparison, FLGC improves upon SGC by a margin of , , and (absolute differences) on Cora, Citeseer, and Pubmed, respectively, while the margins improved by FLGC* upon APPNP are , , and , respectively. Through a vertical comparison, FLGC* achieves , and improvement over FLGC, respectively.

In Table III, we further report the comparison results using the random splits and full splits. We can observe that the proposed FLGCs collectively outperform the competitors in terms of average classification accuracy. It should be noted that our FLGCs tend to obtain a more stable result than other baselines because of their ability to offer closed-form solutions. In a nutshell, the above experiments demonstrate that our FLGC framework is capable of achieving the SOTA performance.

Iv-A4 Running Time and Over-smoothing Analysis

Fig. 2 depicts the interaction between training time and classification accuracy. To ensure a fair comparison, all the training times are collected from the same hardware. In particular, the precomputation time of is taken into account for SGC and FLGCs. We use fixed hyperparameters to train each model and let be the only variable increasing from to . Thus, the running time of each model will definitely increase with , and then will indirectly affect the test accuracy. Remarkably, GCN and linear GCN suffer from unstable performance when because of over-smoothing. On the contrary, both FLGC and FLGC* are robust to the propagation steps. Especially, there is no significant over-smoothing effect in FLGC* across three figures, signifying the effectiveness of our propagation scheme. Also, it suggests that the residual connection is helpful to address the over-smoothing problem.

For a given graph, the training time of a graph model is positively associated with the propagation steps (or layers). Nonetheless, our FLGCs show a notable advantage over other methods (e.g., GCN and APPNP). Especially, it is almost no increase in the computation time using our multi-hop propagation scheme. Instead, such a strategy improves FLGC significantly, particularly with large . Despite computation efficiency, SGC suffers from over-smoothing and its training relies on the optimizer and its parameter settings. In summary, our FLGC model achieves a good balance between classification accuracy and training time.

Dataset #classes #instances #features #train #test Dataset #classes #instances #features #train #test
air 3 359 64 37 322 appendicitis 2 106 7 12 94
ecoli 8 336 7 38 298 heart 2 270 13 27 243
iris 3 150 4 15 135 cleve 2 296 13 30 266
fertility 2 100 9 11 89 segmentation 7 210 18 21 189
wine 3 178 13 19 159 x8d5k 5 1000 8 100 900
wdbc 2 569 30 58 511 vote 2 435 16 44 391
haberman 3 306 3 32 274 wbc 2 683 9 69 614
spectf 2 267 44 28 239 WBC 2 683 9 69 614
cotton 6 356 21 37 319 breast 2 277 9 29 248
seeds 3 210 7 21 189 australian 2 690 14 70 620
glass 6 214 10 23 191 diabetes 2 768 8 78 690
zoo 7 101 16 13 88 dnatest 3 1186 180 120 1066
TABLE IV: Details of 24 UCI benchmark datasets.

Iv-A5 Impact of and

We study the sensitivity of and of FLGC. As depicted in Fig. 3 (a)-(f), both and have a significant effect on the accuracy. Due to the difference in the neighborhood structure, the optimum of and will be varied for different datasets. Usually, a large value tends to bring a compact model, while a small value increases the risk of over-fitting. For Cora and Citeseer datasets, a large

is desired by the FLGC models, while this value should be small on the Pubmed dataset. It should be noticed that FLGC* is equivalent to FLGC and the classic ridge regression classifier when

and , respectively. It can be seen that the best settings for this hyper-parameter is around . The teleport probability serves as a proportion showing to what extent original features contribute to the propagation. Our further analysis revealed the following tendencies: First, compared to the two endpoints ( and ) in Fig. 3, FLGC improves the classic ridge regression with a significant margin. This means that structure information is pretty useful for the traditional linear model. Second, the original node features are helpful to improve FLGC, which makes it possible to aggregate higher hop neighbors.

Iv-B Results on Regular Datasets

Iv-B1 Dataset Description

To further explore the generalization ability of FLGCs, we conduct a series of experiences on widely-used regular classification datasets taken from the University of California at Irvine (UCI) repository333http://archive.ics.uci.edu/ml/index.php. These datasets include a number of binary-class and multi-class classification problems. In the preprocessing, all datasets are scaled into the range of using the min-max standardization technique. For each dataset, we randomly take samples from each class as the labeled set and the rest as the unlabeled set. A detailed description of these datasets is provided in Table IV.

air 76.234.08 80.193.65 76.096.08 90.094.09 90.123.80 86.434.13 78.943.71 89.162.53 90.314.17 90.652.89 91.023.44
ecoli 77.415.31 79.434.34 85.861.99 83.413.90 83.553.87 80.712.85 77.131.47 83.282.61 81.323.96 84.092.10 84.731.92
iris 80.194.26 92.563.67 91.933.34 91.634.63 90.814.89 88.815.95 88.596.29 92.672.05 91.413.49 96.300.66 96.810.47
Fertility 75.2810.03 71.748.72 77.255.80 73.039.91 74.3810.31 83.484.69 88.760.00 76.188.92 77.879.36 88.760.00 88.760.00
wine 79.537.66 94.092.10 95.132.12 93.081.57 93.141.14 90.442.00 91.134.50 93.332.69 91.642.22 94.031.69 94.281.47
wdbc 89.175.90 93.641.83 85.922.41 93.561.75 93.602.05 93.681.60 91.781.68 93.520.98 93.621.86 94.580.90 95.600.67
Haberman 71.732.44 62.575.73 70.023.15 70.512.15 70.402.83 67.964.15 73.720.00 70.882.05 69.712.63 73.720.00 73.720.00
SPECTF 77.411.83 74.333.41 49.774.59 77.242.55 78.491.93 76.994.24 79.500.00 79.500.00 78.582.07 79.500.00 79.500.00
CAR 74.183.00 86.462.19 83.781.20 79.361.18 85.521.23 82.771.80 73.890.73 85.401.10 85.031.17 78.910.78 78.960.84
cotton 60.644.00 78.103.24 76.103.90 74.833.12 75.453.26 71.254.43 72.882.05 75.492.03 75.862.30 73.705.70 78.244.37
Seeds 87.227.08 91.932.58 92.831.76 90.632.40 90.422.83 78.3120.88 89.890.69 91.481.60 89.472.93 92.060.63 93.171.12
glass 64.267.54 77.164.27 73.584.79 78.744.18 77.854.09 56.3920.35 71.475.92 79.633.20 76.345.79 74.555.49 74.875.47
zoo 78.7810.07 98.581.24 97.031.99 92.593.57 92.822.65 85.064.34 91.063.12 92.593.41 92.242.94 93.063.30 93.061.11
appendicitis 81.445.92 75.1111.76 72.555.83 80.437.86 80.327.81 81.175.81 82.342.29 82.667.93 81.496.83 83.196.86 83.305.69
heart 71.716.32 73.274.16 73.853.18 76.095.46 76.134.12 75.519.46 79.143.39 76.676.58 77.286.13 81.111.31 81.651.38
cleve 71.075.87 73.985.49 71.603.15 76.282.81 75.793.63 71.504.59 76.391.48 76.882.98 77.033.79 77.371.93 77.442.35
segmentation 55.635.85 76.066.44 80.615.02 76.982.91 75.983.55 69.217.27 71.323.03 78.203.84 74.663.31 76.514.13 79.104.16
X8D5K 94.334.29 100.00.0 100.00.0 100.00.0 100.00.0 100.00.0 100.00.0 100.00.0 100.00.0 100.00.0 100.000.00
vote 80.545.32 90.521.75 84.903.02 89.232.48 89.642.42 87.372.50 87.341.04 90.202.06 89.852.19 91.101.51 91.791.46
WBC 92.733.43 92.382.68 93.981.33 95.701.56 95.881.07 96.690.51 95.900.51 95.640.50 95.650.81 96.190.47 96.480.66
breast 69.963.86 63.474.85 71.732.52 66.254.18 66.904.05 71.133.29 70.970.00 67.503.13 68.793.07 72.101.41 73.351.49
austra 78.545.48 77.653.83 81.492.37 85.441.64 85.231.98 76.9711.37 81.871.37 84.351.06 83.983.10 85.161.54 85.660.68
diabetes 69.932.96 66.752.75 70.691.55 68.130.92 69.101.05 67.131.81 65.010.87 70.091.13 69.101.50 69.311.55 69.481.10
dnatest 48.962.27 82.951.53 83.471.19 85.301.07 85.541.05 81.911.42 74.401.85 87.801.00 82.035.27 84.920.92 85.391.27
Average 75.295.20 81.373.84 80.843.01 82.863.16 83.213.15 80.045.39 81.391.92 83.882.64 83.053.37 84.621.91 85.271.71
TABLE V: Semi-supervised classification accuracy on 24 UCI datasets. Best results are in bold.
SS-ELM - 66.0 54.0 23.0 18.0 56.0 15.0 8.0 10.0 1.0 1.0
TSVM 234.0 - 136.0 74.0 66.0 157.0 142.0 48.0 80.0 53.0 37.0
LapRLS 246.0 140.0 - 110.0 100.0 167.0 146.0 74.0 106.0 67.0 41.0
GCN-Linear 277.0 202.0 166.0 - 99.0 215.0 191.0 42.5 135.5 41.5 22.0
GCN 282.0 210.0 176.0 177.0 - 231.0 197.0 48.0 172.5 63.0 38.0
DropEdge 244.0 119.0 109.0 61.0 45.0 - 101.0 34.0 37.0 13.0 9.0
SGC 285.0 134.0 130.0 85.0 79.0 175.0 - 65.5 71.0 5.0 5.0
GCNII 292.0 228.0 202.0 257.5 228.0 242.0 234.5 - 208.0 104.5 65.5
APPNP 290.0 196.0 170.0 140.5 127.5 239.0 205.0 68.0 - 46.0 29.0
FLGC 299.0 223.0 209.0 235.5 214.0 263.0 295.0 195.5 230.0 - 5.0
FLGC* 299.0 239.0 235.0 254.0 238.0 267.0 295.0 234.5 247.0 271.0 -
TABLE VI: Summary of ranks computed by the Wilcoxon test. "" denotes the method in the row improves the method of the column. "" indicates the method in the column improves the method of the row. The upper diagonal of level significance is 0.1 while the lower diagonal level of significance is 0.05.
Fig. 4: Test accuracy with different training samples per class on (a) Iris, (b) Wine, (c) WDBC datasets.
Fig. 5: Over-smoothing effect on (a) Iris, (b) Wine, (c) WDBC datasets.
Fig. 6: Effect of neighbor size on (a) Iris, (b) Wine, (c) WDBC datasets. We perform FLGC by fixing , , and and varying . Noticed that means and thus only is used. Each box indicates the statistic over independent runs and corresponding average value is denoted as a blue triangle.

Iv-B2 Baselines and Setup

In this experiment, the selected baselines include GCN variants, i.e., GCN-Linear, GCN [27], DropEdge [43], SGC [51], GCNII [35], and APPNP [28], and classic semi-supervised classification models, i.e., SS-ELM [22], TSVM [25], and LapRLS [3]. There is no off-the-shelf structure in these regular datasets, thus, we construct NN graphs [5, 6] for the representation of structured information. Specifically, we adopt the Euclidean distance to measure the similarity between sample pairs and choose top neighbors centered on a certain sample as its edges. To avoid hyper-parameter , we empirically set it as .

Iv-B3 Comparison with SOTAs and Statistical Test

In Table V, we provide the comparative results on the 24 UCI datasets. All the results are calculated by averaging independent runs. At the bottom of the table, we summarize the arithmetic mean accuracy over 24 datasets. Remarkably, our FLGC models consistently outperform not only the classic semi-supervised models but also the recent GCN variants. Specifically, FLGC* achieves the highest accuracy on out of datasets. On average, FLGC and FLGC* respectively obtain and accuracy across 24 datasets, which improve upon SGC and APPNP by margins of and , respectively. Furthermore, we notice that GCN variants are generally superior to classic semi-supervised models even on the regular grid datasets. This is a valuable clue in designing semi-supervised models on regular datasets.

To further rank all baselines, we carry on a non-parametric statistical test on the results reported in Table V. To this end, we follow the suggestion posed by Garcia et al. [15] on adopting the Wilcoxon signed-ranks test444We use KEEL (Knowledge Extraction based on Evolutionary Learning) tool available from http://www.keel.es/ to conduct the Wilcoxon signed-ranks test. to compute the sum of ranks for each pair of methods. Table VI shows the detailed statistical results. According to the exact critical value table of the Wilcoxon test, the critical values on datasets for a confidence level of and correspond to (lower diagonal) and ( upper diagonal), respectively. We can observe that our proposed FLGC* is significantly better than all the other competitors for different confidence levels, while the FLGC model performs equally to GCNII and GCN for the confidence level of and , respectively. The results markedly demonstrate that our proposed FLGC models can generalize to the regular Euclidean data and can achieve promising performance.

Iv-B4 Comparison w.r.t. Different Sizes of Training Samples

We conduct an experiment to compare the performance of baselines with varying labeled samples size. Fig. 4 (a)-(c) illustrate the comparative results on three selected datasets, i.e., Iris, Wine, and WDBC. We gradually increase the training samples per class from 1 to 20 and plot the test accuracy curves of competitive methods. It can be seen that both of our FLGC models show competitive performance w.r.t. the other baselines under different training sizes. Particularly, our methods remarkably outperform many baselines (e.g., SS-ELM, TSVM, and LapRLS) using all datasets when using an extremely small training size, e.g., only 1 labeled sample per class.

Iv-B5 Study on Over-Smoothing

Fig. 5 (a)-(c) show the effect of the propagation step on the selected datasets. Several tendencies can be observed from the figures. Firstly, GCNII, APPNP, FLGC, and FLGC* face less risk of falling into the over-smoothing dilemma, while that occurs in the other methods. Secondly, linear GCN tends to outperform the nonlinear GCN on the selected datasets. A conceivable reason is that the nonlinear activation accelerates the speed of over-smoothing. Also, this is affected by the predefined graph structure. Thirdly, by adding initial residual, FLGC* benefits from the longer-range propagation, and thus significantly improves upon FLGC.

Fig. 7: Sample images from (a) COIL20, (b) ORL, and (c) Extended Yale B.
Iris ACC 0.797 0.636 0.586 0.573 0.846 0.686 0.817 0.873 0.940 0.953 0.960
NMI 0.637 0.404 0.572 0.427 0.723 0.580 0.682 0.754 0.851 0.874 0.886
Wine ACC 0.960 0.722 0.959 0.926 0.915 0.955 0.541 0.972 0.949 0.949 0.983
NMI 0.862 0.447 0.881 0.780 0.732 0.848 0.361 0.881 0.820 0.820 0.928
Yale ACC 0.472 0.339 0.511 0.521 0.509 0.593 0.442 0.600 0.515 0.546 0.630
NMI 0.540 0.412 0.561 0.549 0.582 0.584 0.506 0.631 0.558 0.557 0.657
TABLE VII: The clustering performance of different clustering models on Iris, Wine, and Yale datasets.
Fig. 8: Clustering ACC (in %) on the (a) COIL20, (b) ORL, and (c) Extended Yale B datasets. Different colors indicate different methods. The height of the bars indicates the clustering ACC, so the higher the better.
Fig. 9: Affinity matrix obtained by FLGC* for (a) Iris and (b) Extended Yale B datasets.
Noise Type Noise Intensity ACC NMI
Gaussian 0.853 0.863 0.863 0.901 0.897 0.881 0.881 0.919
0.838 0.830 0.830 0.891 0.884 0.855 0.855 0.906
0.472 0.748 0.748 0.824 0.748 0.780 0.780 0.866
0.680 0.643 0.643 0.774 0.774 0.710 0.710 0.813
Salt & Pepper 0.845 0.891 0.891 0.908 0.886 0.901 0.901 0.922
0.842 0.862 0.862 0.896 0.885 0.883 0.883 0.913
0.843 0.847 0.847 0.894 0.885 0.870 0.870 0.912
0.838 0.826 0.826 0.877 0.881 0.850 0.850 0.901
TABLE VIII: Clustering results of the robustness experiments on the Extended Yale B dataset.
Fig. 10: Effect of propagation step on the clustering accuracy of (a) FLGC and (b) FLGC*. (c) The influence of on three benchmarks.

Iv-B6 Effect of the Neighborhood Size

We aim to further explore the effect of the predefined graph structure on the classification performance of our FLGC. Fig. 6 shows the tendency of accuracy varied with neighborhood sizes. When , meaning , FLGC performs identically to a ridge regression classifier. At this point, FLGC’s performance becomes more unstable and worse than that considered neighborhood information. This demonstrates that a pairwise relationship defines the intrinsic structure of regular data. Different from real-word graph data, however, the predefined edges (i.e., NN graph) cannot perfectly describe such structures. As a result, the performance of FLGC varies with neighborhood sizes. Empirically, a large neighborhood leads to more performance improvement since that enlarges the first-order receptive field. However, a too large neighborhood size will inevitably degrade the performance because of the increased risk of noisy edges and over-smoothing. It is still an open problem to find an optimum neighborhood size.

V Experiments of Clustering

In this section, we extensively evaluate our proposed unsupervised FLGC on several challenging clustering benchmarks and compare it with many previous clustering models.

V-a Dataset Description

Except for two simple UCI datasets (i.e., Iris and Wine), we add four challenging image clustering benchmarks for performance comparison, i.e., COIL20 object image dataset and Yale, Extended Yale B, and ORL face image datasets. The COIL20 dataset contains gray-scale image samples with a spatial size of and consists of distinct objects, e.g., duck, cat, and car models. The ORL dataset is composed of human face images, with subjects each having samples. Following [24], we down-sampled the original face images from to . The Yale and Extended Yale B datasets are popular benchmarks for subspace clustering. The former includes face images collected from individuals. The latter database is more challenging than the former because it contains images of human subjects acquired under different poses and illumination conditions. The resolution of the two face databases is scaled to and , respectively. Some selected sample images from COIL20, ORL, and Extended Yale B are illustrated in Fig. 7.

V-B Baselines and Setup

We divide our experiment into two parts. In the first part, we aim to evaluate our methods on three small and simple datasets, i.e., Iris, Wine, and Yale. This part follows the settings suggested in [49]. We compare our method with the following baselines: K-Means, NMF [30], Normalized Cuts (NCut) [45], CAN [38], CLR [39], SSR [23], K-Multiple-Means (KMM) [37], EGCSC [5], and RRCSL [49]. Similar to Section IV, we use FLGC* to denote our method with the initial residual. In the second part, we conduct experiments on three challenging image datasets (i.e., COIL20, ORL, and Extended Yale B). The baselines that we compare FLGCs with in this part include Low Rank Representation (LRR) [33], LRSC [48], SSC [12], KSSC [42], SSC by Orthogonal Matching Pursuit (SSC-OMP) [56], EDSC [41], and EGCSC [5]. We follow the experiment setups reported in [24]. For our FLGC, we search for the optimum parameter setting among , , , and a NN graph of .

Two popular metrics are utilized to quantify the performance of the clustering methods, including clustering Accuracy (ACC) and Normalized Mutual Information (NMI) [5, 49]. Specifically, both metrics range in , and the higher score the more accurate clustering result.

V-C Quantitative and Qualitative Results

Table VII gives the quantitative comparison of different clustering models on three small datasets. We can observe that our FLGC models consistently achieve superior ACC and NMI with significant margins compared with many existing clustering models. Specifically, FLGC* obtains and ACC on Iris and Wine datasets, respectively, outperforming the advanced RRCSL ( and ) and KMM ( and ) by large margins. In Fig. 8, we provide a visual comparison of the clustering performance on the three challenging datasets. The results reveal that the proposed FLGC models can markedly improve many self-expressiveness-based clustering models. As an extension of subspace clustering, our unsupervised FLGC reduces the intra-class variations through neighborhood propagation, making it more robust to find the inherent subspace structure among data. Taking Iris and Extended Yale B datasets as examples, we visualize the affinity matrices yielded by FLGC*, as shown in Fig. 9. The visualizations exhibit distinctly block-diagonal structures, which are highly close to the corresponding ground truths.

V-D Analysis of Robustness

To analyze the robustness of FLGCs, we conduct experiments to compare the clustering performance under different noise conditions. Specifically, we adopt a Gaussian noise and a salt-and-pepper noise to corrupt images. The variance

of the Gaussian noise and the proportion of corrupted pixels by the salt and pepper noise are treated as the intensity of noise. We test our method with different intensities of 0.01, 0.05, 0.1, and 0.2. It can be observed from Table VIII, our FLGC* is more robust to noise than other methods. This superiority is benefited from the graph structure, as well as the initial residual propagation scheme. FLGC and EGCSC have the same performance which is because FLGC obtains the best performance when , and FLGC degrades into SGCSC at this point. Compared with EDSC, the other three methods show lower sensitivity to noise, demonstrating the robustness of the graph convolution.

V-E Influence of and

To study the effect of over-smoothing, we show clustering accuracy under different propagation steps on the COIL20, ORL, and Extended Yale B datasets, given in Fig. 10 (a) and (b). We find that FLGC* (Fig. 10 (a)) is robust to large propagation steps since its performance is almost unaffected by a large on the Extended Yale B dataset. In comparison, the accuracy obtained by FLGC drops about 0.70 in terms of ACC for . This robustness to large propagation step further demonstrates that the initial features are crucial for regular data. This conclusion is further supported in Fig. 10 (c), where the clustering ACC tends to be increased by large for the ORL and Extended Yale B datasets. It should be noticed that the structure information shows a higher contribution than initial features for the COIL20 datasets. A conceivable reason is that the samples within COIL20 have a strong inter-class difference, which results in a more accurate structure.

Vi Conclusions

In this paper, we have presented a unified and simple graph convolutional framework, i.e., fully linear graph convolution networks, which incorporates multi-hop neighborhood aggregation into classic linear models to further simplify the training, applying, and implementing of GCN. Technically, we train FLGC by computing a global optimal closed-form solution, resulting in efficient computation. Also, based on the framework, we developed a semi-supervised FLGC and an unsupervised FLGC for semi-supervised node classification tasks and unsupervised clustering tasks, respectively. Furthermore, we showed that FLGC acts as a generalization of traditional linear models on the non-Euclidean data. In comparison with existing approaches, our FLGCs achieved superior performance across real-word graphs and regular grid data concurrently. The success of our FLGC establishes a connection between GCN and classic linear models. Future work may include exploring more scalable linear models to deal with large-scale graph, inductive learning, and extending FLGC to different GCNs.


The authors would like to thank the anonymous reviewers for their constructive suggestions and criticisms.


  • [1] S. Abu-El-Haija, B. Perozzi, A. Kapoor, N. Alipourfard, K. Lerman, H. Harutyunyan, G. Ver Steeg, and A. Galstyan (2019) Mixhop: higher-order graph convolutional architectures via sparsified neighborhood mixing. In

    international conference on machine learning

    pp. 21–29. Cited by: §IV-A2, §IV-A3.
  • [2] S. An, W. Liu, and S. Venkatesh (2007) Face recognition using kernel ridge regression. In

    2007 IEEE Conference on Computer Vision and Pattern Recognition

    Vol. , pp. 1–7. Cited by: §I, §II-E.
  • [3] M. Belkin, P. Niyogi, and V. Sindhwani (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research 7 (Nov), pp. 2399–2434. Cited by: §I, §IV-B2.
  • [4] Y. Cai, X. Liu, and Z. Cai (2020) BS-nets: an end-to-end framework for band selection of hyperspectral image. IEEE Transactions on Geoscience and Remote Sensing 58 (3), pp. 1969–1984. Cited by: §II-B.
  • [5] Y. Cai, Z. Zhang, Z. Cai, X. Liu, X. Jiang, and Q. Yan (2021) Graph convolutional subspace clustering: a robust subspace clustering framework for hyperspectral image. IEEE Transactions on Geoscience and Remote Sensing 59 (5), pp. 4191–4202. Cited by: §I, 2nd item, §III-D, §IV-B2, §V-B, §V-B.
  • [6] Y. Cai, Z. Zhang, Z. Cai, X. Liu, and X. Jiang (2021) Hypergraph-structured autoencoder for unsupervised and semisupervised classification of hyperspectral image. IEEE Geoscience and Remote Sensing Letters (), pp. 1–5. Note: doi: 10.1109/LGRS.2021.3054868 Cited by: §I, §IV-B2.
  • [7] Y. Cai, Z. Zhang, Q. Yan, D. Zhang, and M. J. Banu (2020) Densely connected convolutional extreme learning machine for hyperspectral image classification. Neurocomputing. Note: doi:10.1016/j.neucom.2020.12.064 External Links: ISSN 0925-2312 Cited by: §II-E.
  • [8] J. Chen, T. Ma, and C. Xiao (2018) FastGCN: fast learning with graph convolutional networks via importance sampling. In International Conference on Learning Representations, Cited by: §IV-A2, §IV-A3.
  • [9] M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, Red Hook, NY, USA, pp. 3844–3852. External Links: ISBN 9781510838819 Cited by: §I.
  • [10] Y. Ding, X. Zhao, Z. Zhang, W. Cai, and N. Yang (2021) Graph sample and aggregate-attention network for hyperspectral image classification. IEEE Geoscience and Remote Sensing Letters (), pp. 1–5. Note: doi: 10.1109/LGRS.2021.3062944 Cited by: §I.
  • [11] Y. Ding, X. Zhao, Z. Zhang, W. Cai, and N. Yang (2021) Multiscale graph sample and aggregate network with context-aware learning for hyperspectral image classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14 (), pp. 4561–4572. External Links: Document Cited by: §I.
  • [12] E. Elhamifar and R. Vidal (2013-11) Sparse subspace clustering: algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (11), pp. 2765–2781. External Links: ISSN 0162-8828 Cited by: §I, §II-E, §V-B.
  • [13] W. Fan, Y. Ma, Q. Li, Y. He, E. Zhao, J. Tang, and D. Yin (2019) Graph neural networks for social recommendation. In The World Wide Web Conference, WWW ’19, pp. 417–426. External Links: Document Cited by: §I.
  • [14] W. Feng, J. Zhang, Y. Dong, Y. Han, H. Luan, Q. Xu, Q. Yang, E. Kharlamov, and J. Tang (2020) Graph random neural networks for semi-supervised learning on graphs. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 22092–22103. Cited by: §I.
  • [15] S. Garcia and F. Herrera (2008) An extension on“statistical comparisons of classifiers over multiple data sets”for all pairwise comparisons. Journal of machine learning research 9 (Dec), pp. 2677–2694. Cited by: §IV-B3.
  • [16] M. Gasse, D. Chetelat, N. Ferroni, L. Charlin, and A. Lodi (2019) Exact combinatorial optimization with graph convolutional neural networks. In Advances in Neural Information Processing Systems, Vol. 32, pp. 15580–15592. Cited by: §I.
  • [17] P. Ghamisi, E. Maggiori, S. Li, R. Souza, Y. Tarablaka, G. Moser, A. De Giorgi, L. Fang, Y. Chen, M. Chi, S. B. Serpico, and J. A. Benediktsson (2018-Sep.)

    New frontiers in spectral-spatial hyperspectral image classification: the latest advances based on mathematical morphology, markov random fields, segmentation, sparse representation, and deep learning

    IEEE Geoscience and Remote Sensing Magazine 6 (3), pp. 10–43. Cited by: §II-E.
  • [18] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, G. Wang, J. Cai, and T. Chen (2018) Recent advances in convolutional neural networks. Pattern Recognition 77, pp. 354–377. External Links: ISSN 00313203 Cited by: §II-B, §II-E.
  • [19] W. L. Hamilton, R. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Red Hook, NY, USA, pp. 1025–1035. External Links: ISBN 9781510860964 Cited by: §I.
  • [20] R. Hang, Q. Liu, H. Song, Y. Sun, F. Zhu, and H. Pei (2017) Graph regularized nonlinear ridge regression for remote sensing data analysis. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 10 (1), pp. 277–285. External Links: ISSN 1939-1404 2151-1535 Cited by: §II-E.
  • [21] K. He, X. Zhang, S. Ren, and J. Sun (2016-06) Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §III-B.
  • [22] G. Huang, S. Song, J. N. D. Gupta, and C. Wu (2014-12) Semi-supervised and unsupervised extreme learning machines. IEEE Transactions on Cybernetics 44 (12), pp. 2405–2417. Cited by: §IV-B2.
  • [23] J. Huang, F. Nie, and H. Huang (2015) A new simplex sparse learning model to measure data similarity for clustering. In Twenty-Fourth International Joint Conference on Artificial Intelligence, Cited by: §V-B.
  • [24] P. Ji, T. Zhang, H. Li, M. Salzmann, and I. Reid (2017) Deep subspace clustering networks. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 24–33. Cited by: §V-A, §V-B.
  • [25] T. Joachims (1999)

    Transductive inference for text classification using support vector machines

    In ICML, Vol. 99, pp. 200–209. Cited by: §IV-B2.
  • [26] T. N. Kipf and M. Welling (2016) Variational graph auto-encoders. arXiv preprint arXiv:1611.07308. Cited by: 2nd item.
  • [27] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, Cited by: §I, §II-B, §III-B, §IV-A2, §IV-A3, §IV-B2.
  • [28] J. Klicpera, A. Bojchevski, and S. Günnemann (2019) Predict then propagate: graph neural networks meet personalized pagerank. In International Conference on Learning Representations (ICLR), Cited by: §I, §II-B, §II-D, §III-B, §III-B, §IV-A2, §IV-B2.
  • [29] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436–444. External Links: ISSN 0028-0836 Cited by: §II-E.
  • [30] D. D. Lee and H. S. Seung (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401 (6755), pp. 788–791. Cited by: §V-B.
  • [31] Q. Li, Z. Han, and X. Wu (2018) Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §I, §I, §II-B.
  • [32] R. Liao, Z. Zhao, R. Urtasun, and R. S. Zemel (2019) LanczosNet: multi-scale deep graph convo-lutional networks. In 7th International Conference on Learning Representations, ICLR 2019, Cited by: §IV-A2, §IV-A3.
  • [33] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma (2012) Robust recovery of subspace structures by low-rank representation. IEEE transactions on pattern analysis and machine intelligence 35 (1), pp. 171–184. Cited by: §V-B.
  • [34] J. Ma, W. Tang, J. Zhu, and Q. Mei (2019) A flexible generative framework for graph-based semi-supervised learning. In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Cited by: §IV-A2, §IV-A3.
  • [35] Z. W. Ming Chen, B. D. Zengfeng Huang, and Y. Li (2020) Simple and deep graph convolutional networks. In Proceedings of the 37th International Conference on Machine Learning, Cited by: §I, §II-B, §III-B, §III-B, §IV-B2.
  • [36] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein (2017-07) Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
  • [37] F. Nie, C. Wang, and X. Li (2019) K-multiple-means: a multiple-means clustering method with specified k clusters. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 959–967. Cited by: §V-B.
  • [38] F. Nie, X. Wang, and H. Huang (2014) Clustering and projected clustering with adaptive neighbors. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 977–986. Cited by: §V-B.
  • [39] F. Nie, X. Wang, M. Jordan, and H. Huang (2016) The constrained laplacian rank algorithm for graph-based clustering. In Proceedings of the AAAI conference on artificial intelligence, Vol. 30. Cited by: §V-B.
  • [40] L. Page, S. Brin, R. Motwani, and T. Winograd (1999-11) The pagerank citation ranking: bringing order to the web.. Technical Report Technical Report 1999-66, Stanford InfoLab. Note: Previous number = SIDL-WP-1999-0120 Cited by: §I, §III-B.
  • [41] Pan Ji, M. Salzmann, and Hongdong Li (2014-03) Efficient dense subspace clustering. In IEEE Winter Conference on Applications of Computer Vision, Vol. , pp. 461–468. External Links: ISSN Cited by: §III-D, §V-B.
  • [42] V. M. Patel and R. Vidal (2014) Kernel sparse subspace clustering. In 2014 ieee international conference on image processing (icip), pp. 2849–2853. Cited by: §V-B.
  • [43] Y. Rong, W. Huang, T. Xu, and J. Huang (2019) DropEdge: towards deep graph convolutional networks on node classification. In International Conference on Learning Representations, Cited by: §I, §IV-A2, §IV-A3, §IV-B2.
  • [44] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad (2008) Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §IV-A1, §IV-A2.
  • [45] J. Shi and J. Malik (2000) Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence 22 (8), pp. 888–905. Cited by: §V-B.
  • [46] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liȯ, and Y. Bengio (2018) Graph attention networks. In International Conference on Learning Representations, Cited by: §I, §IV-A2, §IV-A3.
  • [47] P. Velickovic, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm (2019) Deep graph infomax.. In ICLR (Poster), Cited by: §IV-A2, §IV-A3.
  • [48] R. Vidal and P. Favaro (2014) Low rank subspace clustering (lrsc). Pattern Recognition Letters 43, pp. 47 – 61. Note: ICPR2012 Awarded Papers External Links: ISSN 0167-8655 Cited by: §II-E, §V-B.
  • [49] Q. Wang, R. Liu, M. Chen, and X. Li (2021) Robust rank-constrained sparse learning: a graph-based framework for single view and multiview clustering. IEEE Transactions on Cybernetics (), pp. 1–12. Note: doi: 10.1109/TCYB.2021.3067137 Cited by: §V-B, §V-B.
  • [50] B. Wilder, E. Ewing, B. Dilkina, and M. Tambe (2019) End to end learning and optimization on graphs. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. 4672–4683. Cited by: §I.
  • [51] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger (2019) Simplifying graph convolutional networks. In International Conference on Machine Learning, pp. 6861–6871. Cited by: §I, §II-B, §II-C, §III-A, §IV-A2, §IV-A3, §IV-B2.
  • [52] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2020) A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems (), pp. 1–21. Cited by: §I, §I, §II-B.
  • [53] K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2018) How powerful are graph neural networks?. In International Conference on Learning Representations, Cited by: §IV-A2, §IV-A3.
  • [54] Z. Yang, W. Cohen, and R. Salakhudinov (2016) Revisiting semi-supervised learning with graph embeddings. In International conference on machine learning, pp. 40–48. Cited by: §IV-A1.
  • [55] M. Yin, J. Gao, and Z. Lin (2016-03) Laplacian regularized low-rank representation and its applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (3), pp. 504–517. External Links: ISSN 0162-8828 Cited by: §I.
  • [56] C. You, D. Robinson, and R. Vidal (2016) Scalable sparse subspace clustering by orthogonal matching pursuit. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3918–3927. Cited by: §V-B.
  • [57] Z. Zhang, P. Cui, and W. Zhu (2020) Deep learning on graphs: a survey. IEEE Transactions on Knowledge and Data Engineering (), pp. 1–1. Cited by: §I, §II-E.
  • [58] T. Zhao, Y. Liu, L. Neves, O. Woodford, M. Jiang, and N. Shah (2021) Data augmentation for graph neural networks. In The Thirty-Fifth AAAI Conference on Artificial Intelligence, pp. . Cited by: §I.
  • [59] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun (2018) Graph neural networks: A review of methods and applications. CoRR abs/1812.08434. Cited by: §I, §I, §II-B.
  • [60] M. Zhu, X. Wang, C. Shi, H. Ji, and P. Cui (2021) Interpreting and unifying graph neural networks with an optimization framework. arXiv preprint arXiv:2101.11859. Cited by: 1st item.