Elastic Graph Neural Networks

by   Xiaorui Liu, et al.

While many existing graph neural networks (GNNs) have been proven to perform ℓ_2-based graph smoothing that enforces smoothness globally, in this work we aim to further enhance the local smoothness adaptivity of GNNs via ℓ_1-based graph smoothing. As a result, we introduce a family of GNNs (Elastic GNNs) based on ℓ_1 and ℓ_2-based graph smoothing. In particular, we propose a novel and general message passing scheme into GNNs. This message passing algorithm is not only friendly to back-propagation training but also achieves the desired smoothing properties with a theoretical convergence guarantee. Experiments on semi-supervised learning tasks demonstrate that the proposed Elastic GNNs obtain better adaptivity on benchmark datasets and are significantly robust to graph adversarial attacks. The implementation of Elastic GNNs is available at <https://github.com/lxiaorui/ElasticGNN>.


Understanding the Message Passing in Graph Neural Networks via Power Iteration

The mechanism of message passing in graph neural networks(GNNs) is still...

Boosting Graph Neural Networks by Injecting Pooling in Message Passing

There has been tremendous success in the field of graph neural networks ...

Ensemble Multi-Relational Graph Neural Networks

It is well established that graph neural networks (GNNs) can be interpre...

DropMessage: Unifying Random Dropping for Graph Neural Networks

Graph Neural Networks (GNNs) are powerful tools for graph representation...

Optimization and Generalization Analysis of Transduction through Gradient Boosting and Application to Multi-scale Graph Neural Networks

It is known that the current graph neural networks (GNNs) are difficult ...

UniGNN: a Unified Framework for Graph and Hypergraph Neural Networks

Hypergraph, an expressive structure with flexibility to model the higher...

Interpreting and Unifying Graph Neural Networks with An Optimization Framework

Graph Neural Networks (GNNs) have received considerable attention on gra...

1 Introduction

Graph neural networks (GNNs) generalize traditional deep neural networks (DNNs) from regular grids, such as image, video, and text, to irregular data such as social networks, transportation networks, and biological networks, which are typically denoted as graphs (Defferrard et al., 2016; Kipf and Welling, 2016). One popular such generalization is the neural message passing framework (Gilmer et al., 2017):



denotes the feature vector of node

in -th iteration of message passing and is the message aggregated from ’s neighborhood . The specific architecture design has been motivated from spectral domain (Kipf and Welling, 2016; Defferrard et al., 2016) and spatial domain (Hamilton et al., 2017; Veličković et al., 2017; Scarselli et al., 2008; Gilmer et al., 2017). Recent study (Ma et al., 2020) has proven that the message passing schemes in numerous popular GNNs, such as GCN, GAT, PPNP, and APPNP, intrinsically perform the -based graph smoothing to the graph signal, and they can be considered as solving the graph signal denoising problem:


where is the input signal and is the graph Laplacian matrix encoding the graph structure. The first term guides to be close to input signal , while the second term enforces global smoothness to the filtered signal . The resulted message passing schemes can be derived by different optimization solvers, and they typically entail the aggregation of node features from neighboring nodes, which intuitively coincides with the cluster or consistency assumption that neighboring nodes should be similar (Zhu and Ghahramani, ; Zhou et al., 2004). While existing GNNs are prominently driven by -based graph smoothing, -based methods enforce smoothness globally and the level of smoothness is usually shared across the whole graph. However, the level of smoothness over different regions of the graph can be different. For instance, node features or labels can change significantly between clusters but smoothly within the cluster (Zhu, 2005). Therefore, it is desired to enhance the local smoothness adaptivity of GNNs.

Motivated by the idea of trend filtering (Kim et al., 2009; Tibshirani and others, 2014; Wang et al., 2016), we aim to achieve the goal via -based graph smoothing. Intuitively, compared with -based methods, -based methods penalize large values less and thus preserve discontinuity or non-smooth signal better. Theoretically, -based methods tend to promote signal sparsity to trade for discontinuity (Rudin et al., 1992; Tibshirani et al., 2005; Sharpnack et al., 2012). Owning to these advantages, trend filtering (Tibshirani and others, 2014) and graph trend filter (Wang et al., 2016; Varma et al., 2019) demonstrate that

-based graph smoothing can adapt to inhomogenous level of smoothness of signals and yield estimators with k-th order piecewise polynomial functions, such as piecewise constant, linear and quadratic functions, depending on the order of the graph difference operator. While

-based methods exhibit various appealing properties and have been extensively studied in different domains such as signal processing (Elad, 2010), statistics and machine learning (Hastie et al., 2015), it has rarely been investigated in the design of GNNs. In this work, we attempt to bridge this gap and enhance the local smoothnesss adaptivity of GNNs via -based graph smoothing.

Incorporating -based graph smoothing in the design of GNNs faces tremendous challenges. First, since the message passing schemes in GNNs can be derived from the optimization iteration of the graph signal denoising problem, a fast, efficient and scalable optimization solver is desired. Unfortunately, to solve the associated optimization problem involving norm is challenging since the objective function is composed by smooth and non-smooth components and the decision variable is further coupled by the discrete graph difference operator. Second, to integrate the derived messaging passing scheme into GNNs, it has to be composed by simple operations that are friendly to the back-propagation training of the whole GNNs. Third, it requires an appropriate normalization step to deal with diverse node degrees, which is often overlooked by existing graph total variation and graph trend filtering methods. Our attempt to address these challenges leads to a family of novel GNNs, i.e., Elastic GNNs. Our key contributions can be summarized as follows:

  • We introduce -based graph smoothing in the design of GNNs to further enhance the local smoothness adaptivity, for the first time;

  • We derive a novel and general message passing scheme, i.e., Elastic Message Passing (EMP), and develop a family of GNN architectures, i.e., Elastic GNNs, by integrating the proposed message passing scheme into deep neural nets;

  • Extensive experiments demonstrate that Elastic GNNs obtain better adaptivity on various real-world datasets, and they are significantly robust to graph adversarial attacks. The study on different variants of Elastic GNNs suggests that and -based graph smoothing are complementary and Elastic GNNs are more versatile.

2 Preliminary

We use bold upper-case letters such as to denote matrices and bold lower-case letters such as to define vectors. Given a matrix , we use to denote its -th row and to denote its element in -th row and -th column. We define the Frobenius norm, norm, and norm of matrix as , , and , respectively. We define where

is the largest singular value of

. Given two matrices , we define the inner product as .

Let be a graph with the node set and the undirected edge set . We use to denote the neighboring nodes of node , including itself. Suppose that each node is associated with a -dimensional feature vector, and the features for all nodes are denoted as . The graph structure can be represented as an adjacent matrix , where when there exists an edge between nodes and . The graph Laplacian matrix is defined as , where is the diagonal degree matrix. Let be the oriented incident matrix, which contains one row for each edge. If , then has -th row as:

where the edge orientation can be arbitrary. Note that the incident matrix and unnormalized Laplacian matrix have the equivalence . Next, we briefly introduce some necessary background about the graph signal denoising perspective of GNNs and the graph trend filtering methods.

2.1 GNNs as Graph Signal Denoising

It is evident from recent work (Ma et al., 2020) that many popular GNNs can be uniformly understood as graph signal denoising with Laplacian smoothing regularization. Here we briefly describe several representative examples.

GCN. The message passing scheme in Graph Convolutional Networks (GCN) (Kipf and Welling, 2016),

is equivalent to one gradient descent step to minimize with the initial and stepsize . Here with being the adjacent matrix with self-loop, whose degree matrix is .

PPNP & APPNP. The message passing scheme in PPNP and APPNP (Klicpera et al., 2018) follow the aggregation rules


They are shown to be the exact solution and one gradient descent step with stepsize for the following problem


For more comprehensive illustration, please refer to (Ma et al., 2020). We point out that all these message passing schemes adopt -based graph smoothing as the signal differences between neighboring nodes are penalized by the square of norm, e.g., with being the node degree of node

. The resulted message passing schemes are usually linear smoothers which smooth the input signal by their linear transformation.

2.2 Graph Trend Filtering

In the univariate case, the -th order graph trend filtering (GTF) estimator (Wang et al., 2016) is given by


where is the -dimensional input signal of nodes and is a -th order graph difference operator. When , it penalizes the absolute difference across neighboring nodes in graph :

where is equivalent to the incident matrix . Generally, -th order graph difference operators can be defined recursively:

It is demonstrated that GTF can adapt to inhomogeneity in the level of smoothness of signal and tends to provide piecewise polynomials over graphs (Wang et al., 2016). For instance, when , the sparsity induced by the -based penalty implies that many of the differences are zeros across edges in . The piecewise property originates from the discontinuity of signal allowed by less aggressive penalty, with adaptively chosen knot nodes or knot edges. Note that the smoothers induced by GTF are not linear smoothers and cannot be simply represented by linear transformation of the input signal.

3 Elastic Graph Neural Networks

In this section, we first propose a new graph signal denoising estimator. Then we develop an efficient optimization algorithm for solving the denoising problem and introduce a novel, general and efficient message passing scheme, i.e., Elastic Message Passing (EMP), for graph signal smoothing. Finally, the integration of the proposed message passing scheme and deep neural networks leads to Elastic GNNs.

3.1 Elastic Graph Signal Estimator

To combine the advantages of and -based graph smoothing, we propose the following elastic graph signal estimator:


where is the -dimensional input signal of nodes. The first term can be written in an edge-centric way: , which penalizes the absolute difference across connected nodes in graph . Similarly, the second term penalizes the difference quadratically via . The last term is the fidelity term which preserves the similarity with the input signal. The regularization coefficients and control the balance between and -based graph smoothing.

Remark 1.

It is potential to consider higher-order graph differences in both the -based and -based smoothers. But, in this work, we focus on the -th order graph difference operator , since we assume the piecewise constant prior for graph representation learning.

Normalization. In existing GNNs, it is beneficial to normalize the Laplacian matrix for better numerical stability, and the normalization trick is also crucial for achieving superior performance. Therefore, for the -based graph smoothing, we follow the common normalization trick in GNNs: , where , and . It leads to a degree normalized penalty

In the literature of graph total variation and graph trend filtering, the normalization step is often overlooked and the graph difference operator is directly used as in GTF (Wang et al., 2016; Varma et al., 2019). To achieve better numerical stability and handle diverse node degrees in real-world graphs, we propose to normalize each column of the incident matrix by the square root of node degrees for the -based graph smoothing as follows111It naturally supports read-value edge weights if the edge weights are set in the incident matrix .:

It leads to a degree normalized total variation penalty 222With the normalization, the piecewise constant prior is up to the degree scaling, i.e., sparsity in .

Note that this normalized incident matrix maintains the relation with the normalized Laplacian matrix as in the unnormalized case


given that

With the normalization, the estimator defined in (5) becomes:


Capture correlation among dimensions. The node features in real-world graphs are usually multi-dimensional. Although the estimator defined in (7) is able to handle multi-dimensional data since the signal from different dimensions are separable under and norm, such estimator treats each feature dimension independently and does not exploit the potential relation between feature dimensions. However, the sparsity patterns of node difference across edges could be shared among feature dimensions. To better exploit this potential correlation, we propose to couple the multi-dimensional features by norm, which penalizes the summation of norm of the node difference

This penalty promotes the row sparsity of and enforces similar sparsity patterns among feature dimensions. In other words, if two nodes are similar, all their feature dimensions should be similar. Therefore, we define the -based estimator as


where . In the following subsections, we will use to represent both and . We use to represent both and if not specified.

3.2 Elastic Message Passing

For the -based graph smoother, message passing schemes can be derived from the gradient descent iterations of the graph signal denoising problem, as in the case of GCN and APPNP (Ma et al., 2020). However, computing the estimators defined by (7) and (8) is much more challenging because of the nonsmoothness, and the two components, i.e., and , are non-separable as they are coupled by the graph difference operator . In the literature, researchers have developed optimization algorithms for the graph trend filtering problem (4) such as Alternating Direction Method of Multipliers (ADMM) and Newton type algorithms (Wang et al., 2016; Varma et al., 2019)

. However, these algorithms require to solve the minimization of a non-trivial sub-problem in each single iteration, which incurs high computation complexity. Moreover, it is unclear how to make these iterations compatible with the back-propagation training of deep learning models. This motivates us to design an algorithm which is not only efficient but also friendly to back-propagation training. To this end, we propose to solve an equivalent saddle point problem using a primal-dual algorithm with efficient computations.

Saddle point reformulation. For a general convex function , its conjugate function is defined as

By using , the problem (7) and (8) can be equivalently written as the following saddle point problem:


where . Motivated by Proximal Alternating Predictor-Corrector (PAPC) (Loris and Verhoeven, 2011; Chen et al., 2013), we propose an efficient algorithm with per iteration low computation complexity and convergence guarantee:

, (10)
, (11)
, (12)

where . The stepsizes, and , will be specified later. The first step (10) obtains a prediction of , i.e., , by a gradient descent step on primal variable . The second step (11) is a proximal dual ascent step on the dual variable based on the predicted . Finally, another gradient descent step on the primal variable based on gives next iteration  (12). Algorithm (10)–(12) can be interpreted as a “predict-correct” algorithm for the saddle point problem (9). Next we demonstrate how to compute the proximal operator in Eq. (11).

Proximal operators. Using the Moreau’s decomposition principle (Bauschke and Combettes, 2011)

we can rewrite the step (11) using the proximal operator of , that is,

Figure 1: Elastic Message Passing (EMP). and .

We discuss the two options for the function corresponding to the objectives (7) and (8).

  • [leftmargin=0.15in]

  • Option I ( norm):

    By definition, the proximal operator of is

    which is equivalent to the soft-thresholding operator (component-wise):

    Therefore, using (13), we have


    which is a component-wise projection onto the ball of radius .

  • Option II ( norm):

    By definition, the proximal operator of is

    with the -th row being

    Similarly, using (13), we have the -th row of being


    which is a row-wise projection on the ball of radius . Note that the proximal operator in the norm case treats each feature dimension independently, while in the norm case, it couples the multi-dimensional features, which is consistent with the motivation to exploit the correlation among feature dimensions.

The Algorithm (10)–(12) and the proximal operators (14) and (3.2) enable us to derive the final message passing scheme. Note that the computation in steps (10) and (12) can be shared to save computation. Therefore, we decompose the step (10) into two steps:


In this work, we choose and . Therefore, with , Eq. (16) can be simplified as


Let , then steps (11) and (12) become


Substituting the proximal operators in (19) with  (14) and (3.2), we obtain the complete elastic message passing scheme (EMP) as summarized in Figure 1.

Interpretation of EMP. EMP can be interpreted as the standard message passing (MP) ( in Fig. 1) with extra operations (the following steps). The extra operations compute to adjust the standard MP such that sparsity in is promoted and some large node differences can be preserved. EMP is general and covers some existing propagation rules as special cases as demonstrated in Remark 2.

Remark 2 (Special cases).

If there is only -based regularization, i.e., , then according to the projection operator, we have . Therefore, with , the proposed message passing scheme reduces to

If , it recovers the message passing in APPNP:

If , it recovers the simple aggregation operation in many GNNs:

Computation Complexity. EMP is efficient and composed by simple operations. The major computation cost comes from four sparse matrix multiplications, include and . The computation complexity is in the order where is the number of edges in graph and is the feature dimension of input signal . Other operations are simple matrix additions and projection.

The convergence of EMP and the parameter settings are justified by Theorem 1, with a proof deferred to Appendix B.

Theorem 1 (Convergence).

Under the stepsize setting and , the elastic message passing scheme (EMP) in Figure 1 converges to the optimal solution of the elastic graph signal estimator defined in (7) (Option I) or  (8) (Option II). It is sufficient to choose any and since .

3.3 Elastic GNNs

Incorporating the elastic message passing scheme from the elastic graph signal estimator (7) and (8) into deep neural networks, we introduce a family of GNNs, namely Elastic GNNs. In this work, we follow the decoupled way as proposed in APPNP (Klicpera et al., 2018), where we first make predictions from node features and aggregate the prediction through the proposed EMP:


denotes the node features,

is any machine learning model, such as multilayer perceptrons (MLPs),

is the learnable parameters in the model, and is the number of message passing steps. The training objective is the cross entropy loss defined by the final prediction and labels for training data. Elastic GNNs also have the following nice properties:

  • In addition to the backbone neural network model, Elastic GNNs only require to set up three hyperparameters including two coefficients

    and the propagation step , but they do not introduce any learnable parameters. Therefore, it reduces the risk of overfitting.

  • The hyperparameters and provide better smoothness adaptivity to Elastic GNNs depending on the smoothness properties of the graph data.

  • The message passing scheme only entails simple and efficient operations, which makes it friendly to the efficient and end-to-end back-propagation training of the whole GNN model.

4 Experiment

In this section, we conduct experiments to validate the effectiveness of the proposed Elastic GNNs. We first introduce the experimental settings. Then we assess the performance of Elastic GNNs and investigate the benefits of introducing -based graph smoothing into GNNs with semi-supervised learning tasks under normal and adversarial settings. In the ablation study, we validate the local adaptive smoothness, sparsity pattern, and convergence of EMP.

4.1 Experimental Settings


We conduct experiments on 8 real-world datasets including three citation graphs, i.e., Cora, Citeseer, Pubmed 

(Sen et al., 2008), two co-authorship graphs, i.e., Coauthor CS and Coauthor Physics (Shchur et al., 2018), two co-purchase graphs, i.e., Amazon Computers and Amazon Photo (Shchur et al., 2018), and one blog graph, i.e., Polblogs (Adamic and Glance, 2005). In Polblogs graph, node features are not available so we set the feature matrix to be a identity matrix.

Baselines. We compare the proposed Elastic GNNs with representative GNNs including GCN (Kipf and Welling, 2016), GAT (Veličković et al., 2017), ChebNet (Defferrard et al., 2016), GraphSAGE (Hamilton et al., 2017), APPNP (Klicpera et al., 2018) and SGC (Wu et al., 2019). For all models, we use layer neural networks with hidden units.

Parameter settings.

For each experiment, we report the average performance and the standard variance of 10 runs. For all methods, hyperparameters are tuned from the following search space: 1) learning rate:

; 2) weight decay: {5e-4, 5e-5, 5e-6}; 3) dropout rate: {0.5, 0.8}. For APPNP, the propagation step is tuned from and the parameter is tuned from . For Elastic GNNs, the propagation step is tuned from and parameters and are tuned from . As suggested by Theorem 1, we set and in the proposed elastic message passing scheme. Adam optimizer (Kingma and Ba, 2014) is used in all experiments.

4.2 Performance on Benchmark Datasets

On commonly used datasets including Cora, CiteSeer, PubMed, Coauthor CS, Coauthor Physics, Amazon Computers and Amazon Photo, we compare the performance of the proposed Elastic GNN () with representative GNN baselines on the semi-supervised learning task. The detail statistics of these datasets and data splits are summarized in Table 5 in Appendix A. The classification accuracy are showed in Table 1. From these results, we can make the following observations:

  • Elastic GNN outperforms GCN, GAT, ChebNet, GraphSAGE and SGC by significant margins on all datasets. For instance, Elastic GNN improves over GCN by , and on Cora, CiteSeer and PubMed datasets. The improvement comes from the global and local smoothness adaptivity of Elastic GNN.

  • Elastic GNN () consistently achieves higher performance than APPNP on all datasets. Essentially, Elastic GNN covers APPNP as a special case when there is only regularization, i.e., . Beyond the -based graph smoothing, the -based graph smoothing further enhances the local smoothness adaptivity. This comparison verifies the benefits of introducing -based graph smoothing in GNNs.

Model Cora CiteSeer PubMed CS Physics Computers Photo
ChebNet OOM
Table 1: Classification accuracy () on benchmark datasets with times random data splits.
Dataset Ptb Rate Basic GNN Elastic GNN
Cora 0% 83.50.4 84.00.7 85.80.4 85.10.5 85.30.4 85.80.4 85.80.4
5% 76.60.8 80.40.7 81.01.0 82.31.1 81.61.1 81.91.4 82.20.9
10% 70.41.3 75.60.6 76.31.5 76.21.4 77.90.9 78.21.6 78.81.7
15% 65.10.7 69.81.3 72.20.9 73.31.3 75.71.2 76.90.9 77.21.6
20% 60.02.7 59.90.6 67.70.7 63.70.9 70.31.1 67.25.3 70.51.3
Citeseer 0% 72.00.6 73.30.8 73.60.9 73.20.6 73.20.5 73.60.6 73.80.6
5% 70.90.6 72.90.8 72.80.5 72.80.5 72.80.5 73.30.6 72.90.5
10% 67.60.9 70.60.5 70.20.6 70.80.6 70.71.2 72.40.9 72.60.4
15% 64.51.1 69.01.1 70.20.6 68.11.4 68.21.1 71.31.5 71.90.7
20% 62.03.5 61.01.5 64.91.0 64.70.8 64.70.8 64.70.8 64.70.8
Polblogs 0% 95.70.4 95.40.2 95.40.2 95.80.3 95.80.3 95.80.3 95.80.3
5% 73.10.8 83.71.5 82.80.3 78.70.6 78.70.7 82.80.4 83.00.3
10% 70.71.1 76.30.9 73.70.3 75.20.4 75.30.7 81.50.2 81.60.3
15% 65.01.9 68.81.1 68.90.9 72.10.9 71.51.1 77.80.9 78.70.5
20% 51.31.2 51.51.6 65.50.7 68.10.6 68.70.7 77.40.2 77.50.2
Pubmed 0% 87.20.1 83.70.4 88.10.1 86.70.1 87.30.1 88.10.1 88.10.1
5% 83.10.1 78.00.4 87.10.2 86.20.1 87.00.1 87.10.2 87.10.2
10% 81.20.1 74.90.4 86.60.1 86.00.2 86.90.2 86.30.1 87.00.1
15% 78.70.1 71.10.5 85.70.2 85.40.2 86.40.2 85.50.1 86.40.2
20% 77.40.2 68.21.0 85.80.1 85.40.1 86.40.1 85.40.1 86.40.1
Table 2: Classification accuracy () under different perturbation rates of adversarial graph attack.

4.3 Robustness Under Adversarial Attack

Locally adaptive smoothness makes Elastic GNNs more robust to adversarial attack on graph structure. This is because the attack tends to connect nodes with different labels, which fuzzes the cluster structure in the graph. But EMP can tolerate large node differences along these wrong edges, and maintain the smoothness along correct edges.

To validate this, we evaluate the performance of Elastic GNNs under untargeted adversarial graph attack, which tries to degrade GNN models’ overall performance by deliberately modifying the graph structure. We use the MetaAttack (Zügner and Günnemann, 2019) implemented in DeepRobust (Li et al., 2020)333https://github.com/DSE-MSU/DeepRobust

, a PyTorch library for adversarial attacks and defenses, to generate the adversarially attacked graphs based on four datasets including Cora, CiteSeer, Polblogs and PubMed. We randomly split

of nodes for training, validation and test. The detailed data statistics are summarized in Table 6 in Appendix A. Note that following the works (Zügner et al., 2018; Zügner and Günnemann, 2019; Entezari et al., 2020; Jin et al., 2020), we only consider the largest connected component (LCC) in the adversarial graphs. Therefore, the results in Table 2 are not directly comparable with the results in Table 1. We focus on investigating the robustness introduced by -based graph smoothing but not on adversarial defense so we don’t compare with defense strategies. Existing defense strategies can be applied on Elastic GNNs to further improve the robustness against attacks.

Variants of Elastic GNNs. To make a deeper investigation of Elastic GNNs, we consider the following variants: (1) ; (2) ; (3) ; (4) ; (5) . To save computation, we fix the learning rate as , weight decay as , dropout rate as and since this setting works well for the chosen datasets and models. Only and are tuned. The classification accuracy under different perturbation rates ranging from to is summarized in Table 2. From the results, we can make the following observations:

  • All variants of Elastic GNNs outperforms GCN and GAT by significant margins under all perturbation rates. For instance, when the pertubation rate is , Elastic GNN () improves over GCN by , , and on the four datasets being considered. This is because Elastic GNN can adapt to the change of smoothness while GCN and GAT can not adapt well when the perturbation rate increases.

  • outperforms in most cases, and outperforms in almost all cases. It demonstrates the benefits of exploiting the correlation between feature channels by coupling multi-dimensional features via norm.

  • outperforms in most cases, which suggests the benefits of local smoothness adaptivity. When and is combined, the Elastic GNN () achieves significantly better performance than solely , or variant in almost all cases. It suggests that and -based graph smoothing are complementary to each other, and combining them provides significant better robustness against adversarial graph attacks.

4.4 Ablation Study

We provide ablation study to further investigate the adaptive smoothness, sparsity pattern, and convergence of EMP in Elastic GNN, based on three datasets including Cora, CiteSeer and PubMed. In this section, we fix for Elastic GNN, and for APPNP. We fix learning rate as , weight decay as and dropout rate as since this setting works well for both methods.

Adaptive smoothness. It is expected that -based smoothing enhances local smoothness adaptivity by increasing the smoothness along correct edges (connecting nodes with same labels) while lowering smoothness along wrong edges (connecting nodes with different labels). To validate this, we compute the average adjacent node differences (based on node features in the last layer) along wrong and correct edges separately, and use the ratio between these two averages to measure the smoothness adaptivity. The results are summarized in Table 3. It is clearly observed that for all datasets, the ratio for ElasticGNN is significantly higher than based method such as APPNP, which validates its better local smoothness adaptivity.

Sparsity pattern. To validate the piecewise constant property enforced by EMP, we also investigate the sparsity pattern in the adjacent node differences, i.e., , based on node features in the last layer. Node difference along edge is defined as sparse if . The sparsity ratios for -based method such as APPNP and -based method such as Elastic GNN are summarized in Table 4. It can be observed that in Elastic GNN, a significant portion of are sparse for all datasets. While in APPNP, this portion is much smaller. This sparsity pattern validates the piecewise constant prior as designed.

Model Cora CiteSeer PubMed
+ (ElasticGNN)
Table 3: Ratio between average node differences along wrong and correct edges.
Model Cora CiteSeer PubMed
+ (ElasticGNN)
Table 4: Sparsity ratio (i.e., ) in node differences .

Convergence of EMP. We provide two additional experiments to demonstrate the impact of propagation step on classification performance and the convergence of message passing scheme. Figure 2 shows that the increase of classification accuracy when the propagation step increases. It verifies the effectiveness of EMP in improving graph representation learning. It also shows that a small number of propagation step can achieve very good performance, and therefore the computation cost for EMP can be small. Figure 3 shows the decreasing of the objective value defined in Eq. (8) during the forward message passing process, and it verifies the convergence of the proposed EMP as suggested by Theorem 1.

Figure 2: Classification accuracy under different propagation steps.
Figure 3: Convergence of the objective value for the problem in Eq. (8) during message passing.

5 Related Work

The design of GNN architectures can be majorly motivated in spectral domain (Kipf and Welling, 2016; Defferrard et al., 2016) and spatial domain (Hamilton et al., 2017; Veličković et al., 2017; Scarselli et al., 2008; Gilmer et al., 2017). The message passing scheme (Gilmer et al., 2017; Ma and Tang, 2020) for feature aggregation is one central component of GNNs. Recent works have proven that the message passing in GNNs can be regarded as low-pass graph filters (Nt and Maehara, 2019; Zhao and Akoglu, 2019). Generally, it is recently proved that message passing in many GNNs can be unified in the graph signal denosing framework (Ma et al., 2020; Pan et al., 2020; Zhu et al., 2021; Chen et al., 2020). We point out that they intrinsically perform -based graph smoothing and typically can be represented as linear smoothers.

-based graph signal denoising has been explored in graph trend filtering (Wang et al., 2016; Varma et al., 2019) which tends to provide estimators with -th order piecewise polynomials over graphs. Graph total variation has also been utilized in semi-supervised learning (Nie et al., 2011; Jung et al., 2016, 2019; Aviles-Rivero et al., 2019)

, spectral clustering 

(Bühler and Hein, 2009; Bresson et al., 2013b) and graph cut problems (Szlam and Bresson, 2010; Bresson et al., 2013a). However, it is unclear whether these algorithms can be used to design GNNs. To the best of our knowledge, we make first such investigation in this work.

6 Conclusion

In this work, we propose to enhance the smoothness adaptivity of GNNs via and -based graph smoothing. Through the proposed elastic graph signal estimator, we derive a novel, efficient and general message passing scheme, i.e., elastic message passing (EMP). Integrating the proposed message passing scheme and deep neural networks leads to a family of GNNs, i.e., Elastic GNNs. Extensitve experiments on benchmark datasets and adversarially attacked graphs demonstrate the benefits of introducing -based graph smoothing in the design of GNNs. The empirical study suggests that and -based graph smoothing is complementary to each other, and the proposed Elastic GNNs has better smoothnesss adaptivity owning to the integration of and -based graph smoothing. We hope the proposed elastic message passing scheme can inspire more powerful GNN architecture design in the future.


This research is supported by the National Science Foundation (NSF) under grant numbers CNS-1815636, IIS-1928278, IIS-1714741, IIS-1845081, IIS-1907704, IIS-1955285, and Army Research Office (ARO) under grant number W911NF-21-1-0198. Ming Yan is supported by NSF grant DMS-2012439 and Facebook Faculty Research Award (Systems for ML).


  • L. A. Adamic and N. Glance (2005) The political blogosphere and the 2004 us election: divided they blog. In Proceedings of the 3rd international workshop on Link discovery, pp. 36–43. Cited by: §4.1.
  • A. I. Aviles-Rivero, N. Papadakis, R. Li, S. M. Alsaleh, R. T. Tan, and C. Schonlieb (2019) When labelled data hurts: deep semi-supervised classification with the graph 1-laplacian. arXiv preprint arXiv:1906.08635. Cited by: §5.
  • H. H. Bauschke and P. L. Combettes (2011) Convex analysis and monotone operator theory in hilbert spaces. 1st edition, Springer Publishing Company, Incorporated. External Links: ISBN 1441994661 Cited by: §3.2.
  • X. Bresson, T. Laurent, D. Uminsky, and J. H. von Brecht (2013a) An adaptive total variation algorithm for computing the balanced cut of a graph. External Links: 1302.2717 Cited by: §5.
  • X. Bresson, T. Laurent, D. Uminsky, and J. H. Von Brecht (2013b) Multiclass total variation clustering. arXiv preprint arXiv:1306.1185. Cited by: §5.
  • T. Bühler and M. Hein (2009) Spectral clustering based on the graph p-laplacian. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 81–88. Cited by: §5.
  • P. Chen, J. Huang, and X. Zhang (2013) A primal–dual fixed point algorithm for convex separable minimization with applications to image restoration. Inverse Problems 29 (2), pp. 025011. Cited by: Appendix B, §3.2.
  • S. Chen, Y. C. Eldar, and L. Zhao (2020) Graph unrolling networks: interpretable neural networks for graph signal denoising. External Links: 2006.01301 Cited by: §5.
  • F. R. Chung and F. C. Graham (1997) Spectral graph theory. American Mathematical Soc.. Cited by: Appendix B.
  • M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 3844–3852. Cited by: §1, §4.1, §5.
  • M. Elad (2010) Sparse and redundant representations: from theory to applications in signal and image processing. Springer Science & Business Media. Cited by: §1.
  • N. Entezari, S. A. Al-Sayouri, A. Darvishzadeh, and E. E. Papalexakis (2020) All you need is low (rank) defending against adversarial attacks on graphs. In WSDM, Cited by: §4.3.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In International Conference on Machine Learning, pp. 1263–1272. Cited by: §1, §5.
  • W. L. Hamilton, R. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. arXiv preprint arXiv:1706.02216. Cited by: §1, §4.1, §5.
  • T. Hastie, R. Tibshirani, and M. Wainwright (2015) Statistical learning with sparsity: the lasso and generalizations. Chapman Hall/CRC. External Links: ISBN 1498712169 Cited by: §1.
  • W. Jin, Y. Ma, X. Liu, X. Tang, S. Wang, and J. Tang (2020) Graph structure learning for robust graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 66–74. Cited by: §4.3.
  • A. Jung, A. O. Hero, III, A. C. Mara, S. Jahromi, A. Heimowitz, and Y. C. Eldar (2019) Semi-supervised learning in network-structured data via total variation minimization. IEEE Transactions on Signal Processing 67 (24), pp. 6256–6269. External Links: Document Cited by: §5.
  • A. Jung, A. O. Hero III, A. Mara, and S. Jahromi (2016) Semi-supervised learning via sparse label propagation. arXiv preprint arXiv:1612.01414. Cited by: §5.
  • S. Kim, K. Koh, S. Boyd, and D. Gorinevsky (2009) Trend filtering. SIAM review 51 (2), pp. 339–360. Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §2.1, §4.1, §5.
  • J. Klicpera, A. Bojchevski, and S. Günnemann (2018) Predict then propagate: graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997. Cited by: §2.1, §3.3, §4.1.
  • Y. Li, W. Jin, H. Xu, and J. Tang (2020) DeepRobust: a pytorch library for adversarial attacks and defenses. External Links: 2005.06149 Cited by: §4.3.
  • Z. Li and M. Yan (2017) A primal-dual algorithm with optimal stepsizes and its application in decentralized consensus optimization. arXiv preprint arXiv:1711.06785. Cited by: Appendix B.
  • I. Loris and C. Verhoeven (2011) On a generalization of the iterative soft-thresholding algorithm for the case of non-separable penalty. Inverse Problems 27 (12), pp. 125007. Cited by: Appendix B, §3.2.
  • Y. Ma, X. Liu, T. Zhao, Y. Liu, J. Tang, and N. Shah (2020) A unified view on graph neural networks as graph signal denoising. arXiv preprint arXiv:2010.01777. Cited by: §1, §2.1, §2.1, §3.2, §5.
  • Y. Ma and J. Tang (2020) Deep learning on graphs. Cambridge University Press. Cited by: §5.
  • F. Nie, H. Wang, H. Huang, and C. Ding (2011) Unsupervised and semi-supervised learning via -norm graph. In

    2011 International Conference on Computer Vision

    pp. 2268–2273. Cited by: §5.
  • H. Nt and T. Maehara (2019) Revisiting graph neural networks: all we have is low-pass filters. arXiv preprint arXiv:1905.09550. Cited by: §5.
  • X. Pan, S. Shiji, and H. Gao (2020) A unified framework for convolution-based graph neural networks. https://openreview.net/forum?id=zUMD–Fb9Bt. Cited by: §5.
  • L. I. Rudin, S. Osher, and E. Fatemi (1992) Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena 60 (1-4), pp. 259–268. Cited by: §1.
  • F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2008) The graph neural network model. IEEE transactions on neural networks 20 (1), pp. 61–80. Cited by: §1, §5.
  • P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad (2008) Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §4.1.
  • J. Sharpnack, A. Singh, and A. Rinaldo (2012) Sparsistency of the edge lasso over graphs. In Artificial Intelligence and Statistics, pp. 1028–1036. Cited by: §1.
  • O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann (2018) Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868. Cited by: §4.1.
  • A. Szlam and X. Bresson (2010) Total variation, cheeger cuts. In ICML, Cited by: §5.
  • R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight (2005) Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (1), pp. 91–108. Cited by: §1.
  • R. J. Tibshirani et al. (2014) Adaptive piecewise polynomial estimation via trend filtering. Annals of statistics 42 (1), pp. 285–323. Cited by: §1.
  • R. Varma, H. Lee, J. Kovačević, and Y. Chi (2019) Vector-valued graph trend filtering with non-convex penalties. IEEE Transactions on Signal and Information Processing over Networks 6, pp. 48–62. Cited by: §1, §3.1, §3.2, §5.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1, §4.1, §5.
  • Y. Wang, J. Sharpnack, A. J. Smola, and R. J. Tibshirani (2016) Trend filtering on graphs. Journal of Machine Learning Research 17, pp. 1–41. Cited by: §1, §2.2, §3.1, §3.2, §5.
  • F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger (2019) Simplifying graph convolutional networks. In International conference on machine learning, pp. 6861–6871. Cited by: §4.1.
  • L. Zhao and L. Akoglu (2019) PairNorm: tackling oversmoothing in gnns. In International Conference on Learning Representations, Cited by: §5.
  • D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf (2004) Learning with local and global consistency. Cited by: §1.
  • M. Zhu, X. Wang, C. Shi, H. Ji, and P. Cui (2021) Interpreting and unifying graph neural networks with an optimization framework. External Links: 2101.11859 Cited by: §5.
  • [46] X. Zhu and Z. Ghahramani Learning from labeled and unlabeled data with label propagation. Cited by: §1.
  • X. J. Zhu (2005) Semi-supervised learning literature survey. Cited by: §1.
  • D. Zügner, A. Akbarnejad, and S. Günnemann (2018) Adversarial attacks on neural networks for graph data. In KDD, Cited by: §4.3.
  • D. Zügner and S. Günnemann (2019) Adversarial attacks on graph neural networks via meta learning. arXiv preprint arXiv:1902.08412. Cited by: §4.3.

Appendix A Data Statistics

The data statistics for the benchmark datasets used in Section 4.2 are summarized in Table 5. The data statistics for the adversarially attacked graph used in Section 4.3 are summarized in Table 6.

Dataset Classes Nodes Edges Features Training Nodes Validation Nodes Test Nodes
Cora 7 2708 5278 1433 20 per class 500 1000
CiteSeer 6 3327 4552 3703 20 per class 500 1000
PubMed 3 19717 44324 500 20 per class 500 1000
Coauthor CS 15 18333 81894 6805 20 per class 30 per class Rest nodes
Coauthor Physics 5 34493 247962 8415 20 per class 30 per class Rest nodes
Amazon Computers 10 13381 245778 767 20 per class 30 per class Rest nodes
Amazon Photo 8 7487 119043 745 20 per class 30 per class Rest nodes
Table 5: Statistics of benchmark datasets.
Classes Features
Cora 2,485 5,069 7 1,433
CiteSeer 2,110 3,668 6 3,703
Polblogs 1,222 16,714 2 /
PubMed 19,717 44,338 3 500
Table 6: Dataset Statistics for adversarially attacked graph.

Appendix B Convergence Guarantee

We provide Theorem 1 to show the convergence guarantee of the proposed elastic messsage passing scheme and the practical guidance for parameter settings in EMP.

Theorem 1 (Convergence of EMP).

Under the stepsize setting and , the elastic message passing scheme (EMP) in Figure 1 converges to the optimal solution of the elastic graph signal estimator defined in (7) (Option I) or  (8) (Option II). It is sufficient to choose any and since .


We first consider the general problem