1 Introduction
Graph neural networks (GNNs) generalize traditional deep neural networks (DNNs) from regular grids, such as image, video, and text, to irregular data such as social networks, transportation networks, and biological networks, which are typically denoted as graphs (Defferrard et al., 2016; Kipf and Welling, 2016). One popular such generalization is the neural message passing framework (Gilmer et al., 2017):
(1) 
where
denotes the feature vector of node
in th iteration of message passing and is the message aggregated from ’s neighborhood . The specific architecture design has been motivated from spectral domain (Kipf and Welling, 2016; Defferrard et al., 2016) and spatial domain (Hamilton et al., 2017; Veličković et al., 2017; Scarselli et al., 2008; Gilmer et al., 2017). Recent study (Ma et al., 2020) has proven that the message passing schemes in numerous popular GNNs, such as GCN, GAT, PPNP, and APPNP, intrinsically perform the based graph smoothing to the graph signal, and they can be considered as solving the graph signal denoising problem:(2) 
where is the input signal and is the graph Laplacian matrix encoding the graph structure. The first term guides to be close to input signal , while the second term enforces global smoothness to the filtered signal . The resulted message passing schemes can be derived by different optimization solvers, and they typically entail the aggregation of node features from neighboring nodes, which intuitively coincides with the cluster or consistency assumption that neighboring nodes should be similar (Zhu and Ghahramani, ; Zhou et al., 2004). While existing GNNs are prominently driven by based graph smoothing, based methods enforce smoothness globally and the level of smoothness is usually shared across the whole graph. However, the level of smoothness over different regions of the graph can be different. For instance, node features or labels can change significantly between clusters but smoothly within the cluster (Zhu, 2005). Therefore, it is desired to enhance the local smoothness adaptivity of GNNs.
Motivated by the idea of trend filtering (Kim et al., 2009; Tibshirani and others, 2014; Wang et al., 2016), we aim to achieve the goal via based graph smoothing. Intuitively, compared with based methods, based methods penalize large values less and thus preserve discontinuity or nonsmooth signal better. Theoretically, based methods tend to promote signal sparsity to trade for discontinuity (Rudin et al., 1992; Tibshirani et al., 2005; Sharpnack et al., 2012). Owning to these advantages, trend filtering (Tibshirani and others, 2014) and graph trend filter (Wang et al., 2016; Varma et al., 2019) demonstrate that
based graph smoothing can adapt to inhomogenous level of smoothness of signals and yield estimators with kth order piecewise polynomial functions, such as piecewise constant, linear and quadratic functions, depending on the order of the graph difference operator. While
based methods exhibit various appealing properties and have been extensively studied in different domains such as signal processing (Elad, 2010), statistics and machine learning (Hastie et al., 2015), it has rarely been investigated in the design of GNNs. In this work, we attempt to bridge this gap and enhance the local smoothnesss adaptivity of GNNs via based graph smoothing.Incorporating based graph smoothing in the design of GNNs faces tremendous challenges. First, since the message passing schemes in GNNs can be derived from the optimization iteration of the graph signal denoising problem, a fast, efficient and scalable optimization solver is desired. Unfortunately, to solve the associated optimization problem involving norm is challenging since the objective function is composed by smooth and nonsmooth components and the decision variable is further coupled by the discrete graph difference operator. Second, to integrate the derived messaging passing scheme into GNNs, it has to be composed by simple operations that are friendly to the backpropagation training of the whole GNNs. Third, it requires an appropriate normalization step to deal with diverse node degrees, which is often overlooked by existing graph total variation and graph trend filtering methods. Our attempt to address these challenges leads to a family of novel GNNs, i.e., Elastic GNNs. Our key contributions can be summarized as follows:

We introduce based graph smoothing in the design of GNNs to further enhance the local smoothness adaptivity, for the first time;

We derive a novel and general message passing scheme, i.e., Elastic Message Passing (EMP), and develop a family of GNN architectures, i.e., Elastic GNNs, by integrating the proposed message passing scheme into deep neural nets;

Extensive experiments demonstrate that Elastic GNNs obtain better adaptivity on various realworld datasets, and they are significantly robust to graph adversarial attacks. The study on different variants of Elastic GNNs suggests that and based graph smoothing are complementary and Elastic GNNs are more versatile.
2 Preliminary
We use bold uppercase letters such as to denote matrices and bold lowercase letters such as to define vectors. Given a matrix , we use to denote its th row and to denote its element in th row and th column. We define the Frobenius norm, norm, and norm of matrix as , , and , respectively. We define where
is the largest singular value of
. Given two matrices , we define the inner product as .Let be a graph with the node set and the undirected edge set . We use to denote the neighboring nodes of node , including itself. Suppose that each node is associated with a dimensional feature vector, and the features for all nodes are denoted as . The graph structure can be represented as an adjacent matrix , where when there exists an edge between nodes and . The graph Laplacian matrix is defined as , where is the diagonal degree matrix. Let be the oriented incident matrix, which contains one row for each edge. If , then has th row as:
where the edge orientation can be arbitrary. Note that the incident matrix and unnormalized Laplacian matrix have the equivalence . Next, we briefly introduce some necessary background about the graph signal denoising perspective of GNNs and the graph trend filtering methods.
2.1 GNNs as Graph Signal Denoising
It is evident from recent work (Ma et al., 2020) that many popular GNNs can be uniformly understood as graph signal denoising with Laplacian smoothing regularization. Here we briefly describe several representative examples.
GCN. The message passing scheme in Graph Convolutional Networks (GCN) (Kipf and Welling, 2016),
is equivalent to one gradient descent step to minimize with the initial and stepsize . Here with being the adjacent matrix with selfloop, whose degree matrix is .
PPNP & APPNP. The message passing scheme in PPNP and APPNP (Klicpera et al., 2018) follow the aggregation rules
and
They are shown to be the exact solution and one gradient descent step with stepsize for the following problem
(3) 
For more comprehensive illustration, please refer to (Ma et al., 2020). We point out that all these message passing schemes adopt based graph smoothing as the signal differences between neighboring nodes are penalized by the square of norm, e.g., with being the node degree of node
. The resulted message passing schemes are usually linear smoothers which smooth the input signal by their linear transformation.
2.2 Graph Trend Filtering
In the univariate case, the th order graph trend filtering (GTF) estimator (Wang et al., 2016) is given by
(4) 
where is the dimensional input signal of nodes and is a th order graph difference operator. When , it penalizes the absolute difference across neighboring nodes in graph :
where is equivalent to the incident matrix . Generally, th order graph difference operators can be defined recursively:
It is demonstrated that GTF can adapt to inhomogeneity in the level of smoothness of signal and tends to provide piecewise polynomials over graphs (Wang et al., 2016). For instance, when , the sparsity induced by the based penalty implies that many of the differences are zeros across edges in . The piecewise property originates from the discontinuity of signal allowed by less aggressive penalty, with adaptively chosen knot nodes or knot edges. Note that the smoothers induced by GTF are not linear smoothers and cannot be simply represented by linear transformation of the input signal.
3 Elastic Graph Neural Networks
In this section, we first propose a new graph signal denoising estimator. Then we develop an efficient optimization algorithm for solving the denoising problem and introduce a novel, general and efficient message passing scheme, i.e., Elastic Message Passing (EMP), for graph signal smoothing. Finally, the integration of the proposed message passing scheme and deep neural networks leads to Elastic GNNs.
3.1 Elastic Graph Signal Estimator
To combine the advantages of and based graph smoothing, we propose the following elastic graph signal estimator:
(5) 
where is the dimensional input signal of nodes. The first term can be written in an edgecentric way: , which penalizes the absolute difference across connected nodes in graph . Similarly, the second term penalizes the difference quadratically via . The last term is the fidelity term which preserves the similarity with the input signal. The regularization coefficients and control the balance between and based graph smoothing.
Remark 1.
It is potential to consider higherorder graph differences in both the based and based smoothers. But, in this work, we focus on the th order graph difference operator , since we assume the piecewise constant prior for graph representation learning.
Normalization. In existing GNNs, it is beneficial to normalize the Laplacian matrix for better numerical stability, and the normalization trick is also crucial for achieving superior performance. Therefore, for the based graph smoothing, we follow the common normalization trick in GNNs: , where , and . It leads to a degree normalized penalty
In the literature of graph total variation and graph trend filtering, the normalization step is often overlooked and the graph difference operator is directly used as in GTF (Wang et al., 2016; Varma et al., 2019). To achieve better numerical stability and handle diverse node degrees in realworld graphs, we propose to normalize each column of the incident matrix by the square root of node degrees for the based graph smoothing as follows^{1}^{1}1It naturally supports readvalue edge weights if the edge weights are set in the incident matrix .:
It leads to a degree normalized total variation penalty ^{2}^{2}2With the normalization, the piecewise constant prior is up to the degree scaling, i.e., sparsity in .
Note that this normalized incident matrix maintains the relation with the normalized Laplacian matrix as in the unnormalized case
(6) 
given that
With the normalization, the estimator defined in (5) becomes:
(7) 
Capture correlation among dimensions. The node features in realworld graphs are usually multidimensional. Although the estimator defined in (7) is able to handle multidimensional data since the signal from different dimensions are separable under and norm, such estimator treats each feature dimension independently and does not exploit the potential relation between feature dimensions. However, the sparsity patterns of node difference across edges could be shared among feature dimensions. To better exploit this potential correlation, we propose to couple the multidimensional features by norm, which penalizes the summation of norm of the node difference
This penalty promotes the row sparsity of and enforces similar sparsity patterns among feature dimensions. In other words, if two nodes are similar, all their feature dimensions should be similar. Therefore, we define the based estimator as
(8) 
where . In the following subsections, we will use to represent both and . We use to represent both and if not specified.
3.2 Elastic Message Passing
For the based graph smoother, message passing schemes can be derived from the gradient descent iterations of the graph signal denoising problem, as in the case of GCN and APPNP (Ma et al., 2020). However, computing the estimators defined by (7) and (8) is much more challenging because of the nonsmoothness, and the two components, i.e., and , are nonseparable as they are coupled by the graph difference operator . In the literature, researchers have developed optimization algorithms for the graph trend filtering problem (4) such as Alternating Direction Method of Multipliers (ADMM) and Newton type algorithms (Wang et al., 2016; Varma et al., 2019)
. However, these algorithms require to solve the minimization of a nontrivial subproblem in each single iteration, which incurs high computation complexity. Moreover, it is unclear how to make these iterations compatible with the backpropagation training of deep learning models. This motivates us to design an algorithm which is not only efficient but also friendly to backpropagation training. To this end, we propose to solve an equivalent saddle point problem using a primaldual algorithm with efficient computations.
Saddle point reformulation. For a general convex function , its conjugate function is defined as
By using , the problem (7) and (8) can be equivalently written as the following saddle point problem:
(9) 
where . Motivated by Proximal Alternating PredictorCorrector (PAPC) (Loris and Verhoeven, 2011; Chen et al., 2013), we propose an efficient algorithm with per iteration low computation complexity and convergence guarantee:
,  (10)  
,  (11)  
,  (12) 
where . The stepsizes, and , will be specified later. The first step (10) obtains a prediction of , i.e., , by a gradient descent step on primal variable . The second step (11) is a proximal dual ascent step on the dual variable based on the predicted . Finally, another gradient descent step on the primal variable based on gives next iteration (12). Algorithm (10)–(12) can be interpreted as a “predictcorrect” algorithm for the saddle point problem (9). Next we demonstrate how to compute the proximal operator in Eq. (11).
Proximal operators. Using the Moreau’s decomposition principle (Bauschke and Combettes, 2011)
we can rewrite the step (11) using the proximal operator of , that is,
(13) 
We discuss the two options for the function corresponding to the objectives (7) and (8).

[leftmargin=0.15in]

Option I ( norm):
By definition, the proximal operator of is
which is equivalent to the softthresholding operator (componentwise):
Therefore, using (13), we have
(14) which is a componentwise projection onto the ball of radius .

Option II ( norm):
By definition, the proximal operator of is
with the th row being
Similarly, using (13), we have the th row of being
(15) which is a rowwise projection on the ball of radius . Note that the proximal operator in the norm case treats each feature dimension independently, while in the norm case, it couples the multidimensional features, which is consistent with the motivation to exploit the correlation among feature dimensions.
The Algorithm (10)–(12) and the proximal operators (14) and (3.2) enable us to derive the final message passing scheme. Note that the computation in steps (10) and (12) can be shared to save computation. Therefore, we decompose the step (10) into two steps:
(16)  
(17) 
In this work, we choose and . Therefore, with , Eq. (16) can be simplified as
(18) 
Let , then steps (11) and (12) become
(19)  
(20) 
Substituting the proximal operators in (19) with (14) and (3.2), we obtain the complete elastic message passing scheme (EMP) as summarized in Figure 1.
Interpretation of EMP. EMP can be interpreted as the standard message passing (MP) ( in Fig. 1) with extra operations (the following steps). The extra operations compute to adjust the standard MP such that sparsity in is promoted and some large node differences can be preserved. EMP is general and covers some existing propagation rules as special cases as demonstrated in Remark 2.
Remark 2 (Special cases).
If there is only based regularization, i.e., , then according to the projection operator, we have . Therefore, with , the proposed message passing scheme reduces to
If , it recovers the message passing in APPNP:
If , it recovers the simple aggregation operation in many GNNs:
Computation Complexity. EMP is efficient and composed by simple operations. The major computation cost comes from four sparse matrix multiplications, include and . The computation complexity is in the order where is the number of edges in graph and is the feature dimension of input signal . Other operations are simple matrix additions and projection.
3.3 Elastic GNNs
Incorporating the elastic message passing scheme from the elastic graph signal estimator (7) and (8) into deep neural networks, we introduce a family of GNNs, namely Elastic GNNs. In this work, we follow the decoupled way as proposed in APPNP (Klicpera et al., 2018), where we first make predictions from node features and aggregate the prediction through the proposed EMP:
(21) 
denotes the node features,
is any machine learning model, such as multilayer perceptrons (MLPs),
is the learnable parameters in the model, and is the number of message passing steps. The training objective is the cross entropy loss defined by the final prediction and labels for training data. Elastic GNNs also have the following nice properties:
In addition to the backbone neural network model, Elastic GNNs only require to set up three hyperparameters including two coefficients
and the propagation step , but they do not introduce any learnable parameters. Therefore, it reduces the risk of overfitting. 
The hyperparameters and provide better smoothness adaptivity to Elastic GNNs depending on the smoothness properties of the graph data.

The message passing scheme only entails simple and efficient operations, which makes it friendly to the efficient and endtoend backpropagation training of the whole GNN model.
4 Experiment
In this section, we conduct experiments to validate the effectiveness of the proposed Elastic GNNs. We first introduce the experimental settings. Then we assess the performance of Elastic GNNs and investigate the benefits of introducing based graph smoothing into GNNs with semisupervised learning tasks under normal and adversarial settings. In the ablation study, we validate the local adaptive smoothness, sparsity pattern, and convergence of EMP.
4.1 Experimental Settings
Datasets.
We conduct experiments on 8 realworld datasets including three citation graphs, i.e., Cora, Citeseer, Pubmed
(Sen et al., 2008), two coauthorship graphs, i.e., Coauthor CS and Coauthor Physics (Shchur et al., 2018), two copurchase graphs, i.e., Amazon Computers and Amazon Photo (Shchur et al., 2018), and one blog graph, i.e., Polblogs (Adamic and Glance, 2005). In Polblogs graph, node features are not available so we set the feature matrix to be a identity matrix.Baselines. We compare the proposed Elastic GNNs with representative GNNs including GCN (Kipf and Welling, 2016), GAT (Veličković et al., 2017), ChebNet (Defferrard et al., 2016), GraphSAGE (Hamilton et al., 2017), APPNP (Klicpera et al., 2018) and SGC (Wu et al., 2019). For all models, we use layer neural networks with hidden units.
Parameter settings.
For each experiment, we report the average performance and the standard variance of 10 runs. For all methods, hyperparameters are tuned from the following search space: 1) learning rate:
; 2) weight decay: {5e4, 5e5, 5e6}; 3) dropout rate: {0.5, 0.8}. For APPNP, the propagation step is tuned from and the parameter is tuned from . For Elastic GNNs, the propagation step is tuned from and parameters and are tuned from . As suggested by Theorem 1, we set and in the proposed elastic message passing scheme. Adam optimizer (Kingma and Ba, 2014) is used in all experiments.4.2 Performance on Benchmark Datasets
On commonly used datasets including Cora, CiteSeer, PubMed, Coauthor CS, Coauthor Physics, Amazon Computers and Amazon Photo, we compare the performance of the proposed Elastic GNN () with representative GNN baselines on the semisupervised learning task. The detail statistics of these datasets and data splits are summarized in Table 5 in Appendix A. The classification accuracy are showed in Table 1. From these results, we can make the following observations:

Elastic GNN outperforms GCN, GAT, ChebNet, GraphSAGE and SGC by significant margins on all datasets. For instance, Elastic GNN improves over GCN by , and on Cora, CiteSeer and PubMed datasets. The improvement comes from the global and local smoothness adaptivity of Elastic GNN.

Elastic GNN () consistently achieves higher performance than APPNP on all datasets. Essentially, Elastic GNN covers APPNP as a special case when there is only regularization, i.e., . Beyond the based graph smoothing, the based graph smoothing further enhances the local smoothness adaptivity. This comparison verifies the benefits of introducing based graph smoothing in GNNs.
Model  Cora  CiteSeer  PubMed  CS  Physics  Computers  Photo 

ChebNet  OOM  
GCN  
GAT  
SGC  
APPNP  
GraphSAGE  
ElasticGNN 
Dataset  Ptb Rate  Basic GNN  Elastic GNN  

GCN  GAT  
Cora  0%  83.50.4  84.00.7  85.80.4  85.10.5  85.30.4  85.80.4  85.80.4 
5%  76.60.8  80.40.7  81.01.0  82.31.1  81.61.1  81.91.4  82.20.9  
10%  70.41.3  75.60.6  76.31.5  76.21.4  77.90.9  78.21.6  78.81.7  
15%  65.10.7  69.81.3  72.20.9  73.31.3  75.71.2  76.90.9  77.21.6  
20%  60.02.7  59.90.6  67.70.7  63.70.9  70.31.1  67.25.3  70.51.3  
Citeseer  0%  72.00.6  73.30.8  73.60.9  73.20.6  73.20.5  73.60.6  73.80.6 
5%  70.90.6  72.90.8  72.80.5  72.80.5  72.80.5  73.30.6  72.90.5  
10%  67.60.9  70.60.5  70.20.6  70.80.6  70.71.2  72.40.9  72.60.4  
15%  64.51.1  69.01.1  70.20.6  68.11.4  68.21.1  71.31.5  71.90.7  
20%  62.03.5  61.01.5  64.91.0  64.70.8  64.70.8  64.70.8  64.70.8  
Polblogs  0%  95.70.4  95.40.2  95.40.2  95.80.3  95.80.3  95.80.3  95.80.3 
5%  73.10.8  83.71.5  82.80.3  78.70.6  78.70.7  82.80.4  83.00.3  
10%  70.71.1  76.30.9  73.70.3  75.20.4  75.30.7  81.50.2  81.60.3  
15%  65.01.9  68.81.1  68.90.9  72.10.9  71.51.1  77.80.9  78.70.5  
20%  51.31.2  51.51.6  65.50.7  68.10.6  68.70.7  77.40.2  77.50.2  
Pubmed  0%  87.20.1  83.70.4  88.10.1  86.70.1  87.30.1  88.10.1  88.10.1 
5%  83.10.1  78.00.4  87.10.2  86.20.1  87.00.1  87.10.2  87.10.2  
10%  81.20.1  74.90.4  86.60.1  86.00.2  86.90.2  86.30.1  87.00.1  
15%  78.70.1  71.10.5  85.70.2  85.40.2  86.40.2  85.50.1  86.40.2  
20%  77.40.2  68.21.0  85.80.1  85.40.1  86.40.1  85.40.1  86.40.1 
4.3 Robustness Under Adversarial Attack
Locally adaptive smoothness makes Elastic GNNs more robust to adversarial attack on graph structure. This is because the attack tends to connect nodes with different labels, which fuzzes the cluster structure in the graph. But EMP can tolerate large node differences along these wrong edges, and maintain the smoothness along correct edges.
To validate this, we evaluate the performance of Elastic GNNs under untargeted adversarial graph attack, which tries to degrade GNN models’ overall performance by deliberately modifying the graph structure. We use the MetaAttack (Zügner and Günnemann, 2019) implemented in DeepRobust (Li et al., 2020)^{3}^{3}3https://github.com/DSEMSU/DeepRobust
, a PyTorch library for adversarial attacks and defenses, to generate the adversarially attacked graphs based on four datasets including Cora, CiteSeer, Polblogs and PubMed. We randomly split
of nodes for training, validation and test. The detailed data statistics are summarized in Table 6 in Appendix A. Note that following the works (Zügner et al., 2018; Zügner and Günnemann, 2019; Entezari et al., 2020; Jin et al., 2020), we only consider the largest connected component (LCC) in the adversarial graphs. Therefore, the results in Table 2 are not directly comparable with the results in Table 1. We focus on investigating the robustness introduced by based graph smoothing but not on adversarial defense so we don’t compare with defense strategies. Existing defense strategies can be applied on Elastic GNNs to further improve the robustness against attacks.Variants of Elastic GNNs. To make a deeper investigation of Elastic GNNs, we consider the following variants: (1) ; (2) ; (3) ; (4) ; (5) . To save computation, we fix the learning rate as , weight decay as , dropout rate as and since this setting works well for the chosen datasets and models. Only and are tuned. The classification accuracy under different perturbation rates ranging from to is summarized in Table 2. From the results, we can make the following observations:

All variants of Elastic GNNs outperforms GCN and GAT by significant margins under all perturbation rates. For instance, when the pertubation rate is , Elastic GNN () improves over GCN by , , and on the four datasets being considered. This is because Elastic GNN can adapt to the change of smoothness while GCN and GAT can not adapt well when the perturbation rate increases.

outperforms in most cases, and outperforms in almost all cases. It demonstrates the benefits of exploiting the correlation between feature channels by coupling multidimensional features via norm.

outperforms in most cases, which suggests the benefits of local smoothness adaptivity. When and is combined, the Elastic GNN () achieves significantly better performance than solely , or variant in almost all cases. It suggests that and based graph smoothing are complementary to each other, and combining them provides significant better robustness against adversarial graph attacks.
4.4 Ablation Study
We provide ablation study to further investigate the adaptive smoothness, sparsity pattern, and convergence of EMP in Elastic GNN, based on three datasets including Cora, CiteSeer and PubMed. In this section, we fix for Elastic GNN, and for APPNP. We fix learning rate as , weight decay as and dropout rate as since this setting works well for both methods.
Adaptive smoothness. It is expected that based smoothing enhances local smoothness adaptivity by increasing the smoothness along correct edges (connecting nodes with same labels) while lowering smoothness along wrong edges (connecting nodes with different labels). To validate this, we compute the average adjacent node differences (based on node features in the last layer) along wrong and correct edges separately, and use the ratio between these two averages to measure the smoothness adaptivity. The results are summarized in Table 3. It is clearly observed that for all datasets, the ratio for ElasticGNN is significantly higher than based method such as APPNP, which validates its better local smoothness adaptivity.
Sparsity pattern. To validate the piecewise constant property enforced by EMP, we also investigate the sparsity pattern in the adjacent node differences, i.e., , based on node features in the last layer. Node difference along edge is defined as sparse if . The sparsity ratios for based method such as APPNP and based method such as Elastic GNN are summarized in Table 4. It can be observed that in Elastic GNN, a significant portion of are sparse for all datasets. While in APPNP, this portion is much smaller. This sparsity pattern validates the piecewise constant prior as designed.
Model  Cora  CiteSeer  PubMed 

(APPNP)  
+ (ElasticGNN) 
Model  Cora  CiteSeer  PubMed 

(APPNP)  
+ (ElasticGNN) 
Convergence of EMP. We provide two additional experiments to demonstrate the impact of propagation step on classification performance and the convergence of message passing scheme. Figure 2 shows that the increase of classification accuracy when the propagation step increases. It verifies the effectiveness of EMP in improving graph representation learning. It also shows that a small number of propagation step can achieve very good performance, and therefore the computation cost for EMP can be small. Figure 3 shows the decreasing of the objective value defined in Eq. (8) during the forward message passing process, and it verifies the convergence of the proposed EMP as suggested by Theorem 1.
5 Related Work
The design of GNN architectures can be majorly motivated in spectral domain (Kipf and Welling, 2016; Defferrard et al., 2016) and spatial domain (Hamilton et al., 2017; Veličković et al., 2017; Scarselli et al., 2008; Gilmer et al., 2017). The message passing scheme (Gilmer et al., 2017; Ma and Tang, 2020) for feature aggregation is one central component of GNNs. Recent works have proven that the message passing in GNNs can be regarded as lowpass graph filters (Nt and Maehara, 2019; Zhao and Akoglu, 2019). Generally, it is recently proved that message passing in many GNNs can be unified in the graph signal denosing framework (Ma et al., 2020; Pan et al., 2020; Zhu et al., 2021; Chen et al., 2020). We point out that they intrinsically perform based graph smoothing and typically can be represented as linear smoothers.
based graph signal denoising has been explored in graph trend filtering (Wang et al., 2016; Varma et al., 2019) which tends to provide estimators with th order piecewise polynomials over graphs. Graph total variation has also been utilized in semisupervised learning (Nie et al., 2011; Jung et al., 2016, 2019; AvilesRivero et al., 2019)
(Bühler and Hein, 2009; Bresson et al., 2013b) and graph cut problems (Szlam and Bresson, 2010; Bresson et al., 2013a). However, it is unclear whether these algorithms can be used to design GNNs. To the best of our knowledge, we make first such investigation in this work.6 Conclusion
In this work, we propose to enhance the smoothness adaptivity of GNNs via and based graph smoothing. Through the proposed elastic graph signal estimator, we derive a novel, efficient and general message passing scheme, i.e., elastic message passing (EMP). Integrating the proposed message passing scheme and deep neural networks leads to a family of GNNs, i.e., Elastic GNNs. Extensitve experiments on benchmark datasets and adversarially attacked graphs demonstrate the benefits of introducing based graph smoothing in the design of GNNs. The empirical study suggests that and based graph smoothing is complementary to each other, and the proposed Elastic GNNs has better smoothnesss adaptivity owning to the integration of and based graph smoothing. We hope the proposed elastic message passing scheme can inspire more powerful GNN architecture design in the future.
Acknowledgements
This research is supported by the National Science Foundation (NSF) under grant numbers CNS1815636, IIS1928278, IIS1714741, IIS1845081, IIS1907704, IIS1955285, and Army Research Office (ARO) under grant number W911NF2110198. Ming Yan is supported by NSF grant DMS2012439 and Facebook Faculty Research Award (Systems for ML).
References
 The political blogosphere and the 2004 us election: divided they blog. In Proceedings of the 3rd international workshop on Link discovery, pp. 36–43. Cited by: §4.1.
 When labelled data hurts: deep semisupervised classification with the graph 1laplacian. arXiv preprint arXiv:1906.08635. Cited by: §5.
 Convex analysis and monotone operator theory in hilbert spaces. 1st edition, Springer Publishing Company, Incorporated. External Links: ISBN 1441994661 Cited by: §3.2.
 An adaptive total variation algorithm for computing the balanced cut of a graph. External Links: 1302.2717 Cited by: §5.
 Multiclass total variation clustering. arXiv preprint arXiv:1306.1185. Cited by: §5.
 Spectral clustering based on the graph plaplacian. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 81–88. Cited by: §5.
 A primal–dual fixed point algorithm for convex separable minimization with applications to image restoration. Inverse Problems 29 (2), pp. 025011. Cited by: Appendix B, §3.2.
 Graph unrolling networks: interpretable neural networks for graph signal denoising. External Links: 2006.01301 Cited by: §5.
 Spectral graph theory. American Mathematical Soc.. Cited by: Appendix B.
 Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 3844–3852. Cited by: §1, §4.1, §5.
 Sparse and redundant representations: from theory to applications in signal and image processing. Springer Science & Business Media. Cited by: §1.
 All you need is low (rank) defending against adversarial attacks on graphs. In WSDM, Cited by: §4.3.
 Neural message passing for quantum chemistry. In International Conference on Machine Learning, pp. 1263–1272. Cited by: §1, §5.
 Inductive representation learning on large graphs. arXiv preprint arXiv:1706.02216. Cited by: §1, §4.1, §5.
 Statistical learning with sparsity: the lasso and generalizations. Chapman Hall/CRC. External Links: ISBN 1498712169 Cited by: §1.
 Graph structure learning for robust graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 66–74. Cited by: §4.3.
 Semisupervised learning in networkstructured data via total variation minimization. IEEE Transactions on Signal Processing 67 (24), pp. 6256–6269. External Links: Document Cited by: §5.
 Semisupervised learning via sparse label propagation. arXiv preprint arXiv:1612.01414. Cited by: §5.
 Trend filtering. SIAM review 51 (2), pp. 339–360. Cited by: §1.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
 Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §2.1, §4.1, §5.
 Predict then propagate: graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997. Cited by: §2.1, §3.3, §4.1.
 DeepRobust: a pytorch library for adversarial attacks and defenses. External Links: 2005.06149 Cited by: §4.3.
 A primaldual algorithm with optimal stepsizes and its application in decentralized consensus optimization. arXiv preprint arXiv:1711.06785. Cited by: Appendix B.
 On a generalization of the iterative softthresholding algorithm for the case of nonseparable penalty. Inverse Problems 27 (12), pp. 125007. Cited by: Appendix B, §3.2.
 A unified view on graph neural networks as graph signal denoising. arXiv preprint arXiv:2010.01777. Cited by: §1, §2.1, §2.1, §3.2, §5.
 Deep learning on graphs. Cambridge University Press. Cited by: §5.

Unsupervised and semisupervised learning via norm graph.
In
2011 International Conference on Computer Vision
, pp. 2268–2273. Cited by: §5.  Revisiting graph neural networks: all we have is lowpass filters. arXiv preprint arXiv:1905.09550. Cited by: §5.
 A unified framework for convolutionbased graph neural networks. https://openreview.net/forum?id=zUMD–Fb9Bt. Cited by: §5.
 Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena 60 (14), pp. 259–268. Cited by: §1.
 The graph neural network model. IEEE transactions on neural networks 20 (1), pp. 61–80. Cited by: §1, §5.
 Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §4.1.
 Sparsistency of the edge lasso over graphs. In Artificial Intelligence and Statistics, pp. 1028–1036. Cited by: §1.
 Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868. Cited by: §4.1.
 Total variation, cheeger cuts. In ICML, Cited by: §5.
 Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (1), pp. 91–108. Cited by: §1.
 Adaptive piecewise polynomial estimation via trend filtering. Annals of statistics 42 (1), pp. 285–323. Cited by: §1.
 Vectorvalued graph trend filtering with nonconvex penalties. IEEE Transactions on Signal and Information Processing over Networks 6, pp. 48–62. Cited by: §1, §3.1, §3.2, §5.
 Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1, §4.1, §5.
 Trend filtering on graphs. Journal of Machine Learning Research 17, pp. 1–41. Cited by: §1, §2.2, §3.1, §3.2, §5.
 Simplifying graph convolutional networks. In International conference on machine learning, pp. 6861–6871. Cited by: §4.1.
 PairNorm: tackling oversmoothing in gnns. In International Conference on Learning Representations, Cited by: §5.
 Learning with local and global consistency. Cited by: §1.
 Interpreting and unifying graph neural networks with an optimization framework. External Links: 2101.11859 Cited by: §5.
 [46] Learning from labeled and unlabeled data with label propagation. Cited by: §1.
 Semisupervised learning literature survey. Cited by: §1.
 Adversarial attacks on neural networks for graph data. In KDD, Cited by: §4.3.
 Adversarial attacks on graph neural networks via meta learning. arXiv preprint arXiv:1902.08412. Cited by: §4.3.
Appendix A Data Statistics
The data statistics for the benchmark datasets used in Section 4.2 are summarized in Table 5. The data statistics for the adversarially attacked graph used in Section 4.3 are summarized in Table 6.
Dataset  Classes  Nodes  Edges  Features  Training Nodes  Validation Nodes  Test Nodes 

Cora  7  2708  5278  1433  20 per class  500  1000 
CiteSeer  6  3327  4552  3703  20 per class  500  1000 
PubMed  3  19717  44324  500  20 per class  500  1000 
Coauthor CS  15  18333  81894  6805  20 per class  30 per class  Rest nodes 
Coauthor Physics  5  34493  247962  8415  20 per class  30 per class  Rest nodes 
Amazon Computers  10  13381  245778  767  20 per class  30 per class  Rest nodes 
Amazon Photo  8  7487  119043  745  20 per class  30 per class  Rest nodes 
Classes  Features  

Cora  2,485  5,069  7  1,433 
CiteSeer  2,110  3,668  6  3,703 
Polblogs  1,222  16,714  2  / 
PubMed  19,717  44,338  3  500 
Appendix B Convergence Guarantee
We provide Theorem 1 to show the convergence guarantee of the proposed elastic messsage passing scheme and the practical guidance for parameter settings in EMP.
Theorem 1 (Convergence of EMP).
Proof.
We first consider the general problem