1 Introduction
Graphs are universal mathematical structures –usually accompanied by a plethora of features– that are extensively used to describe realworld data in various domains such as citation and social networks Kipf and Welling (2016a); Liu et al. (2019), recommender systems Zhang and Chen (2019), or adversarial attacks Zhang and Zitnik (2020)
. Due to the complex structure of graphs, traditional machine learning models are insufficient for addressing graphbased tasks such as node and graph classification, link prediction and node clustering. This necessity has given rise to
Graph Representation Learning (GRL) methods, which aim to capture the structure of input graph data and produce meaningful lowdimensional representations.A main track of GRL methods is based on Graph Neural Networks (GNNs). GNNs and specifically Graph Convolutional Networks (GCN) Kipf and Welling (2016a) have advanced the research on GRL leading to outstanding performance as they capture a node’s neighborhood influence. Nevertheless, each GNN layer considers only the local onehop node relations leading to unavoidably deep layer stacking to efficiently account for the global graph information. Notwithstanding, vanilla GNNs cannot be arbitrarily deep as they lead to over smoothing and information loss Xu et al. (2018); Zhao and Akoglu (2019); Chen et al. (2020). To address over smoothing, spectral methods exploiting graph filters have been introduced Wang et al. (2019); Zhang et al. (2019). However, so far, these methods have been used in conjunction with traditional neural networks (e.g., MLPs or CNNs) that cannot efficiently exploit the set equivariance property of graph data (i.e., equivariance to permutations in the input data). In the graph domain there is no implicit data ordering, and thus the existing spectral methods waste an important portion of their computational capacity.
Contribution. In this work, we propose PointSpectrum, a GRL architecture that bridges the gap between GNNbased and spectral GRL methods (Section 2). PointSpectrum is based on Laplacian smoothing (used in spectral methods to alleviate the problem of oversmoothing), while it maintains the set equivariance property of GNNbased approaches.
Specifically, PointSpectrum is an unsupervised endtoend trainable architecture, consisting of the following components:

Input: a lowpass graph filter (Laplacian smoothing) is applied on the input graph data that enables the computation of korder graph convolution for arbitrarily large k without over smoothing node features

Encoder: the input data are fed to a set equivariant network that generates lowdimensional node embeddings

Decoder: a joint loss is employed to account for the reconstruction of the input data and their better separation in the embedding space through a clustering metric.
To the best of our knowledge this is the first work that introduces set equivariance in spectral GRL methods. Our experimental results (Section 3) show that: (i) Using set equivariant networks can increase the robustness (wrt. the model parameters) and efficiency (e.g., faster convergence, expressiveness) of spectral methods. (ii) PointSpectrum presents high performance in all benchmark tasks and datasets, outperforming or competing with the stateoftheart (detailed in Section 4). Overall, our findings showcase a new direction for GRL: the combination of spectral methods with set equivariance. Incorporating set equivariant networks in existing spectral methods can be straightforward (e.g., replacing MLPs or CNNs), and our results are promising for the efficiency of this approach.
2 Methodology
Definitions. We consider a nondirected attributed graph , where is the node set with , is the edge set, and
is the feature matrix consisting of feature vectors
. The structural information of graph is represented by the adjacency matrix .Goals. The goal of GRL is to map nodes to lowdimensional embeddings, which should preserve both the structural () and contextual () information of . We denote as , with , the matrix with the node embeddings.
Approach overview. We propose a methodology to compute an embedding matrix , which combines structural and contextual information in a twofold way: on one hand, it uses Laplacian smoothing (based on ) of the feature matrix (Section 2.1) and, on the other hand, by considering the node features as a set of points it exploits global information using a permutation equivariant network architecture (Section 2.2). Last, we leverage node clustering as a boosting component that further enhances performance by better separating nodes in the embedding space. The entire architecture that brings these components together is presented in Section 2.3.
2.1 Graph Convolution and Laplacian Smoothing
The most important notion in the prevalent GNNbased embedding methods, such as GCN Kipf and Welling (2016a), is that neighboring nodes should be similar and hence their features should be smoother than nonneighboring nodes in the graph manifold. However, these methods capture deeper connections by stacking multiple layers, leading to deep architectures, which are known to overly smooth the node features Chen et al. (2020). Over smoothing occurs as each layer repeatedly smooths the original features so as to account for the deeper interactions. To address this problem, the domain of graph signal processing has been used and in particular graph convolution by using Laplacian Smoothing filters Zhang et al. (2019); Cui et al. (2020).
Specifically, “spectral methods” in GRL use a smoothed feature matrix , instead of the original , which corresponds to a kth order graph convolution of :
(1) 
where matrix is the Laplacian Smoothing filter (discussed below). The multiplication of feature matrix with filter corresponds to a 1order graph convolution (or 1order graph smoothing). Stacking filters together, i.e., multiplications with , leads to a order convolution as in Eq. 1. Thus, deep network interactions are captured by the power of filter instead of repeated convolutions of the input features and therefore over smoothing is avoided.
The Laplacian Smoothing Filter, : The intuition behind korder convolution of spectral methods is the following: Each column of feature matrix (i.e., the vector with the values of all nodes for a single feature) can be considered as a graph signal. The smoothness of a signal depicts the similarity between the graph nodes. Since neighboring nodes should be similar, to capture this similarity, we would need to construct a smooth signal based on that captures node adjacency and takes into account the features of the most important neighboring nodes. It can be shown (details in Appendix A.7) that the Laplacian Smoothing filter as defined in Eq. 2 Taubin (1995) can cancel high frequencies between neighboring nodes and preserve the low ones:
(2) 
where ,
is the identity matrix and
is the graph Laplacian. The graph Laplacian is defined as , where is the degree matrix of , with being the degree of node . In order for to be lowpass, should be a nonnegative degressive function. Cui et al. (2020) showed that the optimal value of is , withdenoting the largest eigenvalue of
; in the remainder, we use this value for .Accuracy, ARI and NMI metrics of the PointSpectrum with PointNetST, MLP and CNNbased networks in the encoder solving the clustering task on the Cora dataset. For all metrics, the mean value and standard deviation of 10 experiment runs are depicted.
The set equivariant PointNetST achieves higher performance and is more robust than the MLP and CNN variants, irrespective of the convolution order.Renormalization trick: In practice the renormalization trick Kipf and Welling (2016a), i.e., adding selfloops in the graph, has been shown to improve accuracy and shrink the graph spectral domain Wu et al. (2019). Qualitatively, this means that for every node the smoothing filter also considers its own features alongside the ones from its neighbors. Thus, we perform the following transformation: selfloops are added to the adjacency: ; we then use the symmetric normalized graph Laplacian , where and are the degree and Laplacian matrices of . Finally, the resulting Laplacian smoothing filter becomes .
2.2 Equivariance and PointNetST
Neural network architectures can approximate any continuous function given sufficient capacity and expressive power Sonoda and Murata (2017); Sannai et al. (2019); Yarotsky (2021); here, we denote a neural network as a function operating on the feature matrix (and by extension on ).
Sets and permutation equivariance.
Conventional neural networks such as Multilayer Perceptrons (MLPs) or Convolutional Neural Networks (CNNs) act on data where there is an implicit order (e.g., adjacent pixels in images). However, graphs do not have any implicit order: nodes can be presented in different order but the graph still maintains the same structure (isomorphism). To this end,
can be seen as a set of points , and as a function operating on a set.Definition 1
A function acting on a set is permutation equivariant when for any permutation of the set .
In other words, a permutation equivariant function produces the same output (e.g., embedding, label) for each set item (e.g., node) irrespective of the order of the given data. This means that if a transformation (e.g., a permutation of the input matrix’s rows) is applied to the input data, then the same transformation should be applied to the model’s output as well.
Equivariance in GRL methods. Set equivariance is not captured by traditional neural networks such as MLPs (or CNNs), which are mainly used in the “spectral methods” in GRL Zhang et al. (2019); Cui et al. (2020). More specifically, Zaheer et al. (2017) proved that a function is permutation equivariant iff it can be decomposed in the form , where and are suitable transformations. Also, each MLP or CNN layer can be seen as a transformation , and a deep network with such layers can be expressed as . Hence, the necessary decomposition for permutation equivariance does not hold. On the contrary, GNNs can capture not only permutation equivariance but also graph connectivity, which makes them even more expressive. Nevertheless, as mentioned earlier, this high expressiveness comes with oversmoothing problems. In this work, we aim to bring the benefits of both approaches used in GRL: we employ a set equivariant network that accounts for the graph structure (through the matrix ) and avoids oversmoothing at the same time (as discussed in Section 2.1). This property is crucial for this work’s model prevalence over the traditional neural networks, as it will be shown in Section 3.
PointNetST. In this work, PointNetST Segol and Lipman (2019) is employed as the permutation equivariant neural network architecture. While different choices can be made, such as a deep selfattention network Wang et al. (2018) or the constructions of Keriven and Peyré (2019) and Sannai et al. (2019), PointNetST is preferred as it is provably a universal approximator over the space of equivariant functions Segol and Lipman (2019) and can be implemented as an arbitrarily deep neural network with the following form:
(3) 
where
is a nonlinearity such as ReLU and
, , is the DeepSet layer Zaheer et al. (2017):(4) 
with and being the layer’s parameters. Remark: while Eq. 4 is a generic form of an equivariant transformation, it is noted that PointNetST contains only a single layer with nonzero .
2.3 PointSpectrum
This work proposes PointSpectrum, an encoderdecoder architecture for GRL that consists of the following main components: (i) Laplacian smoothing of the feature matrix, (ii) PointNetST as the permutation equivariant neural network of the encoder, and (iii) a clustering module alongside the decoder. A schematic representation of PointSpectrum is depicted in Figure 1.
Input. The input of the encoderdecoder network is the order graph convolution of node feature matrix , which is computed using a Laplacian filter as described in Section 2.1. The larger the convolution order , the deeper nodewise interactions the model can capture; is a hyperparameter of the model.
Encoder. The encoder is the permutation equivariant PointNetST, which generates the node embeddings , as described in Section 2.2. The embeddings are fed to two individual modules: the decoder and the ClusterNet.
Decoder: The aim of the decoder is to reconstruct a pairwise similarity value between the computed node embeddings based on
. Different choices for the reconstruction loss function can be made (e.g. minimum squared error
Wang et al. (2019) or noisecontrastive binary cross entropy Veličković et al. (2018)). However, as connectivity is incorporated in the smoothed signal, we use a pairwise decoder and cross entropy with negative sampling as the loss function:(5) 
where are the negative samples (i.e., nonexisting edges) for node .
ClusterNet is a differentiable clustering module, which learns to assign nodes to clusters to better separate them in the embedding space Wilder et al. (2019). ClusterNet learns a distribution of soft assignments , where
expresses the probability of node
belonging to cluster , by optimizing KLdivergence loss function:(6) 
where
is a target distribution (updated in every epoch or using different intervals) that emphasizes the more “confident” assignments
Wang et al. (2019):(7) 
Having computed , cluster centers can be extracted directly by averaging the soft assignments for each cluster. Overall, PointSpectrum optimizes the following joint loss of the Decoder and ClusterNet
(8) 
with being hyper parameters that control the importance of each component.
3 Experimental Results
In this section, we demonstrate the efficiency of PointSpectrum in benchmark GRL datasets and tasks. We first present the datasets, the experimental setup, and the baseline methods we compare against to (Section 3.1
). Then, we provide experimental evidence for the gains that are introduced by the equivariance component of PointSpectrum over the traditional deep learning architectures in terms of efficiency (Sections
3.2), complexity (Section 3.3) and robustness (Section 3.4). Finally, we compare the performance of PointSpectrum against baseline and stateofthe art methods in GRL (Section 3.5) and provide a qualitative visual analysis (Section 3.6).3.1 Datasets & experimental setup
Datasets. We evaluated the performance of PointSpectrum on three widely used benchmark citation network datasets, namely Cora McCallum et al. (2000)
, Citeseer
Giles et al. (1998) and Pubmed Namata et al. (2012). The statistics of these data sources are presented and further discussed in Appendix A.1.PointSpectrum setup. For all evaluation tasks, a PointSpectrum model is used for the encoder with a single PointNetST layer of dimension (which is also the embedding dimension). For the clusterNet module, centers are randomly initialized and correspond to the number of distinct labels in each dataset. Last, weights are initialized according to He et al. (2015)
(He initialization). Details for the hyperparameter tuning are given in Appendix
A.2.Baseline methods. We compare PointSpectrum to several baseline methods (see details in Section 4), which we distinguish in four categories: (i) featureonly
(KMeans), (ii)
traditional GRL (DeepWalk, DNGR, TADW), (iii) GNNbased (VGAE, ARVGA  and their simpler derivatives  and DGI, GIC), and (iv) Spectral (DAEGC, AGC, AGE). In Section 3.5, we compare against all these methods on clustering, and only the most prevalent ones on link prediction. Results are obtained directly from Mavromatis and Karypis (2020); in particular, for AGE, Kmeans is used as the clustering method (instead of spectral clustering as in
Cui et al. (2020)) to enable a fair comparison. In Sections 3.2–3.4, the reported results correspond to the node clustering task, since it is the most natural task for PointSpectrum, considering its architecture.3.2 Efficiency of set equivariance
We investigate the efficiency of using a set equivariant network (PointNetST) along with Laplacian smoothing (), by comparing the PointSpectrum architecture (Figure 1) against two variants with MLP and CNN neural networks in the encoder. Figure 2 shows the results for the different encoder types and different convolution orders () for the clustering task on the Cora dataset. ^{1}^{1}1 The corresponding results on the Citeseer and Pubmed datasets can be found in Appendix A.5; on Citeseer PointNetST performs even better than on Cora, while on Pubmed it performs similarly to MLP/CNN variants (due to the small number of features and simpler graph structure).
PointNetST achieves higher performance, is more robust (lower variance), and has less fluctuations with respect to the convolution order than the MLP and CNN variants.
This prevalence of PointNetST suggests that set equivariant networks are able to capture richer information when combined with Laplacian smoothing, and could be good candidates for replacing the conventional encoders of other Spectral methods as well (e.g., DAEGC, AGC, AGE).
3.3 Set equivariance vs. training convergence
Set equivariant networks can offer multifaceted benefits. As shown in Zaheer et al. (2017); Segol and Lipman (2019) and in the results of Section 3.2, they can capture richer structural information compared to the traditional MLP/CNN variants. In addition to these benefits, here we show that they also aid in training efficiency and computation complexity. Figure 3 depicts the value of the loss function during the model’s training for the original PointNetST and the MLP/CNN variants. ^{2}^{2}2Similar findings hold for the Citeseer and Pubmed datasets; see Appendix A.6
In both components of the loss function (reconstruction, clustering) and the joint loss, PointNetST helps PointSpectrum to converge faster than the MLP/CNN variants.
3.4 Performance on permuted data
As already shown, set equivariance offers performance and computational efficiency. In this section we focus on the main characteristic of set equivariance: its robustness on permuted inputs (which can be considered as a specific type of noisy/corrupted input data). To demonstrate this, we train PointSpectrum by randomly permuting the rows of (input data) in every epoch. Then, the trained model is evaluated on the original .
Table 1 presents the results of this experiment, as well as the initial results without permuted data during training (in parentheses). PointNetST achieves the highest performance, but more importantly, the drop in performance due to the corrupted input is significantly smaller compared to the drop in the MLP/CNN variants. This highlights that PointSpectrum can capture data permutations on which it has not been explicitly trained (e.g., in case of graph isomorphism). On the contrary, the MLP/CNN variants perform poorly on permuted data (see, e.g., the NMI/ARI metrics in the Citeseer and Pubmed datasets).
3.5 Comparison against baselines
In this section, we compare PointSpectrum’s efficiency in clustering and link prediction tasks against baseline and stateoftheart methods.
We would like to stress that the main goal of this work is to introduce set equivariance in Laplacian smoothing (Spectral) GRL methods, and demonstrate the benefits it can bring. Hence, we do not extensively emphasize on the hyperparameter optimization of PointSpectrum. Here, we compare its performance against baselines for completeness and for demonstrating its efficiency compared to the stateoftheart in GRL. Nevertheless, the tested PointSpectrum implementation still outperforms Spectral methods, and achieves top or near top performance in the evaluation tasks.
Clustering: The goal in clustering is to group similar nodes into classes based on the computed embeddings. Similar to related literature, the number of classes is given, and in the evaluation the labels provided by the datasets are used. In Table 2 we report the best PointSpectrum results out of 10 experiment runs for 4 metrics: Accuracy (ACC), Normalized Mutual Information (NMI), Adjusted Randomized Index (ARI) and MacroF1 score (F1).
Focusing first on the Spectral methods (bottom rows of Table 2), we see that the overall PointSpectrum performance (i.e., for the majority of metrics and datasets) is superior to Spectral methods. When compared to all GRL methods, PointSpectrum achieves stateoftheart performance on most metrics in the Cora and Pubmed datasets, as well as on the accuracy metric in the Citeseer dataset (in which the GNNbased models GIC and DGI perform best for the other metrics).
Link prediction: In link prediction, some graphs edges are hidden to the model during training and its goal is to predict these hidden interactions based on the computed node embeddings. For this task of positive and negative edges are used as test and as the validation set. Table 3 presents the model performance on link prediction (mean value and standard deviation over 10 runs) as measured by the Area Under the Curve (AUC) and Average Precision (AP) metrics.
PointSpectrum outperforms all baselines by a significant margin in Cora ( ) and Pubmed () datasets, while in Citeseer it is the second best method after GIC (with less than margin; also note that PointSpectrum has much lower variance than GIC).
3.6 Qualitative Analysis
We evaluate qualitatively PointSpectrum by visualizing the computed embeddings using tSNE Van der Maaten and Hinton (2008) for the Cora dataset in Figure 4 (see Appendix A.4 for visualizations on the Citeseer and Pubmed datasets).
On one hand, we observe that PointSpectrum separates well the nodes in the embedding space as the training proceeds. On the other hand, the ClusterNet component enables the PointSpectrum to learn the cluster centers as well (denotes as ‘x’ marks), as it pushes them from a random initial point towards each group’s center. Last, since training is an iterative process, node embeddings and cluster centers attract each other in turns, explaining the formulation of these distinct clusters.
4 Related Work
GRL has gained a lot of attention due to the need for automated processes that can analyze large volumes of structured data, with graphs being the hallmark of such structures.
Conventional graph embeddings: The first efforts exploited wellknown graph mechanisms to calculate node representations. DeepWalk Perozzi et al. (2014) and Node2Vec Grover and Leskovec (2016) utilize random walks to sample the graph and train a Word2Vec model on these samples to extract the embeddings. Also, TADW Yang et al. (2015) applies nonnegative matrix factorization on both the graph and node features to get a consistent partition. Last, DNGR Cao et al. (2016) employs denoising auto encoders to find low dimensional representations and then reconstruct the graph adjacency.
GNNbased methods: Graph Neural Networks are designed to capture graph structures, and thus have been used for learning node representations. VGAE Kipf and Welling (2016b)
uses GCNs to form a variational autoencoder which learns to generate node embeddings, while ARVGA
Pan et al. (2018) uses adversarial learning to train the graph auto encoder. DGI Veličković et al. (2018) leverages both local and global graph information for representation learning using contrastive learning, while GIC Mavromatis and Karypis (2020) extends DGI by forming node clusters to better separate nodes in the embedding space. In particular ClusterNet Wilder et al. (2019) –the clustering process of GIC which is crucial for its superior performance– is also incorporated in PointSpectrum aiding in performance and validating GIC’s design. Although efficient, to capture deep graph interactions GNN methods are inevitably led to over smoothing, where node representations converge to indistinguishable vectors Chen et al. (2020); Zhou et al. (2020).Spectral methods: On the other hand, spectral methods exploit graph filters to perform highorder graph convolution at once, thus bypassing GNNs’ over smoothing. AGC Zhang et al. (2019) uses Laplacian filtering with spectral clustering to cluster nodes into groups, while AGE Cui et al. (2020) employs an auto encoder to produce node embeddings through Laplacian smoothing. Also, DAEGC Wang et al. (2019) leverages an attentional network alongside softlabeling for self supervision to construct the embeddings. While spectral methods address over smoothing, they have only used conventional neural networks (MLPs, CNNs) that cannot capture graph properties (e.g., equivariance) by design; they can only learn the structural information contained in the smoothed input signal.
PointSpectrum –although a spectral method– lies on the intersection of GNNbased and spectral methods, alleviating over smoothing through graph filtering and capturing structural information through set equivariant networks.
5 Conclusion
PointSpectrum is the first work to introduce the set equivariance property (typically, a property of GNNbased methods) into spectral methods. Set equivariance is important when learning on graph data, since it is inherently designed to exploit the nature of unordered data. Our work was motivated by this, and our experimental results clearly demonstrated the performance benefits of using a set equivariant network (PointNetST) over the MLP or CNN layers that are used in spectral methods.
We deem PointSpectrum as an initial effort (or as a proof of concept) in the direction of integrating set equivariance with Laplacian smoothing. This is why we adopted a simple design for the model architecture, without exhaustively overengineering its modules or tuning its hyperparameters. Nevertheless, and despite this simplicity, we have shown that PointSpectrum can achieve stateoftheart results in benchmark datasets. This brings a positive message for the efficiency, applicability and generalizability of our approach to other spectral or more generic GRL methods.
In particular, we identify the following as promising directions for future research:
Extensions: Set equivariant networks (e.g., DeepSet or PointNetST) can easily be introduced to existing spectral methods (e.g., AGC, AGE or DAEGC) by replacing the MLP or CNN layers that they use. A more challenging direction is the extension of the proposed approach to generative models, such as VAE or GAN architectures.
Generalization: As shown, PointSpectrum performs well even under data permutations. A deeper understanding (experimental/theoretical) of its capacity to generalize on noisy, corrupted or unseen data, could provide further insights on the mechanics of using set equivariant methods on graphs, as well as lead to the design of more efficient GRL methods.
Unification: PointSpectrum has a modular design, where a set equivariant network receives as input the smoothed matrix . Unifying these two operations in a single component (e.g., a new GNN layer), if possible, could simultaneously aim at a higher performance and bypass over smoothing.
References

Deep neural networks for learning graph representations.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 30. Cited by: §4.  Measuring and relieving the oversmoothing problem for graph neural networks from the topological view. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 3438–3445. Cited by: §A.7, §1, §2.1, §4.
 Adaptive graph encoder for attributed graph embedding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 976–985. Cited by: §A.7, §2.1, §2.1, §2.2, §3.1, §4.
 CiteSeer: an automatic citation indexing system. In Proceedings of the third ACM conference on Digital libraries, pp. 89–98. Cited by: §3.1.
 Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §4.

Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification
. InProceedings of the IEEE international conference on computer vision
, pp. 1026–1034. Cited by: §3.1.  Matrix analysis. Cambridge university press. Cited by: §A.7.
 Universal invariant and equivariant graph neural networks. Advances in Neural Information Processing Systems 32, pp. 7092–7101. Cited by: §2.2.
 Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §A.7, §1, §1, §2.1, §2.1.
 Variational graph autoencoders. arXiv preprint arXiv:1611.07308. Cited by: §4.
 Is a single vector enough? exploring node polysemy for network embedding. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 932–940. Cited by: §1.
 Graph infoclust: leveraging clusterlevel node information for unsupervised graph representation learning. arXiv preprint arXiv:2009.06946. Cited by: §3.1, §4.
 Automating the construction of internet portals with machine learning. Information Retrieval 3 (2), pp. 127–163. Cited by: §3.1.
 Querydriven active surveying for collective classification. In 10th International Workshop on Mining and Learning with Graphs, Vol. 8, pp. 1. Cited by: §3.1.
 Adversarially regularized graph autoencoder for graph embedding. In IJCAI International Joint Conference on Artificial Intelligence, Cited by: §4.
 Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §4.
 Universal approximations of permutation invariant/equivariant functions by deep neural networks. arXiv preprint arXiv:1903.01939. Cited by: §2.2, §2.2.
 On universal equivariant set networks. In International Conference on Learning Representations, Cited by: §2.2, §3.3.

Neural network with unbounded activation functions is universal approximator
. Applied and Computational Harmonic Analysis 43 (2), pp. 233–268. Cited by: §2.2.  A signal processing approach to fair surface design. In Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, pp. 351–358. Cited by: §A.7, §A.7, §2.1.
 Visualizing data using tsne.. Journal of machine learning research 9 (11). Cited by: §3.6.
 Deep graph infomax. In International Conference on Learning Representations, Cited by: §2.3, §4.
 Attributed graph clustering: a deep attentional embedding approach. In International Joint Conference on Artificial Intelligence, Cited by: §1, §2.3, §2.3, §4.

Nonlocal neural networks.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 7794–7803. Cited by: §2.2.  End to end learning and optimization on graphs. Advances in Neural Information Processing Systems 32, pp. 4672–4683. Cited by: §2.3, §4.
 Simplifying graph convolutional networks. In International conference on machine learning, pp. 6861–6871. Cited by: §2.1.
 Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, pp. 5453–5462. Cited by: §1.
 Network representation learning with rich text information. In Twentyfourth international joint conference on artificial intelligence, Cited by: §4.
 Universal approximations of invariant maps by neural networks. Constructive Approximation, pp. 1–68. Cited by: §2.2.
 Deep sets. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 3394–3404. Cited by: §2.2, §2.2, §3.3.
 Inductive matrix completion based on graph neural networks. In International Conference on Learning Representations, Cited by: §1.
 GNNGuard: defending graph neural networks against adversarial attacks. Advances in Neural Information Processing Systems 33. Cited by: §1.
 Attributed graph clustering via adaptive graph convolution. In 28th International Joint Conference on Artificial Intelligence, IJCAI 2019, pp. 4327–4333. Cited by: §1, §2.1, §2.2, §4.
 PairNorm: tackling oversmoothing in gnns. In International Conference on Learning Representations, Cited by: §1.
 Towards deeper graph neural networks with differentiable group normalization. Advances in Neural Information Processing Systems 33. Cited by: §4.
Appendix A Appendix
a.1 Datasets
Dataset  Nodes  Edges  Features  Classes 

Cora  2708  5429  1433  7 
Citeseer  3327  4732  3703  6 
Pubmed  19717  44338  500  3 
Table 4 presents the statistics of the datasets used throughout the experimental process. These data contain a different number of labels, which are used as the oracle information to calculate the reported metrics (Accuracy, NMI, ARI and F1). All of them present a sparse graph structure, while Pubmed contains less rich information in terms of node features in comparison to Cora and Citeseer.
a.2 Hyperparameter setup
To conduct the experiments and validate PointSpectrum’s performance, hyper parameter tuning is needed. The values corresponding to each dataset are presented in Table 5. It should be noted that the best value for convolution order is smaller for PointSpectrum when compared to other methods that employ graph filtering (8, 6 and 8 for Cora, Citeseer and Pubmed respectively). This showcases the fact that set equivariant networks can capture structural information more easily and thus they do not need the whole information to be presented explicitly.
Regarding hyper parameters and , various configurations were investigated. Specifically, we point out the below behaviors:

constant: the hyper parameter has a constant value throughout training

linear: the hyper parameter linearly increases (decreases) in every epoch to reach a maximum (minimum) value, which is also provided by the user

exponential: the hyper parameter exponentially increases (decreases) in every epoch to reach a maximum (minimum) value, which is also provided by the user. Specifically, given the maximum value and the number of epochs the following function is used: for a given epoch x. For the decreasing values, we sort this function’s results in decreasing order.
a.3 Ablation study
To validate the efficacy of PointSpectrum’s components an ablation study is conducted. First a conventional autoencoder is tested using either an MLP or a CNN encoder. Then PointNetST substitutes the conventional autoencoder and last ClusterNet is also employed alongside reconstruction objective. As it can be seen in Table 6, in all three datasets the holistic PointSpectrum model achieves the best results. Furthermore, regarding Cora and Citeseer both set equivariance (through PointNetST) and the clustering objective (through ClusterNet) increase the model’s performance. However, for Pubmed although clustering is beneficial, set equivariance does not seem to help. This may relate to the low number of features when compared to the graph size, meaning that the information captured from the conventional neural networks is sufficient to characterize the data, while node features (and thus their permutations) are not that important.
To verify the above assumption we have also tested the three PointSpectrum variants on a reduced number of features. Specifically, Figure 5
depicts this experiment on Citeseer dataset where the number of features is large enough to enable us different degrees of features reduction. Although performance deterioration is not analogous to features reduction, a general trend is shown: PointNetST is heavily dependent on features, while conventional neural networks seem to pull closer to their fullfeatures performance no matter the reduction. However, this may come from conventional neural networks’ tendency to overfit to the specific permutation, therefore a more thorough investigation is needed which is out of this work’s scope.
a.4 Additional visual results
Here the visual analyses of tSNE on Citeseer and Pubmed are presented. As depicted, PointSpectrum behaves similar to Cora, separating the embeddings as training proceeds and forcing cluster centers towards the centers of the node groups that it creates. More concretely, in the case of Citeseer, nodes are well separated into distinct clusters and therefore the trained ClusterNet centers match to the actual centers. For Pubmed though, node separation is not trivial and nodes appear to be similar to nodes of other clusters. We suppose that the reason behind this phenomenon is the small number of available features for PointSpectrum to exploit, when compared to Cora and Citeseer.
a.5 Influence of convolution order on Citeseer and Pubmed
Influence of parameter k  the convolution order  is also presented with respect to Citeseer and Pubmed in Figure 7. Citeseer presents similar behavior to Cora as it has similar graph structure with many node attributes being present. However, on Pubmed training behavior is slightly different. As shown, PointNetST does not perform constantly better than its CNN variant. This can be explained by the fact that node attributes are limited in Pubmed, while graph structure is dominant. Thus, the conventional neural networks can overfit on this structure and depict equally high performance compared to set equivariant ones.
a.6 Training Convergence on Citeseer and Pubmed
Extending the discussion on training convergence, Figure 8 depicts the comparison between PointNetST encoder and MLP and CNN ones. Again, Citeseer behaves similar to Cora due to their similar structural and contextual information. Moreover, as already discussed Pubmed presents a different graph structure and less node information complicating set equivariance and restricting its performance. However, despite these complications PointNetST maintains its efficiency and accelerates the training convergence, although by a much smaller factor, nearly indistinguishable from the rest methods.
a.7 Laplacian Smoothing and Graph Convolution
The most important notion in the prevalent GNNbased embedding methods, such as GCN Kipf and Welling (2016a), is that neighboring nodes should be similar and hence their features should be smoother  than that of irrelevant nodes  in the graph manifold. However, these methods capture deeper connections by stacking multiple layers, leading to deep architectures, which are known to overly smooth the node features Chen et al. (2020). To address this problem, the domain of graph signal processing considers as a graph signal, where each one of the nodes is assigned a scalar. Then, the smoothness of signal depicts the similarity between all of the graph nodes. To calculate smoothness, the Rayleigh quotient Horn and Johnson (2012) over the signal and the graph Laplacian  essentially the normalized variance of  is employed:
(9) 
Since neighboring nodes should be similar, a smoother signal is expected to have lower Rayleigh quotient. To find the relation between eigenvalues and Rayleigh quotient, one needs to calculate the eigendecomposition of the graph Laplacian, that is with
being the matrix of eigenvectors and
the diagonal matrix of eigenvalues. Then, the Rayleigh quotient of the eigenvector is:(10) 
It can be seen that the lower Rayleigh quotients  and by extension the smoother eigenvectors  are correlated with low eigenvalues, meaning low frequencies. To employ these observations to every signal , the decomposition of x on the basis of is considered:
(11) 
Consequently, as smooth signals are associated with smooth eigenvectors and low eigenvalues according to Eq. 10, the used filter should cancel high frequencies and preserve the low ones. Laplacian smoothing filters are selected for this purpose, as they combine high performance with low computational cost Taubin (1995).
Laplacian Smoothing Filter: Here, we consider the generalized Laplacian Smoothing filter as defined in Taubin (1995)
(12) 
where , I is the identity matrix and is the filter matrix. Using Eq. 12, the filtered signal is:
What this suggests is that for to be lowpass, should always decline. It has been found that the optimal value of is , with denoting the largest eigenvalue Cui et al. (2020).
Having defined the filter, one can introduce korder smoothing  and thus graph convolution  by stacking k filters together. Ultimately, the overall smoothed feature matrix is
(13) 
a.8 System Specifications
All of the experiments were conducted using a computing grid with an Intel Xeon E52630 v4 CPU, 32Gb RAM and an Nvidia Tesla P100 GPU. Also, to speed up some computations a personal computer with an Intel(R) Core(TM) i76700K CPU @ 4.00GHz, 32Gb RAM and an NVidia GeForce GTX 1070 GPU was also employed for specific experiments alongside the ones running on the grid.