Pointspectrum: Equivariance Meets Laplacian Filtering for Graph Representation Learning

Graph Representation Learning (GRL) has become essential for modern graph data mining and learning tasks. GRL aims to capture the graph's structural information and exploit it in combination with node and edge attributes to compute low-dimensional representations. While Graph Neural Networks (GNNs) have been used in state-of-the-art GRL architectures, they have been shown to suffer from over smoothing when many GNN layers need to be stacked. In a different GRL approach, spectral methods based on graph filtering have emerged addressing over smoothing; however, up to now, they employ traditional neural networks that cannot efficiently exploit the structure of graph data. Motivated by this, we propose PointSpectrum, a spectral method that incorporates a set equivariant network to account for a graph's structure. PointSpectrum enhances the efficiency and expressiveness of spectral methods, while it outperforms or competes with state-of-the-art GRL methods. Overall, PointSpectrum addresses over smoothing by employing a graph filter and captures a graph's structure through set equivariance, lying on the intersection of GNNs and spectral methods. Our findings are promising for the benefits and applicability of this architectural shift for spectral methods and GRL.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/25/2022

Addressing Over-Smoothing in Graph Neural Networks via Deep Supervision

Learning useful node and graph representations with graph neural network...
08/10/2021

Label-informed Graph Structure Learning for Node Classification

Graph Neural Networks (GNNs) have achieved great success among various d...
08/29/2020

Efficient Robustness Certificates for Discrete Data: Sparsity-Aware Randomized Smoothing for Graphs, Images and More

Existing techniques for certifying the robustness of models for discrete...
10/29/2021

Topological Relational Learning on Graphs

Graph neural networks (GNNs) have emerged as a powerful tool for graph c...
04/27/2022

SCGC : Self-Supervised Contrastive Graph Clustering

Graph clustering discovers groups or communities within networks. Deep l...
01/25/2021

Learning Parametrised Graph Shift Operators

In many domains data is currently represented as graphs and therefore, t...
02/17/2021

Ego-based Entropy Measures for Structural Representations on Graphs

Machine learning on graph-structured data has attracted high research in...

1 Introduction

Graphs are universal mathematical structures –usually accompanied by a plethora of features– that are extensively used to describe real-world data in various domains such as citation and social networks Kipf and Welling (2016a); Liu et al. (2019), recommender systems Zhang and Chen (2019), or adversarial attacks Zhang and Zitnik (2020)

. Due to the complex structure of graphs, traditional machine learning models are insufficient for addressing graph-based tasks such as node and graph classification, link prediction and node clustering. This necessity has given rise to

Graph Representation Learning (GRL) methods, which aim to capture the structure of input graph data and produce meaningful low-dimensional representations.

A main track of GRL methods is based on Graph Neural Networks (GNNs). GNNs and specifically Graph Convolutional Networks (GCN) Kipf and Welling (2016a) have advanced the research on GRL leading to outstanding performance as they capture a node’s neighborhood influence. Nevertheless, each GNN layer considers only the local one-hop node relations leading to unavoidably deep layer stacking to efficiently account for the global graph information. Notwithstanding, vanilla GNNs cannot be arbitrarily deep as they lead to over smoothing and information loss  Xu et al. (2018); Zhao and Akoglu (2019); Chen et al. (2020). To address over smoothing, spectral methods exploiting graph filters have been introduced Wang et al. (2019); Zhang et al. (2019). However, so far, these methods have been used in conjunction with traditional neural networks (e.g., MLPs or CNNs) that cannot efficiently exploit the set equivariance property of graph data (i.e., equivariance to permutations in the input data). In the graph domain there is no implicit data ordering, and thus the existing spectral methods waste an important portion of their computational capacity.

Contribution. In this work, we propose PointSpectrum, a GRL architecture that bridges the gap between GNN-based and spectral GRL methods (Section 2). PointSpectrum is based on Laplacian smoothing (used in spectral methods to alleviate the problem of over-smoothing), while it maintains the set equivariance property of GNN-based approaches.

Specifically, PointSpectrum is an unsupervised end-to-end trainable architecture, consisting of the following components:

  • Input: a low-pass graph filter (Laplacian smoothing) is applied on the input graph data that enables the computation of k-order graph convolution for arbitrarily large k without over smoothing node features

  • Encoder: the input data are fed to a set equivariant network that generates low-dimensional node embeddings

  • Decoder: a joint loss is employed to account for the reconstruction of the input data and their better separation in the embedding space through a clustering metric.

Figure 1: PointSpectrum architecture. Encoder is a pointNetST network producing node embeddings in an equivariant manner, the pair-wise decoder reconstructs input and last ClusterNet is employed to better separate points in the embedding space.

To the best of our knowledge this is the first work that introduces set equivariance in spectral GRL methods. Our experimental results (Section 3) show that: (i) Using set equivariant networks can increase the robustness (wrt. the model parameters) and efficiency (e.g., faster convergence, expressiveness) of spectral methods. (ii) PointSpectrum presents high performance in all benchmark tasks and datasets, outperforming or competing with the state-of-the-art (detailed in Section 4). Overall, our findings showcase a new direction for GRL: the combination of spectral methods with set equivariance. Incorporating set equivariant networks in existing spectral methods can be straightforward (e.g., replacing MLPs or CNNs), and our results are promising for the efficiency of this approach.

2 Methodology

Definitions. We consider a non-directed attributed graph , where is the node set with , is the edge set, and

is the feature matrix consisting of feature vectors

. The structural information of graph is represented by the adjacency matrix .

Goals. The goal of GRL is to map nodes to low-dimensional embeddings, which should preserve both the structural () and contextual () information of . We denote as , with , the matrix with the node embeddings.

Approach overview. We propose a methodology to compute an embedding matrix , which combines structural and contextual information in a twofold way: on one hand, it uses Laplacian smoothing (based on ) of the feature matrix (Section 2.1) and, on the other hand, by considering the node features as a set of points it exploits global information using a permutation equivariant network architecture (Section 2.2). Last, we leverage node clustering as a boosting component that further enhances performance by better separating nodes in the embedding space. The entire architecture that brings these components together is presented in Section 2.3.

2.1 Graph Convolution and Laplacian Smoothing

The most important notion in the prevalent GNN-based embedding methods, such as GCN Kipf and Welling (2016a), is that neighboring nodes should be similar and hence their features should be smoother than non-neighboring nodes in the graph manifold. However, these methods capture deeper connections by stacking multiple layers, leading to deep architectures, which are known to overly smooth the node features Chen et al. (2020). Over smoothing occurs as each layer repeatedly smooths the original features so as to account for the deeper interactions. To address this problem, the domain of graph signal processing has been used and in particular graph convolution by using Laplacian Smoothing filters Zhang et al. (2019); Cui et al. (2020).

Specifically, “spectral methods” in GRL use a smoothed feature matrix , instead of the original , which corresponds to a k-th order graph convolution of :

(1)

where matrix is the Laplacian Smoothing filter (discussed below). The multiplication of feature matrix with filter corresponds to a 1-order graph convolution (or 1-order graph smoothing). Stacking filters together, i.e., multiplications with , leads to a -order convolution as in Eq. 1. Thus, deep network interactions are captured by the power of filter instead of repeated convolutions of the input features and therefore over smoothing is avoided.

The Laplacian Smoothing Filter, : The intuition behind k-order convolution of spectral methods is the following: Each column of feature matrix (i.e., the vector with the values of all nodes for a single feature) can be considered as a graph signal. The smoothness of a signal depicts the similarity between the graph nodes. Since neighboring nodes should be similar, to capture this similarity, we would need to construct a smooth signal based on that captures node adjacency and takes into account the features of the most important neighboring nodes. It can be shown (details in Appendix A.7) that the Laplacian Smoothing filter as defined in Eq. 2 Taubin (1995) can cancel high frequencies between neighboring nodes and preserve the low ones:

(2)

where ,

is the identity matrix and

is the graph Laplacian. The graph Laplacian is defined as , where is the degree matrix of , with being the degree of node . In order for to be low-pass, should be a non-negative degressive function. Cui et al. (2020) showed that the optimal value of is , with

denoting the largest eigenvalue of

; in the remainder, we use this value for .

(a)

Accuracy (Cora)

(b) ARI (Cora)
(c) NMI (Cora)
Figure 2:

Accuracy, ARI and NMI metrics of the PointSpectrum with PointNetST, MLP and CNN-based networks in the encoder solving the clustering task on the Cora dataset. For all metrics, the mean value and standard deviation of 10 experiment runs are depicted.

The set equivariant PointNetST achieves higher performance and is more robust than the MLP and CNN variants, irrespective of the convolution order.

Renormalization trick: In practice the renormalization trick Kipf and Welling (2016a), i.e., adding self-loops in the graph, has been shown to improve accuracy and shrink the graph spectral domain Wu et al. (2019). Qualitatively, this means that for every node the smoothing filter also considers its own features alongside the ones from its neighbors. Thus, we perform the following transformation: self-loops are added to the adjacency: ; we then use the symmetric normalized graph Laplacian , where and are the degree and Laplacian matrices of . Finally, the resulting Laplacian smoothing filter becomes .

2.2 Equivariance and PointNetST

Neural network architectures can approximate any continuous function given sufficient capacity and expressive power Sonoda and Murata (2017); Sannai et al. (2019); Yarotsky (2021); here, we denote a neural network as a function operating on the feature matrix (and by extension on ).

Sets and permutation equivariance.

Conventional neural networks such as Multilayer Perceptrons (MLPs) or Convolutional Neural Networks (CNNs) act on data where there is an implicit order (e.g., adjacent pixels in images). However, graphs do not have any implicit order: nodes can be presented in different order but the graph still maintains the same structure (isomorphism). To this end,

can be seen as a set of points , and as a function operating on a set.

Definition 1

A function acting on a set is permutation equivariant when for any permutation of the set .

In other words, a permutation equivariant function produces the same output (e.g., embedding, label) for each set item (e.g., node) irrespective of the order of the given data. This means that if a transformation (e.g., a permutation of the input matrix’s rows) is applied to the input data, then the same transformation should be applied to the model’s output as well.

Equivariance in GRL methods. Set equivariance is not captured by traditional neural networks such as MLPs (or CNNs), which are mainly used in the “spectral methods” in GRL Zhang et al. (2019); Cui et al. (2020). More specifically, Zaheer et al. (2017) proved that a function is permutation equivariant iff it can be decomposed in the form , where and are suitable transformations. Also, each MLP or CNN layer can be seen as a transformation , and a deep network with such layers can be expressed as . Hence, the necessary decomposition for permutation equivariance does not hold. On the contrary, GNNs can capture not only permutation equivariance but also graph connectivity, which makes them even more expressive. Nevertheless, as mentioned earlier, this high expressiveness comes with oversmoothing problems. In this work, we aim to bring the benefits of both approaches used in GRL: we employ a set equivariant network that accounts for the graph structure (through the matrix ) and avoids oversmoothing at the same time (as discussed in Section 2.1). This property is crucial for this work’s model prevalence over the traditional neural networks, as it will be shown in Section 3.

(a) Loss (Cora)
(b) Reconstruction loss (Cora)
(c) Clustering loss (Cora)
Figure 3: Total loss, reconstruction loss and clustering loss of PointSpectrum with PointNetST, MLP and CNN-based networks in the encoder solving the clustering task on the Cora dataset (mean values over 10 runs). The set equivariant PointNetST helps the model converge faster than the MLP and CNN variants.

PointNetST. In this work, PointNetST Segol and Lipman (2019) is employed as the permutation equivariant neural network architecture. While different choices can be made, such as a deep self-attention network Wang et al. (2018) or the constructions of Keriven and Peyré (2019) and Sannai et al. (2019), PointNetST is preferred as it is provably a universal approximator over the space of equivariant functions Segol and Lipman (2019) and can be implemented as an arbitrarily deep neural network with the following form:

(3)

where

is a non-linearity such as ReLU and

, , is the DeepSet layer Zaheer et al. (2017):

(4)

with and being the layer’s parameters. Remark: while Eq. 4 is a generic form of an equivariant transformation, it is noted that PointNetST contains only a single layer with non-zero .

2.3 PointSpectrum

This work proposes PointSpectrum, an encoder-decoder architecture for GRL that consists of the following main components: (i) Laplacian smoothing of the feature matrix, (ii) PointNetST as the permutation equivariant neural network of the encoder, and (iii) a clustering module alongside the decoder. A schematic representation of PointSpectrum is depicted in Figure 1.

Input. The input of the encoder-decoder network is the -order graph convolution of node feature matrix , which is computed using a Laplacian filter as described in Section 2.1. The larger the convolution order , the deeper node-wise interactions the model can capture; is a hyper-parameter of the model.

Encoder. The encoder is the permutation equivariant PointNetST, which generates the node embeddings , as described in Section 2.2. The embeddings are fed to two individual modules: the decoder and the ClusterNet.

Decoder: The aim of the decoder is to reconstruct a pairwise similarity value between the computed node embeddings based on

. Different choices for the reconstruction loss function can be made (e.g. minimum squared error 

Wang et al. (2019) or noise-contrastive binary cross entropy Veličković et al. (2018)). However, as connectivity is incorporated in the smoothed signal, we use a pairwise decoder and cross entropy with negative sampling as the loss function:

(5)

where are the negative samples (i.e., non-existing edges) for node .

ClusterNet is a differentiable clustering module, which learns to assign nodes to clusters to better separate them in the embedding space Wilder et al. (2019). ClusterNet learns a distribution of soft assignments , where

expresses the probability of node

belonging to cluster , by optimizing KL-divergence loss function:

(6)

where

is a target distribution (updated in every epoch or using different intervals) that emphasizes the more “confident” assignments 

Wang et al. (2019):

(7)

Having computed , cluster centers can be extracted directly by averaging the soft assignments for each cluster. Overall, PointSpectrum optimizes the following joint loss of the Decoder and ClusterNet

(8)

with being hyper parameters that control the importance of each component.

3 Experimental Results

In this section, we demonstrate the efficiency of PointSpectrum in benchmark GRL datasets and tasks. We first present the datasets, the experimental setup, and the baseline methods we compare against to (Section 3.1

). Then, we provide experimental evidence for the gains that are introduced by the equivariance component of PointSpectrum over the traditional deep learning architectures in terms of efficiency (Sections

3.2), complexity (Section 3.3) and robustness (Section 3.4). Finally, we compare the performance of PointSpectrum against baseline and state-of-the art methods in GRL (Section 3.5) and provide a qualitative visual analysis (Section 3.6).

3.1 Datasets & experimental setup

Datasets. We evaluated the performance of PointSpectrum on three widely used benchmark citation network datasets, namely Cora McCallum et al. (2000)

, Citeseer 

Giles et al. (1998) and Pubmed Namata et al. (2012). The statistics of these data sources are presented and further discussed in Appendix A.1.

width= Method MLP CNN PointNetST   Cora ACC NMI ARI F1 0.275 0.247 0.170 0.145 (0.583) (0.379) (0.363) (0.484) 0.301 0.090 0.514 0.208 (0.571) (0.417) (0.346) (0.489) 0.625 0.431 0.385 0.538 (0.715) (0.528) (0.493) (0.693)   Citeseer ACC NMI ARI F1 0.308 0.053 0.043 0.272 (0.445) (0.199) (0.172) (0.400) 0.362 0.098 0.091 0.314 (0.423) (0.197) (0.175) (0.395) 0.537 0.316 0.271 0.463 (0.703) (0.430) (0.451) (0.613)   Pubmed ACC NMI ARI F1 0.433 0.022 0.002 0.269 (0.638) (0.210) (0.212) (0.639) 0.457 0.063 0.018 0.323 (0.684) (0.263) (0.298) (0.673) 0.691 0.298 0.307 0.687 (0.710) (0.295) (0.329) (0.703)

Table 1: Clustering results based on the true labels. The input is randomly permuted in every epoch during training. The reported metrics result from the best clustering assignments of the original (not permuted) input. The results in the parenthesis refer to the models’ performance when trained on the original input. PointSpectrum can capture data permutations on which it has not been explicitly trained, whereas the MLP/CNN variants perform poorly.

PointSpectrum setup. For all evaluation tasks, a PointSpectrum model is used for the encoder with a single PointNetST layer of dimension (which is also the embedding dimension). For the clusterNet module, centers are randomly initialized and correspond to the number of distinct labels in each dataset. Last, weights are initialized according to He et al. (2015)

(He initialization). Details for the hyperparameter tuning are given in Appendix 

A.2.

Baseline methods. We compare PointSpectrum to several baseline methods (see details in Section 4), which we distinguish in four categories: (i) feature-only

(K-Means), (ii)

traditional GRL (DeepWalk, DNGR, TADW), (iii) GNN-based (VGAE, ARVGA - and their simpler derivatives - and DGI, GIC), and (iv) Spectral (DAEGC, AGC, AGE). In Section 3.5, we compare against all these methods on clustering, and only the most prevalent ones on link prediction. Results are obtained directly from Mavromatis and Karypis (2020)

; in particular, for AGE, K-means is used as the clustering method (instead of spectral clustering as in 

Cui et al. (2020)) to enable a fair comparison. In Sections 3.23.4, the reported results correspond to the node clustering task, since it is the most natural task for PointSpectrum, considering its architecture.

3.2 Efficiency of set equivariance

We investigate the efficiency of using a set equivariant network (PointNetST) along with Laplacian smoothing (), by comparing the PointSpectrum architecture (Figure 1) against two variants with MLP and CNN neural networks in the encoder. Figure 2 shows the results for the different encoder types and different convolution orders () for the clustering task on the Cora dataset. 111 The corresponding results on the Citeseer and Pubmed datasets can be found in Appendix A.5; on Citeseer PointNetST performs even better than on Cora, while on Pubmed it performs similarly to MLP/CNN variants (due to the small number of features and simpler graph structure).

PointNetST achieves higher performance, is more robust (lower variance), and has less fluctuations with respect to the convolution order than the MLP and CNN variants.

This prevalence of PointNetST suggests that set equivariant networks are able to capture richer information when combined with Laplacian smoothing, and could be good candidates for replacing the conventional encoders of other Spectral methods as well (e.g., DAEGC, AGC, AGE).

3.3 Set equivariance vs. training convergence

Set equivariant networks can offer multi-faceted benefits. As shown in Zaheer et al. (2017); Segol and Lipman (2019) and in the results of Section 3.2, they can capture richer structural information compared to the traditional MLP/CNN variants. In addition to these benefits, here we show that they also aid in training efficiency and computation complexity. Figure 3 depicts the value of the loss function during the model’s training for the original PointNetST and the MLP/CNN variants. 222Similar findings hold for the Citeseer and Pubmed datasets; see Appendix A.6

In both components of the loss function (reconstruction, clustering) and the joint loss, PointNetST helps PointSpectrum to converge faster than the MLP/CNN variants.

Remark: PointNetST achieves a lower loss value in overall (Figure 2(a)). While the clustering loss is slightly lower for the MLP/CNN variants (Figure 2(c)) this difference is infinitesimal (second decimal point) compared to the reconstruction loss term (Figure 2(b)).

width= Method K-Means DeepWalk DNGR TADW GAE/VGAE ARGA/ARVGA DGI GIC DAEGC AGC AGE PointSpectrum (ours)   Cora ACC NMI ARI F1 0.492 0.321 0.230 0.368 0.484 0.327 0.243 0.392 0.419 0.318 0.142 0.340 0.560 0.441 0.332 0.481 0.609 0.436 0.347 0.609 0.711 0.526 0.495 0.693 0.713 0.564 0.511 0.682 0.725 0.537 0.508 0.694 0.704 0.528 0.496 0.682 0.689 0.537 0.487 0.656 0.712 0.559 - 0.682 0.736 0.529 0.516 0.711   Citeseer ACC NMI ARI F1 0.540 0.305 0.279 0.409 0.337 0.088 0.092 0.270 0.326 0.180 0.044 0.300 0.455 0.291 0.228 0.414 0.408 0.176 0.124 0.372 0.581 0.338 0.301 0.525 0.688 0.444 0.450 0.657 0.696 0.453 0.465 0.654 0.672 0.397 0.410 0.636 0.670 0.411 0.419 0.625 0.569 0.348 - 0.544 0.703 0.430 0.451 0.613   Pubmed ACC NMI ARI F1 0.398 0.001 0.002 0.195 0.684 0.279 0.299 0.670 0.458 0.155 0.054 0.467 0.511 0.001 0.001 0.335 0.672 0.277 0.279 0.660 0.690 0.305 0.306 0.678 0.533 0.181 0.166 0.186 0.673 0.319 0.291 0.704 0.671 0.266 0.278 0.659 0.698 0.316 0.319 0.404 - - - - 0.776 0.375 0.444 0.768

Table 2: Clustering results based on the true labels. Horizontal lines discriminate feature-only, traditional GRL, GNN-based and Spectral methods. Underlined values indicate the best results among the spectral methods, and bold values the best results among all methods.

3.4 Performance on permuted data

As already shown, set equivariance offers performance and computational efficiency. In this section we focus on the main characteristic of set equivariance: its robustness on permuted inputs (which can be considered as a specific type of noisy/corrupted input data). To demonstrate this, we train PointSpectrum by randomly permuting the rows of (input data) in every epoch. Then, the trained model is evaluated on the original .

Table 1 presents the results of this experiment, as well as the initial results without permuted data during training (in parentheses). PointNetST achieves the highest performance, but more importantly, the drop in performance due to the corrupted input is significantly smaller compared to the drop in the MLP/CNN variants. This highlights that PointSpectrum can capture data permutations on which it has not been explicitly trained (e.g., in case of graph isomorphism). On the contrary, the MLP/CNN variants perform poorly on permuted data (see, e.g., the NMI/ARI metrics in the Citeseer and Pubmed datasets).

3.5 Comparison against baselines

In this section, we compare PointSpectrum’s efficiency in clustering and link prediction tasks against baseline and state-of-the-art methods.

We would like to stress that the main goal of this work is to introduce set equivariance in Laplacian smoothing (Spectral) GRL methods, and demonstrate the benefits it can bring. Hence, we do not extensively emphasize on the hyperparameter optimization of PointSpectrum. Here, we compare its performance against baselines for completeness and for demonstrating its efficiency compared to the state-of-the-art in GRL. Nevertheless, the tested PointSpectrum implementation still outperforms Spectral methods, and achieves top or near top performance in the evaluation tasks.

Clustering: The goal in clustering is to group similar nodes into classes based on the computed embeddings. Similar to related literature, the number of classes is given, and in the evaluation the labels provided by the datasets are used. In Table 2 we report the best PointSpectrum results out of 10 experiment runs for 4 metrics: Accuracy (ACC), Normalized Mutual Information (NMI), Adjusted Randomized Index (ARI) and Macro-F1 score (F1).

Focusing first on the Spectral methods (bottom rows of Table 2), we see that the overall PointSpectrum performance (i.e., for the majority of metrics and datasets) is superior to Spectral methods. When compared to all GRL methods, PointSpectrum achieves state-of-the-art performance on most metrics in the Cora and Pubmed datasets, as well as on the accuracy metric in the Citeseer dataset (in which the GNN-based models GIC and DGI perform best for the other metrics).

Link prediction: In link prediction, some graphs edges are hidden to the model during training and its goal is to predict these hidden interactions based on the computed node embeddings. For this task of positive and negative edges are used as test and as the validation set. Table 3 presents the model performance on link prediction (mean value and standard deviation over 10 runs) as measured by the Area Under the Curve (AUC) and Average Precision (AP) metrics.

PointSpectrum outperforms all baselines by a significant margin in Cora ( ) and Pubmed () datasets, while in Citeseer it is the second best method after GIC (with less than margin; also note that PointSpectrum has much lower variance than GIC).

width= Method DeepWalk GAE/VGAE ARGA/ARVGA DGI GIC AGE PointSpectrum (ours)   Cora AUC AP   Citeseer AUC AP   Pubmed AUC AP

Table 3: Link prediction performance. Area Under the Curve (AUC) and Average Precision (AP) are reported. Best results on each dataset are shown in bold, and second best in underlined.

3.6 Qualitative Analysis

We evaluate qualitatively PointSpectrum by visualizing the computed embeddings using t-SNE Van der Maaten and Hinton (2008) for the Cora dataset in Figure 4 (see Appendix A.4 for visualizations on the Citeseer and Pubmed datasets).

(a) Initial
(b) Intermediate
(c) Final
Figure 4: PointSpectrum’s training behavior on Cora dataset using t-SNE for visualization. ClusterNet’s trainable centers are denoted as black ‘x’ marks.

On one hand, we observe that PointSpectrum separates well the nodes in the embedding space as the training proceeds. On the other hand, the ClusterNet component enables the PointSpectrum to learn the cluster centers as well (denotes as ‘x’ marks), as it pushes them from a random initial point towards each group’s center. Last, since training is an iterative process, node embeddings and cluster centers attract each other in turns, explaining the formulation of these distinct clusters.

4 Related Work

GRL has gained a lot of attention due to the need for automated processes that can analyze large volumes of structured data, with graphs being the hallmark of such structures.

Conventional graph embeddings: The first efforts exploited well-known graph mechanisms to calculate node representations. DeepWalk Perozzi et al. (2014) and Node2Vec Grover and Leskovec (2016) utilize random walks to sample the graph and train a Word2Vec model on these samples to extract the embeddings. Also, TADW Yang et al. (2015) applies non-negative matrix factorization on both the graph and node features to get a consistent partition. Last, DNGR Cao et al. (2016) employs denoising auto encoders to find low dimensional representations and then reconstruct the graph adjacency.

GNN-based methods: Graph Neural Networks are designed to capture graph structures, and thus have been used for learning node representations. VGAE Kipf and Welling (2016b)

uses GCNs to form a variational autoencoder which learns to generate node embeddings, while ARVGA 

Pan et al. (2018) uses adversarial learning to train the graph auto encoder. DGI Veličković et al. (2018) leverages both local and global graph information for representation learning using contrastive learning, while GIC Mavromatis and Karypis (2020) extends DGI by forming node clusters to better separate nodes in the embedding space. In particular ClusterNet Wilder et al. (2019) –the clustering process of GIC which is crucial for its superior performance– is also incorporated in PointSpectrum aiding in performance and validating GIC’s design. Although efficient, to capture deep graph interactions GNN methods are inevitably led to over smoothing, where node representations converge to indistinguishable vectors Chen et al. (2020); Zhou et al. (2020).

Spectral methods: On the other hand, spectral methods exploit graph filters to perform high-order graph convolution at once, thus bypassing GNNs’ over smoothing. AGC Zhang et al. (2019) uses Laplacian filtering with spectral clustering to cluster nodes into groups, while AGE Cui et al. (2020) employs an auto encoder to produce node embeddings through Laplacian smoothing. Also, DAEGC Wang et al. (2019) leverages an attentional network alongside soft-labeling for self supervision to construct the embeddings. While spectral methods address over smoothing, they have only used conventional neural networks (MLPs, CNNs) that cannot capture graph properties (e.g., equivariance) by design; they can only learn the structural information contained in the smoothed input signal.

PointSpectrum –although a spectral method– lies on the intersection of GNN-based and spectral methods, alleviating over smoothing through graph filtering and capturing structural information through set equivariant networks.

5 Conclusion

PointSpectrum is the first work to introduce the set equivariance property (typically, a property of GNN-based methods) into spectral methods. Set equivariance is important when learning on graph data, since it is inherently designed to exploit the nature of unordered data. Our work was motivated by this, and our experimental results clearly demonstrated the performance benefits of using a set equivariant network (PointNetST) over the MLP or CNN layers that are used in spectral methods.

We deem PointSpectrum as an initial effort (or as a proof of concept) in the direction of integrating set equivariance with Laplacian smoothing. This is why we adopted a simple design for the model architecture, without exhaustively over-engineering its modules or tuning its hyperparameters. Nevertheless, and despite this simplicity, we have shown that PointSpectrum can achieve state-of-the-art results in benchmark datasets. This brings a positive message for the efficiency, applicability and generalizability of our approach to other spectral or more generic GRL methods.

In particular, we identify the following as promising directions for future research:

Extensions: Set equivariant networks (e.g., DeepSet or PointNetST) can easily be introduced to existing spectral methods (e.g., AGC, AGE or DAEGC) by replacing the MLP or CNN layers that they use. A more challenging direction is the extension of the proposed approach to generative models, such as VAE or GAN architectures.

Generalization: As shown, PointSpectrum performs well even under data permutations. A deeper understanding (experimental/theoretical) of its capacity to generalize on noisy, corrupted or unseen data, could provide further insights on the mechanics of using set equivariant methods on graphs, as well as lead to the design of more efficient GRL methods.

Unification: PointSpectrum has a modular design, where a set equivariant network receives as input the smoothed matrix . Unifying these two operations in a single component (e.g., a new GNN layer), if possible, could simultaneously aim at a higher performance and bypass over smoothing.

References

  • S. Cao, W. Lu, and Q. Xu (2016) Deep neural networks for learning graph representations. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 30. Cited by: §4.
  • D. Chen, Y. Lin, W. Li, P. Li, J. Zhou, and X. Sun (2020) Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 3438–3445. Cited by: §A.7, §1, §2.1, §4.
  • G. Cui, J. Zhou, C. Yang, and Z. Liu (2020) Adaptive graph encoder for attributed graph embedding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 976–985. Cited by: §A.7, §2.1, §2.1, §2.2, §3.1, §4.
  • C. L. Giles, K. D. Bollacker, and S. Lawrence (1998) CiteSeer: an automatic citation indexing system. In Proceedings of the third ACM conference on Digital libraries, pp. 89–98. Cited by: §3.1.
  • A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §4.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    .
    In

    Proceedings of the IEEE international conference on computer vision

    ,
    pp. 1026–1034. Cited by: §3.1.
  • R. A. Horn and C. R. Johnson (2012) Matrix analysis. Cambridge university press. Cited by: §A.7.
  • N. Keriven and G. Peyré (2019) Universal invariant and equivariant graph neural networks. Advances in Neural Information Processing Systems 32, pp. 7092–7101. Cited by: §2.2.
  • T. N. Kipf and M. Welling (2016a) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §A.7, §1, §1, §2.1, §2.1.
  • T. N. Kipf and M. Welling (2016b) Variational graph auto-encoders. arXiv preprint arXiv:1611.07308. Cited by: §4.
  • N. Liu, Q. Tan, Y. Li, H. Yang, J. Zhou, and X. Hu (2019) Is a single vector enough? exploring node polysemy for network embedding. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 932–940. Cited by: §1.
  • C. Mavromatis and G. Karypis (2020) Graph infoclust: leveraging cluster-level node information for unsupervised graph representation learning. arXiv preprint arXiv:2009.06946. Cited by: §3.1, §4.
  • A. K. McCallum, K. Nigam, J. Rennie, and K. Seymore (2000) Automating the construction of internet portals with machine learning. Information Retrieval 3 (2), pp. 127–163. Cited by: §3.1.
  • G. Namata, B. London, L. Getoor, B. Huang, and U. EDU (2012) Query-driven active surveying for collective classification. In 10th International Workshop on Mining and Learning with Graphs, Vol. 8, pp. 1. Cited by: §3.1.
  • S. Pan, R. Hu, G. Long, J. Jiang, L. Yao, and C. Zhang (2018) Adversarially regularized graph autoencoder for graph embedding. In IJCAI International Joint Conference on Artificial Intelligence, Cited by: §4.
  • B. Perozzi, R. Al-Rfou, and S. Skiena (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §4.
  • A. Sannai, Y. Takai, and M. Cordonnier (2019) Universal approximations of permutation invariant/equivariant functions by deep neural networks. arXiv preprint arXiv:1903.01939. Cited by: §2.2, §2.2.
  • N. Segol and Y. Lipman (2019) On universal equivariant set networks. In International Conference on Learning Representations, Cited by: §2.2, §3.3.
  • S. Sonoda and N. Murata (2017)

    Neural network with unbounded activation functions is universal approximator

    .
    Applied and Computational Harmonic Analysis 43 (2), pp. 233–268. Cited by: §2.2.
  • G. Taubin (1995) A signal processing approach to fair surface design. In Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, pp. 351–358. Cited by: §A.7, §A.7, §2.1.
  • L. Van der Maaten and G. Hinton (2008) Visualizing data using t-sne.. Journal of machine learning research 9 (11). Cited by: §3.6.
  • P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm (2018) Deep graph infomax. In International Conference on Learning Representations, Cited by: §2.3, §4.
  • C. Wang, S. Pan, R. Hu, G. Long, J. Jiang, and C. Zhang (2019) Attributed graph clustering: a deep attentional embedding approach. In International Joint Conference on Artificial Intelligence, Cited by: §1, §2.3, §2.3, §4.
  • X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 7794–7803. Cited by: §2.2.
  • B. Wilder, E. Ewing, B. Dilkina, and M. Tambe (2019) End to end learning and optimization on graphs. Advances in Neural Information Processing Systems 32, pp. 4672–4683. Cited by: §2.3, §4.
  • F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger (2019) Simplifying graph convolutional networks. In International conference on machine learning, pp. 6861–6871. Cited by: §2.1.
  • K. Xu, C. Li, Y. Tian, T. Sonobe, K. Kawarabayashi, and S. Jegelka (2018) Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, pp. 5453–5462. Cited by: §1.
  • C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Chang (2015) Network representation learning with rich text information. In Twenty-fourth international joint conference on artificial intelligence, Cited by: §4.
  • D. Yarotsky (2021) Universal approximations of invariant maps by neural networks. Constructive Approximation, pp. 1–68. Cited by: §2.2.
  • M. Zaheer, S. Kottur, S. Ravanbhakhsh, B. Póczos, R. Salakhutdinov, and A. J. Smola (2017) Deep sets. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 3394–3404. Cited by: §2.2, §2.2, §3.3.
  • M. Zhang and Y. Chen (2019) Inductive matrix completion based on graph neural networks. In International Conference on Learning Representations, Cited by: §1.
  • X. Zhang and M. Zitnik (2020) GNNGuard: defending graph neural networks against adversarial attacks. Advances in Neural Information Processing Systems 33. Cited by: §1.
  • X. Zhang, H. Liu, Q. Li, and X. M. Wu (2019) Attributed graph clustering via adaptive graph convolution. In 28th International Joint Conference on Artificial Intelligence, IJCAI 2019, pp. 4327–4333. Cited by: §1, §2.1, §2.2, §4.
  • L. Zhao and L. Akoglu (2019) PairNorm: tackling oversmoothing in gnns. In International Conference on Learning Representations, Cited by: §1.
  • K. Zhou, X. Huang, Y. Li, D. Zha, R. Chen, and X. Hu (2020) Towards deeper graph neural networks with differentiable group normalization. Advances in Neural Information Processing Systems 33. Cited by: §4.

Appendix A Appendix

a.1 Datasets

Dataset Nodes Edges Features Classes
Cora 2708 5429 1433 7
Citeseer 3327 4732 3703 6
Pubmed 19717 44338 500 3
Table 4: Dataset specifics

Table 4 presents the statistics of the datasets used throughout the experimental process. These data contain a different number of labels, which are used as the oracle information to calculate the reported metrics (Accuracy, NMI, ARI and F1). All of them present a sparse graph structure, while Pubmed contains less rich information in terms of node features in comparison to Cora and Citeseer.

a.2 Hyperparameter setup

width= Dataset k Dropout LR Epochs Dim (value) (value) Cora 5 0.2 0.01 500 100 const (1) const (2) Citeseer 1 0.2 0.01 500 100 expdec (1) exp (5) Pubmed 7 0.2 0.01 500 100 const (1) const (2)

Table 5: Hyper-parameter values for different datasets. and types are: constant (const), exponential increase (exp) and exponential decrease (expdec). k refers to convolution order, LR to learning rate and Dim to encoder’s dimensions.

To conduct the experiments and validate PointSpectrum’s performance, hyper parameter tuning is needed. The values corresponding to each dataset are presented in Table 5. It should be noted that the best value for convolution order is smaller for PointSpectrum when compared to other methods that employ graph filtering (8, 6 and 8 for Cora, Citeseer and Pubmed respectively). This showcases the fact that set equivariant networks can capture structural information more easily and thus they do not need the whole information to be presented explicitly.

Regarding hyper parameters and , various configurations were investigated. Specifically, we point out the below behaviors:

  • constant: the hyper parameter has a constant value throughout training

  • linear: the hyper parameter linearly increases (decreases) in every epoch to reach a maximum (minimum) value, which is also provided by the user

  • exponential: the hyper parameter exponentially increases (decreases) in every epoch to reach a maximum (minimum) value, which is also provided by the user. Specifically, given the maximum value and the number of epochs the following function is used: for a given epoch x. For the decreasing values, we sort this function’s results in decreasing order.

a.3 Ablation study

width= Method Conventional AE (MLP) Conventional AE (CNN)    + PointNetST      + ClusterNet   Cora ACC NMI ARI F1 0.578 0.414 0.333 0.526 0.610 0.423 0.362 0.560 0.645 0.482 0.428 0.561 0.736 0.529 0.516 0.711   Citeseer ACC NMI ARI F1 0.480 0.215 0.198 0.428 0.465 0.212 0.189 0.415 0.673 0.393 0.407 0.595 0.703 0.430 0.451 0.613   Pubmed ACC NMI ARI F1 0.607 0.248 0.206 0.596 0.594 0.231 0.189 0.584 0.531 0.207 0.127 0.539 0.776 0.375 0.444 0.768

Table 6: Ablation study.

To validate the efficacy of PointSpectrum’s components an ablation study is conducted. First a conventional autoencoder is tested using either an MLP or a CNN encoder. Then PointNetST substitutes the conventional autoencoder and last ClusterNet is also employed alongside reconstruction objective. As it can be seen in Table 6, in all three datasets the holistic PointSpectrum model achieves the best results. Furthermore, regarding Cora and Citeseer both set equivariance (through PointNetST) and the clustering objective (through ClusterNet) increase the model’s performance. However, for Pubmed although clustering is beneficial, set equivariance does not seem to help. This may relate to the low number of features when compared to the graph size, meaning that the information captured from the conventional neural networks is sufficient to characterize the data, while node features (and thus their permutations) are not that important.

To verify the above assumption we have also tested the three PointSpectrum variants on a reduced number of features. Specifically, Figure 5

depicts this experiment on Citeseer dataset where the number of features is large enough to enable us different degrees of features reduction. Although performance deterioration is not analogous to features reduction, a general trend is shown: PointNetST is heavily dependent on features, while conventional neural networks seem to pull closer to their full-features performance no matter the reduction. However, this may come from conventional neural networks’ tendency to overfit to the specific permutation, therefore a more thorough investigation is needed which is out of this work’s scope.

(a) Accuracy (Cora)
(b) NMI (Cora)
(c) ARI (Citeseer)
(d) F1 (Citeseer)
Figure 5: PointNetST, MLP and CNN-based methods’ performance trained and evaluated on reduced number of features (depicted as pointNet, mlp and cnn respectively.

a.4 Additional visual results

Here the visual analyses of tSNE on Citeseer and Pubmed are presented. As depicted, PointSpectrum behaves similar to Cora, separating the embeddings as training proceeds and forcing cluster centers towards the centers of the node groups that it creates. More concretely, in the case of Citeseer, nodes are well separated into distinct clusters and therefore the trained ClusterNet centers match to the actual centers. For Pubmed though, node separation is not trivial and nodes appear to be similar to nodes of other clusters. We suppose that the reason behind this phenomenon is the small number of available features for PointSpectrum to exploit, when compared to Cora and Citeseer.

(a) Initial
(b) Intermediate
(c) Final
Figure 6: PointSpectrum’s training behavior on Citeseer (top) and Pubmed (bottom) dataset using t-SNE for visualization. ClusterNet’s trainable centers are denoted as black ‘x’ marks.

a.5 Influence of convolution order on Citeseer and Pubmed

(a) Accuracy (Citeseer)
(b) ARI (Citeseer)
(c) NMI (Citeseer)
(d) Accuracy (Pubmed)
(e) ARI (Pubmed)
(f) NMI (Pubmed)
Figure 7: Accuracy, ARI and NMI metrics for PointNetST, MLP and CNN-based networks solving the clustering task on Citeseer and Pubmed datasets. For all three measures, the mean value and standard deviation of 10 experiment runs are depicted.

Influence of parameter k - the convolution order - is also presented with respect to Citeseer and Pubmed in Figure 7. Citeseer presents similar behavior to Cora as it has similar graph structure with many node attributes being present. However, on Pubmed training behavior is slightly different. As shown, PointNetST does not perform constantly better than its CNN variant. This can be explained by the fact that node attributes are limited in Pubmed, while graph structure is dominant. Thus, the conventional neural networks can overfit on this structure and depict equally high performance compared to set equivariant ones.

a.6 Training Convergence on Citeseer and Pubmed

Extending the discussion on training convergence, Figure 8 depicts the comparison between PointNetST encoder and MLP and CNN ones. Again, Citeseer behaves similar to Cora due to their similar structural and contextual information. Moreover, as already discussed Pubmed presents a different graph structure and less node information complicating set equivariance and restricting its performance. However, despite these complications PointNetST maintains its efficiency and accelerates the training convergence, although by a much smaller factor, nearly indistinguishable from the rest methods.

(a) Loss (Citeseer)
(b) Reconstruction loss (Citeseer)
(c) Clustering loss (Citeseer)
(d) Loss (Pubmed)
(e) Reconstruction loss (Pubmed)
(f) Clustering loss (Pubmed)
Figure 8: Loss, reconstruction and clustering loss for PointNetST, MLP and CNN-based networks solving the clustering task on Citeseer and Pubmed datasets. For all three measures, the mean value and standard deviation of 10 experiment runs are depicted. PointNetST helps the model to converge faster than the MLP and CNN variants.

a.7 Laplacian Smoothing and Graph Convolution

The most important notion in the prevalent GNN-based embedding methods, such as GCN Kipf and Welling (2016a), is that neighboring nodes should be similar and hence their features should be smoother - than that of irrelevant nodes - in the graph manifold. However, these methods capture deeper connections by stacking multiple layers, leading to deep architectures, which are known to overly smooth the node features Chen et al. (2020). To address this problem, the domain of graph signal processing considers as a graph signal, where each one of the nodes is assigned a scalar. Then, the smoothness of signal depicts the similarity between all of the graph nodes. To calculate smoothness, the Rayleigh quotient Horn and Johnson (2012) over the signal and the graph Laplacian - essentially the normalized variance of - is employed:

(9)

Since neighboring nodes should be similar, a smoother signal is expected to have lower Rayleigh quotient. To find the relation between eigenvalues and Rayleigh quotient, one needs to calculate the eigendecomposition of the graph Laplacian, that is with

being the matrix of eigenvectors and

the diagonal matrix of eigenvalues. Then, the Rayleigh quotient of the eigenvector is:

(10)

It can be seen that the lower Rayleigh quotients - and by extension the smoother eigenvectors - are correlated with low eigenvalues, meaning low frequencies. To employ these observations to every signal , the decomposition of x on the basis of is considered:

(11)

Consequently, as smooth signals are associated with smooth eigenvectors and low eigenvalues according to Eq. 10, the used filter should cancel high frequencies and preserve the low ones. Laplacian smoothing filters are selected for this purpose, as they combine high performance with low computational cost Taubin (1995).

Laplacian Smoothing Filter: Here, we consider the generalized Laplacian Smoothing filter as defined in Taubin (1995)

(12)

where , I is the identity matrix and is the filter matrix. Using Eq. 12, the filtered signal is:

What this suggests is that for to be low-pass, should always decline. It has been found that the optimal value of is , with denoting the largest eigenvalue Cui et al. (2020).

Having defined the filter, one can introduce k-order smoothing - and thus graph convolution - by stacking k filters together. Ultimately, the overall smoothed feature matrix is

(13)

a.8 System Specifications

All of the experiments were conducted using a computing grid with an Intel Xeon E5-2630 v4 CPU, 32Gb RAM and an Nvidia Tesla P100 GPU. Also, to speed up some computations a personal computer with an Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz, 32Gb RAM and an NVidia GeForce GTX 1070 GPU was also employed for specific experiments alongside the ones running on the grid.