Deep learning approaches are at the forefront of modern machine learning. While they are effective in a multitude of applications, their most impressive results are typically achieved when processing data with inherent structure that can be used to inform the network architecture or the neuron connectivity design. For example, image processing tasks gave rise to convolutional neural networks that rely on spatial organization of pixels, while time-series analysis gave rise to recurrent neural networks that leverage temporal organization in their information processing via feedback loops and memory mechanisms. The success of neural networks in such applications, traditionally associated with signal processing, has motivated the emergence of geometric deep learning, with the goal of generalizing the design of structure-aware network architectures from Euclidean spatiotemporal structures to a wide range of non-Euclidean geometries that often underlie modern data.
Geometric deep learning approaches typically use graphs as a model for data geometries, either by constructing them from input data (e.g., via similarity kernels) or directly given as quantified interactions between data points (Bronstein et al., 2017). Using such models, recent works have shown that graph neural networks (GNNs) perform well in multiple application fields, including biology, chemistry and social networks (Gilmer et al., 2017; Hamilton et al., 2017; Kipf and Welling, 2016)
. It should be noted that most GNNs consider each graph together with given node features, as a generalization of images or audio signals, and thus aim to compute whole-graph representations. These, in turn, can be applied to graph classification, for example when each graph represents the molecular structure of proteins or enzymes classified by their chemical properties(Fout et al., 2017; De Cao and Kipf, 2018; Knyazev et al., 2018).
On the other hand, methods such as graph convolutional networks (GCNs) presented by Kipf and Welling (2016) consider node-level tasks and in particular node classification. As explained in Kipf and Welling (2016), such tasks are often considered in the context of semi-supervised learning, as typically only a small portion of nodes on the graph possess labels. In these settings, the entire dataset is considered as one graph and the network is tasked with learning node representations that infer information from node features as well as the graph structure. However, most state-of-the-art approaches for incorporating graph structure information in neuronal network operations aim to enforce similarity between representations of adjacent (or neighboring) nodes, which essentially implements local smoothing of neuron activations over the graph (Li et al., 2018). While such smoothing operations may be sufficiently effective in whole-graph settings, they often cause degradation of results in semi-supervised node processing tasks due to oversmoothing (NT and Maehara, 2019; Li et al., 2018), as nodes become indistinguishable with deeper and increasingly complex network architectures.
In this work, we focus on the above mentioned node-level processing and aim to tackle the oversmoothing problem by introducing neural pathways that encode higher-order forms of regularity in graphs. Our construction is inspired by recently proposed scattering networks on graphs (Gama et al., 2018; Gao et al., 2019; Zou and Lerman, 2019), which have proven to be effective for whole-graph representation and classification. These networks generalize the Euclidean scattering transform, which was originally presented by Mallat (2012) as a mathematical model for convolutional neural networks. In graph settings, the scattering construction leverages deep cascades of graph wavelets Hammond et al. (2011); Coifman and Maggioni (2006) and pointwise nonlinearities to capture multiple modes of variations from graph features or labels. Using the terminology of graph signal processing, these can be considered as generalized band-pass filtering operations, while GCNs (and many other GNNs) can be considered as relying on low-pass filters. Here, we propose to combine together the strengths of GCNs on node-level tasks together with the ones indicated by scattering networks on whole-graph tasks.
The paper is structured as follows. Sec. 1.1 and 1.2 discuss related works and the notations used in this paper. Sec. 2 provides preliminaries on graph signal processing. Sec. 3 and 4 discuss the GCN approach and geometric scattering models to briefly unify their formulation to be consistent with this work. Then, Sec. 5 and 6 present our new architecture components combining ideas from these two models. Finally, Sec. 7 provides empirical results followed by our conclusions in Sec. 8.
1.1 Related Work
As many applied fields such as Bioinformatics and Neuroscience heavily rely on the analysis of graph-structured data, the study of reliable classification methods has received much attention lately. In this work, we focus on the particular class of semi-supervised classification tasks, where Kipf and Welling (2016); Li et al. (2018) had success using GCN models. Their theoretical study reveals that graph convolutions can be interpreted as Laplacian smoothing operations, which poses fundamental limitations on the approach. Further, NT and Maehara (2019)
developed a theoretical framework based on graph signal processing, relying on the relation between frequency and feature noise, to show that GNNs perform a low-pass filtering on the feature vectors. InAbu-El-Haija et al. (2019) multiple powers of the adjacency matrix were used to learn the higher-order neighborhood information, while Liao et al. (2019) used Lanczos algorithm to construct a low-rank approximation of the graph Laplacian that efficiently gathers the multiscale information, demonstrated on citation networks and the QM8 quantum chemistry dataset. Finally, Xu et al. (2019) studied wavelets on graphs and collected higher-order neighborhood information based on wavelet transformation.
Together with the study of learned networks, recent studies have also introduced the construction of geometric scattering transforms, relying on manually crafted families of graph wavelet transforms (Gama et al., 2018; Gao et al., 2019; Zou and Lerman, 2019). Similar to the initial motivation of geometric deep learning to generalize convolutional neural networks, the geometric scattering framework generalizes the construction of Euclidean scattering from Mallat (2012) to the graph setting. Theoretical studies (e.g., Gama et al., 2018, 2019) established the stability of these generalized scattering transforms to perturbations and deformations of graphs and signals on them. Moreover, the practical application of geometric scattering to whole-graph data analysis was studied in Gao et al. (2019), achieving strong classification results on social networks and biochemistry data, which established the effectiveness of this approach.
In this work, we aim to combine the complementary strengths of GCN models and geometric scattering, and to provide a new avenue for incorporating richer notions of regularity into GNNs. Furthermore, our construction integrates trained task-driven components in geometric scattering architectures. Finally, while most previous work on geometric scattering focused on whole-graph settings, we consider node-level processing which requires new considerations about the construction.
We denote matrices and vectors with bold letters with uppercase letters representing matrices and lowercase letters representing vectors. In particular,
is used for the identity matrix anddenotes the vector with ones in every component. We write for the standard scalar product in . We will interchangeably consider functions of graph nodes as vectors indexed by the nodes, implicitly assuming a correspondence between a node and a specific index. This carries over to matrices, where we relate nodes to column or row indices. We further use the abbreviation where .
2 Graph Signal Processing
Let be a weighted graph where is the set of nodes, is the set of (undirected) edges and assigns (positive) edge weights to the graph edges. We note that can equivalently be considered as a function of , where we set the weights of non-adjacent node pairs to zero. We define a graph signal as a function on the nodes of and aggregate them in a signal vector with the entry being .
We define the (combinatorial) graph Laplacian matrix , where is the weighted adjacency matrix of the graph given by
and is the degree matrix of defined by with being the degree of the node .
In practice, we work with the (symmetric) normalized Laplacian matrix
Conveniently, has the property of being symmetric positive semi-definite, and thus being orthogonally diagonalizable according to
is the diagonal matrix with the eigenvalues on the main diagonal andas its columns. This can be seen by writing
where , , and all remaining values being zero. Expanding the inner product according to this formulation establishes positive semi-definiteness of because the weights are strictly positive and where .
A detailed study of the eigenvalues reveals that
The lower bound is easily verified by recalling that positive semi-definite matrices only exhibit non-negative eigenvalues followed by checking that is an eigenvector corresponding to . The upper bound has been established for example in Chung (1997).
We can interpret the as the frequency magnitudes and as the corresponding harmonics. We accordingly define the Fourier transform of a signal vector by for
. The corresponding inverse Fourier transform is given by. Note that this can be written compactly as
Finally, we introduce the concept of graph convolutions. We define a filter defined on the set of nodes and want to convolve the corresponding filter vector with a signal vector , i.e. . To explicitly compute this convolution, we recall that in the Euclidean setting, the convolution of two signals equals the product of their corresponding frequencies. This property generalizes to graphs (Shuman et al., 2013) in the sense that for . Applying the inverse Fourier transform yields
where . Therefore, we can parametrize the convolution with a filter by directly considering the Fourier coefficients in .
Furthermore, it can be verified (Defferrard et al., 2016) that when these coefficients are defined as polynomials for of the Laplacian eigenvalues in (i.e. ), the resulting filter convolution are localized in space and can be written in terms of as
without requiring spectral decomposition of the normalized Laplacian. This property motivates the standard practice of using filters that have polynomial forms, which is studied in previous works on graph convolutional networks and geometric scattering as well as this work.
3 Graph Convolutional Network
The initial idea of learning convolutional filters on graphs consists of learning the parameter . This has however two immediate drawbacks as pointed out in Defferrard et al. (2016). The filters cannot be localized in the feature space and their computation is expensive (). Similar to Kipf and Welling (2016), we are interested in a semi-supervised setting where only a small potion of the nodes is labeled. Their method relies on the construction of a model that does not only use the labeled data, but also leverages the intrinsic geometric information encoded in the adjacency matrix . This is done by enforcing similarity between adjacent nodes and propagating information from the labeled nodes over the whole set of nodes.
The convolutional filter is parametrized by . This parametrization yields
The choice of only one learnable parameter is made to avoid overfitting. The matrix has eigenvalues in . This could lead to vanishing or exploding gradients. This issue is addressed by the following renormalization trick (Kipf and Welling, 2016): , where and . This operation replaces the features of the nodes by a weighted average of itself and its neighbors. Note that the repeated execution of graph convolutions will enforce similarity throughout higher-order neighborhoods with order equal to the number of staggered layers. Setting
the complete layer-wise propagation rule takes the form
Here, indicates the layer with neurons, the activation vector of the neuron, the learned parameter of the convolution with the incoming activation vector from the preceding layer and
an element-wise applied activation function. Written in matrix notation, this gives
where is the weight-matrix of the layer and contains the activations outputted by the layer.
We want to remark that the above explained GCN model can be interpreted as a low-pass operation. For the sake of simplicity, let us consider the convolutional operation (1) before the reparametrization trick. If we observe the convolution operation as the summation
we clearly see that higher weights are put on the low-frequency harmonics, while high-frequency harmonics are progressively less involved as . This indicates that the model can only access a diminishing potion of the original information contained in the input signal the more graph convolutions are staggered.
This observation is in line with the well-known oversmoothing problem (Li et al., 2018) related to GCN models. The repeated application of graph convolutions will successively smooth the signals of the graph such that nodes can not be distinguished anymore.
4 Geometric Scattering
In this section, we recall the concept of geometric scatterings on graphs. These are based on the lazy random walk matrix
which is closely related to the graph random walk, a Markov process with transition matrix . The matrix however allows self loops while normalizing by a factor of two in order to retain a Markov process.
Therefore, considering a distribution of the initial position of the lazy random walk, its positional distribution after steps is encoded by .
It was shown in Gao et al. (2019) that the propagation of a graph signal vector by this Markov operator, i.e. , is a low-pass operation which preserves the zero-frequencies of the signal while suppressing high frequencies.
This limitation is addressed by introducing the wavelet matrix of scale ,
This leverages the fact that high frequencies can be recovered with multiscale wavelet transforms as the one which decompose the non-zero frequencies into dyadic frequency bands. The operation collects signals from a neighborhood of order but applies no averaging operation over them.
Geometric scattering was originally introduced in the context of whole-graph classification and consisted of aggregating scattering features. These are stacked wavelet transforms parameterized via tuples containing the bandwidth scale parameters, which are separated by element-wise absolute value nonlinearities according to
where corresponds to the length of the tuple . The scattering features are aggregated over the whole graph by taking
-order moments over the set of nodes,
As our research is devoted to the study of node-based classification we reinvent this approach in a new context, keeping the scattering transforms on a node-level by dismissing the aggregation step (4). For each tuple , we define the following scattering propagation rule, which mirrors the GCN rule but replaces the low-pass filter by a geometric scattering operation resulting in
We note that in practice we only choose a subset of tuples, which is chosen as part of the network design explained in the following section.
5 Combining GCN and Scattering Models
To combine the benefits of GCN models and geometric scatterings adapted to the node level, we now propose a hybrid network structure as shown in Fig. 1. This structure combines low-pass operations based on the GCN model with band-pass operations based on geometric scattering. We will first introduce the general architecture of the model.
To define the layer-wise propagation rule, we introduce
which are the concatenations of channels and , respectively. Every is defined according to Eq. 2 with the slight modification of added biases and powers of , i.e.,
Note that every GCN filter uses a different propagation matrix and therefore aggregates information from -step neighborhoods. Similarly, we proceed with according to Eq. 5 and calculate
where , enables scatterings of different orders and scales. Finally, the GCN components and scattering components get concatenated to
The learned parameters are the weight matrices coming from the convolutional and scattering layers. These are complemented by vectors of the biases , which are transposed and vertically concatenated times to the matrices . To simplify notation, we assume here that all channels use the same number of neurons (). Waiving this assumption would slightly complicate the notation but works perfectly fine in practice.
In this work, for simplicity, we limit our architecture to three GCN channels and two scattering channels, as illustrated in Fig. 2. Inspired by the aggregation step in classical geometric scattering, we use with as our nonlinearity. However, unlike the powers in Eq. 4, the -th power here is applied to the node-level instead of moments over the whole graph, retaining the distinction between node-level activations on the graph.
We note that for the first layer, we set the input to have the original node features as the graph signal. Each subchannel (GCN or scattering) transforms the original feature space to a new hidden space with the dimension determined by the number of neurons encoded in the columns of the corresponding submatrix of . These transformations are learned by the network via the weights and biases. Larger matrices (i.e., more columns, as the number of nodes in the graph is fixed) indicate the weight matrices have more parameters to learn. Thus, the information in these channels can be propagated well and sufficiently represented.
In general, the width of a channel is relevant to the importance of these regularities. A wider channel suggests these frequency components are more critical and need to be sufficiently learned. Reducing the width of the channel will suppress the magnitude of information that can be learned from a particular frequency window. For more details and analysis of specific design choices in our architecture we refer the reader to Sec. 7, where we discuss them in the context of our experimental results.
6 Graph Residual Convolution
Using the combination of GCN and scattering architectures, we collect multiscale information at the node level. This information is aggregated from different localized neighborhoods, which may exhibit vastly different frequency spectra. This comes for example from varying label rates in different graph substructures. In particular, very sparse graph sections can cause problems when the scattering features actually learn the difference between labeled and unlabeled nodes, creating high-frequency noise. In the classical geometric scattering used for whole-graph representation, geometric moments were used to aggregate the node-based information, serving at the same time as a low-pass filter. As we want to keep the information localized on the node level, we choose a different approach inspired by skip connections in residual neural networks (He et al., 2016). Conceptually, this low-pass filter, which we call graph residual convolution, reduces the captured frequency spectrum up to a cutoff frequency as depicted in Fig. 3(b).
The graph residual convolution matrix is given by
and we apply it after the hybrid layer of GCN and scattering filters. For , we get that is the identity (no cutoff) while results in
, which can be interpreted as an interpolation between the completely lazy random walk and the non-resting random walk.
where are learned weights, are learned biases (similar to the notations used previously), and is the number of features of the concatenated layer in Eq. 6. If is the final layer, we choose equal to the number of classes.
Compared to shallow GCN models, the proposed hybrid scattering GCN is able to collect a richer range of regularity patterns from the graph. Fig. 4 shows the comparison between Label Propagation (Zhu et al., 2003), GCN used in Kipf and Welling (2016), and GCN with partially absorbing random walks from Li et al. (2018). The oversmoothing caused by the Laplacian kernel and the shallow layers, which limit the collection of information from far-ranging neighbourhoods, restrict the GCN performance when the training size shrinks. The x-axis refers to the proportion of the original training size (1.0 means that we use the original training data used in Kipf and Welling (2016), while 0.2 means we use only 20% of the original training data). By preserving more high frequencies and collecting regularity patterns from the entire graph, our scattering GCN outperforms the other methods under small training size conditions.
Compared to the partially absorbing model discussed in Li et al. (2018), it achieves better overall performance, except for extreme conditions when the training size shirks to 20% of the original data. One explanation for this is that our model may produce high-frequency noise under insufficient labelling conditions, which our simplified architecture cannot filter out sufficiently. During the training, we also notice that shuffling the dataset can result in varying performance.
We evaluate our methods on three datasets summarized in Tab. 1. We train the scattering GCN on Citeseer, Cora and Pubmed, where the nodes are documents and edges are citation links (Sen et al., 2008). The label rate is the number of labelled nodes that are used for training, divided by the total number of nodes in each dataset. A comparison of the classification results on these datasets is presented in Tab. 2.
7.1 Architecture Visualization
In our experiments, the two scattering channels are selected from three first-order scattering and two second-order scattering configurations, including , and . The architectures for different models are shown in Fig. 5. As discussed before, the band-pass filters may cause additional noise in case of insufficiently labelled data. For the Pubmed dataset, we maintain the structure of plain GCN and add narrow band-pass filters. For sufficiently labelled datasets, we expand the width of band-pass filters and enable the model to incorporate more high-frequency information.
In the following, we provide results from our experiments on the Cora dataset and discuss the choice of parameters. We use -regularization for every layer and dropout for the first layer. The network structure is shown in Fig. 5
(a). We train for a maximum of 200 epochs using Adam(Kingma and Ba, 2014).
To further explore the parameter space, we test our model with different parameters and scattering moments. The results are summarized in Fig. 6. Our experiments suggest that the graph residual convolution plays a critical role in suppressing the high-frequency noise. As shown in Fig. 6, when , which is identical to an identity mapping, all frequency components are propagated. As increases, the performance first increases due to the removal of high-frequency noise. But as further increases, e.g. when
, the layer becomes a random walk model, which stays at the initial location with a probability of one half. This low-pass operation degrades the performance to a level close toKipf and Welling (2016).
Next, we evaluate the performance of our proposed model with different scattering transformations under different graph residual convolutions. We performed experiments on the graph-based benchmark node classification task on the Cora dataset with different parameters (five different scattering transformations and three different residual convolutions). The results are summarized in Tab. 3, 4 and 5. The rows and columns in these tables denote the two scattering transformation channels used in the scattering GCN, with the structural information shown in Fig. 5 (a).
As shown in Tab. 3, the classification accuracy drops when we use two second-order scattering transformations for the scattering channels. The accuracy is 82.2% for and . However, we notice that as we replace one second-order scattering transformation by a first-order one, the accuracy increases. This can be attributed to the narrow bandwidth of second-order scattering transformations. As mentioned before, the second-order scattering transformations are less localized in space compared to first-order ones, which suggests that the second-order scattering transformations have wider spatial support. In this case, the frequencies embedded in these channels do not capture the regularity of the graph signal and weaken the performance of the model.
Generally, we limit the models in our experiments to scatterings of a maximal order of two, as higher-order scatterings correspond to a wider spatial support and a narrower frequency band, which are less likely to contain the intrinsic regularity of the graph signal.
We also notice that by combining specific first-order () and second-order ( scattering transformations and preserving certain high-frequency components (), the performance is further increased. This can be interpreted as these filters capturing the critical regularity of the graph signal. However, except for this case, preserving high-frequencies always weakens the network performances in our experiments.
Besides that, we evaluate the performance of our scattering GCN model for different graph residual convolutions. We perform experiments with (close to identity mapping, all frequencies pass), and (smooth filter). We find that as increases, the overall performance first increases and then decreases. The average accuracy is 82.7% for , 82.9% for and 82.4% for . As discussed before, this suggests that both high frequency noise () and smooth regularity should be addressed. Thus, the parameter can be thought of as a ’tradeoff’-parameter in the frequency domain.
Our study of semi-supervised node-level classification tasks for graphs shows a new approach for addressing some of the main limitations of GCN models, which are currently the leading (or even state-of-the-art) methods for this task. In this context, we discuss the concept of regularity on graphs and expand the GCN approach, which solely consists in enforcing smoothness, to a richer notion of regularity. This is achieved by incorporating different frequency bands of graph signals, which could not be leveraged in traditional GCN models. We take our inspiration from the concept of geometric scatterings, which have only been used for whole-graph classification so far, and develop a related theory for node-based classification. We also establish the graph residual convolution, drawing inspiration from skip connections in residual networks, which alleviates concerns of high-frequency noise generated in our approach. This results in a toolbox of complementary methods that, combined together, open promising research avenues for advancing semi-supervised learning on graph data, as our experimental results presented in this paper suggest.
This work was partially funded by IVADO (l’institut de valorisation des données) and NIH grant R01GM135929.
- Mixhop: higher-order graph convolution architectures via sparsified neighborhood mixing. arXiv preprint arXiv:1905.00067. Cited by: §1.1.
- Geometric deep learning: going beyond Euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §1.
- Spectral graph theory. American Mathematical Society. Cited by: §2.
- Diffusion wavelets. Applied and Computational Harmonic Analysis 21 (1), pp. 53 – 94. Cited by: §1.
- MolGAN: an implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973. Cited by: §1.
- Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §2, §3, Table 2.
- Protein interface prediction using graph convolutional networks. In Advances in neural information processing systems, pp. 6530–6539. Cited by: §1.
- Stability of graph scattering transforms. External Links: Cited by: §1.1.
- Diffusion scattering transforms on graphs. External Links: Cited by: §1.1, §1.
- Geometric scattering for graph data analysis. In International Conference on Machine Learning, pp. 2122–2131. Cited by: §1.1, §1, §4.
- Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1263–1272. Cited by: §1.
- Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034. Cited by: §1.
- Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis 30 (2), pp. 129 – 150. Cited by: §1.
- Deep residual learning for image recognition. In , pp. 770–778. Cited by: §6.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §7.2.
- Semi-supervised classification with graph convolutional networks. In International conference on learning representations, Cited by: §1.1, §1, §1, §3, §3, Figure 4, §7.2, Table 2, §7.
- Spectral multigraph networks for discovering and fusing relationships in molecules. arXiv preprint arXiv:1811.09595. Cited by: §1.
Deeper insights into graph convolutional networks for semi-supervised learning.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.1, §1, §3, Table 2, §7, §7.
- Lanczosnet: multi-scale deep graph convolutional networks. arXiv preprint arXiv:1901.01484. Cited by: §1.1.
- Group invariant scattering. Communications on Pure and Applied Mathematics 65 (10), pp. 1331–1398. Cited by: §1.1, §1.
- Revisiting graph neural networks: all we have is low-pass filters. arXiv preprint arXiv:1905.09550. Cited by: §1.1, §1.
- Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §7.
- Vertex-frequency analysis on graphs. External Links: Cited by: §2.
- Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: Table 2.
- Graph wavelet neural network. arXiv preprint arXiv:1904.07785. Cited by: §1.1.
- Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML-03), pp. 912–919. Cited by: §7.
- Graph convolutional neural netwoks via scattering. Applied and Computational Harmonic Analysis. Cited by: §1.1, §1.