Scattering GCN: Overcoming Oversmoothness in Graph Convolutional Networks

03/18/2020 ∙ by Yimeng Min, et al. ∙ Université de Montréal 5

Graph convolutional networks (GCNs) have shown promising results in processing graph data by extracting structure-aware features. This gave rise to extensive work in geometric deep learning, focusing on designing network architectures that ensure neuron activations conform to regularity patterns within the input graph. However, in most cases the graph structure is only accounted for by considering the similarity of activations between adjacent nodes, which in turn degrades the results. In this work, we augment GCN models by incorporating richer notions of regularity by leveraging cascades of band-pass filters, known as geometric scatterings. The produced graph features incorporate multiscale representations of local graph structures, while avoiding overly smooth activations forced by previous architectures. Moreover, inspired by skip connections used in residual networks, we introduce graph residual convolutions that reduce high-frequency noise caused by joining together information at multiple scales. Our hybrid architecture introduces a new model for semi-supervised learning on graph-structured data, and its potential is demonstrated for node classification tasks on multiple graph datasets, where it outperforms leading GCN models.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning approaches are at the forefront of modern machine learning. While they are effective in a multitude of applications, their most impressive results are typically achieved when processing data with inherent structure that can be used to inform the network architecture or the neuron connectivity design. For example, image processing tasks gave rise to convolutional neural networks that rely on spatial organization of pixels, while time-series analysis gave rise to recurrent neural networks that leverage temporal organization in their information processing via feedback loops and memory mechanisms. The success of neural networks in such applications, traditionally associated with signal processing, has motivated the emergence of geometric deep learning, with the goal of generalizing the design of structure-aware network architectures from Euclidean spatiotemporal structures to a wide range of non-Euclidean geometries that often underlie modern data.

Geometric deep learning approaches typically use graphs as a model for data geometries, either by constructing them from input data (e.g., via similarity kernels) or directly given as quantified interactions between data points (Bronstein et al., 2017). Using such models, recent works have shown that graph neural networks (GNNs) perform well in multiple application fields, including biology, chemistry and social networks (Gilmer et al., 2017; Hamilton et al., 2017; Kipf and Welling, 2016)

. It should be noted that most GNNs consider each graph together with given node features, as a generalization of images or audio signals, and thus aim to compute whole-graph representations. These, in turn, can be applied to graph classification, for example when each graph represents the molecular structure of proteins or enzymes classified by their chemical properties 

(Fout et al., 2017; De Cao and Kipf, 2018; Knyazev et al., 2018).

On the other hand, methods such as graph convolutional networks (GCNs) presented by Kipf and Welling (2016) consider node-level tasks and in particular node classification. As explained in Kipf and Welling (2016), such tasks are often considered in the context of semi-supervised learning, as typically only a small portion of nodes on the graph possess labels. In these settings, the entire dataset is considered as one graph and the network is tasked with learning node representations that infer information from node features as well as the graph structure. However, most state-of-the-art approaches for incorporating graph structure information in neuronal network operations aim to enforce similarity between representations of adjacent (or neighboring) nodes, which essentially implements local smoothing of neuron activations over the graph (Li et al., 2018). While such smoothing operations may be sufficiently effective in whole-graph settings, they often cause degradation of results in semi-supervised node processing tasks due to oversmoothing (NT and Maehara, 2019; Li et al., 2018), as nodes become indistinguishable with deeper and increasingly complex network architectures.

In this work, we focus on the above mentioned node-level processing and aim to tackle the oversmoothing problem by introducing neural pathways that encode higher-order forms of regularity in graphs. Our construction is inspired by recently proposed scattering networks on graphs (Gama et al., 2018; Gao et al., 2019; Zou and Lerman, 2019), which have proven to be effective for whole-graph representation and classification. These networks generalize the Euclidean scattering transform, which was originally presented by Mallat (2012) as a mathematical model for convolutional neural networks. In graph settings, the scattering construction leverages deep cascades of graph wavelets Hammond et al. (2011); Coifman and Maggioni (2006) and pointwise nonlinearities to capture multiple modes of variations from graph features or labels. Using the terminology of graph signal processing, these can be considered as generalized band-pass filtering operations, while GCNs (and many other GNNs) can be considered as relying on low-pass filters. Here, we propose to combine together the strengths of GCNs on node-level tasks together with the ones indicated by scattering networks on whole-graph tasks.

The paper is structured as follows. Sec. 1.1 and 1.2 discuss related works and the notations used in this paper. Sec. 2 provides preliminaries on graph signal processing. Sec. 3 and 4 discuss the GCN approach and geometric scattering models to briefly unify their formulation to be consistent with this work. Then, Sec. 5 and 6 present our new architecture components combining ideas from these two models. Finally, Sec. 7 provides empirical results followed by our conclusions in Sec. 8.

1.1 Related Work

As many applied fields such as Bioinformatics and Neuroscience heavily rely on the analysis of graph-structured data, the study of reliable classification methods has received much attention lately. In this work, we focus on the particular class of semi-supervised classification tasks, where Kipf and Welling (2016); Li et al. (2018) had success using GCN models. Their theoretical study reveals that graph convolutions can be interpreted as Laplacian smoothing operations, which poses fundamental limitations on the approach. Further, NT and Maehara (2019)

developed a theoretical framework based on graph signal processing, relying on the relation between frequency and feature noise, to show that GNNs perform a low-pass filtering on the feature vectors. In 

Abu-El-Haija et al. (2019) multiple powers of the adjacency matrix were used to learn the higher-order neighborhood information, while Liao et al. (2019) used Lanczos algorithm to construct a low-rank approximation of the graph Laplacian that efficiently gathers the multiscale information, demonstrated on citation networks and the QM8 quantum chemistry dataset. Finally, Xu et al. (2019) studied wavelets on graphs and collected higher-order neighborhood information based on wavelet transformation.

Together with the study of learned networks, recent studies have also introduced the construction of geometric scattering transforms, relying on manually crafted families of graph wavelet transforms (Gama et al., 2018; Gao et al., 2019; Zou and Lerman, 2019). Similar to the initial motivation of geometric deep learning to generalize convolutional neural networks, the geometric scattering framework generalizes the construction of Euclidean scattering from Mallat (2012) to the graph setting. Theoretical studies (e.g., Gama et al., 2018, 2019) established the stability of these generalized scattering transforms to perturbations and deformations of graphs and signals on them. Moreover, the practical application of geometric scattering to whole-graph data analysis was studied in Gao et al. (2019), achieving strong classification results on social networks and biochemistry data, which established the effectiveness of this approach.

In this work, we aim to combine the complementary strengths of GCN models and geometric scattering, and to provide a new avenue for incorporating richer notions of regularity into GNNs. Furthermore, our construction integrates trained task-driven components in geometric scattering architectures. Finally, while most previous work on geometric scattering focused on whole-graph settings, we consider node-level processing which requires new considerations about the construction.

1.2 Notation

We denote matrices and vectors with bold letters with uppercase letters representing matrices and lowercase letters representing vectors. In particular,

is used for the identity matrix and

denotes the vector with ones in every component. We write for the standard scalar product in . We will interchangeably consider functions of graph nodes as vectors indexed by the nodes, implicitly assuming a correspondence between a node and a specific index. This carries over to matrices, where we relate nodes to column or row indices. We further use the abbreviation where .

2 Graph Signal Processing

Let be a weighted graph where is the set of nodes, is the set of (undirected) edges and assigns (positive) edge weights to the graph edges. We note that can equivalently be considered as a function of , where we set the weights of non-adjacent node pairs to zero. We define a graph signal as a function on the nodes of and aggregate them in a signal vector with the entry being .

We define the (combinatorial) graph Laplacian matrix , where is the weighted adjacency matrix of the graph given by

and is the degree matrix of defined by with being the degree of the node .

In practice, we work with the (symmetric) normalized Laplacian matrix

Conveniently, has the property of being symmetric positive semi-definite, and thus being orthogonally diagonalizable according to


is the diagonal matrix with the eigenvalues on the main diagonal and

is the orthogonal matrix containing the corresponding normalized eigenvectors

as its columns. This can be seen by writing

where , , and all remaining values being zero. Expanding the inner product according to this formulation establishes positive semi-definiteness of because the weights are strictly positive and where .

A detailed study of the eigenvalues reveals that

The lower bound is easily verified by recalling that positive semi-definite matrices only exhibit non-negative eigenvalues followed by checking that is an eigenvector corresponding to . The upper bound has been established for example in Chung (1997).

We can interpret the as the frequency magnitudes and as the corresponding harmonics. We accordingly define the Fourier transform of a signal vector by for

. The corresponding inverse Fourier transform is given by

. Note that this can be written compactly as

Finally, we introduce the concept of graph convolutions. We define a filter defined on the set of nodes and want to convolve the corresponding filter vector with a signal vector , i.e. . To explicitly compute this convolution, we recall that in the Euclidean setting, the convolution of two signals equals the product of their corresponding frequencies. This property generalizes to graphs (Shuman et al., 2013) in the sense that for . Applying the inverse Fourier transform yields

where . Therefore, we can parametrize the convolution with a filter by directly considering the Fourier coefficients in .

Furthermore, it can be verified (Defferrard et al., 2016) that when these coefficients are defined as polynomials for of the Laplacian eigenvalues in (i.e. ), the resulting filter convolution are localized in space and can be written in terms of as

without requiring spectral decomposition of the normalized Laplacian. This property motivates the standard practice of using filters that have polynomial forms, which is studied in previous works on graph convolutional networks and geometric scattering as well as this work.

3 Graph Convolutional Network

The initial idea of learning convolutional filters on graphs consists of learning the parameter . This has however two immediate drawbacks as pointed out in Defferrard et al. (2016). The filters cannot be localized in the feature space and their computation is expensive (). Similar to Kipf and Welling (2016), we are interested in a semi-supervised setting where only a small potion of the nodes is labeled. Their method relies on the construction of a model that does not only use the labeled data, but also leverages the intrinsic geometric information encoded in the adjacency matrix . This is done by enforcing similarity between adjacent nodes and propagating information from the labeled nodes over the whole set of nodes.

The convolutional filter is parametrized by . This parametrization yields


The choice of only one learnable parameter is made to avoid overfitting. The matrix has eigenvalues in . This could lead to vanishing or exploding gradients. This issue is addressed by the following renormalization trick (Kipf and Welling, 2016): , where and . This operation replaces the features of the nodes by a weighted average of itself and its neighbors. Note that the repeated execution of graph convolutions will enforce similarity throughout higher-order neighborhoods with order equal to the number of staggered layers. Setting

the complete layer-wise propagation rule takes the form

Here, indicates the layer with neurons, the activation vector of the neuron, the learned parameter of the convolution with the incoming activation vector from the preceding layer and

an element-wise applied activation function. Written in matrix notation, this gives


where is the weight-matrix of the layer and contains the activations outputted by the layer.

We want to remark that the above explained GCN model can be interpreted as a low-pass operation. For the sake of simplicity, let us consider the convolutional operation (1) before the reparametrization trick. If we observe the convolution operation as the summation

we clearly see that higher weights are put on the low-frequency harmonics, while high-frequency harmonics are progressively less involved as . This indicates that the model can only access a diminishing potion of the original information contained in the input signal the more graph convolutions are staggered.

This observation is in line with the well-known oversmoothing problem  (Li et al., 2018) related to GCN models. The repeated application of graph convolutions will successively smooth the signals of the graph such that nodes can not be distinguished anymore.

4 Geometric Scattering

In this section, we recall the concept of geometric scatterings on graphs. These are based on the lazy random walk matrix

which is closely related to the graph random walk, a Markov process with transition matrix . The matrix however allows self loops while normalizing by a factor of two in order to retain a Markov process.

Therefore, considering a distribution of the initial position of the lazy random walk, its positional distribution after steps is encoded by .

It was shown in Gao et al. (2019) that the propagation of a graph signal vector by this Markov operator, i.e. , is a low-pass operation which preserves the zero-frequencies of the signal while suppressing high frequencies.

This limitation is addressed by introducing the wavelet matrix of scale ,

This leverages the fact that high frequencies can be recovered with multiscale wavelet transforms as the one which decompose the non-zero frequencies into dyadic frequency bands. The operation collects signals from a neighborhood of order but applies no averaging operation over them.

Geometric scattering was originally introduced in the context of whole-graph classification and consisted of aggregating scattering features. These are stacked wavelet transforms parameterized via tuples containing the bandwidth scale parameters, which are separated by element-wise absolute value nonlinearities according to


where corresponds to the length of the tuple . The scattering features are aggregated over the whole graph by taking

-order moments over the set of nodes,


As our research is devoted to the study of node-based classification we reinvent this approach in a new context, keeping the scattering transforms on a node-level by dismissing the aggregation step (4). For each tuple , we define the following scattering propagation rule, which mirrors the GCN rule but replaces the low-pass filter by a geometric scattering operation resulting in


We note that in practice we only choose a subset of tuples, which is chosen as part of the network design explained in the following section.

5 Combining GCN and Scattering Models

To combine the benefits of GCN models and geometric scatterings adapted to the node level, we now propose a hybrid network structure as shown in Fig. 1. This structure combines low-pass operations based on the GCN model with band-pass operations based on geometric scattering. We will first introduce the general architecture of the model.

To define the layer-wise propagation rule, we introduce


which are the concatenations of channels and , respectively. Every is defined according to Eq. 2 with the slight modification of added biases and powers of , i.e.,

Note that every GCN filter uses a different propagation matrix and therefore aggregates information from -step neighborhoods. Similarly, we proceed with according to Eq. 5 and calculate

where , enables scatterings of different orders and scales. Finally, the GCN components and scattering components get concatenated to


The learned parameters are the weight matrices coming from the convolutional and scattering layers. These are complemented by vectors of the biases , which are transposed and vertically concatenated times to the matrices . To simplify notation, we assume here that all channels use the same number of neurons (). Waiving this assumption would slightly complicate the notation but works perfectly fine in practice.

Figure 1: Comparison between GCN and our network: we add band-pass channels to collect different frequency components.

In this work, for simplicity, we limit our architecture to three GCN channels and two scattering channels, as illustrated in Fig. 2. Inspired by the aggregation step in classical geometric scattering, we use with as our nonlinearity. However, unlike the powers in Eq. 4, the -th power here is applied to the node-level instead of moments over the whole graph, retaining the distinction between node-level activations on the graph.

We note that for the first layer, we set the input to have the original node features as the graph signal. Each subchannel (GCN or scattering) transforms the original feature space to a new hidden space with the dimension determined by the number of neurons encoded in the columns of the corresponding submatrix of . These transformations are learned by the network via the weights and biases. Larger matrices (i.e., more columns, as the number of nodes in the graph is fixed) indicate the weight matrices have more parameters to learn. Thus, the information in these channels can be propagated well and sufficiently represented.

In general, the width of a channel is relevant to the importance of these regularities. A wider channel suggests these frequency components are more critical and need to be sufficiently learned. Reducing the width of the channel will suppress the magnitude of information that can be learned from a particular frequency window. For more details and analysis of specific design choices in our architecture we refer the reader to Sec. 7, where we discuss them in the context of our experimental results.

Figure 2: The band-pass layers

6 Graph Residual Convolution

Using the combination of GCN and scattering architectures, we collect multiscale information at the node level. This information is aggregated from different localized neighborhoods, which may exhibit vastly different frequency spectra. This comes for example from varying label rates in different graph substructures. In particular, very sparse graph sections can cause problems when the scattering features actually learn the difference between labeled and unlabeled nodes, creating high-frequency noise. In the classical geometric scattering used for whole-graph representation, geometric moments were used to aggregate the node-based information, serving at the same time as a low-pass filter. As we want to keep the information localized on the node level, we choose a different approach inspired by skip connections in residual neural networks (He et al., 2016). Conceptually, this low-pass filter, which we call graph residual convolution, reduces the captured frequency spectrum up to a cutoff frequency as depicted in Fig. 3(b).

Figure 3:

a) The structure of the graph residual convolution layer, b) Schematic depiction in the frequency domain

The graph residual convolution matrix is given by

and we apply it after the hybrid layer of GCN and scattering filters. For , we get that is the identity (no cutoff) while results in

, which can be interpreted as an interpolation between the completely lazy random walk and the non-resting random walk


In our architecture, we apply the graph residual layer on the output of the scattering GCN layer (6). The update rule for this step, illustrated in Fig. 3(a), is then expressed by

where are learned weights, are learned biases (similar to the notations used previously), and is the number of features of the concatenated layer in Eq. 6. If is the final layer, we choose equal to the number of classes.

7 Results

Figure 4: Comparison of different models on the Cora citation network: the x-axis is the ratio of training size compared to the original data split (training/testing: 1000/140) in Kipf and Welling (2016), 0.5 means that the training size reduces to 500 while the testing size stays the same.

Compared to shallow GCN models, the proposed hybrid scattering GCN is able to collect a richer range of regularity patterns from the graph. Fig. 4 shows the comparison between Label Propagation (Zhu et al., 2003), GCN used in Kipf and Welling (2016), and GCN with partially absorbing random walks from Li et al. (2018). The oversmoothing caused by the Laplacian kernel and the shallow layers, which limit the collection of information from far-ranging neighbourhoods, restrict the GCN performance when the training size shrinks. The x-axis refers to the proportion of the original training size (1.0 means that we use the original training data used in Kipf and Welling (2016), while 0.2 means we use only 20% of the original training data). By preserving more high frequencies and collecting regularity patterns from the entire graph, our scattering GCN outperforms the other methods under small training size conditions.

Dataset Nodes Edges Features Label Rate
Citeseer 3,327 4,732 3,703 0.036
Cora 2,708 5,429 1,433 0.052

19,717 44,338 500 0.003
Table 1: Dataset statistics

Compared to the partially absorbing model discussed in Li et al. (2018), it achieves better overall performance, except for extreme conditions when the training size shirks to 20% of the original data. One explanation for this is that our model may produce high-frequency noise under insufficient labelling conditions, which our simplified architecture cannot filter out sufficiently. During the training, we also notice that shuffling the dataset can result in varying performance.

We evaluate our methods on three datasets summarized in Tab. 1. We train the scattering GCN on Citeseer, Cora and Pubmed, where the nodes are documents and edges are citation links (Sen et al., 2008). The label rate is the number of labelled nodes that are used for training, divided by the total number of nodes in each dataset. A comparison of the classification results on these datasets is presented in Tab. 2.

Model Cora Pubmed Citeseer
Scattering GCN 83.3 79.4 71.0
GCN[a] 81.5 79.0 70.3
Partially absorbing[b] 81.7 79.2 71.2
GAT[c] 83.0 79.0 72.5
Chebyshev[d] 81.2 74.4 69.8
Table 2: Classification accuracies. A: Kipf and Welling (2016), B: Li et al. (2018), C: Veličković et al. (2017), D: Defferrard et al. (2016).

7.1 Architecture Visualization

In our experiments, the two scattering channels are selected from three first-order scattering and two second-order scattering configurations, including , and . The architectures for different models are shown in Fig. 5. As discussed before, the band-pass filters may cause additional noise in case of insufficiently labelled data. For the Pubmed dataset, we maintain the structure of plain GCN and add narrow band-pass filters. For sufficiently labelled datasets, we expand the width of band-pass filters and enable the model to incorporate more high-frequency information.

Figure 5: Architecture Visualization. Dataset: a) Cora, b) Citeseer, c) Pubmed. The number indicated under each channel signifies the number of neurons associated with it.

7.2 Parameters

In the following, we provide results from our experiments on the Cora dataset and discuss the choice of parameters. We use -regularization for every layer and dropout for the first layer. The network structure is shown in Fig. 5

(a). We train for a maximum of 200 epochs using Adam 

(Kingma and Ba, 2014).

Figure 6: Scattering GCN performance on Cora dataset with different parameters.

To further explore the parameter space, we test our model with different parameters and scattering moments. The results are summarized in Fig. 6. Our experiments suggest that the graph residual convolution plays a critical role in suppressing the high-frequency noise. As shown in Fig. 6, when , which is identical to an identity mapping, all frequency components are propagated. As increases, the performance first increases due to the removal of high-frequency noise. But as further increases, e.g. when

, the layer becomes a random walk model, which stays at the initial location with a probability of one half. This low-pass operation degrades the performance to a level close to 

Kipf and Welling (2016).

Next, we evaluate the performance of our proposed model with different scattering transformations under different graph residual convolutions. We performed experiments on the graph-based benchmark node classification task on the Cora dataset with different parameters (five different scattering transformations and three different residual convolutions). The results are summarized in Tab. 34 and 5. The rows and columns in these tables denote the two scattering transformation channels used in the scattering GCN, with the structural information shown in Fig. 5 (a).

0.834 0.832 0.833 0.832 0.832
0.832 0.834 0.828 0.828
0.828 0.830 0.830
0.822 0.822
Table 3: Classification accuracies with with average accuracy 82.9%.

As shown in Tab. 3, the classification accuracy drops when we use two second-order scattering transformations for the scattering channels. The accuracy is 82.2% for and . However, we notice that as we replace one second-order scattering transformation by a first-order one, the accuracy increases. This can be attributed to the narrow bandwidth of second-order scattering transformations. As mentioned before, the second-order scattering transformations are less localized in space compared to first-order ones, which suggests that the second-order scattering transformations have wider spatial support. In this case, the frequencies embedded in these channels do not capture the regularity of the graph signal and weaken the performance of the model.

0.830 0.829 0.827 0.835 0.835
0.829 0.831 0.828 0.828
0.828 0.825 0.825
0.820 0.820
Table 4: Classification accuracies with with average accuracy 82.7%.

Generally, we limit the models in our experiments to scatterings of a maximal order of two, as higher-order scatterings correspond to a wider spatial support and a narrower frequency band, which are less likely to contain the intrinsic regularity of the graph signal.

0.823 0.820 0.821 0.827 0.827
0.821 0.823 0.825 0.825
0.823 0.829 0.829
0.821 0.821
Table 5: Classification accuracies with with average accuracy 82.4%.

We also notice that by combining specific first-order () and second-order ( scattering transformations and preserving certain high-frequency components (), the performance is further increased. This can be interpreted as these filters capturing the critical regularity of the graph signal. However, except for this case, preserving high-frequencies always weakens the network performances in our experiments.

Besides that, we evaluate the performance of our scattering GCN model for different graph residual convolutions. We perform experiments with (close to identity mapping, all frequencies pass), and (smooth filter). We find that as increases, the overall performance first increases and then decreases. The average accuracy is 82.7% for , 82.9% for and 82.4% for . As discussed before, this suggests that both high frequency noise () and smooth regularity should be addressed. Thus, the parameter can be thought of as a ’tradeoff’-parameter in the frequency domain.

8 Conclusion

Our study of semi-supervised node-level classification tasks for graphs shows a new approach for addressing some of the main limitations of GCN models, which are currently the leading (or even state-of-the-art) methods for this task. In this context, we discuss the concept of regularity on graphs and expand the GCN approach, which solely consists in enforcing smoothness, to a richer notion of regularity. This is achieved by incorporating different frequency bands of graph signals, which could not be leveraged in traditional GCN models. We take our inspiration from the concept of geometric scatterings, which have only been used for whole-graph classification so far, and develop a related theory for node-based classification. We also establish the graph residual convolution, drawing inspiration from skip connections in residual networks, which alleviates concerns of high-frequency noise generated in our approach. This results in a toolbox of complementary methods that, combined together, open promising research avenues for advancing semi-supervised learning on graph data, as our experimental results presented in this paper suggest.


This work was partially funded by IVADO (l’institut de valorisation des données) and NIH grant R01GM135929.


  • S. Abu-El-Haija, B. Perozzi, A. Kapoor, H. Harutyunyan, N. Alipourfard, K. Lerman, G. V. Steeg, and A. Galstyan (2019) Mixhop: higher-order graph convolution architectures via sparsified neighborhood mixing. arXiv preprint arXiv:1905.00067. Cited by: §1.1.
  • M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst (2017) Geometric deep learning: going beyond Euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §1.
  • F. R. K. Chung (1997) Spectral graph theory. American Mathematical Society. Cited by: §2.
  • R. R. Coifman and M. Maggioni (2006) Diffusion wavelets. Applied and Computational Harmonic Analysis 21 (1), pp. 53 – 94. Cited by: §1.
  • N. De Cao and T. Kipf (2018) MolGAN: an implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973. Cited by: §1.
  • M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §2, §3, Table 2.
  • A. Fout, J. Byrd, B. Shariat, and A. Ben-Hur (2017) Protein interface prediction using graph convolutional networks. In Advances in neural information processing systems, pp. 6530–6539. Cited by: §1.
  • F. Gama, J. Bruna, and A. Ribeiro (2019) Stability of graph scattering transforms. External Links: 1906.04784 Cited by: §1.1.
  • F. Gama, A. Ribeiro, and J. Bruna (2018) Diffusion scattering transforms on graphs. External Links: 1806.08829 Cited by: §1.1, §1.
  • F. Gao, G. Wolf, and M. Hirn (2019) Geometric scattering for graph data analysis. In International Conference on Machine Learning, pp. 2122–2131. Cited by: §1.1, §1, §4.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1263–1272. Cited by: §1.
  • W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034. Cited by: §1.
  • D. K. Hammond, P. Vandergheynst, and R. mi Gribonval (2011) Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis 30 (2), pp. 129 – 150. Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §6.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §7.2.
  • T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. In International conference on learning representations, Cited by: §1.1, §1, §1, §3, §3, Figure 4, §7.2, Table 2, §7.
  • B. Knyazev, X. Lin, M. R. Amer, and G. W. Taylor (2018) Spectral multigraph networks for discovering and fusing relationships in molecules. arXiv preprint arXiv:1811.09595. Cited by: §1.
  • Q. Li, Z. Han, and X. Wu (2018) Deeper insights into graph convolutional networks for semi-supervised learning. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §1.1, §1, §3, Table 2, §7, §7.
  • R. Liao, Z. Zhao, R. Urtasun, and R. S. Zemel (2019) Lanczosnet: multi-scale deep graph convolutional networks. arXiv preprint arXiv:1901.01484. Cited by: §1.1.
  • S. p. Mallat (2012) Group invariant scattering. Communications on Pure and Applied Mathematics 65 (10), pp. 1331–1398. Cited by: §1.1, §1.
  • H. NT and T. Maehara (2019) Revisiting graph neural networks: all we have is low-pass filters. arXiv preprint arXiv:1905.09550. Cited by: §1.1, §1.
  • P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad (2008) Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §7.
  • D. I. Shuman, B. Ricaud, and P. Vandergheynst (2013) Vertex-frequency analysis on graphs. External Links: 1307.5708 Cited by: §2.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: Table 2.
  • B. Xu, H. Shen, Q. Cao, Y. Qiu, and X. Cheng (2019) Graph wavelet neural network. arXiv preprint arXiv:1904.07785. Cited by: §1.1.
  • X. Zhu, Z. Ghahramani, and J. D. Lafferty (2003) Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML-03), pp. 912–919. Cited by: §7.
  • D. Zou and G. Lerman (2019) Graph convolutional neural netwoks via scattering. Applied and Computational Harmonic Analysis. Cited by: §1.1, §1.