Graph Neural Networks (GNNs) have demonstrated excellent performance in node classification tasks and are very promising in graph classification and regression (Bronstein et al., 2017; Battaglia et al., 2018; Zhang et al., 2018b; Zhou et al., 2018; Wu et al., 2019). In node classification, the input is a single graph with missing node labels to be predicted from the known node labels. In this problem, GNNs with appropriate graph convolutions can be trained based on the single graph provided and state-of-the-art performance has been achieved(Defferrard et al., 2016; Kipf & Welling, 2017; Ma et al., 2019b)
. Graph classification and regression are a very different kind of task where the label of a graph-structured sample is to be predicted. This is similar to the image classification problem tackled by traditional deep convolutional neural networks. The major difference is that here each input sample has an arbitrary adjacency structure, instead of the fixed regular grid of images. An example are the molecules of different sizes shown in Figure1. This raises two important challenges: 1) How can GNNs exploit the structural information of the input graphs? 2) How can GNNs handle input graphs with varying number of nodes or different connectivity structures?
These problems have motivated the design of proper graph convolution and graph pooling to allow GNNs to capture the geometric information of each data sample (Zhang et al., 2018a; Ying et al., 2018; Cangea et al., 2018; Gao & Ji, 2019; Knyazev et al., 2019; Ma et al., 2019a; Lee et al., 2019). Graph convolution plays an important role especially in question 1).
The following graph convolution, as proposed by Kipf & Welling (2017), is a widely accepted example:
Here is a normalized version of the adjacency matrix of the input graph, where
is the identity matrix andis the degree diagonal matrix for . Further, is the array of -dimensional input features on the nodes of the graph, and is the filter parameter matrix. The graph convolution in equation 1 captures the structural information of the input in terms of (or ) and transforms the feature dimension from to . The filter size does not depend on the graph size, which allows a fixed network architecture to process input graphs of varying size. However, the GCN convolution preserves the number of nodes and hence the output dimension of the network is not unique. Graph pooling provides an effective way to overcome this obstacle. Among several approaches that have been proposed, only EigenPooling (Ma et al., 2019a) combines both features and graph structure. However, this is based on eigenpairs of graph Laplacian and suffers from a high computational cost. We provide an overview of this and other pooling methods in Section 4.
In this paper, we propose a new graph pooling strategy by a sparse Haar representation of the data, which we call HaarPooling. This is based on the Haar basis (Li et al., 2019), which combines graph structure and features, and is computationally efficient. Suppose we have input . HaarPooling is defined as
where , . Each column of is a compressive Haar basisvector. By applying HaarPooling in equation 2, the number of nodes is compressed from to . The Haar basis provides a sparse representation which distills graph structural information. Cascading pooling layers we can obtain an output of fixed dimension, regardless of the size of the inputs. The sparsity of the Haar basis matrix ensures that the computation of HaarPooling is efficient. The Haar basis generation and the Haar transforms are in cost (up to a log term of ) for input graph with nodes. Experiments demonstrate that GNNs with HaarPooling aligned with GCN convolution achieve state-of-the-art performance on various graph classification tasks.
This paper is organized as follows. Section 2 details the components and computational flow for HaarPooling. Section 3 provides the mathematical details on HaarPooling, including the compressive Haar basis, compressive Haar transforms, and efficient implementations. Section 4 gives an overview of existing work on graph pooling. Section 5 reports our experimental results on benchmark graph classification tasks compared with existing graph pooling methods. Section 6 concludes the paper. Proofs and implementation details are deferred to the appendix.
In this section we give an overview of the proposed pooling framework. First we define the pooling architecture in terms of a chain. Each layer in the chain determines which sets of nodes are pooled together. Then we construct the compressive Haar transform, which compresses the dimension of the features. This will be used to define the HaarPooling for each layer.
Chain of coarse-grained graphs for pooling
Graph pooling amounts to defining a sequence of coarse-grained graphs. On this chain, each graph is an induced graph that arises from grouping (clustering) certain subsets of nodes from the previous graph. We use clustering algorithms to generate the groupings of nodes. There are many good candidates, such as spectral clustering(Shi & Malik, 2000), -means clustering (Pakhira, 2014), DBSCAN (Ester et al., 1996), OPTICS (Ankerst et al., 1999) and METIS (Karypis & Kumar, 1998). Any of these will work with HaarPooling.
Figure 3 shows an example of a chain with levels, for an input graph . Here, for each level, the vertices are given by
Compressive Haar transforms on chain
For each layer of the chain, we will have a feature representation. We define these in terms of the Haar basis. Haar basis represents graph-structured data by low and high frequency Haar coefficients in frequency domain. The low frequency coefficients contain the coarse information of the original data while the high frequency coefficients contain the fine details. In the HaarPooing, the data is pooled (or compressed) by discarding fine detail information.
The Haar basis can be compressed in each layer. Consider a chain where at level the two subsequent graphs have and nodes, . For each of these graphs, we can create a Haar basis with and elements, respectively. The elements of the smaller layer are obtained by compressing a subset of the elements from the other layer. These new vectors form the matrix of size . We call compressive Haar basis matrix for this particular th layer. This then defines the compressive Haar transform for feature with size .
Computational strategy of HaarPooling
The HaarPooling is then defined as follows.
Definition 1 (HaarPooling).
The HaarPooling for a graph neural network with pooling layers is defined as
where and , or is the compressive Haar basis matrix for the th layer, is the input feature array, and is the output feature array. The corresponding layer is called HaarPooling layer.
HaarPooling has following key properties.
The HaarPooling reduces layer by layer the first dimension of input feature. In the last pooling layer, the output feature is compressed as a vector with length and each original input sample would generate such a vector with the same length. This then makes it possible to deal with input graph-structured data with different size and structure.
The HaarPooling uses the sparse Haar representation on chain structure. In each HaarPooling layer, the representation then combines the features of input with the geometric information of the graphs of the th and th layers of the chain.
By the property of Haar basis, the HaarPooling only drops the high frequency (or detailed) information of the input data. The has good approximation to . Thus, the major data information (i.e. the low frequency coefficients) is preserved in the pooling, and the loss of the information is small.
Since the Haar basis matrix is very sparse, HaarPooling can be computed very fast, with near linear computational complexity.
Figure 4 shows the computational details of the HaarPooling associated with the chain from Figure 3. There are two HaarPooling layers. In the first layer, the input with size is transformed by the compressive Haar basis matrix which consists of the first three column vectors of the full Haar basis in (a), and output is a matrix . In the second layer, the input with size (usually followed by convolution) is transformed by the compressive Haar matrix which is the first column vector of the full Haar basis matrix in (b). By the construction of the Haar basis in relation to the chain (details in Appendix B), each of the first three column vectors and of has only up to three different values. This bound is exactly the number of nodes of . For each column , all nodes with the same parent take the same value. Similarly, the vector is constant. This means that the HaarPooling synthesizes the node feature by adding the same weight to the nodes that are in the same cluster of the coarser layer, and in this way, pools the feature using the graph clustering information.
3 Mathematics and Computation for HaarPooling
Chain of graph by clustering
For a graph , a graph is a coarse-grained graph of if and each node of has only one parent node in associated with it. Each node of is called a cluster of . For integers satisfying , a coarse-grained chain for is a set of graphs such that and is a coarse-grained graph of for each , and the has only one node. Here, we call the graph the top level or the coarsest level and the bottom level or the finest level. The chain hierarchically coarsens for the graph . (Here we use notation as the number of layers of the chain to distinguish the number of layers for pooling.) For details about graphs and chains, we refer to examples in Chung & Graham (1997); Hammond et al. (2011); Chui et al. (2015, 2018); Wang & Zhuang (2018, 2019).
3.1 Compressive Haar transforms
The construction of Haar basis is rooted in the theory of Haar wavelet basis which was first introduced by Haar (1910). It is a special example of the more general Daubechies wavelets (Daubechies, 1992). Haar basis is later constructed on graph by Belkin et al. (2006), and also Chui et al. (2015); Wang & Zhuang (2018, 2019). The construction of the Haar basis is based on a chain of the graph. If the topology of the graph is well reflected by the clustering of the chain, then the Haar basis contains the crucial geometric information of the graph. For a chain , on the th-layer graph , , there is a Haar orthogonal basis defined on , where is the size of and for . For two consecutive layers , the first members of , , which are defined on the finer layer, can be reduced into the , : for first , , where is the parent of and is the number of nodes of the cluster in layer which lies in. It means that for sharing the parent have the same value. This property is critical to pooling as can then be treated as weights for the graph on which the input feature defined, and the nodes gain the same weight if they are in the same cluster. On the other hand, the remaining Haar basis vectors for are constructed to reflect the high-frequency information in the Haar wavelet decomposition. This property is exploited by the compressive Haar basis which then pools the input feature into a lower (first) dimension output feature. The construction and its pseudo-codes for algorithmic implementation of the full Haar basis is detailed in Li et al. (2019); Wang & Zhuang (2019), which we also attach in the appendix. Let , , be the sequence of Haar bases associated with the layers of chain of a graph . For , we let the matrix and call the matrix Haar transform matrix for layer .
For each level , the sequence is an orthonormal basis for with ; each basis is the Haar basis system for the chain .
Let be a coarse-grained chain for . If each parent of level , , contains at least two children, the number of different values of the Haar basis , , is bounded by a constant.
In Figure 4, the Haar basis is created based on the coarse-grained chain , where are graphs with nodes. The two colorful matrices show two Haar bases for the layers and in the chain . There are in total vectors of the Haar basis for each with length , and vectors of the Haar basis for each with length . Haar basis matrix for each level of the chain has up to different values in each column as indicated by colors in each matrix. For , each node of is a cluster of nodes in . Each column of the matrix is a member of the Haar basis on the individual layer of the chain. The first three column vectors of can be reduced as an orthonormal basis of and the first column vector of can be compressed to the constant basis for . This connection ensures that the compressive Haar transform for HaarPooling is feasible and would allow fast algorithms of HaarPooling (see Section 3.2 below).
Adjoint and forward Haar transforms
We use adjoint Haar transforms for HaarPooling, which as the sparsity of the Haar basis matrix, the transform is fast implementable. The adjoint Haar transform for the signal on is defined as
and the forward Haar transform for (coefficients) vector .
We call the components of the Haar (wavelet) coefficients for . The adjoint Haar transform represents the signal in Haar wavelet domain by computing the Haar coefficients for graph signal. Here the adjoint and forward Haar transforms can be extended to a feature data with size by replacing the column vector by the feature array.
The adjoint and forward Haar Transforms are invertible in that for and vector on graph ,
Proposition 2 shows that the forward Haar transform can recover the graph signal from the adjoint Haar transform . This means that adjoint and forward Haar transforms have zero-loss in graph signal transmission.
Compressive Haar transforms
Now for a graph neural network, suppose we want to use pooling layers. We associate the chain of an input graph with the pooling by linking the th layer of pooling with the th layer of the chain. Then, we can use the Haar basis system on the chain to define the pooling operation. By the property of Haar basis, in the Haar transforms for layer , , of the Haar coefficients, the first coefficients are low-frequency coefficients, which reflect the approximation to the original data, and the other coefficients are in high frequency, which contain fine details of the Haar wavelet decomposition. To define pooling, we remove the high-frequency coefficients in Haar wavelet representation and obtain the compressive Haar transforms for the feature at layers , which then gives the HaarPooling in Definition 1.
As we preserve the approximate part of the Haar wavelet decomposition in compressive Haar transform, the information has little loss in the pooling. That is,
where is the full Haar basis matrix at the th layer.
In HaarPooling, the compression or pooling occurs in the Haar wavelet domain. HaarPooling transforms the features on the nodes to the Haar wavelet domain and discards the high-frequency coefficients in the sparse Haar wavelet representation. Figure 4 shows a two-layer HaarPooling strategy. The first layer pools the input by the compressive Haar basis matrix to the output with lower first dimension. The second layer pools the input by the to the output which first dimension drops to one.
3.2 Fast Computation of HaarPooling
For the HaarPooling introduced in Definition 1, we can develop a fast computational strategy by virtue of fast adjoint Haar transforms. Let be a coarse-grained chain of the graph . For convenience, we label the vertices of the level- graph by .
Fast algorithm for HaarPooling
The HaarPooling in equation 3 can be computed fast by using the hierarchical structure of the chain, as we introduce as follows. For , let be the number of children of , i.e. the number of vertices of which belongs to the cluster , for . For , we let for . Now, for and , define the weight for the node of layer by
Let . Then, for , the weighted chain becomes a filtration if each parent of the chain has at least two children. See e.g. (Chui et al., 2015, Definition 2.3).
Let . For the th HaarPooling layer, let be the Haar basis for the th layer, which we also call the Haar basis for the filtration of a graph . For , we let the feature vector at node . We define the weighted sum for feature for by
and recursively, for and ,
For each vertex of , the is the weighted sum of the at the level for those vertices of whose parent is .
For , let for be the Haar bases for the filtration at layer . Then, the compressive Haar transform for the th HaarPooling layer can be computed by, for the feature and ,
where is the largest possible number in such that is the th member of the orthonormal basis for , are the vertices of and the weights are given by equation 6.
With increasing graph size, the sparsity of the Haar basis matrix becomes more pronounced (Li et al., 2019). This sparsity implies fast computation for HaarPooling. The computational complexity of HaarPooling is determined by the adjoint Haar transforms. In the first step of Algorithm 1, the total number of summations for all elements of Step 1 is no more than ; In the second step, by the locality of the Haar basis, the total number of multiplication and summation operations is at most . Here is the constant which bounds the number of different values of the Haar basis vector. Thus, the computational cost of Algorithm 1 is .
We run an experiment to evaluate the CPU computational time of HaarPooling by Algorithm 1 against the direct matrix product. We use randomly generated graphs and features with size ranging from to . As show in Figure 5, the fast HaarPooling has computational cost nearly proportional to the number of nodes , while the ordinary matrix product incurs a cost close to order . These results are consistent with the computational complexity analysis given above.
4 Related Work
Graph pooling is a necessary step when building a GNN model for graph classification, as one needs a unified graph-level rather than node-level representation for graph inputs for which size and topology are changing. The most direct pooling method takes the global mean and sum of node representations obtained by the graph convolutional layer (Duvenaud et al., 2015) as a simple graph-level representation. However, this pooling operation treats all the nodes equally and ignores the global geometry of the graph data. ChebNet (Defferrard et al., 2016) uses the graph coarsening procedure to build the pooling module, which needs graph clustering algorithms to obtain subgraphs. One drawback of this topology-based strategy lies in that it does not reflect the graph node features in the pooling. Global pooling methods consider the information of node embeddings to obtain the entire graph representation. As a general framework for graph classification problems, MPNN (Gilmer et al., 2017) uses the Set2Set method (Vinyals et al., 2015) to obtain graph-level representations. Zhang et al. (2018a) proposed a SortPool method that sorts feature representation of nodes before feeding them into traditional 1-D convolutional and dense layers. But these global pooling techniques cannot guarantee hierarchical graph representations that may contain certain useful information in the graph structure. Recently, the prominent idea to build a differentiable and data-dependent pooling layer with learnable operations/parameters has brought a substantial improvement on graph classification tasks. Ying et al. (2018) proposed a differentiable pooling layer (DiffPool) that learns a cluster assignment matrix over the nodes using the output of a GNN model. One common problem about DiffPool method is its huge storage complexity brought about by the computation of the soft clustering assignments. Cangea et al. (2018); Gao & Ji (2019); Knyazev et al. (2019) used a Top-K pooling method that samples a subset of important nodes by employing a trainable projection vector. Lee et al. (2019)
introduced Self-Attention Graph Pooling (SAGPool) by replacing the way of computing node scores in Top-K pooling by a GCN module. These hierarchical pooling methods technically still employ mean/max pooling procedures to aggregate the feature representation of super-nodes, which would lead to information loss.
There are also spectral-based pooling methods that take account of both the graph structure and its node features. Noutahi et al. (2019) proposed the Laplacian Pooling (LaPool) method that dynamically selects centroid nodes and their corresponding follower nodes by an attention mechanism that uses graph Laplacian. Ma et al. (2019a)
introduced EigenPool by using local graph Fourier transform to extract subgraph information utilizing both node features and structure of the subgraph. Its potential drawback lies in the inherent bottleneck of computing Laplacian-based graph Fourier transform, that is, huge cost in the eigendecomposition of the graph Laplacian. This shortcoming partially motivates our present work.
To verify whether the proposed framework can hierarchically learn good graph representations for classification, we evaluate HaarPooling on five widely used benchmark data sets for graph classification (Kersting et al., 2016), including one protein graph data set PROTEINS (Borgwardt et al., 2005; Dobson & Doig, 2003); two mutagen data sets MUTAG (Debnath et al., 1991; Kriege & Mutzel, 2012) and MUTAGEN (Riesen & Bunke, 2008; Kazius et al., 2005) (full name Mutagenicity); and two data sets that consist of chemical compounds screened for activity against non-small cell lung cancer and ovarian cancer cell lines, NCI1 and NCI109 (Wale et al., 2008). We include data sets from different domains, sample and graph sizes to give a comprehensive understanding of how HaarPooling performs with data sets in various scenarios. A summary information of the data sets is given in Table 1, which shows the data sets containing graphs with different sizes and structures: the number of data samples ranges from 188 to 4,337, the average number of nodes is from to and the average number of edges is from to .
Baselines and running environment
We compare HaarPool with SortPool (Zhang et al., 2018a), DiffPool (Ying et al., 2018), gPool (Gao & Ji, 2019), SAGPool (Lee et al., 2019), EigenPool (Ma et al., 2019a), CSM (Kriege & Mutzel, 2012) and GIN (Xu et al., 2019)
on the above data sets. The experiments use PyTorch Geometric111https://pytorch-geometric.readthedocs.io/en/latest. (Fey & Lenssen, 2019) and were run in Google Cloud using 4 Nvidia Telsa T4 with 2560 CUDA cores, compute 7.5, 16GB GDDR6 VRAM.
In experiments, we use a GNN with at most GCN (Kipf & Welling, 2017)
convolutional layers plus one HaarPooling layer, followed by three fully connected layers. The hyperparameters of the network are adjusted case by case. We use spectral clustering, which exploits the eigenvalues of the graph Laplacian, to generate a chain with the number of layers given. Spectral clustering has shown good performance in coarsening a variety of data patterns and can handle isolated nodes.
We use random shuffling of the data set, which we split into training, validation, and test sets with proportions , and respectively. We use the Adam optimizer (Kingma & Ba, 2015)
, early stopping criterion, patience. The specific values are provided in the appendix. The early stopping criterion was that the validation loss does not improve for 50 epochs, with a maximum of 150 epochs, as suggested byShchur et al. (2018).
The classification test accuracy is reported in Table 2. GNNs with HaarPooling have excellent performance on all data sets. In 4 out of 5 datasets, it achieved top accuracy. This shows that HaarPooling with appropriate graph convolution, can achieve top performance on a variety of graph classification tasks, and in some cases improve state of the art by a few percent points.
‘*’ means that the records are retrieved from EigenPool (Ma et al., 2019a), ‘–’ means that there is no public records for the corresponding method on the data set, and the bold number indicates the best performance in the list.
We introduced a new graph pooling method called HaarPooling. HaarPooling has a mathematically appealing formalism derived from compressive Haar transforms. Unlike existing pooling methods, HaarPooling takes into account both graph the structure and features of the graph-structured input data to compute a coarsened representation. As an individual unit, HaarPooling can be applied in conjunction with any type of graph convolution in GNNs. We show in experiments that HaarPooling reaches state of the art in various benchmark graph classification tasks. Moreover, having only linear computational complexity in the size of the inputs, HaarPooling is a very fast pooling method.
The first four authors have the equal contribution to the paper.
Yu Guang Wang acknowledges support from the Australian Research Council under Discovery Project DP180100506. Ming Li acknowledges support from the National Natural Science Foundation of China (No. 61802132 and 61877020). Guido Montúfar has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no 757983). Xiaosheng Zhuang acknowledges support in part from Research Grants Council of Hong Kong (Project No. CityU 11301419) This material is based upon work supported by the National Science Foundation under Grant No. DMS-1439786 while Zheng Ma, Guido Montúfar and Yu Guang Wang were in residence at the Institute for Computational and Experimental Research in Mathematics in Providence, RI, during the Point configurations in Geometry, Physics and Computer Science program. Part of this research was performed while Guido Montúfar and Yu Guang Wang were at the Institute for Pure and Applied Mathematics (IPAM), which is supported by the National Science Foundation (Grant No. DMS-1440415).
- Ankerst et al. (1999) Mihael Ankerst, Markus M Breunig, Hans-Peter Kriegel, and Jörg Sander. Optics: ordering points to identify the clustering structure. In ACM Sigmod Record, volume 28, pp. 49–60. ACM, 1999.
- Battaglia et al. (2018) Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
Belkin et al. (2006)
Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani.
Manifold regularization: A geometric framework for learning from
labeled and unlabeled examples.
Journal of Machine Learning Research, 7(Nov):2399–2434, 2006.
- Borgwardt et al. (2005) Karsten M Borgwardt, Cheng Soon Ong, Stefan Schönauer, SVN Vishwanathan, Alex J Smola, and Hans-Peter Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(suppl_1):i47–i56, 2005.
Bronstein et al. (2017)
Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre
Geometric deep learning: going beyond euclidean data.IEEE Signal Processing Magazine, 34(4):18–42, 2017.
Cangea et al. (2018)
Cătălina Cangea, Petar Veličković, Nikola Jovanović,
Thomas Kipf, and Pietro Liò.
Towards sparse hierarchical graph classifiers.In Workshop on Relational Representation Learning, NeurIPS, 2018.
- Chui et al. (2018) C. K. Chui, H.N. Mhaskar, and X. Zhuang. Representation of functions on big data associated with directed graphs. Applied and Computational Harmonic Analysis, 44(1):165 – 188, 2018. ISSN 1063-5203. doi: https://doi.org/10.1016/j.acha.2016.12.005.
- Chui et al. (2015) C.K. Chui, F. Filbir, and H.N. Mhaskar. Representation of functions on big data: graphs and trees. Applied and Computational Harmonic Analysis, 38(3):489 – 509, 2015.
- Chung & Graham (1997) Fan RK Chung and Fan Chung Graham. Spectral graph theory. American Mathematical Society, 1997.
- Daubechies (1992) Ingrid Daubechies. Ten lectures on wavelets. SIAM, 1992.
- Debnath et al. (1991) Asim Kumar Debnath, Rosa L. Lopez de Compadre, Gargi Debnath, Alan J. Shusterman, and Corwin Hansch. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. Journal of Medicinal Chemistry, 34(2):786–797, 1991. doi: 10.1021/jm00106a046.
- Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, pp. 3844–3852, 2016.
- Dobson & Doig (2003) Paul D Dobson and Andrew J Doig. Distinguishing enzyme structures from non-enzymes without alignments. Journal of Molecular Biology, 330(4):771–783, 2003.
- Duvenaud et al. (2015) David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In NIPS, pp. 2224–2232, 2015.
- Ester et al. (1996) Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, volume 96, pp. 226–231, 1996.
- Fey & Lenssen (2019) Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. In Workshop on Representation Learning on Graphs and Manifolds, ICLR, 2019.
- Gao & Ji (2019) Hongyang Gao and Shuiwang Ji. Graph U-Nets. ICML, pp. 2083–2092, 2019.
- Gavish et al. (2010) Matan Gavish, Boaz Nadler, and Ronald R Coifman. In ICML, pp. 367–374, 2010.
- Gilmer et al. (2017) Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In ICML, pp. 1263–1272, 2017.
- Haar (1910) Alfred Haar. Zur theorie der orthogonalen funktionensysteme. Mathematische Annalen, 69(3):331–371, 1910.
- Hammond et al. (2011) David K Hammond, Pierre Vandergheynst, and Rémi Gribonval. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129–150, 2011.
- Karypis & Kumar (1998) George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359–392, 1998.
- Kazius et al. (2005) Jeroen Kazius, Ross McGuire, and Roberta Bursi. Derivation and validation of toxicophores for mutagenicity prediction. Journal of Medicinal Chemistry, 48(1):312–320, 2005.
- Kersting et al. (2016) Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann. Benchmark data sets for graph kernels, 2016. URL http://graphkernels.cs.tu-dortmund.de.
- Kingma & Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
- Kipf & Welling (2017) Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
- Knyazev et al. (2019) Boris Knyazev, Graham W Taylor, and Mohamed R Amer. Understanding attention in graph neural networks. In Workshop on Representation Learning on Graphs and Manifolds, ICLR, 2019.
- Kriege & Mutzel (2012) Nils Kriege and Petra Mutzel. Subgraph matching kernels for attributed graphs. In ICML, pp. 291–298, 2012.
- Lee et al. (2019) Junhyun Lee, Inyeop Lee, and Jaewoo Kang. Self-attention graph pooling. In ICML, pp. 3734–3743, 2019.
- Li et al. (2019) Ming Li, Zheng Ma, Yu Guang Wang, and Xiaosheng Zhuang. Fast Haar transforms for graph neural networks. arXiv preprint arXiv:1907.04786, 2019.
- Ma et al. (2019a) Yao Ma, Suhang Wang, Charu C. Aggarwal, and Jiliang Tang. Graph convolutional networks with EigenPooling. In KDD, pp. 723–731, 2019a.
- Ma et al. (2019b) Zheng Ma, Ming Li, and Yu Guang Wang. PAN: Path integral based convolution for deep graph neural networks. In Workshop on Learning and Reasoning with Graph-Structured Representation. ICML, 2019b.
- Noutahi et al. (2019) Emmanuel Noutahi, Dominique Beani, Julien Horwood, and Prudencio Tossou. Towards interpretable sparse graph representation learning with Laplacian pooling. arXiv preprint arXiv:1905.11577, 2019.
M. K. Pakhira.
A linear time-complexity k-means algorithm using cluster shifting.In 2014 International Conference on Computational Intelligence and Communication Networks, pp. 1047–1051, 2014. doi: 10.1109/CICN.2014.220.
Riesen & Bunke (2008)
Kaspar Riesen and Horst Bunke.
IAM graph database repository for graph based pattern recognition and machine learning.In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pp. 287–297. Springer, 2008.
- Shchur et al. (2018) Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of graph neural network evaluation. In Workshop on Relational Representation Learning, NeurIPS, 2018.
- Shi & Malik (2000) Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. Departmental Papers (CIS), pp. 107, 2000.
- Vinyals et al. (2015) Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for sets. In ICLR, 2015.
- Wale et al. (2008) Nikil Wale, Ian A Watson, and George Karypis. Comparison of descriptor spaces for chemical compound retrieval and classification. Knowledge and Information Systems, 14(3):347–375, 2008.
- Wang & Zhuang (2018) Yu Guang Wang and Xiaosheng Zhuang. Tight framelets and fast framelet filter bank transforms on manifolds. Applied and Computational Harmonic Analysis, 2018. doi: 10.1016/j.acha.2018.02.001.
- Wang & Zhuang (2019) Yu Guang Wang and Xiaosheng Zhuang. Tight framelets on graphs for multiscale analysis. In Wavelets and Sparsity XVIII, SPIE Proc., pp. 11138–11, 2019.
- Wu et al. (2019) Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019.
- Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In ICLR, 2019.
- Ying et al. (2018) Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. In NeurIPS, pp. 4800–4810, 2018.
Zhang et al. (2018a)
Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen.
An end-to-end deep learning architecture for graph classification.
Thirty-Second AAAI Conference on Artificial Intelligence, 2018a.
- Zhang et al. (2018b) Ziwei Zhang, Peng Cui, and Wenwu Zhu. Deep learning on graphs: A survey. arXiv preprint arXiv:1812.04202, 2018b.
- Zhou et al. (2018) Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Graph neural networks: A review of methods and applications. arXiv preprint arXiv:1812.08434, 2018.
Appendix A Graph Classification
This task is to categorize graph-structured data into several classes. The training set consists of pairs of samples , . For the th sample, is a graph with vertex set of size (also called nodes), and edge set with weights . The feature is an array of features per vertex, i.e., an -valued function over . The label is an integer from a finite set indicating which class the input sample lies in. The number of nodes and the graph structure usually vary over the different input samples.
Graph neural networks
Deep graph neural networks (GNNs) are designed to work with graph-structured inputs of the form described above. A GNN is typically composed of multiple graph convolution layers, graph pooling layers, and fully connected layers. A (graph) convolutional layer extracts an array of features from the previous array. It changes the dimension of the feature array but does not change the number of nodes . Since the number of nodes of different inputs is variable, the number of nodes of the corresponding outputs is also variable. This raises new challenges in comparison with traditional image classification tasks, where the local structure connecting pixels is always fixed (even if the number of pixels might be variable).
In GNNs, one uses graph pooling to reduce the first dimension of the feature arrays, and more importantly, to obtain outputs of uniform dimension (commonly followed by fully connected layers). A general architecture uses a cascade of convolutional and pooling layers. Figure 2
illustrates such an architecture with three blocks of graph convolutional and pooling layers, followed by a multi-layer perceptron (MLP) with three fully connected layers. In practice, each block can include several convolutional layers but use only one pooling layer at most. The exact architecture of GNNs with combined convolutional and pooling layers is mainly dependent upon the particular problem and the data set and is designed case by case.
Appendix B Construction of Haar basis
Construction of Haar basis. With a chain of the graph, one can generate a Haar basis for following Chui et al. (2015), see also Gavish et al. (2010). We show the construction of Haar basis on , as follows.
Step 1. Let be a coarse-grained graph of with . Each vertex is a cluster of . Order , e.g., by degrees of vertices or weights of vertices, as . We define vectors on by
and for ,
where is the indicator function for the th vertex on given by
Then, one can show that forms an orthonormal basis for .
Note that each belongs to exactly one cluster . In view of this, for each , we can extend the vector on to a vector on by
here is the size of the cluster , i.e., the number of vertices in whose common parent is . We order the cluster , e.g., by degrees of vertices, as
For , similar to equation 11, define
where for , is given by
One can show that the resulting is an orthonormal basis for .
Step 2. Let be a coarse-grained chain for the graph . An orthonormal basis for is generated using equation 10 and equation 11. We then repeatedly use Step 1: for , we generate an orthonormal basis for from the orthonormal basis for the coarse-grained graph that was derived in the previous steps. We call the sequence of vectors at the finest level, the Haar global orthonormal basis or simply the Haar basis for associated with the chain . The orthonormal basis for , is called the associated (orthonormal) basis for the Haar basis .
Besides the orthogonality, the Haar basis has the locality which is critical to the fast computation of HaarPooling.
Appendix C Proof
Proof of Theorem 3.
By the relation between and , for and ,
where is the parent of and is the parent of , and we recursively compute the summation to obtain the last equality, thus completing the proof. ∎
Appendix D Experimental Setting
The architecture of GNN is identified by the layer type and the number of hidden nodes at each layer. For example, we denote 3GC256-HP-2FC256-FC128 to represent a GNN architecture with 3 GCNConv layers each with 256 hidden nodes plus one HaarPooling layer followed by 2 fully connected layers each with 256 hidden nodes and 1 fully connected layer with 128 hidden nodes. The architecture for each data set is shown by Table 3.
The hyperparameters include batch size; learning rate, weight decay rate (these two for optimization); maximum number of epochs; patience for early stopping. The choice of hyperparameters in each data set is shown in Table 4.
|Data Set||Layers and #Hidden Nodes|