Recent years have witnessed increasing interests in generalizing neural networks for graph structured data. The stream of research on this topic is usually under the name of “Graph Neural Networks” (Scarselli et al., 2009), which typically involves transforming, propagating and aggregating node features across the graph. Among them, some focus on node-level representation learning (Kipf and Welling, 2016; Hamilton et al., 2017; Schlichtkrull et al., 2018) while others investigate learning graph-level representation (Li et al., 2015; Henaff et al., 2015; Duvenaud et al., 2015; Defferrard et al., 2016; Bruna et al., 2013; Ying et al., 2018b; Gao and Ji, 2019a, b). While standing from different perspectives, these methods have been proven to advance various graph related tasks. The methods focusing on node representation learning have brought improvement to tasks such as node classification (Kipf and Welling, 2016; Hamilton et al., 2017; Schlichtkrull et al., 2018; Gao et al., 2018; Gao and Ji, 2019a, b) and link prediction (Schlichtkrull et al., 2018) and those methods working on graph-level representation learning have mainly facilitated graph classification. In this paper, we work on graph level representation learning with a focus on the task of graph classification.
The task of graph classification is to predict the label of a given graph utilizing its associated features and graph structure. Graph Neural Networks can extract graph representation while using all associated information. Majority of existing graph neural networks (Dai et al., 2016; Duvenaud et al., 2015; Gilmer et al., 2017; Li et al., 2015) have been designed to generate good node representations, and then globally summarize the node representations as the graph representation. These methods are inherently “flat” since they treat all the nodes equivalently when generating graph representation using the node representations. In other words, the entire graph structure information is totally neglected during this process. However, nodes are naturally of different statuses and roles in a graph, and they should contribute differently to the graph level representation. Furthermore, graphs often have different local structures (or subgraphs), which contain vital graph characteristics. For instance, in a graph of a protein, atoms (nodes) are connected via bonds (edges); some local structures, which consist of groups of atoms and their direct bonds, can represent some specific functional units, which, in turn, are important to tell the functionality of the entire protein (Shervashidze et al., 2011; Duvenaud et al., 2015; Borgwardt et al., 2005). These local structures are also not captured during the global summarizing process. To generate the graph representation which preserves the local and global graph structures, a hierarchical pooling process, analogous to the pooling process in conventional convolutional neural (CNN) networks (Krizhevsky et al., 2012), is needed.
. These methods group nodes into subgraphs (supernodes), coarsen the graph based on these subgraphs and then the entire graph information is reduced to the coarsened graph by generating features of supernodes from their corresponding nodes in subgraphs. However, when pooling the features for supernodes, average pooling or max pooling have been usually adopted where the structures of these group nodes (the local structures) are still neglected. With the local structures, the nodes in the subgraphs are of different statuses and roles when they contribute to the supernode representations. It is challenging to design a general pooling operator while incorporating the local structure information as 1) the subgraphs may contain different numbers of nodes, thus a fixed size pooling operator cannot work for all subgraphs; and 2) the subgraphs could have very different structures, which may require different approaches to summarize the information for the supernode representation. To address the aforementioned challenges, we design a novel pooling operator
based on the eigenvectors of the subgraphs, which naturally have the same size of each subgraph and can effectively capture the local structures when summarizing node features for supernodes.can be used as pooling layers to stack with any graph neural network layers to form a novel framework for graph classification. Our major contributions can be summarized as follows:
We introduce a novel pooling operator , which can naturally summarize the subgraph information while utilizing the subgraph structure;
We provide theoretical understandings on from both local and global perspectives;
We incorporate pooling layers based on into existing graph neural networks as a novel framework for representation learning for graph classification; and
We conduct comprehensive experiments on numerous real-world graph classification benchmarks to demonstrate the effectiveness of the proposed pooling operator.
2. The Proposed Framework –
In this paper, we aim to develop a Graph Neural Networks (GNN) model, which consists of convolutional layers and pooling layers, to learn graph representations such that graph level classification can be applied. Before going to the details, we first introduced some notations and the problem setting.
Problem Setting: A graph can be represented as , where is the set of nodes and is the set of edges. The graph structure information can also be represented by an adjacency matrix . Furthermore, each node in the graph is associated with node features and we use to denote the node feature matrix, where is the dimension of features. Note that this node feature matrix can also be viewed as a -dimensional graph signal (Shuman et al., 2013) defined on the graph . In the graph classification setting, we have a set of graphs , each graph is associated with a label . The task of the graph classification is to take the graph (structure information and node features) as input and predict its corresponding label. To make the prediction, it is important to extract useful information from both graph structure and node features. We aim to design graph convolution layers and
to hierarchically extract graph features, which finally learns a vector representation of the input graph for graph classification.
2.1. An Overview of EigenGCN
In this work, we build our model based on Graph Convolutional Networks (GCN) (Kipf and Welling, 2016), which has been demonstrated to be effective in node-level representation learning. While the GCN model is originally designed for semi-supervised node classification, we only discuss the part for node representation learning but ignoring the classification part. The GCN is stacked by several convolutional layers and a single convolutional layer can be written as:
where is the output of the -th convolutional layer for and denotes the input node features. A total number of convolutional layers are stacked to learn node representations and the output matrix can be viewed as the final node representations learned by the GCN model.
As we described above, the GCN model has been designed for learning node representations. In the end, the output of the GCN model is a matrix instead of a vector. The procedure of the GCN is rather “flat”, as it can only “pass message” between nodes through edges but cannot summarize the node information into the higher level graph representation. A simple way to summarize the node information to generate graph level representation is global pooling. For example, we could use the average of the node representations as the graph representation. However, in this way, a lot of key information is ignored and the graph structure is also totally overlooked during the pooling process.
To address this challenge, we propose eigenvector based pooling layers to hierarchically summarize node information and generate graph representation. An illustrative example is demonstrated in Figure 1. In particular, several pooling layers are added between convolutional layers. Each of the pooling layers pools the graph signal defined on a graph into a graph signal defined on a coarsened version of the input graph, which consists of fewer nodes. Thus, the design of the pooling layers consists of two components: 1) graph coarsening, which divides the graph into a set of subgraphs and form a coarsened graph by treating subgraphs as supernodes; and 2) transform the original graph signal information into the graph signal defined on the coarsened graph with . We coarsen the graph based on a subgraph partition. Given a subgraph partition with no overlaps between subgraphs, we treat each of the subgraphs as a supernode. To form a coarsened graph of the supernodes, we determine the connectivity between the supernodes by the edges across the subgraphs. During the pooling process, for each of the subgraphs, we summarize the information of the graph signal on the subgraph to the supernode. With graph coarsening, we utilize the graph structure information to form coarsened graphs, which makes it possible to learn representations level by level in a hierarchical way. With , we can learn node features of the coarsened graph that exploits the subgraph structure as well as the node features of the input graph.
Figure 1 shows an illustrative example, where a binary graph classification is performed. In this illustrative example, the graph is coarsened three times and finally becomes a single supernode. The input is a graph signal (the node features), which can be multi-dimensional. For the ease of illustration, we do not show the node features on the graph. Two convolutional layers are applied to the graph signal. Then, the graph signal is pooled to a signal defined on the coarsened graph. This procedure (two convolution layers and one pooling layer) is repeated two more times and the graph signal is finally pooled to a signal on a single node. This pooled signal on the single node, which is a vector, can be viewed as the graph representation. The graph representation then goes through several fully connected layers and the prediction is made upon the output of the last layer. Next, we introduce details of graph coarsening and of EigenGCN.
2.2. Graph Coarsening
In this subsection, we introduce how we perform the graph coarsening. As we mentioned in the previous subsection, the coarsening process is based on subgraph partition. There are different ways to separate a given graph to a set of subgraphs with no overlapping nodes. In this paper, we adopt spectral clustering to obtain the subgraphs, so that we can control the number of the subgraphs, which, in turn, determines the pooling ratio. We leave other options as future work. Given a set of subgraphs, we treat them as supernodes and build the connections between them as similar in(Tremblay and Borgnat, [n. d.]). An example of the graph coarsening and supernodes is shown in Figure 1, where a subgraph and its supernodes are denoted using the same color. Next, we introduce how to mathematically describe the subgraphs, supernodes, and their relations.
Let be a partition of a graph , which consists of connected subgraphs For the graph , we have the adjacency matrix and the feature matrix . Let denote the number of nodes in the subgraph and is the list of nodes in subgraph . Note that each of the subgraph can be also viewed as a supernode. For each subgraph , we can define a sampling operator as follows:
where denotes the element in the -th position of and is the -th element in the node list . This operator provides a relation between nodes in the subgraph and the nodes in the original graph. Given a single dimensional graph signal defined on the original entire graph, the induced signal that is only defined on the subgraph can be written as
On the other hand, we can also use to up-sample a graph signal defined only on the subgraph to the entire graph by
It keeps the values of the nodes in the subgraph untouched while setting the values of all the other nodes that do not belong to the subgraph to . The operator can be applied to multi-dimensional signal in a similar way. The induced adjacency matrix of the subgraph , which only describes the connection within the subgraph , can be obtained as
The intra-subgraph adjacency matrix of the graph , which only consists of the edges inside each subgraph, can be represented as
Then the inter-subgraph adjacency matrix of graph , which only consists of the edges between subgraphs, can be represented as .
Let denote the coarsened graph, which consists of the supernodes and their connections. We define the assignment matrix , which indicates whether a node belongs to a specific subgraph as:
Then, the adjacency matrix of the coarsened graph is given as
With Graph Coarsening, we can obtain the connectivity of , i.e., . Obviously, encodes the network structure information of . Next, we describe how to obtain the node features of using . With and , we can stack more layers of GCN-GraphCoarsening- to learn higher level representations of the graph for classification.
2.3. Eigenvector-Based Pooling –
In this subsection, we introduce , aiming to obtain that encodes network structure information and node features of . Globally, the pooling operation is to transform a graph signal defined on a given graph to a corresponding graph signal defined on the coarsened version of this graph. It is expected that the important information of the original graph signal can be largely preserved in the transformed graph signal. Locally, for each of the subgraph, we summarize the features of the nodes in this subgraph to a single representation of the supernode. It is necessary to consider the structure of the subgraph when we perform the summarizing, as the subgraph structure also encodes important information. However, common adopted pooling methods such as max pooling (Ying et al., 2018b; Defferrard et al., 2016) or average pooling (Duvenaud et al., 2015) ignored the graph structure. In some works (Niepert et al., 2016), the subgraph structure is used to find a canonical ordering of the nodes, which is, however, very difficult and expensive. In this work, to use the structure of the subgraphs, we design the pooling operator based on the graph spectral theory by facilitating the eigenvectors of the Laplacian matrix of the subgraph. Next, we first briefly review the graph Fourier transform and then introduce the design of based on graph Fourier transform.
2.3.1. Graph Fourier Transform
Given a graph with being the adjacency matrix and being the node feature matrix. Without loss of generality, for the following description, we consider , i.e., , which can be viewed as a single dimensional graph signal defined on the graph (Sandryhaila and Moura, [n. d.]). This is the spatial view of a graph signal, which maps each node in the graph to a scalar value (or a vector if the graph signal is multi-dimensional). Analogous to the classical signal processing, we can define graph Fourier transform (Shuman et al., 2013) and spectral representation of the graph signal in the spectral domain. To define the graph signal in the spectral domain, we need to use the Laplacian matrix (Chung and Graham, 1997) , where is the diagonal degree matrix with . The Laplacian matrix can be used to define the “smoothness” of a graph signal (Shuman et al., 2013) as follows:
measures the smoothness of the graph signal . The smoothness of a graph signal depends on how dramatically the value of connected nodes can change. The smaller , the more smooth it is. For example, for a connected graph, a graph signal with the same value on all the nodes has a smoothness of , which means “extremely smooth” with no change.
As is a real symmetric semi-positive definite matrix, it has a completed set of orthonormal eigenvectors . These eigenvectors are also known as the graph Fourier modes (Shuman et al., 2013)
, which are associated with the ordered real non-negative eigenvalues. Given a graph signal , the graph Fourier transform can be obtained as follows
where is the matrix consists of the eigenvectors of . The vector obtained after the transform is the representation of the graph signal in the spectral domain. Correspondingly, the inverse graph Fourier transform, which transfers the spectral representation back to the spatial representation, can be denoted as:
Note that we can also view each the eigenvector of the Laplacian matrix as a graph signal, and its corresponding eigenvalue can measure its “smoothness”. For any of the eigenvector , we have:
The eigenvectors (or Fourier modes) are a set of base signals with different “smoothness” defined on the graph . Thus, the graph Fourier transform of a graph signal can be also viewed as linearly decomposing the signal into the set of base signals. can be viewed as the coefficients of the linear combination of the base signals to obtain the original signal .
2.3.2. The Design of Pooling Operators
Since graph Fourier transform can transform graph signal to spectral domain which takes into consideration both graph structure and graph signal information, we adopt graph Fourier transform to design pooling operators, which pool the graph signal defined on a given graph to a signal defined on its coarsened version . The design of the pooling operator is based on graph Fourier transform of the subgraphs . Let denote the Laplacian matrix of the subgraph . We denote the eigenvectors of the Laplacian matrix as . We then use the up-sampling operator to up-sample these eigenvectors (base signals on this subgraph) into the entire graph and get the up-sampled version as:
With the up-sampled eigenvectors, we organize them into matrices to form pooling operators. Let denote the pooling operator consisting of all the -th eigenvectors from all the subgraphs
Note that the subgraphs are not necessary all with the same number of nodes, which means that the number of eigenvectors can be different. Let be the largest number of nodes among all the subgraphs. Then, for a subgraph with nodes, we set for as . The pooling process with -th pooling operator can be described as
where is the pooled result using the -th pooling operator. The -th row of contains the information pooled from the -th subgraph, which is the representation of the -th supernode.
Following this construction, we build a set of pooling operators. To combine the information pooled by different pool operators, we can concatenate them together as follows:
where is the final pooled results. For efficient computation, instead of using the results pooled by all the pooling operators, we can choose to only use the first of them as follows:
3. Theoretical Analysis of
In this section, we provide a theoretical analysis of by understanding it from local and global perspectives. We prove that the pooling operation can preserve useful information to be processed by the following GCN layers. We also verify that is permutation invariant, which lays the foundation for graph classification with .
3.1. A Local View of EigenPooling
In this subsection, we analyze the pooling operator from a local perspective focusing on a specific subgraph . For the subgraph , the pooling operator tries to summarize the nodes’ features and form a representation for the corresponding supernode of the subgraph. For a pooling operator , the part that is effective on the subgraph , is only the up-sampled eigenvector as the other eigenvectors have values on the subgraph . Without the loss of generality, let’s consider a single dimensional graph signal defined on the , the pooling operation on can be represented as:
which is the Fourier coefficient of the Fourier mode of the subgraph . Thus, from a local perspective, the pooling process is a graph Fourier transform of the graph signal defined on the subgraph. As we introduced in the Section 2.3.1, each of the Fourier modes (eigenvectors) is associated with an eigenvalue, which measures its smoothness. The Fourier coefficient of the corresponding Fourier mode provides the information to indicate the importance of this Fourier mode to the signal. The coefficient summarizes the graph signal information utilizing both the node features and the subgraph structure as the smoothness is related to both of them. Each of the coefficients characterizes a different property (smoothness) of the graph signal. Using the first coefficients while discarding the others means that we focus more on the “smoother” part of the graph signal, which is common in a lot of applications such as signal denoising and compression (Tremblay and Borgnat, [n. d.]; Chen et al., 2014; Narang and Ortega, 2012). Therefore, we can use the squared summation of the coefficients to measure how much information can be preserved as shown in the following theorem.
Theorem 3.1 ().
Let be a graph signal defined on the graph and be the Fourier modes of this graph. Let be the corresponding Fourier coefficients, i.e., . Let be the signal re-constructed using only the first Fourier modes. Then can measure the information being preserved by this re-construction. Here denotes the vector consisting of the first elements of .
According to Eq.(10), can be written as . Since is orthogonal, we have
which completes the proof. ∎
It is common that for natural graph signal that the magnitude of the spectral form of the graph signal is concentrated on the first few coefficients (Sandryhaila and Moura, [n. d.]; Shuman et al., 2013), which means that for . In other words, by using the first coefficients, we can preserve the majority of the information while reducing the computational cost. We will empirically verify it in the experiment section.
3.2. A Global View of EigenPooling
In this subsection, we analyze the pooling operators from a global perspective focusing on the entire graph . The pooling operators we constructed can be viewed as a filterbank (Tremblay and Borgnat, [n. d.]). Each of the filters in the filterbank filters the given graph signal and obtains a new graph signal. In our case, the filtered signal is defined on the coarsened graph . Without the loss of generality, we consider a single dimensional signal of , then the filtered signals are . Next, we describe some key properties of these pooling operators.
Property 1: Perfect Reconstruction: The first property is that when number of filters are used, the input graph signal can be perfectly reconstructed from the filtered signals.
Lemma 3.2 ().
The graph signal can be perfectly reconstructed from its filtered signals together with the pooling operators as .
With the definition of given in Eq.(13), we have
Obviously, , since that are orthonormal and are all vectors. Thus, . Substitute this to Eq.(19), we arrive at
which completes the proof.
From Lemma 3.2, we know if number of filters are chosen, the filtered signals can preserve all the information from . Thus, together with graph coarsening, eigenvector pooling can preserve the signal information of the input graph and can enlarge the receptive filed, which allows us to finally learn a vector representation for graph classification.
Property 2: Energy/Information Preserving The second property is that the filtered signals preserve all the energy when filters are chosen. To show this, we first give the following lemma, which serves as a tool for demonstrating property 2.
Lemma 3.3 ().
All the columns in the operators are orthogonal to each other.
By definition, we know that, for the same , i.e, the same subgraph, are orthogonal to each other, which means are also orthogonal to each other. In addition, all the with different are also orthogonal to each other as they only have non-zero values on different subgraphs. ∎
With the above lemma, we can further conclude that the norm of graph signal is equal to the summation of the norm of the pooled signals . The proof is given as follows:
Lemma 3.4 ().
The norm of the graph signal is equal to the summation of the norm of the pooled signals :
Property 3: Approximate Energy Preserving Lemma 3.4 describes the energy preserving when number of filters are chosen. In practice, we only need of filters for efficient computation. Next we show that even with number of filters, the filtered signals preserve most of the energy/information.
Theorem 3.5 ().
Let be the graph signal reconstructed only using the first pooling operators and pooled signals . Then the ratio can measure the portion of information being preserved by this .
As shown in Lemma 3.2, . Similarly, we can show that . The portion of the information being preserved can be represented as
which completes the proof. ∎
Since for natural graph signals, the magnitude of the spectral form of the graph signal is concentrated in the first few coefficients (Sandryhaila and Moura, [n. d.]), which means that even with filters, EigenPooling can preserve the majority of the information/energy.
3.3. Permutation Invariance of EigenGCN
takes the adjacency matrix and the node feature matrix as input and aims to learn a vector representation of the graph. The nodes in a graph do not have a specific ordering, i.e., and can be permuted. Obviously, for the same graph, we want to extract the same representation no matter which permutation of and are used as input. Thus, in this subsection, we prove that is permutation invariant, which lays the foundation of using for graph classification.
Theorem 3.6 ().
Let be any permutation matrix, then , i.e., is permutation invariant.
In order to prove that is permutation invariant, we only need to show that it’s key components GCN, graph coarsening and EigenPooling are permutation invariant. For GCN, before permutation, the output is . With permutation, the output becomes
where we have used . This shows that the effect of permutation on GCN only permutes the order of the node representations but doesn’t change the value of the representations. Second, the graph coarsening is done by spectral clustering with . No matter which permutation we have, the detected subgraphs will not change. Finally, EigenPooling summarizes the information within each subgraph. Since the subgraph structures are not affected by the permutation and the representation of each node in the subgraphs is also not affected by the permutation, we can see that the supernodes’ representations after EigenPooling are not affected by the permutation. In addition, the inter-connectivity of supernodes is not affected since it’s determined by spectral clustering. Thus, we can say that one step of GCN-Graph Coarsening-EigenPooling is permutation invariant. Since finally learns one vector representation of the input graph, we can conclude that the vector representation is the same under any permutation . ∎
In this section, we conduct experiments to evaluate the effectiveness of the proposed framework . Specifically, we aim to answer two questions:
Can improve the graph classification performance by the design of ?
How reliable it is to use number of filters for pooling?
We begin by introducing datasets and experimental settings. We then compare with representative and state-of-the-art baselines for graph classification to answer the first question. We further conduct analysis on graph signals to verify the reasonableness of using filters, which answers the second question.
4.1. Data sets
To verify whether the proposed framework can hierarchically learn good graph representations for classification, we evaluate on 6 widely used benchmark data sets for graph classification (Kersting et al., 2016), which includes three protein graph data sets, i.e., ENZYMES (Borgwardt et al., 2005; Schomburg et al., 2004), PROTEINS (Borgwardt et al., 2005; Dobson and Doig, 2003), and (Dobson and Doig, 2003; Shervashidze et al., 2011); one mutagen data set Mutagenicity (Riesen and Bunke, 2008; Kazius et al., 2005) (We denoted as MUTAG in Table 1 and Table 2); and two data sets that consist of chemical compounds screened for activity against non-small cell lung cancer and ovarian cancer cell lines, NCI1 and NCI109 (Wale et al., 2008). Some statistics of the data sets can be found in Table 1. From the table, we can see that the used data sets contain varied numbers of graphs and have different graph sizes. We include data sets of different domains, sample and graph sizes to give a comprehensive understanding of how performs with data sets under various conditions.
4.2. Baselines and Experimental Settings
To compare the performance of graph classification, we consider some representative and state-of-the-art graph neural network models with various pooling layers. Next, we briefly introduce these baseline approaches as well as the experimental settings for them.
GCN (Kipf and Welling, 2016) is a graph neural network framework proposed for semi-supervised node classification. It learns node representations by aggregating information from neighbors. As the GCN model does not consist of a pooling layer, we directly pool the learned node representations as the graph representation. We use it as a baseline to compare whether a hierarchical pooling layer is necessary.
GraphSage (Hamilton et al., 2017) is similar as the GCN and provides various aggregation method. As similar in GCN, we directly pool the learned node representations as the graph representation.
SET2SET. This baseline is also built upon GCN, it is also “flat” but uses set2set architecture introduced in (Vinyals et al., 2015) instead of averaging over all the nodes. We select this method to further show whether a hierarchical pooling layer is necessary no matter average or other pooling methods are used.
Diff-pool (Ying et al., 2018b) is a graph neural network model designed for graph level representation learning with differential pooling layers. It uses node representations learned by an additional convolutional layer to learn the subgraphs (supernodes) and coarsen the graph based on it. We select this model as it achieves state-of-art performance on the graph classification task.
EigenGCN-H represents various variants of the proposed framework EigenGCN, where denotes the number of pooling operators we use for . In this evaluation, we choose .
For each of the data sets, we randomly split it to parts, i.e., as training set, as validation set and as testing set. We repeat the randomly splitting process times, and the average performance of the different splits are reported. The parameters of baselines are chosen based on their performance on the validate set. For the proposed framework, we use the splits of the training set and validation set to tune the structure of the graph neural network as well as the learning rate. The same structure and learning rate are then used for all splits.
4.3. Performance on Graph Classification
Each experiment is run 10 times and the average graph classification performance in terms of accuracy is reported in Table 2. From the table, We make the following observations:
Diff-pool and the framework perform better than those methods without a hierarchical pooling procedure in most of the cases. Aggregating the node information hierarchically can help learn better graph representations.
The proposed framework shares the same convolutional layer with GCN, GraphSage, and SET2SET. However, the proposed framework (with different ) outperforms them in most of the data sets. This further indicates the necessity of the hierarchical pooling procedure. In other words, the proposed can indeed help the graph classification performance.
In most of the data sets, we can observe that the variants of the with more eigenvectors achieve better performance than those with fewer eigenvectors. Including more eigenvectors, which suggests that we can preserve more information during pooling, can help learn better graph representations in most of the cases. In some of data sets, including more eigenvector does not bring any improvement in performance or even make the performance worse. Theoretically, we are able to preserve more information by using more eigenvectors. However, noise signals may be also preserved, which can be filtered when using fewer eigenvectors.
The proposed achieves the state-of-the-art or at least comparable performance on all the data sets, which shows the effectiveness of the proposed framework .
To sum up, can help learn better graph representation and the proposed framework with can achieve state-of-the-art performance in graph classification task.
4.4. Understanding Graph Signals
In this subsection, we investigate the distribution of the Fourier coefficients on signals in real data. We aim to show that for natural graph signals, most of the information/energy concentrates on the first few Fourier models (or eigenvectors). This paves us a way to only use filters in . Specifically, given one data set, for each graph with nodes and its associated signal , we first calculate the graph Fourier transform and obtain the coefficients . We then calculate the following ratio: , where denotes the first rows of the matrix for various values of . According to Theorem 3.5, this ratio measures how much information can be preserved by the first coefficients. We then average the ratio over the entire data set and obtain
Note that if , we set . We visualize the ratio for each of the data set up to in Figure 2. As shown in Figure 2, for most of the data set, the magnitude of the coefficients concentrated in the first few coefficients, which demonstrates the reasonableness of using only filters in . In addition, using filters can save computational cost.
5. Related Work
In recent years, graph neural network models, which try to extend deep neural network models to graph structured data, have attracted increasing interests. These graph neural network models have been applied to various applications in many different areas. In (Kipf and Welling, 2016), a graph neural network model that tries to learn node representation by aggregating the node features from its neighbors, is applied to perform semi-supervised node classification. Similar methods were later proposed to further enhance the performance by including attention mechanism (Veličković et al., 2017). GraphSage (Ying et al., 2018b), which allows more flexible aggregation procedure, was designed for the same task. There are some graph neural networks models designed to reason the dynamics of physical systems where the model is applied to predict future states of nodes given their previous states (Battaglia et al., 2016; Sanchez-Gonzalez et al., 2018). Most of the aforementioned methods can fit in the framework of “message passing” neural networks (Gilmer et al., 2017), which mainly involves transforming, propagating and aggregating node features across the graph through edges. Another stream of graph neural networks was developed based on the graph Fourier transform (Defferrard et al., 2016; Bruna et al., 2013; Henaff et al., 2015; Levie et al., 2017). The features are first transferred to the spectral domain, next filtered with learnable filters and then transferred back to the spatial domain. The connection between these two streams of works is shown in (Defferrard et al., 2016; Kipf and Welling, 2016). Graph neural networks have also been extended to different types of graphs (Ma et al., 2018, 2019; Derr et al., 2018) and applied to various applications (Wang et al., 2018; Fan et al., 2019; Ying et al., 2018a; Monti et al., 2017; Schlichtkrull et al., 2018; Trivedi et al., 2017). Comprehensive surveys on graph neural networks can be found in (Zhou et al., 2018; Wu et al., 2019; Zhang et al., 2018a; Battaglia et al., 2018).
However, the design of the graph neural network layers is inherently “flat”, which means the output of pure graph neural network layers is node representations for all the nodes in the graph. To apply graph neural networks to the graph classification task, an approach to summarize the learned node representations and generate the graph representation is needed. A simple way to generate the graph representations is to globally combine the node representations. Different combination approaches have been investigated, which include averaging over all node representation as the graph representation (Duvenaud et al., 2015), adding a “virtual node” connected to all the nodes in the graph and using its node representation as the graph representation (Li et al., 2015), and using conventional fully connected layers or convolutional layers after arranging the graph to the same size (Zhang et al., 2018b; Gilmer et al., 2017). However, these global pooling methods cannot hierarchically learn graph representations, thus ignoring important information in the graph structure. There are a few recent works (Defferrard et al., 2016; Ying et al., 2018b; Simonovsky and Komodakis, 2017; Fey et al., 2018) investigating learning graph representations with a hierarchical pooling procedure. These methods usually involve two steps 1) coarsen a graph by grouping nodes into supernode to form a hierarchical structure and 2) learn supernode representations level by level and finally obtain the graph representation. These methods use mean-pooling or max-pooling when they generate supernodes representation, which neglects the important structure information in the subgraphs. In this paper, we propose a pooling operator based on local graph Fourier transform, which utilizes the subgraph structure as well as the node features for generating the supernode representations.
In this paper, we design , a pooling operator based on local graph Fourier transform, which can extract subgraph information utilizing both node features and structure of the subgraph. We provide a theoretical analysis of the pooling operator from both local and global perspectives. The pooling operator together with a subgraph-based graph coarsening method forms the pooling layer, which can be incorporated into any graph neural networks to hierarchically learn graph level representations. We further proposed a graph neural network framework by combining the proposed pooling layers with the GCN convolutional layers. Comprehensive graph classification experiments were conducted on commonly used graph classification benchmarks. Our proposed framework achieves state-of-the-art performance on most of the data sets, which demonstrates its effectiveness.
Yao Ma and Jiliang Tang are supported by the National Science Foundation (NSF) under grant numbers IIS-1714741, IIS-1715940, IIS-1845081 and CNS-1815636, and a grant from Criteo Faculty Research Award.
- Battaglia et al. (2016) Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. 2016. Interaction networks for learning about objects, relations and physics. In NIPS. 4502–4510.
- Battaglia et al. (2018) Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. 2018. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 (2018).
- Borgwardt et al. (2005) Karsten M Borgwardt, Cheng Soon Ong, Stefan Schönauer, SVN Vishwanathan, Alex J Smola, and Hans-Peter Kriegel. 2005. Protein function prediction via graph kernels. Bioinformatics 21, suppl_1 (2005), i47–i56.
- Bruna et al. (2013) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2013. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203 (2013).
- Chen et al. (2014) Siheng Chen, Aliaksei Sandryhaila, Jose MF Moura, and Jelena Kovacevic. 2014. Signal denoising on graphs via graph filtering. In GlobalSIP.
- Chung and Graham (1997) Fan RK Chung and Fan Chung Graham. 1997. Spectral graph theory. Number 92. American Mathematical Soc.
- Dai et al. (2016) Hanjun Dai, Bo Dai, and Le Song. 2016. Discriminative embeddings of latent variable models for structured data. In ICML. 2702–2711.
- Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS.
- Derr et al. (2018) Tyler Derr, Yao Ma, and Jiliang Tang. 2018. Signed graph convolutional networks. In 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 929–934.
- Dobson and Doig (2003) Paul D Dobson and Andrew J Doig. 2003. Distinguishing enzyme structures from non-enzymes without alignments. JMB 330, 4 (2003), 771–783.
- Duvenaud et al. (2015) David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. 2015. Convolutional networks on graphs for learning molecular fingerprints. In NIPS. 2224–2232.
- Fan et al. (2019) Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. 2019. Graph Neural Networks for Social Recommendation. arXiv preprint arXiv:1902.07243 (2019).
Fey et al. (2018)
Matthias Fey, Jan
Eric Lenssen, Frank Weichert, and
Heinrich Müller. 2018.
SplineCNN: Fast geometric deep learning with continuous B-spline kernels. InCVPR. 869–877.
- Gao and Ji (2019a) Hongyang Gao and Shuiwang Ji. 2019a. Graph representation learning via hard and channel-wise attention networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM.
Gao and Ji (2019b)
Hongyang Gao and
Shuiwang Ji. 2019b.
Graph U-nets. In
Proceedings of The 36th International Conference on Machine Learning.
- Gao et al. (2018) Hongyang Gao, Zhengyang Wang, and Shuiwang Ji. 2018. Large-scale learnable graph convolutional networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1416–1424.
- Gilmer et al. (2017) Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. Neural Message Passing for Quantum Chemistry. In ICML. 1263–1272.
- Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In NIPS. 1024–1034.
- Henaff et al. (2015) Mikael Henaff, Joan Bruna, and Yann LeCun. 2015. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163 (2015).
- Kazius et al. (2005) Jeroen Kazius, Ross McGuire, and Roberta Bursi. 2005. Derivation and validation of toxicophores for mutagenicity prediction. JMC 48, 1 (2005), 312–320.
- Kersting et al. (2016) Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann. 2016. Benchmark Data Sets for Graph Kernels. http://graphkernels.cs.tu-dortmund.de
- Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS. 1097–1105.
- Levie et al. (2017) Ron Levie, Federico Monti, Xavier Bresson, and Michael M Bronstein. 2017. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. arXiv preprint arXiv:1705.07664 (2017).
- Li et al. (2015) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2015. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493 (2015).
- Ma et al. (2018) Yao Ma, Ziyi Guo, Zhaochun Ren, Eric Zhao, Jiliang Tang, and Dawei Yin. 2018. Dynamic graph neural networks. arXiv preprint arXiv:1810.10627 (2018).
- Ma et al. (2019) Yao Ma, Suhang Wang, Chara C Aggarwal, Dawei Yin, and Jiliang Tang. 2019. Multi-dimensional Graph Convolutional Networks. In Proceedings of the 2019 SIAM International Conference on Data Mining. SIAM, 657–665.
- Monti et al. (2017) Federico Monti, Michael Bronstein, and Xavier Bresson. 2017. Geometric matrix completion with recurrent multi-graph neural networks. In Advances in Neural Information Processing Systems. 3697–3707.
- Narang and Ortega (2012) Sunil K Narang and Antonio Ortega. 2012. Perfect reconstruction two-channel wavelet filter banks for graph structured data. IEEE Transactions on Signal Processing 60, 6 (2012), 2786–2799.
- Niepert et al. (2016) Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. 2016. Learning convolutional neural networks for graphs. In ICML. 2014–2023.
Riesen and Bunke (2008)
Kaspar Riesen and Horst
IAM graph database repository for graph based pattern recognition and machine learning. InJoint IAPR International Workshops on SPR and (SSPR). Springer, 287–297.
- Sanchez-Gonzalez et al. (2018) Alvaro Sanchez-Gonzalez, Nicolas Heess, Jost Tobias Springenberg, Josh Merel, Martin Riedmiller, Raia Hadsell, and Peter Battaglia. 2018. Graph networks as learnable physics engines for inference and control. arXiv preprint arXiv:1806.01242 (2018).
- Sandryhaila and Moura ([n. d.]) Aliaksei Sandryhaila and José MF Moura. [n. d.]. Discrete signal processing on graphs. IEEE transactions on signal processing 61, 7 ([n. d.]), 1644–1656.
- Scarselli et al. (2009) Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2009. The graph neural network model. IEEE Transactions on Neural Networks 20, 1 (2009), 61–80.
- Schlichtkrull et al. (2018) Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In European Semantic Web Conference. Springer, 593–607.
- Schomburg et al. (2004) Ida Schomburg, Antje Chang, Christian Ebeling, Marion Gremse, Christian Heldt, Gregor Huhn, and Dietmar Schomburg. 2004. BRENDA, the enzyme database: updates and major new developments. Nucleic acids research 32, suppl_1 (2004), D431–D433.
- Shervashidze et al. (2011) Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. 2011. Weisfeiler-lehman graph kernels. JMLR 12, Sep (2011), 2539–2561.
Shuman et al. (2013)
David I Shuman, Sunil K
Narang, Pascal Frossard, Antonio Ortega,
and Pierre Vandergheynst.
The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains.IEEE Signal Processing Magazine 30, 3 (2013), 83–98.
- Simonovsky and Komodakis (2017) Martin Simonovsky and Nikos Komodakis. 2017. Dynamic Edge-Conditioned Filters in Convolutional Neural Networks on Graphs. In CVPR. 3693–3702.
- Tremblay and Borgnat ([n. d.]) Nicolas Tremblay and Pierre Borgnat. [n. d.]. Subgraph-based filterbanks for graph signals. IEEE Transactions on Signal Processing 64, 15 ([n. d.]), 3827–3840.
et al. (2017)
Rakshit Trivedi, Hanjun
Dai, Yichen Wang, and Le Song.
Know-evolve: Deep temporal reasoning for dynamic knowledge graphs. InProceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 3462–3471.
- Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2017. Graph Attention Networks. arXiv preprint arXiv:1710.10903 (2017).
- Vinyals et al. (2015) Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. 2015. Order matters: Sequence to sequence for sets. arXiv preprint arXiv:1511.06391 (2015).
- Wale et al. (2008) Nikil Wale, Ian A Watson, and George Karypis. 2008. Comparison of descriptor spaces for chemical compound retrieval and classification. Knowledge and Information Systems 14, 3 (2008), 347–375.
et al. (2018)
Xiaolong Wang, Yufei Ye,
and Abhinav Gupta. 2018.
Zero-shot recognition via semantic embeddings and
knowledge graphs. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6857–6866.
- Wu et al. (2019) Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. 2019. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596 (2019).
- Ying et al. (2018a) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018a. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 974–983.
- Ying et al. (2018b) Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. 2018b. Hierarchical graph representation learning with differentiable pooling. In NIPS. 4805–4815.
- Zhang et al. (2018b) Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. 2018b. An End-to-End Deep Learning Architecture for Graph Classification. (2018).
- Zhang et al. (2018a) Ziwei Zhang, Peng Cui, and Wenwu Zhu. 2018a. Deep learning on graphs: A survey. arXiv preprint arXiv:1812.04202 (2018).
- Zhou et al. (2018) Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and Maosong Sun. 2018. Graph Neural Networks: A Review of Methods and Applications. arXiv preprint arXiv:1812.08434 (2018).