Many real world data such as social networks, collections of documents and chemical structures are naturally represented as graphs. Consequently there exists great potential for the application of machine learning to graphs. Given the great successes of neural networks or deep learning to the analysis of images, there has recently been much research considering the application or generalization of neural networks to graphs. In many cases this has resulted in state of the art performance in many tasks(Wu et al., 2019).
Graph convolutional is a neural network architecture commonly applied to graphs. This architecture consists of a sequence of convolutional layers where each layer iteratively updates a representation or embedding of each vertex. This update is achieved through the application of an operation which considers the current representation of each vertex plus the current representation of its adjacent neighbours (Gilmer et al., 2017). The output of a sequence of convolutional layers is a representation of each vertex which encodes properties of the vertex in question and vertices in its neighbourhood.
If one wishes to perform a vertex centric task such as vertex classification, then one may operate directly on the set of vertex representations output from a sequence of convolutional layers. However, if one wishes to perform a graph centric task such as graph classification, then the set of vertex representations must somehow be integrated or pooled to form a graph representation. Pooling represents a challenging problem because there exists no vertex ordering and different graphs may have a different number of vertices. Commonly employed pooling methods include computing the mean or sum of vertex representations. However these simple pooling methods are not a complete invariant in the sense that many different sets of vertex representations may result in the same graph representation leading to weak discrimination power (Xu et al., 2018). To overcome this issue and increase discrimination power a number of authors have proposed more sophisticated pooling methods. For example Ying et al. (2018)
proposed a pooling method which performs a hierarchical clustering of vertex representations.
In this article we propose a novel pooling method which maps a set of vertex representations to a function space representation which forms a vector space. This method is parameterized by a single learnable parameter which controls the discrimination power of the method. This makes the method applicable to both finer and coarser classification tasks which require greater and less discrimination power respectively. The proposed pooling method is inspired by related methods in the field of applied topology which map sets of points into a function space representation (Adams et al., 2017).
2 Related Pooling Methods
The simplest and mostly commonly used pooling methods involve computing basic summary statistics such as mean and sum of vertex representations (Duvenaud et al., 2015). To improve discrimination power more sophisticated pooling methods have been proposed. The SortPooling method first sorts the vertices with respect to structural roles in the graph (Wu et al., 2019). The vertex representations corresponding to the first vertices in this order are then used as input to a traditional one dimensional convolutional network. The value is a fixed hyper-parameter in the model. Set2set is a general approach for embedding a set in a manner which is invariant to element order (Vinyals et al., 2015). Gilmer et al. (2017) proposed to use this method to perform pooling. Ying et al. (2018) proposed a pooling method which performs a hierarchical clustering of vertex representations. Kearnes et al. (2016) proposed a pooling method based on fuzzy histograms. This method has similarities to that proposed in this article but is formulated is in terms of fuzzy theory as opposed to function spaces. The method proposed in this article is in turn distinct. All of the above pooling methods are supervised methods. Bai et al. (2019) proposed a pooling method which is unsupervised.
3 Function Space Pooling
Let graph be a graph where and are the corresponding sets of vertices and edges respectively. Let be the set of vertex representations output from a sequence of convolutional layers applied to . We assume that each vertex representation is an element of . The proposed pooling method takes as input and returns a function. That is, the method is a map from the space of sets to the space of functions. It contains two steps which we now describe in turn.
The set of vertex representations is an object in the category of sets which we denote . Let be the
-dimensional Sigmoid function defined in Equation1 where is the -dimensional interval. In the first step of the proposed pooling method we apply the -dimensional Sigmoid elementwise to to give a map . To illustrate this map consider Figure 1 which displays an example set containing three elements in . The result of applying the map to this set is illustrated in Figure 1.
be a probability distribution. For the purposes of this work we used the
-dimensional Gaussian distribution defined in Equation2 with mean
In the second step of the proposed pooling method we apply a map to . Here is the vector space of functions where the -th power of the absolute value is Lebesgue integrable and is equipped with the -norm defined in Equation 3 (Christensen, 2010). Note that, function addition and subtraction is performed pointwise.
The function resulting from the map is defined in Definition 1. To illustrate this map consider again the example set illustrated in Figure 1. Figure 1 displays the function resulting from applying the map to this set with a parameter value of .
For the corresponding function representation is defined in Equation 4
The elements of , and in turn the function representation , are infinite dimensional vector spaces. That is, there are an infinite number of elements in the domain of . We approximate this function as a finite dimensional vector space by discretizing the function domain using a regular grid of elements. For example, the image in Figure 1 corresponds to a discretizing of the function domain using a grid.
The proposed pooling method is parameterized by in the probability distribution of Equation 2 where this parameter is in the range . As the value of approaches the probability distribution approaches an indicator function on the domain . On the other hand, as the value of approaches the probability distribution approaches a uniform function on the domain . For example, Figures 1 and 1 display the functions resulting from applying the map to the set in Figure 1 with parameter values of and respectively.
The parameter may be interpreted in a couple of ways. As the value of approaches the function representation approaches a complete invariant. That is, distinct sets map to distinct functions where the distance between these functions as defined by the norm in Equation 3 is greater than zero. On the other hand, as approaches , the distance between these functions reduces. An alternative way of interpreting the parameter is as follows. As mentioned above, as the value of approaches the probability distribution approaches a uniform function on the domain . In this case the proposed pooling method differs from computing the sum of the elements of by a multiplicative constant only.
To evaluate the performance of the proposed pooling method we considered the task of graph classification. The layout of this section is as follows. Section 4.1 describes the datasets considered. Section 4.2 describes the feed-forward network architecture used in all experiments. Section 4.3 describes the optimization method used to optimize the network parameters. Finally section 4.4 presents the classification accuracy achieved by the proposed pooling method relative to two benchmark methods.
The first dataset consider was the MUTAG dataset which consists of 188 chemical compounds where the classification problem is binary and concerns predicting if a chemical compound has mutagenicity or not (Debnath et al., 1991). Each chemical compound is represented as a graph where there are distinct types of vertices.
The second dataset consider was the PROTEINS dataset which consists of proteins where the classification task is binary and concerns predicting if a protein is an enzyme or not (Borgwardt et al., 2005). Each protein is represented as a graph where there are distinct types of vertices. Both of the datasets considered are commonly used to evaluate graph neural networks (Fey and Lenssen, 2019).
4.2 Network Architecture
The feed-forward network architecture used consists of the following six layers. The first two layers are convolutional layers. A number of studies have found that two convolutional layers empirically gives best performance (Kipf and Welling, 2016). The third layer is a fully connected linear layer. The fourth layer is the pooling method used. The fifth layer is another fully connected linear layer. The final layer is a softmax function.
The convolutional layers used are similarly to the GraphSAGE convolutional layers (Hamilton et al., 2017). Let denote the matrix containing the vertex representation in the th convolutional layer. Each matrix row corresponds to the representation of an individual vertex. Let denote matrix multiplication and CONCAT denote horizontal matrix concatenation. The th convolutional layer is implemented using Equation 5 where and are the corresponding weights and biases respectively. The weights is a matrix of dimension where the dimension of the th layer. The biases is a vector of dimension . The dimension of the input layer
is equal to the number of vertex types since one-hot encoding was used. The dimensions of the two convolutional layersand were both set to .
The dimensions of the first and second linear layers were set to and respectively. The output of the first linear layer is the input to the pooling method. Therefore the multi-dimensional interval corresponding to the domain of the function in Definition 1 is of dimension . We approximate this function as a finite dimensional vector space by discretizing the function domain using a regular grid with elements in each dimension. This gives a finite dimensional vector space of size .
The model parameters to be optimized in the architecture of section 4.2 are the weights and biases of the convolutional and linear layers plus the parameter
of the pooling method. For loss function a Cross Entropy loss term plus anregularization term was used. In all experiments a regularization weight of was used. The Adam optimization algorithm was used to optimize all model parameters with a learning rate of . In all experiments optimization was performed using epochs.
|Sum||65.6 13||60.1 18|
|Mean||78.1 18||57.7 16|
|Function Space||83.3 11||72.8 19|
4.4 Classification Accuracy
The proposed pooling method was benchmarked against the methods of computing the mean and sum of vertex representations. As discussed in section 2, these are some of the most commonly used pooling methods. For each benchmark pooling method the corresponding network architecture was identical to that described in section 4.2 with the exception that the pooling layer was replaced and the dimension of the linear layer before this layer was changed from to . For both datasets considered we computed the mean accuracy of 10-fold cross validation for each pooling method and the results of this analysis are displayed in Table 1. For both datasets, the proposed pooling method outperformed both benchmark methods.
We propose a novel pooling method for convolutional layers in graph neural networks which involves computing a function space representation of vertex representations. Experimental results demonstrate the proposed method outperforms the commonly employed pooling methods of computing the mean and sum of vertex representations.
- Adams et al. (2017) Henry Adams, Tegan Emerson, Michael Kirby, Rachel Neville, Chris Peterson, Patrick Shipman, Sofya Chepushtanova, Eric Hanson, Francis Motta, and Lori Ziegelmeier. Persistence images: A stable vector representation of persistent homology. The Journal of Machine Learning Research, 18(1):218–252, 2017.
- Bai et al. (2019) Yunsheng Bai, Hao Ding, Yang Qiao, Agustin Marinovic, Ken Gu, Ting Chen, Yizhou Sun, and Wei Wang. Unsupervised inductive whole-graph embedding by preserving graph proximity. arXiv preprint arXiv:1904.01098, 2019.
- Borgwardt et al. (2005) Karsten M Borgwardt, Cheng Soon Ong, Stefan Schönauer, SVN Vishwanathan, Alex J Smola, and Hans-Peter Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(suppl_1):i47–i56, 2005.
- Christensen (2010) Ole Christensen. Functions, spaces, and expansions: mathematical tools in physics and engineering. Springer Science & Business Media, 2010.
- Debnath et al. (1991) Asim Kumar Debnath, Rosa L Lopez de Compadre, Gargi Debnath, Alan J Shusterman, and Corwin Hansch. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. Journal of medicinal chemistry, 34(2):786–797, 1991.
- Duvenaud et al. (2015) David Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alan Aspuru-Guzik, and Ryan Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015.
- Fey and Lenssen (2019) Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428, 2019.
- Gilmer et al. (2017) Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1263–1272, 2017.
- Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
- Kearnes et al. (2016) Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of computer-aided molecular design, 30(8):595–608, 2016.
- Kipf and Welling (2016) Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
- Vinyals et al. (2015) Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for sets. arXiv preprint arXiv:1511.06391, 2015.
- Wu et al. (2019) Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019.
- Xu et al. (2018) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
- Ying et al. (2018) Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. arXiv preprint arXiv:1806.08804, 2018.