1 Introduction and Related Work
Most networks display some level of hierarchical topology (Barabási & Pósfai, 2016); examples include actor networks, the semantic web and the internet at the autonomous system level (Ravasz & Barabási, 2003).
The defining feature of hierarchical network topology is how the clustering coefficient, , of the th node scales with its degree, ; namely,
(1) 
Nodes with high degree and low clustering coefficient are those that connect different communities. Nodes with low degree and high clustering coefficient are those that connect within the community more than to different communities. Another distinct characteristic of hierarchical networks is that the clustering coefficient is independent of the number of nodes and that they show scalefree topology (Barabási & Pósfai, 2016).
In this work, we present and explore a novel architecture operating on the graph domain which attempts to utilise the intrinsic hierarchical topological information that is embedded in a wide range of datasets. Graph neural networks, a booming field of deep learning, was built on the assertion that intrinsic graph structure in datasets was underutilised
(Zhou et al., 2018). However, the field of multiscale graph neural networks effectively does not exist, despite the fact that the evidence supporting the claim of hierarchical underutilisation is just as strong as for graph structure underutilisation (Ravasz & Barabási, 2003).Limited work has been done in the area of multiscale graph neural networks. Luzhnica, Day & Liò (2019) introduced clique pooling; however cliques (complete subgraphs) are more restrictive than clusters (dense subgraphs) — it is more likely that hierarchical networks consist of clusters rather than cliques (beyond triangular cliques). The work on multiscale graph convolutions by Haija et al. (2018) is the closest piece of work to ours; however it does not use hierarchical clustering algorithms.
In terms of applications; Lu et al. (2019) used a hierarchical levelbylevel treatment of quantum interactions in molecules, Zitnik et al. (2017) used hierarchical multiplex graphs for multiple spatial scales of brain tissue and Kim et al. (2019)
developed a temporal multiscale graph convolutional neural network for analysing bikesharing demands. Our multiscale decomposition was loosely inspired by the multiscale neural network using hierarchical block matrices, by Fan
et al. (2019), however it was designed to solve partial differential equations and does not operate on the graph domain.2 Architecture
The architecture is shown in Figure 1. We input an unweighted, undirected, graph, , where there are nodes, , and edges, . We construct , where
is the identity matrix and
is the adjacency matrix given by,(2) 
A multiscale decomposition, , is taken through hierarchical clustering,
(3) 
and hence,
(4) 
We pass this and the input feature matrix, , with features per node, through the first convolutional layer, with input nodes, obtaining the set over ,
(5) 
where , is the number of input nodes to the second layer, is the weight matrix for the th neural network layer on the th scale,
is a nonlinear activation function and
is the th diagonal node degree matrix. The Kipf & Welling (2016) propagation rule was used in equation 5. We then pass through the second convolutional layer,(6) 
where , with being the number of classes. We then vertically concatenate the latent feature matrices from each scale, obtaining
(7) 
Here,
. Finally this is fed through a fully connected (FC) layer, resulting in the node classification probability matrix as output,
(8) 
where , the weight matrix and is a bias matrix . In our implementation of this architecture, we chose .
We used a graph convolutional network (GCN) layer depth of two here but this architecture can easily be generalised to any GCN layer depth. Ensembling, as we have done here, has significant literature demonstrating its power as an architectural choice (Xibin et al., 2020).
3 Experiments
Tests were conducted on a truncated version of the Cora dataset (https://relational.fit.cvut.cz/dataset/CORA). Its properties are detailed in Table 1. Code can be found at [link removed for review].
Truncated Cora Dataset  

Type  Machine learning citation network 
Task  Semisupervised node classification 
Number of nodes  2708 
Number of labelled nodes  140 
Number of edges  5429 
Number of classes  7 
Feature vector length  1433 
3.1 Hyperparameters
Prior to performing the following experiments, the Cora network was put through a GirvanNewman clustering algorithm. It produced 528 usable networks (excluding the last one which contained no edges, hence graph convolutions could not be performed properly). The nonlinear activation function used was ReLU. Adam optimization was used with a learning rate of 0.01 and weight decay of
. The number of epochs used to train the convolutional layers was 300 for each graph. The number of epochs used to train the FC layer was 10. Given that this was a multiclass classification task with no major skew to the dataset, accuracy was deemed to be a sufficient performance metric. Two tests were performed for each data point, with error bars given by the standard deviation. The strategy for assigning the number of hidden nodes for a given GCN layer was to take the arithmetic mean of the number of nodes in the directly neighbouring layers, ensuring consistency. Each data point took approximately 10 minutes on a Nvidia GTX Titan X. Figure 2 summarises the results. Overall, the architecture achieved accuracies of 77% using only 5% of the truncated Cora dataset for training.
3.1.1 Depth
Three graphs were used. Graphs 0, 200 and 400 were used. That is, the coarsegrained representation, an intermediate scale and the finest scale respectively. The number of FC hidden nodes was 30. The layer depth was varied from one layer to five layers.
The results are shown in Figure 2(a). We note that the accuracy increases initially to its highest at layer depth 2, then drops steadily as more layers are added. This is consistent with a wellknown phenomenon where below a problem specific threshold, the performance of the neural network is poor, but beyond that, more depth again leads to poor performance (Loukas, 2020). It is speculated that a neural network that is too shallow is unable to learn more complex and abstract nonlinear features, whereas a network that is too deep is liable to overfitting. Overly deep networks can also incur the vanishing gradient phenomenon (Ghosh & Ghosh, 2019).
3.1.2 Scales
Each graph from the multiscale decomposition can be interpreted as a different scale of the original network. GCN depth was kept at two. The number of FC hidden nodes was 7. We note this is less than for the layer depth experiment and we note the drop in accuracy. The number of scales used, , was varied from 3 to 176.
Figure 2(b) displays the results. There seems to be no strong trend. The highest accuracy was achieved with 8 scales. This was higher than the two lower number of scales, supporting the claim that multiscale decomposition does improve performance, but only up to a certain threshold number of scales. This is similar to the behaviour when altering the layer depth.
3.1.3 Hidden Nodes
The GCN layer depth was kept at depth two. Three graphs were used: number 0, 200 and 400. The number of hidden nodes in the FC layer was altered from 5 to 45. The FC network always had 7 input nodes and 7 output nodes.
The results are shown in Figure 2(d). The accuracy increases sharply when increasing from 5 to 20 hidden nodes, but then a plateau is reached, where increasing the number of hidden nodes does not significantly increase the performance. Individual more extreme tests, not plotted here, were performed at 100 and 1000 hidden nodes, both still returning approximately 76% accuracy, confirming the plateau. Loukas (2020) suggests that the product of GNN width (number of hidden nodes) and depth has to exceed (a function of) the graph size to be performant. The neural network must be either deep or wide. This could explain the initial increase in accuracy as the number of hidden nodes is increased.
3.2 Noise
This experiment was conducted using standard hyperparameters: a GCN depth of two; the same three graphs from the depth and hidden node experiments; and 30 hidden nodes in the FC layer. The other experimental parameters detailed at the beginning of the hyperparameter section also apply here.
The method for noise addition selected nodes, out of 140 in the training set, where the noise parameter,
. It then replaced the feature vectors corresponding to those nodes with a random vector of binary values, maintaining the original dimensions. The values were sampled from a uniform probability distribution.
The effect of varying the noise parameter on accuracy is shown in Figure 2(c). As should be expected, the accuracy decreases as the noise is increased. The model shows robustness.
4 Discussion
4.1 Limitations
There are three main limitations to the approach taken in the architecture described. First, there is added computational complexity due to the addition of a hierarchical clustering algorithm. The GirvanNewman (2002) algorithm scales as for a sparse network. The algorithm needs to be invoked anytime a new network is input into the model. It is therefore important to choose the optimal hierarchical clustering algorithm. Second, more RAM is needed to store the graphs produced by the multiscale decomposition. A memory efficient way of storing these could be devised, such as deserializing individual disk stored graphs, from the multiscale decomposition, each time one of the scales of GCN layers is trained, then removing them before training the next scale. Third, it is computationally more expensive to train many graph convolutional networks than training only one. The performance benefit should outweigh the extra computation time.
4.2 Future Work
Performance could be compared across multiple datasets: a model artificially generated hierarchical network, a perfectly nonhierarchical network such as a random Erdős–Rényi (1960) graph and a stochastic block model network (Karrer & Newman, 2011) to understand the impact of adding communities. A neural network approach to hierarchical clustering could be implemented (Tian et al., 2014; Yang et al., 2017). Assessing performance while varying the distribution of scales, other than the linear spacing used here, may be informative.
Architecture changes could include condensing nodes in each community to one node, rather than using graphs with different numbers of edges. The decomposition would produce graphs with different numbers of nodes; each cluster represented by a node with a feature vector obtained from averaging all the node features in that cluster. They would be pushed through graph convolutional layers, then the nodes would be unpacked by expanding each node into its original cluster nodes; this time the child nodes would all have the same feature vector from the parent node. This would be performed on all graphs and would allow the graphs to have the same number of nodes again which allows the usual flattening and FC layer.
4.3 Applications
First, the architecture would be well suited for quantum molecular property prediction using the QM9 or QM7b datasets (Blum & Reymond, 2009; Montavon et al., 2013; Ramakrishnan et al., 2014; Ruddigkeit et al., 2012). This is because of its hierarchical nature as demonstrated by the improved performance of a multilevel architecture by Lu et al. (2019). Quantum molecular property prediction can aid drug discovery. Second, the protein interface prediction network by Fout et al. (2017) could be augmented with our architecture, perhaps realising performance gains. This is a prime target for our architecture due to the hierarchical nature of protein interactions.
Lastly, the graph element network by Alet et al. (2019) could be extended with our architecture. Graph element networks aim to be the neural network version of finite element analysis by using graph neural networks as computational substrates. Multiscale finite element analysis already exists where different scale meshes are used, from coarse to fine. Multiscale graph element networks do not exist, despite being a clear extension.
References
 AbuElHaija et al. (2018) AbuElHaija, S., Kapoor, A., Perozzi, B., and Lee, J. NGCN: multiscale graph convolution for semisupervised node classification. CoRR, abs/1802.08888, 2018. URL http://arxiv.org/abs/1802.08888.
 Alet et al. (2019) Alet, F., Jeewajee, A. K., Bauza, M., Rodriguez, A., LozanoPerez, T., and Kaelbling, L. P. Graph element networks: adaptive, structured computation and memory. arXiv preprint arXiv:1904.09019, 2019.
 Barabási & Pósfai (2016) Barabási, A.L. and Pósfai, M. Network science. Cambridge University Press, Cambridge, 2016. ISBN 9781107076266 1107076269. URL http://barabasi.com/networksciencebook/.
 Blum & Reymond (2009) Blum, L. C. and Reymond, J.L. 970 million druglike small molecules for virtual screening in the chemical universe database GDB13. J. Am. Chem. Soc., 131:8732, 2009.
 Erdős & Rényi (1960) Erdős, P. and Rényi, A. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1):17–60, 1960.
 Fan et al. (2019) Fan, Y., Lin, L., Ying, L., and ZepedaNúnez, L. A multiscale neural network based on hierarchical matrices. Multiscale Modeling & Simulation, 17(4):1189–1213, 2019.
 Fout et al. (2017) Fout, A., Byrd, J., Shariat, B., and BenHur, A. Protein interface prediction using graph convolutional networks. In Advances in neural information processing systems, pp. 6530–6539, 2017.
 Ghosh & Ghosh (2019) Ghosh, S. and Ghosh, S. Exploring the ideal depth of neural network when predicting question deletion on community question answering. In Proceedings of the 11th Forum for Information Retrieval Evaluation, pp. 52–55, 2019.
 Girvan & Newman (2002) Girvan, M. and Newman, M. E. Community structure in social and biological networks. Proceedings of the national academy of sciences, 99(12):7821–7826, 2002.
 Karrer & Newman (2011) Karrer, B. and Newman, M. E. Stochastic blockmodels and community structure in networks. Physical review E, 83(1):016107, 2011.
 Kipf & Welling (2016) Kipf, T. N. and Welling, M. Semisupervised classification with graph convolutional networks. CoRR, abs/1609.02907, 2016. URL http://arxiv.org/abs/1609.02907.
 Loukas (2020) Loukas, A. What graph neural networks cannot learn: depth vs width. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=B1l2bp4YwS.

Lu et al. (2019)
Lu, C., Liu, Q., Wang, C., Huang, Z., Lin, P., and He, L.
Molecular property prediction: A multilevel quantum interactions
modeling perspective.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, volume 33, pp. 1052–1060, 2019.  Luzhnica et al. (2019) Luzhnica, E., Day, B., and Liò, P. Clique pooling for graph classification. CoRR, abs/1904.00374, 2019. URL http://arxiv.org/abs/1904.00374.
 Montavon et al. (2013) Montavon, G., Rupp, M., Gobre, V., VazquezMayagoitia, A., Hansen, K., Tkatchenko, A., Müller, K.R., and von Lilienfeld, O. A. Machine learning of molecular electronic properties in chemical compound space. New Journal of Physics, 15(9):095003, 2013. URL http://stacks.iop.org/13672630/15/i=9/a=095003.
 Ramakrishnan et al. (2014) Ramakrishnan, R., Dral, P. O., Rupp, M., and von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data, 1, 2014.
 Ravasz & Barabási (2003) Ravasz, E. and Barabási, A.L. Hierarchical organization in complex networks. Phys. Rev. E, 67:026112, Feb 2003. doi: 10.1103/PhysRevE.67.026112. URL https://link.aps.org/doi/10.1103/PhysRevE.67.026112.
 Ruddigkeit et al. (2012) Ruddigkeit, L., Van Deursen, R., Blum, L. C., and Reymond, J.L. Enumeration of 166 billion organic small molecules in the chemical universe database gdb17. Journal of chemical information and modeling, 52(11):2864–2875, 2012.
 San Kim et al. (2019) San Kim, T., Lee, W. K., and Sohn, S. Y. Graph convolutional network approach applied to predict hourly bikesharing demands considering spatial, temporal, and global effects. PloS one, 14(9), 2019.
 Tian et al. (2014) Tian, F., Gao, B., Cui, Q., Chen, E., and Liu, T.Y. Learning deep representations for graph clustering. In TwentyEighth AAAI Conference on Artificial Intelligence, 2014.
 Xibin et al. (2020) Xibin, D., Zhiwen, Y., Wenming, C., Yifan, S., and Qianli, M. A survey on ensemble learning. Frontiers of Computer Science, 14(2):241–258, 2020.
 Yang et al. (2017) Yang, C., Liu, M., Wang, Z., Liu, L., and Han, J. Graph clustering with dynamic embedding. arXiv preprint arXiv:1712.08249, 2017.
 Zhou et al. (2018) Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., and Sun, M. Graph neural networks: A review of methods and applications. CoRR, abs/1812.08434, 2018. URL http://arxiv.org/abs/1812.08434.
 Zitnik & Leskovec (2017) Zitnik, M. and Leskovec, J. Predicting multicellular function through multilayer tissue networks. Bioinformatics, 33(14):i190–i198, 2017.
Comments
There are no comments yet.