1 Introduction
Complex systems are ubiquitous phenomenon in natural and scientific disciplines, and how relationships between parts give rise to global behaviours of a system is a central theme in many areas of study such as system biology biology, neural science brain, and drug and material discoveries drug material.
Graph neural networks are promising architecture for representation learning on graphs  the structural abstraction of complex system. Stateoftheart performance is observed in various graph mining tasks GCN2; GCN5; graphSAGE; GNNpower; gat; WLneural; GNNreview; GNNreview2; GNNreview3. However, due to the nonEuclidian nature, challenges still exist in graph classification. For example, to generate a fixeddimensional graphlevel representation, GNN combines information from each node through graph pooling. In combined forms, a graph will collapse into a “supernode”, where identities of the constituent subgraphs and their interconnections are mixed together. Is this the best way to generate graphlevel features? From complex systems view, mixing all parts of a system can affect interpretability and model prediction, because properties of a complex system arise largely from the interactions among its components (molecular; book_complex; book_complex2).
The choice of the “collapsing”style graph pooling roots deeply in the lack of natural alignment among graphs that are not isomorphic. Therefore the pooling sacrifices structural details for feature compatibility. In recent years, substructure patterns draw considerable attention in graph mining, such as motifs (motif1; motif2; motif3; motif4) and graphlets fastgkernel. It provides an intermediate scale for structure comparison or counting, and has been considered in node embedding motif_embed, deep graph kernels Deepgkernel and graph convolution (GNNmotif1). However, due to the combinitorial nature, only substructures of very small sizes (4 or 5 nodes) can be considered Deepgkernel; motif3
, greatly limiting the coverage of structural variations; also, handling substructures as discrete objects makes it difficult to compensate for their similarities, at least computationally, and so the risk of overfit may rise in supervised learning scenarios.
These intrinsic difficulties are related to the concept of resolution in graphstructured data processing. Resolution is the scale at which measurements can be made and/or information processing algorithms are conducted. Here, we will first define two relevant terms, i.e., the spatial resolution and the structural resolution, and how they may affect the performance of graph classification.
First, spatial resolution is related to the geometrical scale of the “elementary component” of a graph on which an algorithm operates. It can range from nodes, to subgraphs, or entire graph. Graph details beyond effective spatial resolution are algorithmically unidentifiable. For example, graph pooling compresses the whole graph into a single vector, and so the spatial resolution drops to the lowest: node and edge identities are mixed together, and subsequent classification layer can no longer exploit any substructure or their connections, but just a global aggregation. We call this vanishing spatial resolution. Insufficient spatial resolution may affect the interpretability, and also the predictive power since global property of a complex system arises largely from the its inherent interactions (molecular; book_complex; book_complex2).
Second, structural resolution is the fineness level in differentiating between substructures. substructures (or subgraphs) shed light on functional organization and graph alignment. However, they are treated in a discrete, and overdelicate manner: in exact matching, two substructures are considered distinct even if they share significant similarity. We call it exploding structural resolution. It can lead to risk of overfitting, similar to observed in deep graph kernels (Deepgkernel) and dictionary learning adpt_size.
We believe that both resolution dilemmas originate from the way we perform profiling, identification, and alignment of substructures. Substructures are building blocks of a graph; relations like interaction or alignment are all defined between substructures (of varying scales). However, exact substructure matching is too costly and prone to overfit, leading to exploding structural resolution; meanwhile, graph alignment becomes infeasible when substructure matching is poorly defined, and so collapsingstyle graph pooling becomes the norm, which finally leads to vanishing spatial resolution.
Our contribution. In this paper, we propose a simple neural architecture called “Structural Landmarking and Interaction Modelling”  or SLIM, for inductive graph classification. The key idea is to embed substructure instances into a continuous metric space and learn structural landmarks there for explicit interaction modelling. The SLIM network can effectively resolve the resolution dilemmas. More importantly, by fully exploring the diverse structural distribution of the input graphs, any substructure instance and even unseen examples can be mapped parametrically to a common and optimizable structural landmark set. This enables a novel, identitypreserving graph pooling paradigm, where the interacting relation between constituent parts of a graph can be modelled explicitly, shedding important light on the functional organizations of complex systems.
The design philosophy of SLIM comes from the longstanding views of complex systems: complexity arises from interaction. Therefore, explicit modelling of the parts and their interactions is key to explaining the complexity and improving the prediction. In contrast, graph neural networks is more about “integration”, where delicate partmodelling like convolution does exist but finally obscured in the pooling process. It turns out, that by respecting the structural organization of complex systems, SLIM is more interpretable, accurate, and provides new insights in graph representation learning.
2 Resolution Dilemmas in Graph Classification
A complex system is composed of many parts that interact with each other in a nonsimple way. Since graphs are structural abstraction of complex systems, accurate graph classification depends on how global properties of a system relate to its structure. It is believed that the property (and complexity) of a complex system arises from the interaction among its components book_complex; book_complex2. So, accurate interaction modelling should benefit prediction. However, this is nontrivial due to resolution dilemmas.
2.1 Spatial Resolution Diminishes in Graph Pooling
Graph neural networks (GNN) for graph classification typically has two stages: graph convolution and graph pooling (graphSAGE; GNNpower). The spatial resolutions for these two stages are significantly different.
The goal of convolution is to pass message among neighboring nodes in the general form of , where is the neighbors of graphSAGE; GNNpower. Here, the spatial resolution is controlled by the number of convolution layers: more layers capture lager substructures/subtrees and can lead to improved discriminative power GNNpower. In other words, a medium resolution (substructure level) can be more informative functional markers than a high resolution (node level). In practice, multiple resolutions can be combined via CONCATENATE function graphSAGE; GNNpower for subsequent processing.
The goal of graph pooling is to generate compact, graphlevel representations that are compatible across graphs. Due to the lack of natural alignment between graphs that are not isomorphic, graph pooling typically “squeezes” a graph into a single vector (or “supernode”) in the form of , where is the node of
. Different readout functions have been proposed, including maxpooling
(max_pooling), sumpooling GNNpower, various pooling functions (MEAN, LSTM, etc.) graphSAGE, or deep sets (deep_set); attention has been used to evaluate node importance in attention pooling (att_pool) and gPool (unet); besides, hierarchical differential pooling has also been investigated (dif_pool).An important resolution bottleneck occurs in graph pooling, as shown in Figure 1
. Since all the nodes are mixed into one, subsequent classifier can no longer identify any individual substructure nor their interactions, regardless of the resolution in graph convolution. We call this “diminishing spatial resolution”, which can be undesirable
^{1}^{1}1Some work adopt different aggregation strategies: Sortpooling arranges nodes in a linear chain and perform 1dconvolution DGCNN; SEED uses distribution of multiple random walks SEED; Deep graph kernel evaluates graph similarity by subgraph counts Deepgkernel. Explicit modelling of the interaction between graph parts is not considered. in that: (1) how much information in welldesigned convolution domain can penetrate through the pooling layer for final prediction is hard to analyze/control; (2) in molecule classification, graph labels hinge on functional modules and how they organize drug; an overly coarse spatial resolution will mix up functional modules and conceal their interaction.Can meaningful spatial resolution(s) survive graph pooling? The answer is yes. Indeed, it involves substructure alignment, and the notion of structural resolution. See discussions below.
2.2 Structural Resolution Explodes in Substructure Identification
Substructures are the basic unit to accommodate interacting relations. A global criteria to identify and align substructures is the key to preserving substructure identities and comparing the inherent interactions across graphs. Again, the fineness level in determining whether two substructures are “similar” or “different” is subject to a wide spectrum of choices, which we call “structural resolution”.
We illustrate in Figure 2. The right end denotes the finest resolution in differentiating between substructures: exact matching, as we manipulate motif/graphlet motif1; motif2; motif3; GNNmotif1; fastgkernel. The exponential configuration of subgraphs will finally lead to an “exploding” structural resolution, because maintaining a large number of unique substructures is infeasible and easily overfits. The left end of the spectrum treats all substructures the same and underfits the data. We are interested in a medium structural resolution, where similar substructures are mapped to the same identity, which we believe can benefit the generalization performance (see Figure 4 for empirical evidence).
Theoretically, an overdelicate structural resolution corresponds to a highly “coherent” basis in representing a graph, leading to unidentifiable dictionary learning ERC; supervised_dic. Structural landmarking is exactly aimed at controlling structural resolution and improve incoherence for graph classification.
3 Structural Landmarking and Interaction Modelling (SLIM)
Considering the difficulty in manipulating substructures as discrete objects, we embed them in a continuous space, and transform all structurerelated operations from discrete and offtheshelf version to continuous and optimizable counterpart. The key idea of SLIM is the identification of structural landmarks in this new space, via both unsupervised compression and supervised finetuning, through the distribution of embedded substructures under possibly multiple scales. Structural landmarking resolves resolution dilemmas and allow explicit interaction modelling in graph classification.
Problem Setting. Give a set of labeled graphs }’s for , with each graph defined on the node/edge set with adjacency matrix where , and . Assume that nodes are drawn from categories, and the node attribute matrix for is . Our goal is to train an inductive model to predict the labels of the testing graphs.
The SLIM network has three main steps: (1) substurcture embedding, (2) substructure landmarking, and (3) identitypreserving graph pooling, as shown in Figure 3. Detailed discussion follows.
3.1 Substructure Embedding
The goal of substructure embedding is to extract substructure instances and embed them in a metric space. One can employ multiple layers of convolutions graphSAGE; GNNpower to model substructures (rooted subtrees), or randomly sample subgraphs fastgkernel. For convenience, we simply extract one subgraph instance from each node using a hop breathfirst search, which controls the spatial resolution^{2}^{2}2When is large, one subgraph around each node may be unnecessary. See discussion in Appendix (Sec8.4). . In Figure 3, subgraphs in the shaded circles around each atom is a substructure instance.
Let be the thorder adjacency matrix, i.e., the th entry equals 1 only if node and are within hops away. Since each subgraph is associated with one node, the subgraphs extracted from can be represented as , whose th row is a dimensional vector summarizing the counts of the nodetypes in the subgraph around the th node. Variations include (1) emphasize the center node, ; (2) layerwise node distribution , where specifies whether two nodes in are exactly hops away; or (3) weighted Layerwise summation , where ’s are nonnegative weighting that decays with .
Next we consider embedding the substructure instances (i.e., rows of ’s) into a latent space so that statistical manipulations can better align with the prediction task. The embedding should preserve important proximity relations to facilitate subsequent landmarking: if two substructures are similar, or they often interconnect with each other, their embedding should be close. In other words, the embedding should be smooth regard to both structural similarities and geometrical interactions.
A parametric transform on
’s with controlled complexity can guarantee the smoothness of embedding w.r.t. structural similarity, e.g., an autoencoder
. Let be the embedding of the subgraph instances extracted from . To maintain the smoothness of ’s w.r.t. geometric interaction, we will maximize the loglikelihood of the coouuurrence of substructure instances in each graph, similar to word2vec (wordvec)(1) 
Here is the th row of , is inner product, and are the neighbors of node in graph
. This loss function tends to embed strongly interconnecting substructures close to each other.
3.2 Substructure Landmarking
The goal of structural landmarking is to identify a set of informative structural landmarks in the continuous embedding space which has: (1) high statistical coverage, namely, the landmarks should faithfully recover distribution of the substructures from the input graphs, so that we can generalize to new substructure examples from the distribution; and (2) high discriminative power, namely the landmarks should be able to reflect discriminative interaction patterns for classification.
Let be the structural landmarks. In order for them to be representative of the substructure distribution, it is desirable that each subgraph instance is faithfully approximated with the closest landmark. We will minimize the following distortion loss
(2) 
Here denotes the th row (substructure) from graph . In practice, we will implement a soft assignment by using one cluster indicator matrix for each graph , whose
th entry is the probability that the
th substructure of belongs to the th landmark . Inspired by deep embedding clustering (DEC), is parameterized by a Student’s tdistributionand the loss function can be greatly simplified by minimizing the KLdivergence
(3) 
Here, is a selfsharprening version of , and minimizing the KLdistance forces each substructure instance to be assigned to only a small number of landmarks similar to sparse dictionary learning. Besides the unsupervised regularization in (2) or (3), learning of the structural landmarks will also be affected by the classification loss, guaranteeing the discriminative power of the landmarks.
3.3 IdentityPreserving Graph Pooling
The goal of identitypreserving graph pooling is to project structural details of each graph onto the common space of landmarks, so that a compatible, graphlevel feature can be obtained that simultaneously preserves the identity of the parts (substructures) and models their interactions.
The structural landmarking mechanism allows computing rich graphlevel features. First, we can model substructure distributions. The density of the substructure landmarks in graph can be computed as
. Furthermore, the firstorder moment of substructures belonging to each of the
landmarks in is where , and the th column of is the mean of ’s substructure instances belonging to the th landmark. Second, we can model how the landmarks interact with each other in graph . To do this, we can project the adjacency matrices ’s onto the landmark sets and obtain a interaction matrix , which encodes the interacting relations (geometric connections) among the structural landmarks.These features can be combined together for final classification. For example, they can be reshaped and concatenated to feed into the fullyconnected layer. One can also resort to more intuitive ways; for example, using firstorder and secondorder features together, one can transform each graph into a constantsized, “landmark” graph with node feature , node weight , and edge weights . Then standard graph convolution can be applied on the landmark graphs to generate graphlevel features (without pains of graph alignment anymore). In experiments, for simplicity, we will compute the normalized interaction matrix and use it as features, which works pretty well on all the benchmark datasets. More detailed discussion can be found in Appendix (Sec 8.4 & 8.7).
4 Theoretic Analysis and Discussions
We provide learning theoretic support on the choice of structural resolution (landmark size ). Graphs are bags of interconnected substructure instances, and each instance can be represented by the landmarks as . A too small number of landmarks fails to recover basic data structures, whereas too many landmarks will result in overfitting (e.g. in exact substructure matching where a maximal is used for reconstruction) adpt_size. In dictionary learning, the mutual coherence is a crucial index in evaluating the redundancy of the codevectors, which is defined as
(4) 
where denotes the normalized correlation. A lower selfcoherence permits better support recovery (ERC); while large coherence leads to worse stability in both sparse coding and classification supervised_dic. In particular, a faithful recovery of the sparse signal support is guaranteed only when
(5) 
Obviously, large leads to unstable solutions. In the following, we quantify a lowerbound of the coherence as a factor of the landmark size in clusteringbased basis selection, since the sparse coding and means algorithm generate very similar code vectors (clusteringdic).
Theorem 1.
The lower bound of the squared mutual coherence of the landmark vectors increases monotonically with , the number of landmarks in clusteringbased sparse dictionary learning.
Here, is the dimension, , where and is the volume of the dimensional unit ball; is the maximum norm of (a subset) of the landmark vectors ’s, and is a factor depending on data distribution .
Proof is in Appendix (Sec 8.1). Theorem 1 says that when the landmark set size increases, the mutual coherence has a lower bound that consistently increases and violates the recovery condition (5). In fact, a very high structural resolution (like exact matching) leaves a heavy burden to subsequent classifiers by failing to compensate for structural similarities. This justifies the SLIM network where the landmark set size can be controlled conveniently to avoid unstable dictionary learning.
Discussions. GNNs have shown great potential in graph isomorphism test by generating injective graph embedding, thanks to the theoretic foundations GNNpower; WLneural. However, accurate graph classification needs more thought: classification is not injective; besides, quality of features is also of notable importance. SLIM provides new insight in both respects: (1) it finds a tradeoff in the duality of handling similarity and distinctness; (2) it explores new ways of generating graphlevel features: instead of aggregating all parts together as in GNNs, it taps into the vision of complex systems so that interaction between the parts is leveraged to explain the complexity and improve the learning. More discussions are in Appendix (Sec 8.28.8), including the choice of spatial/structural resolutions, interpretability, hierarchical and semisupervised version, and comparison with graph kernels (graph_kernels).
5 Experiments
Benchmark data. We have used a number of popular benchmark data sets for graph classification. (1) MUTAG: chemical compound data set with 188 instances and two classes; there are 7 node/atom types, and 3 edge/bound types (bond types are ignored). (2) PROTEINS: protein molecule data set with 1113 instances and three classes; there are 3 node types (secondary structure elements). NCI1: chemical compounds data set for cancer cell lines with 4110 instances and two classes. (4) PTC: chemical compound data set for toxicology prediction with 417 instances and 8 classes. (5) D&D data set for enzyme classification with 1178 instances and two classes.
Competing methods. We have incorporated a number of highly competitive methods proposed in recent years for comparison: (1) Graph neural tangent kernel (GNTK) GNTK; (2) Graph Isomorphism Network (GIN) GNNpower; (3) Endtoend graph classification (DCGNN) DGCNN; (4) Hierarchical and differential pooling (DiffPool) dif_pool; (5) Selfattention Pooling (SAG) att_pool; (6) Convolutional network for graphs (PATCHYSAN) 10; (7) Graphlet kernel (GK) GK; (8) WeisfeilerLehman Graph Kernels (WLGK) WL_kernel; 9) Propagation kernel (PK) PK. For method (4),(6),(7),(8),(9) we directly cited their reported results (averaged 10fold corssvalidated error) due to unavailability of their codes; for other competing methods we run their codes with default setting and report the performance.
Experimental setting. We follow the experimental setting in GNNpower and 10
and perform 10fold crossvalidation; we report the average and standard deviation of validation accuracies across the 10 folds within the crossvalidation. In the SLIM network, the spatial resolution is controlled by a BFS with 3hop neighbors, and the structural resolution is simply set to
; the FClayer has one hidden layer with dimension 64; crossentropy id used for classification; weights for the loss term (1) and (3) are set to 0.01. No dropout or batchnormalization is used considering the size of the benchmark data. The hyperparameters for different dataset include (1) the number of hidden units in the Autoencoder with one hidden unit with a dimension
; (2) the optimizer is chosen among SGD or Adagrad, with the learning rate ; (3) local graph representation, including node distribution, layerwise distribution, and weighted layerwise summation (see Sec 3.2 for details); (4) the number of epochs, i.e., a single epoch with the best crossvalidated accuracy averaged over all the 10 folds was selected. Overall, a minimal SLIM network is used in the experiments in order to test its performance.
Structural Resolution. In Figure 4, we examine the performance of SLIM under different choices of the structural resolution (landmark set size ). As can be seen, the accuracyvs curve has a bellshaped structure. When is either too small (underfitting) or too large (coherent landmarks that overfit), the accuracy is low, and the best performance is typically around a median value. This validates the correctness of Theorem 1, and the usefulness of structural landmarking in improving graph classification.
Classification Performance. We then compare the performance of different methods in Table 1. As can be seen, overall, neural network based approaches are more competitive than graph kernels, except that graph kernels have lower fluctuations, and the WLgraph kernel perform the best on the NCI1 dataset. On most benchmamrk datasets, the SLIM network generates classification accuracies that are either higher or at least as good as other GNN/graphpooling schemes.
Category  Algorithm  MUTAG  PTC  NCI1  Protein  D&D 

GK  81.381.74  55.650.46  62.490.27  71.390.31  74.380.69  
Graph  PK  76.002.69  59.502.44  82.540.47  73.680.68  78.250.51 
kernel  WLGK  84.111.91  57.972.49  84.460.45  74.680.49  78.340.62 
PATCHYSAN  92.634.21  60.004.82  78.591.89  75.892.76  77.122.41  
DGCNN  85.831.66  68.596.47  74.460.47  75.540.94  79.371.03  
DiffPool  90.523.98    76.532.23  75.823.56  78.952.40  
GNTK  90.128.58  67.926.98  75.201.53  75.614.24  79.422.18  
GNN  SAG  73.539.68  75.673.12  74.181.29  71.860.97  76.912.12 
GIN  90.038.82  76.252.83  79.844.57  71.282.65  77.582.94  
SLIM  93.283.36  80.416.92  80.532.01  77.474.34  79.482.66 
Accuracy Evolution. We also plot the evolution of the testing accuracy for different methods on the benchmark datasets, so as to have a more comprehensive evaluation on their performance. As can be seen from Figure 5, our approach not only generates accurate classification on the benchmark datasets, but also the accuracy converges relatively faster and remains more stable with respect to the training epochs, making it easier to determine when to stop the training process. Other GNN algorithms can also attain a high accuracy on some of the benchmark datasets, but the prediction performance fluctuates significantly across the training epochs (even by using large minibatch sizes). We speculate that stability of the SLIM network arises from explicit modelling of the substructure distributions. It’s also worthwhile to note that on MUTAG data the proposed method produces a classification with 100% accuracy on more than half of the runs across different folds (Figure 5(b)). It demonstrates the power of the SLIM network in capturing important graphlevel features.
6 Conclusion
Graph neural networks represent stateoftheart computational architecture for graph mining.In this paper, we designed the SLIM network that employs structural landmarking to resolve resolution dilemmas in graph classification and capture inherent interactions in graphstructured systems. We hope this attempt could open up possibilities in designing GNNs with informative structural priors.
References
7 Appendix
7.1 Proof of Theorem I
Proof.
Suppose we have spatial instances embedded in the dimensional latent space as , and the landmarks (or codevectors) are defined as . Let be the density function of the instances. Define the averaged distance between the instance and the closest landmark point as
(6) 
where is the index of the closest landmark to instance . As expected, will decay with the number of landmarks with the following rate (distortion)
(7) 
where is a dimensiondependent factor , with the volume of the unit ball in kdimensional Euclidean space and ; is a factor depending on the distribution .
Since is the average distortion error, we can make sure that there exists a nonempty subset of instances such that for . Next we will only consider this subset of instances and the relevant set of landmarks will be denoted by . For the landmarks , we make a realistic assumption that there are enough instances so that we can always find one instance falling in the middle of and its closest landmark neighbor . In this case, we have then bound the distance between the closest landmark pairs as
For any such pair, assume that the angle spanned by them is . we can bound the angle between the two landmark vectors by
(8) 
Let , we can finally lowbound the normalized correlation between close landmark pairs, and henceforth the coherence of the landmarks, as
This indicates that the squared mutual coherence of the landmarks has a lower bound that consistently increases when the number of the landmark vectors, , increases in a dictionary learning process. ∎
This theorem provides important guidance on the choice of structural resolution. It shows that when a clusteringbased dictionary learning scheme is used to determine the structural landmarks, the size of the dictionary can not be chosen too large; or else the risk of overfitting can be huge. Note that exact substructure matching as is often practiced in current graph mining tasks corresponds to an extreme case where the number of landmarks, , equals the number of unique substructures; therefore it should be avoided in practice. The structural landmarking scheme is a flexible framework to tune the number of landmarks, and to avoid overfitting.
7.2 Choice of Spatial and Structural Resolutions
The spatial resolution determines the “size” of the local substructure (or subgraph), such as functional modules in a molecule. Small substructures can be very limited in terms of their representation power, while too large substructures can mask the right scale of the local components crucial to the learning task. An optimal spatial resolution can be datadependent. In practice, we will restrict the size of the local subgraphs to 3hop BFS neighbors, considering that the “radius” of the graphs in the benchmark datasets are usually around 58. We then further finetune the spatial resolution by assigning a nonnegative weighting on the nodes residing on different layers from the central code in the local subgraph. Such weighting is shared across all the subgraphs and can be used to adjust the importance of each layer of the BFSbased subgraph. The weighting can be chosen as a monotonously decaying function, or optimized through learning.
The choice of structural resolution has a similar flavor in that too small or too large resolutions are neither desirable. On the other hand, it can be adjusted conveniently by tuning the landmark set size based on the validation data. In our experiments, can be chosen by cross validation; for simplicity, we fix .
Finally, note that geometrically larger substructures (or subgraphs) are characterized by higher variations among instances due to the exponential amount of configuration. Therefore, the structural resolution should also commensurate with spatial resolutions. For example, substructures constructed by 1hopBFS may use a smaller landmark size than those with 3hopBFS. In our experiments we do not consider such dependencies yet, but will study it in our future research.
7.3 Comparison with Graph Kernels
Graph kernels are powerful methods to measure the similarity between graphs. The key idea is to compare the substructure pairs from the two graphs and compute the accumulated similarity, where examples of substructures include random walks, paths, subgraphs, or subtrees. Among them, paths/subgraphs/subtrees are deterministic substructures in a graph, while random walks are stochastic sequences (of nodes) in a graph.
Although the SLIM network considers substructures as the basic processing unit, it has a number of important differences compared with graph kernels. First, we consider optimizable substructural landmarks, which is dependent on the class labels and therefore discriminative; in comparison, the substructures considered in graph kernels are identified by enumerating or sampling among a large amount of predetermined candidates. Second, the similarity measured by graph kernels is between a apir of substructures across the two graphs; in comparison, the SLIM network models the interacting relation in each graph as its feature. Third, it can be difficult to interpret graph kernels due to the nonlinearity of kernel methods and the exponential amount of substructures; in comparison, the SLIM network maintains a reasonable amount of “landmark” structures and so can provide informative clues on the prediction result.
7.4 Hierarchical Version
7.4.1 Subtlety in Spatial Resolution Definition
First we would like to clarify a subtlety in the definition of spatial resolutions. In physics, resolution is defined as the smallest distance (or interval) between two objects that can be separated; therefore it must involve two scales: the scale of the object, and the scale of the interval. Usually these two scales are proportional. In other words, you cannot have a large intervals and small objects, or the opposite (a small interval and large object). For example, in the context of imaging, each object is a pixel and the size of the pixel is the same as the interval between two adjacent pixels.
In the context of graphs, each object is a subgraph centered around one node, whose scale is manually determined by the order of the BFSsearch centered around that node. Therefore, the interval between two subgraphs may be smaller than the size of the subgraph. For example, two nodes and are direct neighbors, and each of them haa a 3hop subgraph. Then, the interval between these two subgraphs, if defined by the distance between and , will be 1hop; this is smaller than the size of the two subgraphs, which is 3hop. In other words, the two objects/subgraphs indeed overlap with each other, and the scale of the object and the scale of the interval between objects is no longer commensurate (large objects and small interval in this scenario).
This scenario makes it less complete to define spatial resolutions just based on the size of the subgraphs (as in the main text), since there are actually two scales to define. To avoid unnecessary confusions, we skip these details. In practice, one has two choices dealing with the discrepancy: (1) requiring that the subgraphs are not overlapping, i.e., we do not have to grow one hop subgraph around each node; instead, we just explore a subset of the subgraphs. This can be implemented in a hierarchical version which we discuss in the next subsection; (2) we still allow each node to have a local subgraph and study them together, which helps cover the diversity of subgraphs since theoretically, an ideal choice of the subgraph is highly domain specific and having more subgraph examples gives a better chance to include those subgraphs that are beneficial to the prediction task.
7.4.2 Hierarchical SLIM
We can implement a hierarchical version of SLIM so that subgraphs of different scales, together with the interacting relation between subgraphs under each scale, can be captured for final prediction. Note that in dif_pool
a hierarchical clustering scheme is used to partition one graph, in a bottom up manner, to less and less clusters. We can implement the same idea and construct a hierarchy of scales each of which will host a number of substructures. The structural landmarking scheme will be implemented in each layer of the hierarchy to generate graphlevel features specific to that scale. Finally these features can be combined together for graph classification.
7.5 Semisupervised SLIM Network
The SLIM network is flexible and can be trained in both fully supervised setting and semisupervised setting. This is because the SLIM model takes a parametric form and so it is inductive and can generalize to any new samples; on the other hands, the clusteringbased loss term in (3
) can be evaluated on both labeled samples and unlabeled samples, rendering the extra flexibility to look into the distribution of the testing sample in the training phase, if they are available. This is in flavor very similar to the smoothness constraint widely used in semisupervised learning, such as the graphregularized manifold learning
mani_reg. Therefore, the SLIM network can be implemented in the following modes
Supervised version. Only training graphs and their labels are available during the training phase, and the loss function (3) is only computed on the training samples.

Semisupervised version. Both labeled training graphs and unlabeled testing graphs are available. The loss function (3) will be computed on both the training and testing graphs, wile the classification loss function will only be evaluated on the training graph labels.
7.6 Interpretability
The SLIM network not only generates accurate prediction in graph classification problems, but can also provide important clues on interpreting the prediction results, because the graphlevel features in SLIM bear clear physical meaning. For example, assume that we use the interaction matrix for the th graph as its feature representation; and the th entry then quantifies the connectivity strength between the th substructure landmark and the th structure landmark. Then, by checking the dimensional model coefficients from the fullyconnected layer, one can then tell which subset of substructureconnectivity (i.e., two substructures are directly connected in a graph) is important in making the prediction. To improve the interpretability one can further imposes a sparsity constraint on the model coefficient.
In traditional graph neural networks such as GraphSAGE of GIN, node features are transformed through many layers and finally mingled altogether through graph pooling. The resultant graphlevel representation, whose dimension is manually determined and each entry pools the values across all the nodes in the graph, can be difficult to interpret.
7.7 The prediction Layer
The SLIM network renders various possibilities to generate the prediction layer.

Fully connected layer. The interaction matrix can be reshaped into a vector, or transformed to a smaller matrix via bilateral dimension reduction before reshaped into a vector. Then a fully connected layer follows for the final prediction.

Landmark Graph. Each graph can be transformed into a landmarkgraph a with fixed number of (landmark) nodes, with and quantifying the weight of each node and the edge between every pair of nodes, and the feature of each node (see definition in Section3.3). Then, this graph can be subject to a graph convolution such as generate a fixeddimensional graphlevel feature without having to take care of the varying graph size. We will study this in our future experiments.

Riemannian manifold. When using the interaction matrix
or the normalized version as graph level features, we can treat each graph as a point in the Riemannian manifold due to the symmetry and positive semidefiniteness of the representation. Then the distance between two interaction matrices can be computed as the Wasserstein distance between two Gaussian distributions with the interaction matrix as covariances, which has a closedform. We will study this in our future experiments.
7.8 Interaction versus Integration
The SLIM network and existing GNNs represent two different flavors of learning, namely, interaction modelling versus integration approach. Interaction modelling is based on mature understanding of complex systems and can provide physically meaningful interpretations or support for graph classification; integration based approaches bypass the difficulty of preserving the identity of substructures and instead focus on whether the integrated representation is an injective mapping, as typically studied in graph isomorphism testing.
Note that an ideal classification is different from isomorphism testing and is not injective. In a good classifier, the goal of deciding which samples are similar and which are not are equally important. Here comes the tradeoff between handling similarity and distinctness. The Isomorphismflavor GNN’s are aimed at preserving the differences between local substructures (even just a very minute difference), and then map the resultant embedding to the class labels. Our approach, on the other hand, tries to absorb patterns that are sufficiently close to the same landmark, and then map the landmarkbased features to class labels. In the latter case, the structural resolution can be tuned in a flexible way to explore different fineness levels, thus tuning the balance between “similarity” and “distinctness”; in the meantime, the structural landmarks allow preserving substructure identities and exploiting their interactions.
Comments
There are no comments yet.