Structural Landmarking and Interaction Modelling: on Resolution Dilemmas in Graph Classification

06/29/2020 ∙ by Kai Zhang, et al. ∙ Temple University Georgia Institute of Technology East China Normal University 2

Graph neural networks are promising architecture for learning and inference with graph-structured data. Yet difficulties in modelling the “parts” and their “interactions” still persist in terms of graph classification, where graph-level representations are usually obtained by squeezing the whole graph into a single vector through graph pooling. From complex systems point of view, mixing all the parts of a system together can affect both model interpretability and predictive performance, because properties of a complex system arise largely from the interaction among its components. We analyze the intrinsic difficulty in graph classification under the unified concept of “resolution dilemmas” with learning theoretic recovery guarantees, and propose “SLIM”, an inductive neural network model for Structural Landmarking and Interaction Modelling. It turns out, that by solving the resolution dilemmas, and leveraging explicit interacting relation between component parts of a graph to explain its complexity, SLIM is more interpretable, accurate, and offers new insight in graph representation learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Complex systems are ubiquitous phenomenon in natural and scientific disciplines, and how relationships between parts give rise to global behaviours of a system is a central theme in many areas of study such as system biology biology, neural science brain, and drug and material discoveries drug material.

Graph neural networks are promising architecture for representation learning on graphs - the structural abstraction of complex system. State-of-the-art performance is observed in various graph mining tasks GCN2; GCN5; graphSAGE; GNNpower; gat; WLneural; GNNreview; GNNreview2; GNNreview3. However, due to the non-Euclidian nature, challenges still exist in graph classification. For example, to generate a fixed-dimensional graph-level representation, GNN combines information from each node through graph pooling. In combined forms, a graph will collapse into a “super-node”, where identities of the constituent sub-graphs and their inter-connections are mixed together. Is this the best way to generate graph-level features? From complex systems view, mixing all parts of a system can affect interpretability and model prediction, because properties of a complex system arise largely from the interactions among its components (molecular; book_complex; book_complex2).

The choice of the “collapsing”-style graph pooling roots deeply in the lack of natural alignment among graphs that are not isomorphic. Therefore the pooling sacrifices structural details for feature compatibility. In recent years, substructure patterns draw considerable attention in graph mining, such as motifs (motif1; motif2; motif3; motif4) and graphlets fast-gkernel. It provides an intermediate scale for structure comparison or counting, and has been considered in node embedding motif_embed, deep graph kernels Deep-gkernel and graph convolution (GNNmotif1). However, due to the combinitorial nature, only substructures of very small sizes (4 or 5 nodes) can be considered Deep-gkernel; motif3

, greatly limiting the coverage of structural variations; also, handling substructures as discrete objects makes it difficult to compensate for their similarities, at least computationally, and so the risk of overfit may rise in supervised learning scenarios.

These intrinsic difficulties are related to the concept of resolution in graph-structured data processing. Resolution is the scale at which measurements can be made and/or information processing algorithms are conducted. Here, we will first define two relevant terms, i.e., the spatial resolution and the structural resolution, and how they may affect the performance of graph classification.

First, spatial resolution is related to the geometrical scale of the “elementary component” of a graph on which an algorithm operates. It can range from nodes, to sub-graphs, or entire graph. Graph details beyond effective spatial resolution are algorithmically unidentifiable. For example, graph pooling compresses the whole graph into a single vector, and so the spatial resolution drops to the lowest: node and edge identities are mixed together, and subsequent classification layer can no longer exploit any substructure or their connections, but just a global aggregation. We call this vanishing spatial resolution. Insufficient spatial resolution may affect the interpretability, and also the predictive power since global property of a complex system arises largely from the its inherent interactions (molecular; book_complex; book_complex2).

Second, structural resolution is the fineness level in differentiating between substructures. substructures (or sub-graphs) shed light on functional organization and graph alignment. However, they are treated in a discrete, and over-delicate manner: in exact matching, two substructures are considered distinct even if they share significant similarity. We call it exploding structural resolution. It can lead to risk of overfitting, similar to observed in deep graph kernels (Deep-gkernel) and dictionary learning adpt_size.

We believe that both resolution dilemmas originate from the way we perform profiling, identification, and alignment of substructures. Substructures are building blocks of a graph; relations like interaction or alignment are all defined between substructures (of varying scales). However, exact substructure matching is too costly and prone to overfit, leading to exploding structural resolution; meanwhile, graph alignment becomes infeasible when substructure matching is poorly defined, and so collapsing-style graph pooling becomes the norm, which finally leads to vanishing spatial resolution.

Our contribution. In this paper, we propose a simple neural architecture called “Structural Landmarking and Interaction Modelling” - or SLIM, for inductive graph classification. The key idea is to embed substructure instances into a continuous metric space and learn structural landmarks there for explicit interaction modelling. The SLIM network can effectively resolve the resolution dilemmas. More importantly, by fully exploring the diverse structural distribution of the input graphs, any substructure instance and even unseen examples can be mapped parametrically to a common and optimizable structural landmark set. This enables a novel, identity-preserving graph pooling paradigm, where the interacting relation between constituent parts of a graph can be modelled explicitly, shedding important light on the functional organizations of complex systems.

The design philosophy of SLIM comes from the long-standing views of complex systems: complexity arises from interaction. Therefore, explicit modelling of the parts and their interactions is key to explaining the complexity and improving the prediction. In contrast, graph neural networks is more about “integration”, where delicate part-modelling like convolution does exist but finally obscured in the pooling process. It turns out, that by respecting the structural organization of complex systems, SLIM is more interpretable, accurate, and provides new insights in graph representation learning.

We will discuss the resolution dilemmas and related works in Section 2. Section 3,  4 and  5 covers the design, analysis, and performance of SLIM, respectively. The last section concludes the paper.

2 Resolution Dilemmas in Graph Classification

A complex system is composed of many parts that interact with each other in a non-simple way. Since graphs are structural abstraction of complex systems, accurate graph classification depends on how global properties of a system relate to its structure. It is believed that the property (and complexity) of a complex system arises from the interaction among its components book_complex; book_complex2. So, accurate interaction modelling should benefit prediction. However, this is non-trivial due to resolution dilemmas.

2.1 Spatial Resolution Diminishes in Graph Pooling

Graph neural networks (GNN) for graph classification typically has two stages: graph convolution and graph pooling (graphSAGE; GNNpower). The spatial resolutions for these two stages are significantly different.

The goal of convolution is to pass message among neighboring nodes in the general form of , where is the neighbors of graphSAGE; GNNpower. Here, the spatial resolution is controlled by the number of convolution layers: more layers capture lager substructures/sub-trees and can lead to improved discriminative power GNNpower. In other words, a medium resolution (substructure level) can be more informative functional markers than a high resolution (node level). In practice, multiple resolutions can be combined via CONCATENATE function graphSAGE; GNNpower for subsequent processing.

The goal of graph pooling is to generate compact, graph-level representations that are compatible across graphs. Due to the lack of natural alignment between graphs that are not isomorphic, graph pooling typically “squeezes” a graph into a single vector (or “super-node”) in the form of , where is the node of

. Different readout functions have been proposed, including max-pooling

(max_pooling), sum-pooling GNNpower, various pooling functions (MEAN, LSTM, etc.) graphSAGE, or deep sets (deep_set); attention has been used to evaluate node importance in attention pooling (att_pool) and gPool (unet); besides, hierarchical differential pooling has also been investigated (dif_pool).

An important resolution bottleneck occurs in graph pooling, as shown in Figure 1

. Since all the nodes are mixed into one, subsequent classifier can no longer identify any individual substructure nor their interactions, regardless of the resolution in graph convolution. We call this “diminishing spatial resolution”, which can be undesirable

111Some work adopt different aggregation strategies: Sortpooling arranges nodes in a linear chain and perform 1d-convolution DGCNN; SEED uses distribution of multiple random walks SEED; Deep graph kernel evaluates graph similarity by subgraph counts Deep-gkernel. Explicit modelling of the interaction between graph parts is not considered. in that: (1) how much information in well-designed convolution domain can penetrate through the pooling layer for final prediction is hard to analyze/control; (2) in molecule classification, graph labels hinge on functional modules and how they organize drug; an overly coarse spatial resolution will mix up functional modules and conceal their interaction.

Figure 1: Spatial resolution vanishes after graph pooling. (Note: not all nodes are marked with convolution - the shaded circles; see Appendix Sec 8.4 for more discussion on relation with hierarchical processing.)

Can meaningful spatial resolution(s) survive graph pooling? The answer is yes. Indeed, it involves substructure alignment, and the notion of structural resolution. See discussions below.

2.2 Structural Resolution Explodes in Substructure Identification

Substructures are the basic unit to accommodate interacting relations. A global criteria to identify and align substructures is the key to preserving substructure identities and comparing the inherent interactions across graphs. Again, the fineness level in determining whether two substructures are “similar” or “different” is subject to a wide spectrum of choices, which we call “structural resolution”.

Figure 2: How structural resolution may affect the generalization performance. Only small substructures here for illustration; node types do make a difference in profiling the substructures.

We illustrate in Figure 2. The right end denotes the finest resolution in differentiating between substructures: exact matching, as we manipulate motif/graphlet motif1; motif2; motif3; GNNmotif1; fast-gkernel. The exponential configuration of sub-graphs will finally lead to an “exploding” structural resolution, because maintaining a large number of unique substructures is infeasible and easily overfits. The left end of the spectrum treats all substructures the same and underfits the data. We are interested in a medium structural resolution, where similar substructures are mapped to the same identity, which we believe can benefit the generalization performance (see Figure 4 for empirical evidence).

Theoretically, an over-delicate structural resolution corresponds to a highly “coherent” basis in representing a graph, leading to unidentifiable dictionary learning ERC; supervised_dic. Structural landmarking is exactly aimed at controlling structural resolution and improve incoherence for graph classification.

3 Structural Landmarking and Interaction Modelling (SLIM)

Considering the difficulty in manipulating substructures as discrete objects, we embed them in a continuous space, and transform all structure-related operations from discrete and off-the-shelf version to continuous and optimizable counterpart. The key idea of SLIM is the identification of structural landmarks in this new space, via both unsupervised compression and supervised fine-tuning, through the distribution of embedded substructures under possibly multiple scales. Structural landmarking resolves resolution dilemmas and allow explicit interaction modelling in graph classification.

Problem Setting. Give a set of labeled graphs }’s for , with each graph defined on the node/edge set with adjacency matrix where , and . Assume that nodes are drawn from categories, and the node attribute matrix for is . Our goal is to train an inductive model to predict the labels of the testing graphs.

Figure 3: The three main steps of the SLIM network illustrated in molecule graph classification.

The SLIM network has three main steps: (1) sub-sturcture embedding, (2) substructure landmarking, and (3) identity-preserving graph pooling, as shown in Figure 3. Detailed discussion follows.

3.1 Substructure Embedding

The goal of substructure embedding is to extract substructure instances and embed them in a metric space. One can employ multiple layers of convolutions graphSAGE; GNNpower to model substructures (rooted sub-trees), or randomly sample sub-graphs fast-gkernel. For convenience, we simply extract one sub-graph instance from each node using a -hop breath-first search, which controls the spatial resolution222When is large, one subgraph around each node may be unnecessary. See discussion in Appendix (Sec8.4). . In Figure 3, sub-graphs in the shaded circles around each atom is a substructure instance.

Let be the th-order adjacency matrix, i.e., the th entry equals 1 only if node and are within -hops away. Since each sub-graph is associated with one node, the sub-graphs extracted from can be represented as , whose th row is a -dimensional vector summarizing the counts of the node-types in the sub-graph around the th node. Variations include (1) emphasize the center node, ; (2) layer-wise node distribution , where specifies whether two nodes in are exactly -hops away; or (3) weighted Layer-wise summation , where ’s are non-negative weighting that decays with .

Next we consider embedding the substructure instances (i.e., rows of ’s) into a latent space so that statistical manipulations can better align with the prediction task. The embedding should preserve important proximity relations to facilitate subsequent landmarking: if two substructures are similar, or they often inter-connect with each other, their embedding should be close. In other words, the embedding should be smooth regard to both structural similarities and geometrical interactions.

A parametric transform on

’s with controlled complexity can guarantee the smoothness of embedding w.r.t. structural similarity, e.g., an autoencoder

. Let be the embedding of the sub-graph instances extracted from . To maintain the smoothness of ’s w.r.t. geometric interaction, we will maximize the log-likelihood of the co-ouuurrence of substructure instances in each graph, similar to word2vec (wordvec)

(1)

Here is the th row of , is inner product, and are the neighbors of node- in graph

. This loss function tends to embed strongly inter-connecting substructures close to each other.

3.2 Substructure Landmarking

The goal of structural landmarking is to identify a set of informative structural landmarks in the continuous embedding space which has: (1) high statistical coverage, namely, the landmarks should faithfully recover distribution of the substructures from the input graphs, so that we can generalize to new substructure examples from the distribution; and (2) high discriminative power, namely the landmarks should be able to reflect discriminative interaction patterns for classification.

Let be the structural landmarks. In order for them to be representative of the substructure distribution, it is desirable that each sub-graph instance is faithfully approximated with the closest landmark. We will minimize the following distortion loss

(2)

Here denotes the th row (substructure) from graph . In practice, we will implement a soft assignment by using one cluster indicator matrix for each graph , whose

-th entry is the probability that the

th substructure of belongs to the th landmark . Inspired by deep embedding clustering (DEC), is parameterized by a Student’s t-distribution

and the loss function can be greatly simplified by minimizing the KL-divergence

(3)

Here, is a self-sharprening version of , and minimizing the KL-distance forces each substructure instance to be assigned to only a small number of landmarks similar to sparse dictionary learning. Besides the unsupervised regularization in (2) or (3), learning of the structural landmarks will also be affected by the classification loss, guaranteeing the discriminative power of the landmarks.

3.3 Identity-Preserving Graph Pooling

The goal of identity-preserving graph pooling is to project structural details of each graph onto the common space of landmarks, so that a compatible, graph-level feature can be obtained that simultaneously preserves the identity of the parts (substructures) and models their interactions.

The structural landmarking mechanism allows computing rich graph-level features. First, we can model substructure distributions. The density of the substructure landmarks in graph can be computed as

. Furthermore, the first-order moment of substructures belonging to each of the

landmarks in is where , and the th column of is the mean of ’s substructure instances belonging to the th landmark. Second, we can model how the landmarks interact with each other in graph . To do this, we can project the adjacency matrices ’s onto the landmark sets and obtain a interaction matrix , which encodes the interacting relations (geometric connections) among the structural landmarks.

These features can be combined together for final classification. For example, they can be reshaped and concatenated to feed into the fully-connected layer. One can also resort to more intuitive ways; for example, using first-order and second-order features together, one can transform each graph into a constant-sized, “landmark” graph with node feature , node weight , and edge weights . Then standard graph convolution can be applied on the landmark graphs to generate graph-level features (without pains of graph alignment anymore). In experiments, for simplicity, we will compute the normalized interaction matrix and use it as features, which works pretty well on all the benchmark datasets. More detailed discussion can be found in Appendix (Sec 8.4 & 8.7).

4 Theoretic Analysis and Discussions

We provide learning theoretic support on the choice of structural resolution (landmark size ). Graphs are bags of inter-connected substructure instances, and each instance can be represented by the landmarks as . A too small number of landmarks fails to recover basic data structures, whereas too many landmarks will result in overfitting (e.g. in exact substructure matching where a maximal is used for reconstruction) adpt_size. In dictionary learning, the mutual coherence is a crucial index in evaluating the redundancy of the code-vectors, which is defined as

(4)

where denotes the normalized correlation. A lower self-coherence permits better support recovery (ERC); while large coherence leads to worse stability in both sparse coding and classification supervised_dic. In particular, a faithful recovery of the sparse signal support is guaranteed only when

(5)

Obviously, large leads to unstable solutions. In the following, we quantify a lower-bound of the coherence as a factor of the landmark size in clustering-based basis selection, since the sparse coding and -means algorithm generate very similar code vectors (clusteringdic).

Theorem 1.

The lower bound of the squared mutual coherence of the landmark vectors increases monotonically with , the number of landmarks in clustering-based sparse dictionary learning.

Here, is the dimension, , where and is the volume of the -dimensional unit ball; is the maximum -norm of (a subset) of the landmark vectors ’s, and is a factor depending on data distribution .

Proof is in Appendix (Sec 8.1). Theorem 1 says that when the landmark set size increases, the mutual coherence has a lower bound that consistently increases and violates the recovery condition (5). In fact, a very high structural resolution (like exact matching) leaves a heavy burden to subsequent classifiers by failing to compensate for structural similarities. This justifies the SLIM network where the landmark set size can be controlled conveniently to avoid unstable dictionary learning.

Discussions. GNNs have shown great potential in graph isomorphism test by generating injective graph embedding, thanks to the theoretic foundations GNNpower; WLneural. However, accurate graph classification needs more thought: classification is not injective; besides, quality of features is also of notable importance. SLIM provides new insight in both respects: (1) it finds a tradeoff in the duality of handling similarity and distinctness; (2) it explores new ways of generating graph-level features: instead of aggregating all parts together as in GNNs, it taps into the vision of complex systems so that interaction between the parts is leveraged to explain the complexity and improve the learning. More discussions are in Appendix (Sec 8.2-8.8), including the choice of spatial/structural resolutions, interpretability, hierarchical and semi-supervised version, and comparison with graph kernels (graph_kernels).

5 Experiments

Benchmark data. We have used a number of popular benchmark data sets for graph classification. (1) MUTAG: chemical compound data set with 188 instances and two classes; there are 7 node/atom types, and 3 edge/bound types (bond types are ignored). (2) PROTEINS: protein molecule data set with 1113 instances and three classes; there are 3 node types (secondary structure elements). NCI1: chemical compounds data set for cancer cell lines with 4110 instances and two classes. (4) PTC: chemical compound data set for toxicology prediction with 417 instances and 8 classes. (5) D&D data set for enzyme classification with 1178 instances and two classes.

Competing methods. We have incorporated a number of highly competitive methods proposed in recent years for comparison: (1) Graph neural tangent kernel (GNTK) GNTK; (2) Graph Isomorphism Network (GIN) GNNpower; (3) End-to-end graph classification (DCGNN) DGCNN; (4) Hierarchical and differential pooling (DiffPool) dif_pool; (5) Self-attention Pooling (SAG) att_pool; (6) Convolutional network for graphs (PATCHY-SAN) 10; (7) Graphlet kernel (GK) GK; (8) Weisfeiler-Lehman Graph Kernels (WLGK) WL_kernel; 9) Propagation kernel (PK) PK. For method (4),(6),(7),(8),(9) we directly cited their reported results (averaged 10-fold corss-validated error) due to unavailability of their codes; for other competing methods we run their codes with default setting and report the performance.

Experimental setting. We follow the experimental setting in GNNpower and 10

and perform 10-fold cross-validation; we report the average and standard deviation of validation accuracies across the 10 folds within the cross-validation. In the SLIM network, the spatial resolution is controlled by a BFS with 3-hop neighbors, and the structural resolution is simply set to

; the FC-layer has one hidden layer with dimension 64; cross-entropy id used for classification; weights for the loss term (1) and (3

) are set to 0.01. No drop-out or batch-normalization is used considering the size of the benchmark data. The hyper-parameters for different dataset include (1) the number of hidden units in the Autoencoder with one hidden unit with a dimension

; (2) the optimizer is chosen among SGD or Adagrad, with the learning rate ; (3) local graph representation, including node distribution

, layer-wise distribution, and weighted layer-wise summation (see Sec 3.2 for details); (4) the number of epochs, i.e., a single epoch with the best cross-validated accuracy averaged over all the 10 folds was selected. Overall, a minimal SLIM network is used in the experiments in order to test its performance.

Figure 4: Accuracy vs structural resolution .

Structural Resolution. In Figure 4, we examine the performance of SLIM under different choices of the structural resolution (landmark set size ). As can be seen, the accuracy-vs- curve has a bell-shaped structure. When is either too small (underfitting) or too large (coherent landmarks that overfit), the accuracy is low, and the best performance is typically around a median value. This validates the correctness of Theorem 1, and the usefulness of structural landmarking in improving graph classification.

(a) NCI data.
(b) MUTAG data.
(c) Protein data.
(d) D&D data.
Figure 5: Testing accuracy of different algorithms over the training epochs.

Classification Performance. We then compare the performance of different methods in Table 1. As can be seen, overall, neural network based approaches are more competitive than graph kernels, except that graph kernels have lower fluctuations, and the WL-graph kernel perform the best on the NCI1 dataset. On most benchmamrk datasets, the SLIM network generates classification accuracies that are either higher or at least as good as other GNN/graph-pooling schemes.

Category Algorithm MUTAG PTC NCI1 Protein D&D
GK 81.381.74 55.650.46 62.490.27 71.390.31 74.380.69
Graph PK 76.002.69 59.502.44 82.540.47 73.680.68 78.250.51
kernel WLGK 84.111.91 57.972.49 84.460.45 74.680.49 78.340.62
PATCHY-SAN 92.634.21 60.004.82 78.591.89 75.892.76 77.122.41
DGCNN 85.831.66 68.596.47 74.460.47 75.540.94 79.371.03
DiffPool 90.523.98 - 76.532.23 75.823.56 78.952.40
GNTK 90.128.58 67.926.98 75.201.53 75.614.24 79.422.18
GNN SAG 73.539.68 75.673.12 74.181.29 71.860.97 76.912.12
GIN 90.038.82 76.252.83 79.844.57 71.282.65 77.582.94
SLIM 93.283.36 80.416.92 80.532.01 77.474.34 79.482.66
Table 1: Averaged prediction accuracy for different algorithms on 5 benchmark data-sets.

Accuracy Evolution. We also plot the evolution of the testing accuracy for different methods on the benchmark datasets, so as to have a more comprehensive evaluation on their performance. As can be seen from Figure 5, our approach not only generates accurate classification on the benchmark datasets, but also the accuracy converges relatively faster and remains more stable with respect to the training epochs, making it easier to determine when to stop the training process. Other GNN algorithms can also attain a high accuracy on some of the benchmark datasets, but the prediction performance fluctuates significantly across the training epochs (even by using large mini-batch sizes). We speculate that stability of the SLIM network arises from explicit modelling of the sub-structure distributions. It’s also worthwhile to note that on MUTAG data the proposed method produces a classification with 100% accuracy on more than half of the runs across different folds (Figure 5(b)). It demonstrates the power of the SLIM network in capturing important graph-level features.

6 Conclusion

Graph neural networks represent state-of-the-art computational architecture for graph mining.In this paper, we designed the SLIM network that employs structural landmarking to resolve resolution dilemmas in graph classification and capture inherent interactions in graph-structured systems. We hope this attempt could open up possibilities in designing GNNs with informative structural priors.

References

7 Appendix

7.1 Proof of Theorem I

Proof.

Suppose we have spatial instances embedded in the -dimensional latent space as , and the landmarks (or codevectors) are defined as . Let be the density function of the instances. Define the averaged distance between the instance and the closest landmark point as

(6)

where is the index of the closest landmark to instance . As expected, will decay with the number of landmarks with the following rate (distortion)

(7)

where is a dimension-dependent factor , with the volume of the unit ball in k-dimensional Euclidean space and ; is a factor depending on the distribution .

Since is the average distortion error, we can make sure that there exists a non-empty subset of instances such that for . Next we will only consider this subset of instances and the relevant set of landmarks will be denoted by . For the landmarks , we make a realistic assumption that there are enough instances so that we can always find one instance falling in the middle of and its closest landmark neighbor . In this case, we have then bound the distance between the closest landmark pairs as

For any such pair, assume that the angle spanned by them is . we can bound the angle between the two landmark vectors by

(8)

Let , we can finally low-bound the normalized correlation between close landmark pairs, and henceforth the coherence of the landmarks, as

This indicates that the squared mutual coherence of the landmarks has a lower bound that consistently increases when the number of the landmark vectors, , increases in a dictionary learning process. ∎

This theorem provides important guidance on the choice of structural resolution. It shows that when a clustering-based dictionary learning scheme is used to determine the structural landmarks, the size of the dictionary can not be chosen too large; or else the risk of overfitting can be huge. Note that exact sub-structure matching as is often practiced in current graph mining tasks corresponds to an extreme case where the number of landmarks, , equals the number of unique sub-structures; therefore it should be avoided in practice. The structural landmarking scheme is a flexible framework to tune the number of landmarks, and to avoid overfitting.

7.2 Choice of Spatial and Structural Resolutions

The spatial resolution determines the “size” of the local sub-structure (or sub-graph), such as functional modules in a molecule. Small sub-structures can be very limited in terms of their representation power, while too large sub-structures can mask the right scale of the local components crucial to the learning task. An optimal spatial resolution can be data-dependent. In practice, we will restrict the size of the local sub-graphs to 3-hop BFS neighbors, considering that the “radius” of the graphs in the benchmark data-sets are usually around 5-8. We then further fine-tune the spatial resolution by assigning a non-negative weighting on the nodes residing on different layers from the central code in the local subgraph. Such weighting is shared across all the sub-graphs and can be used to adjust the importance of each layer of the BFS-based sub-graph. The weighting can be chosen as a monotonously decaying function, or optimized through learning.

The choice of structural resolution has a similar flavor in that too small or too large resolutions are neither desirable. On the other hand, it can be adjusted conveniently by tuning the landmark set size based on the validation data. In our experiments, can be chosen by cross validation; for simplicity, we fix .

Finally, note that geometrically larger substructures (or sub-graphs) are characterized by higher variations among instances due to the exponential amount of configuration. Therefore, the structural resolution should also commensurate with spatial resolutions. For example, substructures constructed by 1-hop-BFS may use a smaller landmark size than those with 3-hop-BFS. In our experiments we do not consider such dependencies yet, but will study it in our future research.

7.3 Comparison with Graph Kernels

Graph kernels are powerful methods to measure the similarity between graphs. The key idea is to compare the sub-structure pairs from the two graphs and compute the accumulated similarity, where examples of substructures include random walks, paths, sub-graphs, or sub-trees. Among them, paths/sub-graphs/sub-trees are deterministic sub-structures in a graph, while random walks are stochastic sequences (of nodes) in a graph.

Although the SLIM network considers sub-structures as the basic processing unit, it has a number of important differences compared with graph kernels. First, we consider optimizable sub-structural landmarks, which is dependent on the class labels and therefore discriminative; in comparison, the sub-structures considered in graph kernels are identified by enumerating or sampling among a large amount of pre-determined candidates. Second, the similarity measured by graph kernels is between a apir of sub-structures across the two graphs; in comparison, the SLIM network models the interacting relation in each graph as its feature. Third, it can be difficult to interpret graph kernels due to the nonlinearity of kernel methods and the exponential amount of sub-structures; in comparison, the SLIM network maintains a reasonable amount of “landmark” structures and so can provide informative clues on the prediction result.

7.4 Hierarchical Version

7.4.1 Subtlety in Spatial Resolution Definition

First we would like to clarify a subtlety in the definition of spatial resolutions. In physics, resolution is defined as the smallest distance (or interval) between two objects that can be separated; therefore it must involve two scales: the scale of the object, and the scale of the interval. Usually these two scales are proportional. In other words, you cannot have a large intervals and small objects, or the opposite (a small interval and large object). For example, in the context of imaging, each object is a pixel and the size of the pixel is the same as the interval between two adjacent pixels.

In the context of graphs, each object is a sub-graph centered around one node, whose scale is manually determined by the order of the BFS-search centered around that node. Therefore, the interval between two sub-graphs may be smaller than the size of the sub-graph. For example, two nodes and are direct neighbors, and each of them haa a 3-hop sub-graph. Then, the interval between these two subgraphs, if defined by the distance between and , will be 1-hop; this is smaller than the size of the two sub-graphs, which is 3-hop. In other words, the two objects/subgraphs indeed overlap with each other, and the scale of the object and the scale of the interval between objects is no longer commensurate (large objects and small interval in this scenario).

This scenario makes it less complete to define spatial resolutions just based on the size of the sub-graphs (as in the main text), since there are actually two scales to define. To avoid unnecessary confusions, we skip these details. In practice, one has two choices dealing with the discrepancy: (1) requiring that the sub-graphs are not overlapping, i.e., we do not have to grow one -hop subgraph around each node; instead, we just explore a subset of the sub-graphs. This can be implemented in a hierarchical version which we discuss in the next subsection; (2) we still allow each node to have a local sub-graph and study them together, which helps cover the diversity of subgraphs since theoretically, an ideal choice of the subgraph is highly domain specific and having more sub-graph examples gives a better chance to include those sub-graphs that are beneficial to the prediction task.

7.4.2 Hierarchical SLIM

We can implement a hierarchical version of SLIM so that sub-graphs of different scales, together with the interacting relation between sub-graphs under each scale, can be captured for final prediction. Note that in dif_pool

a hierarchical clustering scheme is used to partition one graph, in a bottom up manner, to less and less clusters. We can implement the same idea and construct a hierarchy of scales each of which will host a number of sub-structures. The structural landmarking scheme will be implemented in each layer of the hierarchy to generate graph-level features specific to that scale. Finally these features can be combined together for graph classification.

7.5 Semi-supervised SLIM Network

The SLIM network is flexible and can be trained in both fully supervised setting and semi-supervised setting. This is because the SLIM model takes a parametric form and so it is inductive and can generalize to any new samples; on the other hands, the clustering-based loss term in (3

) can be evaluated on both labeled samples and unlabeled samples, rendering the extra flexibility to look into the distribution of the testing sample in the training phase, if they are available. This is in flavor very similar to the smoothness constraint widely used in semi-supervised learning, such as the graph-regularized manifold learning

mani_reg. Therefore, the SLIM network can be implemented in the following modes

  • Supervised version. Only training graphs and their labels are available during the training phase, and the loss function (3) is only computed on the training samples.

  • Semi-supervised version. Both labeled training graphs and unlabeled testing graphs are available. The loss function (3) will be computed on both the training and testing graphs, wile the classification loss function will only be evaluated on the training graph labels.

7.6 Interpretability

The SLIM network not only generates accurate prediction in graph classification problems, but can also provide important clues on interpreting the prediction results, because the graph-level features in SLIM bear clear physical meaning. For example, assume that we use the interaction matrix for the th graph as its feature representation; and the th entry then quantifies the connectivity strength between the th sub-structure landmark and the th structure landmark. Then, by checking the -dimensional model coefficients from the fully-connected layer, one can then tell which subset of substructure-connectivity (i.e., two substructures are directly connected in a graph) is important in making the prediction. To improve the interpretability one can further imposes a sparsity constraint on the model coefficient.

In traditional graph neural networks such as GraphSAGE of GIN, node features are transformed through many layers and finally mingled altogether through graph pooling. The resultant graph-level representation, whose dimension is manually determined and each entry pools the values across all the nodes in the graph, can be difficult to interpret.

7.7 The prediction Layer

The SLIM network renders various possibilities to generate the prediction layer.

  • Fully connected layer. The interaction matrix can be re-shaped into a vector, or transformed to a smaller matrix via bilateral dimension reduction before reshaped into a vector. Then a fully connected layer follows for the final prediction.

  • Landmark Graph. Each graph can be transformed into a landmark-graph a with fixed number of (landmark) nodes, with and quantifying the weight of each node and the edge between every pair of nodes, and the feature of each node (see definition in Section3.3). Then, this graph can be subject to a graph convolution such as generate a fixed-dimensional graph-level feature without having to take care of the varying graph size. We will study this in our future experiments.

  • Riemannian manifold. When using the interaction matrix

    or the normalized version as graph level features, we can treat each graph as a point in the Riemannian manifold due to the symmetry and positive semi-definiteness of the representation. Then the distance between two interaction matrices can be computed as the Wasserstein distance between two Gaussian distributions with the interaction matrix as covariances, which has a closed-form. We will study this in our future experiments.

7.8 Interaction versus Integration

The SLIM network and existing GNNs represent two different flavors of learning, namely, interaction modelling versus integration approach. Interaction modelling is based on mature understanding of complex systems and can provide physically meaningful interpretations or support for graph classification; integration based approaches bypass the difficulty of preserving the identity of sub-structures and instead focus on whether the integrated representation is an injective mapping, as typically studied in graph isomorphism testing.

Note that an ideal classification is different from isomorphism testing and is not injective. In a good classifier, the goal of deciding which samples are similar and which are not are equally important. Here comes the tradeoff between handling similarity and distinctness. The Isomorphism-flavor GNN’s are aimed at preserving the differences between local sub-structures (even just a very minute difference), and then map the resultant embedding to the class labels. Our approach, on the other hand, tries to absorb patterns that are sufficiently close to the same landmark, and then map the landmark-based features to class labels. In the latter case, the structural resolution can be tuned in a flexible way to explore different fineness levels, thus tuning the balance between “similarity” and “distinctness”; in the meantime, the structural landmarks allow preserving sub-structure identities and exploiting their interactions.