Log In Sign Up

Explain Graph Neural Networks to Understand Weighted Graph Features in Node Classification

Real data collected from different applications that have additional topological structures and connection information are amenable to be represented as a weighted graph. Considering the node labeling problem, Graph Neural Networks (GNNs) is a powerful tool, which can mimic experts' decision on node labeling. GNNs combine node features, connection patterns, and graph structure by using a neural network to embed node information and pass it through edges in the graph. We want to identify the patterns in the input data used by the GNN model to make a decision and examine if the model works as we desire. However, due to the complex data representation and non-linear transformations, explaining decisions made by GNNs is challenging. In this work, we propose new graph features' explanation methods to identify the informative components and important node features. Besides, we propose a pipeline to identify the key factors used for node classification. We use four datasets (two synthetic and two real) to validate our methods. Our results demonstrate that our explanation approach can mimic data patterns used for node classification by human interpretation and disentangle different features in the graphs. Furthermore, our explanation methods can be used for understanding data, debugging GNN models, and examine model decisions.


GNN Explainer: A Tool for Post-hoc Explanation of Graph Neural Networks

Graph Neural Networks (GNNs) are a powerful tool for machine learning on...

Parameterized Explainer for Graph Neural Network

Despite recent progress in Graph Neural Networks (GNNs), explaining pred...

Feature Selection and Extraction for Graph Neural Networks

Graph Neural Networks (GNNs) have been a latest hot research topic in da...

Software and application patterns for explanation methods

Deep neural networks successfully pervaded many applications domains and...

GNNInterpreter: A Probabilistic Generative Model-Level Explanation for Graph Neural Networks

Recently, Graph Neural Networks (GNNs) have significantly advanced the p...

Physically Interpretable Neural Networks for the Geosciences: Applications to Earth System Variability

Neural networks have become increasingly prevalent within the geoscience...

Understanding Non-linearity in Graph Neural Networks from the Bayesian-Inference Perspective

Graph neural networks (GNNs) have shown superiority in many prediction t...

1 Introduction

Our contemporary society relies heavily on interpersonal/cultural relations (social networks), our economy is densely connected and structured (commercial relations, financial transfers, supply/distribution chains). Moreover, those complex network structures also appear in nature, on biological systems, like the brain, vascular and nervous systems, and also on chemical systems, for instance, atoms’ connections on molecules. Since this data is hugely structured and depends heavily on the relations within the networks, it makes sense to represent the data as a graph, where nodes represent entities and the edges the connections between them.

Graph Neural Networks (GNNs) such as GCN [kipf2016semi], GraphSage [hamilton2017inductive], can handle graph-structured data by preserving the information structure of graphs. Our primary focus is on the node labeling problem. Examples are fraud detection, classification of social-networks’ users, role assignment on biological structures, among others. GNNs can combine node features, connection patterns, and graph structure by using a neural network to embed the node information and pass it through edges in the graph. However, due to the complex data representation and non-linear transformations performed on the data, explaining decisions made by GNNs is a challenging problem. Therefore we want to identify the patterns in the input data that were used by a given GNN model to make a decision and examine if the model works as we desire, as depicted in Figure 1.

Although deep learning model visualization techniques have been developed in the convolution neural network (CNN), those methods are not directly applicable to explain weighted graphs with node features for the classification task. A few work have been down on explaining GNN (

[pope2019explainability, baldassarre2019explainability, ying2019gnn, yang2019interpretable]). However, to our best knowledge, no work has been done on explaining comprehensive features (namely node feature, edge feature, and connecting patterns) in a weighted graph, especially for node classification problems. Here we propose a few post-hoc graph feature explanation methods to formulate an explanation on nodes and edges. Our experiments on synthetic and real data demonstrate that our proposed methods and pipeline can generate similar explanations and evidence as human interpretation. Furthermore, that helps to understand whether the node features or graph typologies are the key factors used in GNN node classification of a weighted graph.

Figure 1: Framework to explain GNN node classification.

Our contribution is summarized as follows:

  1. We propose the formula of weight graph pattern (learned by GNN) explanation as two perspectives: Informative Components Detection and Node Feature Importance.

  2. We extend the current GNN explanation methods, which mainly focus on the undirected-unweighted graph to directed weighted graph. We adapt the well-know CNN visualization methods to GNN explanation.

  3. We propose a pipeline, including novel evaluation methods, to find whether topological information or node features are the key factors in node classification. We also propose a way to discover group similarities from the disentangled results.

Paper structure: In section 2, we introduce Graph and GNN. Then in section 3, the formula of graph explanation is described, and the corresponding methods are extended in section 4 and 5. In section 6

, we propose the evaluation metrics and methods. The experiments and results are presented in section

7. We conclude the paper in section 8.

2 Graph Neural Networks

2.1 Data Representation – Weighted Graph

In this section, we introduce the necessary notation and definitions. We denote a graph by where is the set of nodes, the set of edges linking the nodes and the set of nodes’ features. For every pair of connected nodes, , we denote by the weight of the edge linking them. We denote , where . For each node, , we associate a

-dimensional vector of features,

and denote the set of all features as .

Edge features contain important information about graphs. For instances, the graph may represent a banking system, where the nodes represents different banks, and the edges are the transaction between them; or graph may represent a social network, where the nodes represent different users, and the edges is the contacting frequencies between the users.

We consider a node classification task, where each node is assigned a label . The two explanation perspectives correspond to the informative s explanation on and of the weighted graph.

2.2 GNN Utilizing Edge Weight

Different from the state of art GNN architecture, i.e. graph convolution networks (GCN) [kipf2016semi] and graph attention networks (GAT) [velivckovic2017graph], some GNNs can exploit the edge information on graph [gong2019exploiting, shang2018edge, yang2019interpretable]. Here, we consider weighted and directed graphs, and develop the graph neural network that uses both nodes and edges weights, where edge weights affect message aggregation. Not only our approach can handle directed and weighted graphs but also preserves edge information in the propagation of GNNs. Preserving and using edges information is important in many real-world graphs such as banking payment network, recommendation systems (that use social network), and other systems that heavily rely on the topology of the connections. Since, apart from node (atomic) features also attributes of edges (bonds) are important for predicting local and global properties of graphs. Generally speaking, GNNs inductively learn a node representation by recursively aggregating and transforming the feature vectors of its neighboring nodes. Following [battaglia2018relational, zhang2018deep, zhou2018graph], a per-layer update of the GNN in our setting involves these three computations, message passing Eq. (1), message aggregation Eq. (2), and updating node representation Eq. (3), which can be expressed as:


where is the embedded representation of node on the layer ; is the weighted edge pointing from to ; is ’s neighborhood from where it collects information to update its aggregated message . Specifically, as initial, and is the final embedding for node of an

-layer GNN node classifier.

Here, following [schlichtkrull2018modeling], we set and define the propagation model for calculating the forward-pass update of node representation as:


where denotes the set of neighbors of node and denotes the directed edge from to , denotes the model’s parameters to be learned, and is any linear/nonlinear function that can be applied on neighbour nodes’ feature embedding. is the dimension of the layer representation.

Our method can deal with negative edges-weighted by re-normalizing them to a positive interval, for instances , therefore in the following we use only positive weighted edges. Hence, in the existing literature and in different experiment setting based on the natural of input graph, edge weights normally can play two roles: 1) message filtering; and 2) node embedding.

2.2.1 Type I: Edge Weights for Message Filtering

As the graph convolution operations in [gong2019exploiting], the edge feature matrices will be used as filters to multiply the node feature matrix. The GNN layer using edge weight for filtering can be formed as the following steps:

(message) (5)
(aggregate) (6)
(update) (7)

To avoid increasing the scale of output features by multiplication, the edge features need to be normalized, as in GAT [velivckovic2017graph] and GCN [kipf2016semi]. Due to the aggregation mechanism, we normalize the weights by in-degree . Depending on the the problem:

  • g can simply defined as: g ; or

  • g can be a gate function, such as a rnn-type block of , i.e. .

2.2.2 Type II: Edge Weights for Node Embedding

If contributes to node ’s feature embedding, , where is composition of one fully-connected (FC) layer and reshape operation, mapping . In this case, we will replace equation (5) and (6) by:

(message) (8)
(aggregate) (9)

Similarly, g can be or .

For the final prediction, we apply an Fully Connected (FC) layer:


Since Type II can be converted to unweighted graph explanation, which has been studied in existing literature [ying2019gnn, pope2019explainability], the following explanation will focus on Type I. For generalizations, we focus on model agnostic and post-hoc explanation, without retraining GNN and modifying pre-trained GNN architectures.

3 Formula of Graph Explanation

We consider the weighted graph feature explanation problem as a two-stage pipeline.

First, we train a node classification function, in this case, a GNN. The GNN inputs are a graph , its associated node feature and its true nodes labels . We represent this classifier as , where . The advantage of the GNN is that it keeps the flow of information across nodes and the structure of our data. Furthermore, it is invariant to permutations on the ordering. Hence it keeps the relational inductive biases of the input data (see [battaglia2018relational]).

Second, given the node classification model and node’s true label , the explanation part will provide a subgraph and a subset of features retrieved from the -hop neighborhood of each node , for and . Theoretically, the subgraph, along with the subset of features is the minimal set of information and information flow across neighbor nodes of , that the GNN used to compute the node’s label.

We define to be a subgraph of , where , if and . Consider the classification of node , then our Weighted Graph Explanation methods has two explanation components:

  • Informative Components Detection. Our method computes a subgraph, , containing , that aims to explain the classification task by looking at the edge connectivity patterns and their connecting nodes . This provides insights on the characteristics of the graph that contribute to the node’s label.

  • Node feature Importance. Our method assigns to each node feature a score indicating its importance and ranking.

4 Informative Components Detection

Relational structures in graphs often contain crucial information for node classification, such as graph’s topology and information flow (i.e., direction and amplitude). Therefore, knowing which edges contribute the most to the information flow towards or from a node is important to understand the node classification evidence. In this section, we discuss methods to identify the informative components on weighted graphs.

4.1 Computational graph

Due to the properties of the GNN, (2), we only need to consider the graph structure used in aggregation, i.e. the computational graph w.r.t node is defined as containing nodes, where . The node feature set associated with the is . The prediction of GNN is given by , which can be considered as a distribution mapping by GNN. Our goal is to identity a subgraph (and its associated features , or a subset of them) which the GNN uses to predict ’s label. In the following subsections, we introduce three approaches to detect explainable components within the computational graph: 1) Maximal Mutual Information (MMI) Mask; and 2) Guided Gradient Salience.

4.2 Maximal Mutual Information (MMI) Mask

We first introduce some definitions. We define the Shannon entropy of a discrete random variable,

, by , where

is the probability mass function. Furthermore, the conditional entropy is defined as:

where and are the sample spaces. Finally, we define the mutual information (MI) between two random variables as , this measures the mutual dependence between both variables.

Using ideas from Information theory [cover2012elements] and following GNNExplainer [ying2019gnn], the informative explainable subgraph and nodes features subset are chosen to maximize the mutual information (MI):


Since the trained GNN node classifier is fixed, the term of Eq.(11) is constant. As a result, it is equivalent to minimize the conditional entropy .


Therefore, the explanation to the graph components with prediction power w.r.t node ’s prediction is a subgraph and its associated feature set , that minimize (12). The objective of the explanation thus aims to pick the top informative edges and its connecting neighbours, which form a subgraph, for predicting ’s label. Because, probably some edges in ’s computational graph form important message-passing (6) pathways, which allow useful node information to be propagated across and aggregated at for prediction; while some edges in might not be informative for prediction. Instead of directly optimize in Eq. (12), as it is not tractable and there are exponentially many discrete structures containing nodes, GNNExplainer [ying2019gnn] optimizes a mask on the binary adjacent matrix, which allows gradient descent to be performed on .

If the edge weights are used for node embedding, the connection can be treated as binary and fit into the original GNNExplainer.However, if edge weights are used as filtering, the mask should affect filtering and normalization. We extend the original GNNExplainer method by considering edge weights and improving the method by adding extra regularization. Unlike GNNExplainer, where there are no constraints on the mask value, we add constraints to the value learned by the mask


and perform a projected gradient decent optimization. Therefore, rather than optimizing a relaxed adjacency matrix in GNNExplainer, we optimize a mask on weighted edges, supposing there are Q edges in . Then , where is element-wise multiplication of two matrix. The masked edge is subject to the constraint that , . Then the objective function can be written as:


In GNNExplainer, the top edges may not form a connected component including the node (saying ) under prediction i. Hence, we added the entropy of the for all the node pointing to node ’ as a regularization term, to ensure that at least one edge connected to node will be selected. After mask is learned, we use threshold to remove small and isolated nodes. Our proposed optimization methods to optimize maximizing mutual information (equation (11)) under above constrains is shown in Algorithm 1.

Input: 1. , computation graph of node ; 2. Pre-trained GNN model ; 3. , node ’s real label; 4. , learn-able mask; 5. , number of optimization iterations; 6. , number of layers of GNN.

1: initialize,
2:, for
3:for  to  do
4:      renormalize mask
5:     for  to  do
6:          message
7:          aggregate
8:          update
9:     end for
10:      predict on masked graph
12:      update mask
13:end for


Algorithm 1 Optimize mask for weighted graph

4.3 Guided Gradient (GGD) Salience

Guided gradient-based explanation methods [simonyan2013deep] is perhaps the most straight forward and easiest approach. By calculating the differentiate of the output w.r.t the model input then applying norm, a score can be obtained. The gradient-based score can be used to indicate the relative importance of the input feature since it represents the change in input space which corresponds to the maximizing positive rate of change in the model output. Since edge weights are variables in GNN, we can obtain the edge mask as


where is the correct class of node , and is the score for class

before softmax layer. where

is node ’s feature. We normalize by dividing to be bound it to . Here, we select the edges whose is in the top largest ones and their connecting nodes. The advantage of contrasting gradient salience method is easy to compute.

5 Node Feature Importance

Node’s features information play an important role in computing messages between nodes. That data contribute to the message passing among nodes in the message layer (see Eq. (1)). Therefore, the explanation for the classification task (or others, like regression) must take into account the feature information. In this section, we will discuss three approaches to define node feature importance in the case that the node attribute is a vector containing multiple features.

5.1 Maximal Mutual Information (MMI) Mask

Following GNNExplainer [ying2019gnn], in addition to learning a mask on edge to maximize mutual information, we also can learn a mask on node attribute to filter features given . The filtered node feature , where

is a feature selection mask matrix to be learned, is optimized by

In order to calculate the output given but without feature and also guarantee propagation, a reparametrization on is used in paper [ying2019gnn]:


where is a matrix with the same dimension of and each column

is sampled from the Gaussian distribution with mean and std of the

row of . To minimize the objective function, when dimension is not important; that is, any sample of will pull the corresponding mask value towards 0; if dimension is very important, the mask value will go towards 1. Again, we set constrain:


and perform projected gradient decent optimization.

However, before performing optimization on , is only sampled once. Different samples of may affect the optimized , resulting in unstable results. Performing multiple sampling of will be time-consuming since each sample is followed by optimization operation on .

5.2 Prediction Difference Analysis (PDA)

We propose using PDA for node features importance, which can cheaply perform multiple random sampling with GNN testing time. The importance of a nodal feature, towards the correct prediction, can be measured as the drop of prediction score to its actual class after dropping a certain nodal feature. We denote by the subset of the feature set where we removed feature . The prediction score of the corrupted node is . To compute , we need to marginalize out the feature :


Modeling by a generative model can be computationally intensive and may not be feasible. We empirically sample from training data. Noting that the training data maybe unbalance, to reduce sampling bias we should have , where is the features space of class and is the number of training instance in class . Explicitly, . We define the importance score for node feature as the difference of original prediction score


Naturally, is bounded in . The larger the indicates a more important the feature.

5.3 Guided Gradient (GGD) Node Feature Salience

Similar to the guided gradient method in detecting explainable components, we calculate the differentiate of the output with respect to the node under prediction and its neighbors in its computation graph on the feature for :


The larger the is, the more important the feature is.

6 Evaluation Metrics and Methods

For synthetic data, we can compare explanation with data generation rules. However, for real data, we do not have ground truth for the explanation. In order to evaluate the results, we propose the evaluation metrics for quantitatively measuring the explanation results and propose the correlation methods to validate if edge connection patter or node feature is the crucial factor for classification.

Figure 2: Disentangle informative subgraphs and node features,

6.1 Evaluation Metrics

We define metrics consistency, contrastivity and sparsity (Here, definition of contrastivity andsparsity are different from the ones in[pope2019explainability]) to measure informative component detection results. Firstly, To measure the similarity between graphs, we introduce graph edit distance (GED) [abu2015exact], which is a graph similarity measure analogous to Levenshtein distance for strings. It is defined as minimum cost of edit path (sequence of node and edge edit operations) transforming graph G1 to graph isomorphic to G2. In case the structure is isomorphic but edge weights are different. If GED=0, Jensen-Shannon Divergence (JSD) [nielsen2010family], is added on GED to further compare the two isomorphic subgraphs. Specifically, we design consistency as the GED between the informative subgraphs of the node in the same class, as whether the informative components detected for the node in the same class are consist; and design contrastivity as the GED across the informative subgraphs of the node in the same class, as and whether the informative components detected for the node in the different class are contrastive; Sparsity is defined as the density of mask , as the density of component edge importance weights.

6.2 Important features disentanglement

We follow the pipeline described in Figure 2. Hence, after training a GNN, we perform informative component detection and node importance analysis on each of the nodes . Furthermore, we get the local topology that explains the labeling of that node. After, for each label , we collect all the subgraphs that explain that label, , where means that node is classified as class . Then, we measure the distance, using the predefined GED, from all the subgraphs in each label to all the subgraphs in all labels . So, we obtain a set of distances between the instance within the class and across classes. Similarly, for each label , we collect all the node feature saliency vectors that explain that label, , where means that node is classified as class ,and . We then measure the similarity using predefined Pearson correlation of all the feature saliency vectors in each label , so that we obtain a set of correlations between the instance within the class and across classes.

As the last step, we group the distance and correlations by class-pairs and take the average of the instance in each class pair. Therefore, we generate a distance map for informative components and a similarity map for node feature salience. The key features should have high consistency within the groups and contrastivity across different classes. Therefore, we examine the distance map and similarity map of the given graph and GNN classifier. If topology information contributes significantly to the GNN, the diagonal entries of distance maps should be small, while the other entries should be large. When node features are key factors for node labeling, the diagonal entries of distance maps should be large, while the other entries should be small. From those maps, not only we can examine if the detected informative components or the node features are meaningful for node classification, but also we find which classes have similar informative components or important node features.

7 Experiments

Figure 3: Synthetic BA-house graph data and corresponding edge weights, each BA node belongs to class ”0,” and each ”house” shape node belongs labeled ”1-3” based on its motif. The node orders are denoted.

The topic that we addressed in this work of model-agnostic GNN post-hoc explaination was quite new. Few previous studies could be compared to our work. For example, Pope et al. [pope2019explainability] formulated GNN differently, which replied on adjacent matrix, and the attention method in [ying2019gnn] is model-specific. Therefore, those methods were not easily adopted. We mainly compared with the original MMI Mask proposed in GNNExplainer [ying2019gnn]

. Furthermore, to our knowledge, graph feature importance disentangle pipeline is first proposed here. We simulated synthetic data and compared the results with human interpretation to demonstrate the feasibility of our methods. Note that, the color codes for all the figures below follow the on denoted in Figure

3. The red node is the node we try to classify and explain.

7.1 Synthetic Data 1 - SynComp

Following [ying2019gnn], we generated a Barabási–Albert (BA) graph with nodes and attached five-node house-structure graph motifs are attached to random nodes, ended with 65 nodes in Figure 3. We created a small graph for visualization purpose. However, the experiment results held for large graphs. Several natural and human-made systems, including the Internet, citation networks, social networks, and banking payment system can be thought to be approximately a BA graph, which certainly contains few nodes (hubs) with unusually high degree and a big number of nodes poorly connected. The edges connecting with different node pairs were assigned different weights denoted in Figure 3 as well, where was an edge weight we will discuss later. Then, we added noise to synthetic data by uniformly randomly adding edges, where was the number of nodes in the graph. In order to constrain the node label is determined by motif only, all the node feature was designed the 2-D node attributes with the same constant.

We use in Eq. (5). The parameters setting are

input_dim = 2, hidden_dim = 8, num_layers = 3 and epoch =300.

We randomly split of the nodes for training and the rest for testing. GNN achieved and accuracy on training and testing dataset correspondingly. We performed informative component detection (kept top 6 edges) and compare them with human interpretation – the ’house shape,’ which can be used as a reality check (Table 1). The GNNExplainer [ying2019gnn] performed worse in this case, because it ignored the weights on the edge so that the blue nodes were usually included into the informative subgraphs. In Figure 4, we showed the explanation results of the node in the same place but has different topology structure (row a & b) and compared how eight weights affected the results (row a & d). We also showed the results generated by different methods (row a & c).

Figure 4: Informative Components. Row a)-c), . Row a) is for the node in class one not connecting to class 0 nodes using MMI mask. Row b) is for the node in class one connecting to class 0 nodes using MMI mask. Row c) is for the node in class one connecting to class 0 nodes using GGD. Row d) is for the node in class one connecting to class 0 nodes using MMI mask, but .
Method MMI mask GGD GNNExplainer [ying2019gnn]
AUC 0.899 0.804

(Measuring on all the nodes in class 1 with )

Table 1: Saliency component compared with ’house’ shape.

7.2 Synthetic Data 2 - SynNode

In order to constrain the node labels were determined by node features only, we constructed a graph with BA topology and designed the 2D node attributes for each node on the graph. We generated a random normal distributed noise

for each node , where . The entry of the node attribute vector was assigned as . For the entry, the value is , where is the real label of . We constructed a graph containing 60 nodes and randomly removed half of the edges to make it sparse. We used the same training model and methods in SynComp. For the quantitative measurement on node importance, we calculated the accuracy of classifying the entry as the important features. Then we applied softmax function on the node feature importance vectors and calculated their mean square error (MSE) with

. Last, we theoretically listed the computation complexity estimation. We show the measurements on one example node in Table

2, where is number of sampling times.

Method MMI mask PDA GGD
Time cost Train Test Test

(Repeating 10 times, mean std)

Table 2: Compare importance score with ground truth.

7.3 Citation Network Data

PubMed dataset [pubmed] contains 19717 scientific publications pertaining to diabetes classified into one of three classes, ”Diabetes Mellitus, Experimental,” ”Diabetes Mellitus Type 1”, ”Diabetes Mellitus Type 2”. The citation network built on PubMed consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. Edge attribute is defined as a positive Pearson correlation of the node attributes. We randomly split of the nodes as training data and rest as testing dataset. GNN used edge as filtering and . The parameters setting are hidden_dim = 32, num_layers = 3 and epoch =1000. Learning rate was initialized as 0.1, and decreased half per 100 epochs. We achieved an accuracy of 0.786 and 0.742 on training and testing data separately. We selected top 20 edges in both MMI and GGD, show the overlapping informative component detection results of an example in each class in Figure 5. Obviously, we can find the pattern that those nodes were correctly classified since they connect to the nodes in the same class.

Figure 5: Overlapping Informative components detected by MMI mask and GGD for the examples of each class.

For the selected examples, we used above three node feature importance methods to vote the top 10 important features. Specifically, we first ranked the feature (keywords in the publications) importance by each method. Different nodes’ feature might have different ranks by different methods. Then we summed the rank of each feature over the three methods. The smaller the summed rank number is, the more important the feature is. The top 10 ranked keywords are ”children”, ”type 2”, ”iddm”, ”type 1”, ”insulindepend”, ”noninsulindepend”, ”autoimmun”, ”hypoglycemia”, ”oral”, ”fast”. We consulted 2 diabetes experts and got the validation that ”type 2”, ”iddm”, ”noninsulindepend” were directly related to publications of class ’”Diabetes Mellitus Type 2”; ”autoimmune”, ”children”, ”hypoglycemia”, ”insulindepend”, ”type 1” are closely associated to class ”Diabetes Mellitus Type 1”; and ”oral”, ”fast” are the common experment methods in class ”Diabetes Mellitus, Experimental”.

7.4 Bitcoin OTC Data

Bitcoin is a cryptocurrency that is used for trading anonymously. There is counterparty risk due to anonymity. We use Bitcoin dataset ([kumar2018rev2]) collecting in one month, where Bitcoin users rate the level of trust to the users they made transactions to. The rating scales are from -10 to +10 (except for 0). According to OTC’s guideline, the higher the rating, the more trustworthy. We labeled the users whose rating score had at list one negative score as risky; the users whose more than half received ratings were greater than one as trustworthy users; the users who did not receive any rating scores as an unknown group; and the rest of the users were assigned to the neural group. We chose the rating network data at a time point, which contained 1447 users, 5739 rating records. We renormalized the edge weights to by . Then we trained a GNN on unknown, neutral and trustworthy node, risky node, those nodes only, and perform classification on the rest of the nodes. We chose g as a gate and the other settings are setting are hidden_dim = 32, num_layers = 3 and epoch =1000. Learning rate was initialized as 0.1, and decreased half per 100 epochs. We achieved accuracy 0.730 on the training dataset and 0.632 on the testing dataset. Finally, we showed the explanation result using MMI mask since it is more interpretable (see Figure 6) and compared them with possible human reasoning ones. The pattern of the informative component of the risky node contains negative rating; the major ratings to a trustworthy node are greater than 1; and for the neutral node, it received lots of rating score 1. The informative components match the rules of how we label the nodes.

Figure 6: Informative subgraph detected by MMI mask (showing the original rating scores on the edges).

Using both real datasets, we measured consistency, contrastivity, and sparsity by selecting the top 4 important edges. The results on the two real datasets are listed in Table 3.

Dataset Consistency Contrastivity Sparsity
PubMed 2.00 1.99 0.022
MMI BitCoin 1.81 2.45 0.132
PubMed 2.14 2.07 0.049
GGD BitCoin 2.05 2.60 0.151

(Average on 50 random correctly classified nodes in each class)

Table 3: Evaluate informative components.

7.5 Feature Importance Disentanglement

We performed the disentanglement experiment on SynComp (), SynNode and Pubmed datasets, because these datasets have both node and edge features. For the Pubmed dataset, we randomly selected 50 correctly classified nodes in each class to calculate the stats. Since we had different explanation methods, we calculated the distance maps and similarity maps for each method and performed averaging over different methods. The distance map calculating on the subgraph with top 4 informative edges is shown in Figure 6(a) and 6(b). From the distance map, we can examine the connecting pattern is a key factor for classifying the nodes in SynComp, but not in SynNode. For the SynComp dataset, in-class distances were smaller than cross-class distances. Whereas, the distance map for SynNode and PubMed did not contain the pattern. Also, from the distance map, we could see the node in class 2 and 3 had the most distinguishable informative component, but classes 0 and 1’s are similar. For the similarity maps (Fig. 6(d), Fig. 6(e), and 6(f)), SynNode and PubMed datasets had much more significant similarities within the class compared with the similarities across the classes. Combining distance maps and similarity maps for each dataset, we could understand that topology was the critical factor for SynComp dataset, and node feature was the key factor for SynNode and PubMed dataset for node classification in GNNs.

Figure 7: Explanation disentangle using maps: (a) SynComp Informative Subgraphs Distance Map; (b) SynNode Informative Subgraphs Distance Map; (c) PubMed Informative Subgraphs Distance Map; (d) SynComp Node Salience Similarity Map; (e) SynNode Node Salience Similarity Map; (f) PubMed Node Salience Similarity Map.

8 Conclusion

In this work, we formulate the explanation on weighted graph features used in GNN for node classification task as two perspectives: Components Detection and Node Feature Importance, that can provide subjective and comprehensive explanations of feature patterns used in GNN. We also propose evaluation metrics to validate the explanation results and a pipeline to find whether topology information or node features contribute more to the node classification task. The explanations may help debugging, feature engineering, informing human decision-making, building trust, increase transparency of using graph neural networks, among others. Our future work will include extending the explanation to graphs with multi-dimensional edge features and explaining different graph learning tasks, such as link prediction and graph classification.



Figure 8: Comparing with GNNExplainer [ying2019gnn] on SynComp dataset: a) the motif and corresponding edge weights; b) the human interpretation of the informative component to classify a node (colored in red) as class ”1”; c) informative components of nodes classified as class ”1” detected by our proposed MMI mask, which correctly detects the house structure; d) and informative components of nodes classified as class ”1” detected by GNNExplainer [ying2019gnn], which wrongly includes the unimportant connection to BA graph. Node orders are denoted in c) and d).

We compared our proposed MMI mask edge with GNNExplainer [ying2019gnn] for weighed graph informative components detection. Given the pretrained GNN

, GNNExplainer learned a mask on edges and used a sigmoid function to bound each entry of the mask to

. Then the mask weights were used as edge weights inputted to .

Remind that we created SynComp dataset by generating a Barabasi–Albert (BA) graph with 15 nodes and attaching 10 five-node house-structure graph motifs to 10 random BA nodes. Each BA node belongs to class” 0” and colored in blue. Each node on ”house” belongs to class ”1-3” based on its motif, and we define: the nodes on the house shoulder (colored in green) belong to class ”1”; the nodes on the house bottom (colored in purple) belong to class ”2”; and the node on house top (colored in orange) belong to class ”3”. We performed the detection of the informative components for all the nodes on the house shoulder, which connect to a BA graph node as well. We set the connection with a small edge weights in SynComp dataset (shown in Figure 8 a) with all edge weights denoted), which meant the connection was not important compared to other edges.

The informative components detection results are shown in Figure 8 c) and d) for our proposed method and GNNExplainer correspondingly. We used human interpretation that a node on the house shoulder should belong to class ”1” as ground truth. Therefore, the ground truth of the informative components to classify a node in class ”1” should be a ”house” structure (shown as Figure 8 b), the node we try to classify is colored in red). Because no matter the node connects to a BA node or not, once it is on the ”house” shoulder, it belongs to class ”1”. Obviously, our methods could accurately detect the ’house’ structure, while directly applied GNNExplainer on weighted graph resulted in wrongly including the edge to BA nodes, as GNNExplainer ignore edge weights.