I Introduction
Graphs represent relations between entities and have been used to model social networks [tang2009relational], biological networks [Marinka2017], and online reviews [rayana2015collective]. On prediction tasks on graphs, such as node classification, link prediction, and graph classification [kipf2017gcn, chen2018fastgcn, velivckovic2018gat, hamilton2017graphsage], GNN exploits the relations to aggregate information in a neighborhood of each node to achieve stateoftheart predictive performance. However, the aggregations over many nodes multihops away make the GNN predictions too opaque to be understood and trusted by humans. Explanations of the GNN predictions try to simplify the computation to deliver societal merits, such as justifying the predictions, fulfilling legal regulation [Goodman2017], and algorithmic recourse [Ustun2019, Russell2019, Barocas2020fat]. For example, when warning an online shopper about frauds detected using GNN on a review graph [rayana2015collective], the user may ask “why I am a victim of frauds” and expect an explanation such as “the website you’re viewing has connections with certain suspicious IP addresses”. We focus on explaining GNN predictions made on graph nodes.
Algorithms Properties 
DiverseCF 
Recourse 
Convex 
DeepLIFT 
LIME 
GNNLIME 
GNNExplainer 
Attention 
The proposed 
Surrogate  
Gradient  
Search  
Simulatability  
Counterfactual  
Blackbox  
Human Evaluation 
According to [Lewis1986], “To explain an event is to provide some information about its causal history” and the explanation of a prediction can be defined in two ways. First, an explanation can be a causal chain consisting of a forward mapping from inputs and model parameters (the “causes” ) to model prediction (the “outcome” ) via steps of computations. An explanation with good simulatability would allow humans to more easily forward simulate the causal chain (possibly a simplified version). Second, an explanation can be in the form of counterfactuals [Miller2019]: an event is said to have caused event , if in the counterfactual where did not happen, would not have happened. Counterfactuals allow humans to see the impact of on , and prior works show that humans do counterfactual reasoning in their daytoday life [Miller2019, Binns2018chi]
. We define counterfactual relevance as the amount of change in the probability that
happens when the cause is altered. There are additional desiderata. To give humans a better sense of causal relationship, an explanation of an outcome should be robust to perturbations irrelevant to the cause but sensitive to changes in . Diverse counterfactuals for algorithmic recourse with minimal changes in the decision subjects, e.g., a human on a dating site, allow human agency in the decisionmaking [Ustun2019].Explaining GNN is gaining more attention, and yet there is no study of the interactions between the two metrics, simulatability and counterfactual relevance, from the human and computation perspectives. Gradientbased methods [baldassarre2019explainability, pope2019explainability] use magnitudes of gradients to highlight important edges or node features. Such methods aim at counterfactuals since the gradients indicate how fast the prediction (the “outcome”) changes with respect to small perturbations in the highlighted input (the “cause”). Learningbased explanation methods, including GNNExplainer [ying2019gnn] and GNNLIME [graphlime], extract a simpler surrogate model to faithfully approximate a GNN prediction and thus promote simulatability, without concerning counterfactual relevance. Explanation methods based on gradients [ying2019gnn] need to access the target model as a whitebox and may break privacy and security constraints. See Table I and related work for comparisons.
Inspired by the two modes of human thinking studied in psychology [kahneman2011thinking], we hypothesize that human perception of an explanation is a function of both metrics. We conjecture a cognitive process where humans first intuitively make sense of the outcome in a lightweight forward simulation using an explanation (System 1), and then perform more effortful counterfactual reasoning (System 2) to figure out a cause of the outcome. If the explanation is rejected due to low simulatability in the first phase, humans will be less willing to seek for the causes. Fig. 1 shows a pictorial representation of the hypothesis, with four categories of explanations. Gradientbased explanations are in the high counterfactual relevance, low simulatability category (region A), and GNNExplainer has no guarantee of high counterfactual relevance but aims to achieve high simulatability (region D). Table III in Section V shows the quantitative evaluation of these methods.
To test the above hypotheses on GNN, we adopt simple (small, acyclic, and connected) subgraphs as explanations for forward simulation of a GNN prediction on a node. Each explanation is associated with a counterfactual explanation that has some elements removed from the explanation to flip the prediction. We generate explanations and counterfactuals in the four categories shown in Fig. 1 and measure how simulatability and counterfactual relevance interact to influence human perception of the explanations. Statistical analyses show that: 1) a low simulatability can, but not always, prevent the adoption of an explanation, making counterfactual reasoning less relevant. 2) conditioned on a high simulatability, high counterfactual relevance improves human acceptance of the explanation.
Given the joint effect of the two metrics on humans, current methods do not jointly maximize simulatability and counterfactual relevance. Since the two metrics can be competing and tradeoffs are necessary, we define Pareto efficient explanations and formulate a multiobjective optimization problem to model the tradeoffs. Since the target model is a blackbox, we design a depthfirst search algorithm that accesses the zeroth order information of the model, i.e., the predictions, to identify Pareto efficient subgraph explanations. Explanation search algorithms, such as those based on (mixed) integer programming [Ustun2019, Russell2019] and subgraph enumeration [Yoshida2019kdd], employ similar searches, and yet they are singleobjective optimization. Though less expensive, gradientbased approaches [ying2018graph] are whitebox methods and only find node/edge importance, while the generation of connected graphs still requires exhaustive search. Further, we provide an analysis on the lack of robustness of gradientbased GNN explanations. In the contrast, we empirically verify the robustness and sensitivity of the optimal explaining subgraphs found by the proposed algorithm. Although we strike for causal explanations, we are cautious and formulate GNN using Structural Equation Model (SEM) to prove that confounders can exist in a subgraph explanation and users must be cautioned that the found counterfactuals are not “the” causes of the predictions. Lastly, we extensively verified that the proposed algorithm dominates singleobjective baselines in both metrics on 9 datasets.
Ii Problem Formulation
Assume that we have a GNN of layers trained to predict class distributions of the nodes on a graph , where is the set of nodes and is the set of edges connecting the nodes. Let be the set of neighbors of . On layer , and for any node , , GNN computes using messages sent from to , by the following operations:
(1)  
(2)  
(3) 
The MSG function computes the message vector sent from
to (e.g., ). The AGG function aggregates the messages sent from all to and can be the elementwise sum, average, or maximum of the messages. The UPDATE function uses parameter to map to . One example is, followed by some nonlinear mapping such as ReLU. The input node feature vector
for is regarded as . The output of the GNN on node is , which can be softmaxed to the node class distribution (a vector of class probabilities). The parameters of GNN, , , are trained endtoend on labeled nodes on . We define an explanation of the prediction to be a subgraph of that contains the target node [ying2019gnn]. Besides being agnostic to the above details of architecture and parameters, we desire the following properties of the explanations.Simulatability. A comprehensible explanation should be simulatable, defined by the following two aspects. The simplicity of an explanation is related to the limit of human cognitive bandwidth [Miller1956] and sparsity is used as a proxy of simplicity [Du2019, Guidotti2018, ying2019gnn]. We say that the explaining subgraph is sparse if contains no more than nodes. Due to the sparsity, does not allow full computation taken on the full graph , and the faithfulness of measures how much the can reproduce generated on . Similar to [Suermondt1992], we measure faithfulness using the symmetric KLdivergence between the prediction on and on (the larger, the better):
(4) 
Counterfactual relevance. Let the abovedefined subgraph be a “fact”. A counterfactual of is a perturbation of . We restrict the counterfactual to be a strict subgraph of . Let the difference between and be denoted by , the size of which is represented by , so that means adding to reconstructs . The class distributions of generated by the target GNN model on and are denoted by and , respectively. We define the counterfactual relevance [Miller2019] of the tuple when explaining as
(5) 
can be positive, negative or zero. Because represents the faithfulness, the absolute measures the change in the class distribution of approximated by the fact and the counterfactual . When is large, the portion removed from is likely to be the cause of [guo2020survey]. The normalizer makes sure that the same difference caused by a small will be more desirable than that caused by a larger . It also prohibits extreme counterfactuals that remove all nodes except the target . These quantities are demonstrated in Fig. 2.
Iii How Humans Perceive Explanations
“System 1 operates automatically and quickly …
System 2 allocates attention to the effortful mental activities …”Daniel Kahneman, Nobel laureate
We conducted a human subject study to find the roles of the two metrics in the human perception of explanations. The two modes of thinking, System 1 and System 2, are extensively studied in psychology, as quoted above. We conjecture that forward simulations help humans quickly screen an explanation using System 1, while reasoning using the counterfactual is a more deliberate process that requires System 2, so that humans will conduct counterfactual reasoning only after the explanation has passed System 1 screening. Simulatability and counterfactual relevance measure how well an explanation and an associated counterfactual are received by the two Systems.
According to Fig. 1
, on the Cora dataset, we sample five target nodes and for each node we generate subgraphs with low and high simulatability. This leads to ten explaining subgraphs for each subject to evaluate the simulatability. For each explaining subgraph
, we further generate two counterfactuals that are subgraphs of , with different counterfactual relevance.Fig. 3 shows one sample test case. For each of the five nodes, a subject will see the original graph where GNN produced the prediction , the explanation that produced , and two counterfactuals that generate two . The full graph is considered to be too complicated for interpretation, while is more intelligible. The two counterfactuals allow a subject to evaluate if a removed part is a plausible cause of the prediction . For each graph, we color the nodes based on the GNN’s prediction, so that a subject can relate a prediction to the neighbors. We show the predicted class distributions in histograms, so that the predictions across the (sub)graphs can be compared conveniently. The subjects were not told about the two metrics of the explanations but needed to understand, analyze, and then rate the explanations.
To avoid bias, we frame the survey as an evaluation of a graphbased search engine and recruited subjects with search experience using Google Scholar. The authors of this paper are excluded. The two counterfactuals are randomly ordered. Each subject is further trained on two additional sample cases. During the test phase, we ask subjects the following questions after each test case and collect feedback ( in the parentheses) in a 5point Likert scale (1very little (won’t accept),2little,3not sure,4a little, 5very well):

[leftmargin=*,topsep=0pt]

Simulatability (): How well do you think the second subgraph is reproducing the prediction computed in the first graph?

Counterfactual1 (): How much do you think the removed component in the third subgraph is an important factor leading to the histogram for the second subgraph, had it not been removed?

Counterfactual2 (): Same as above but replace the the third subgraph with the forth subgraph.

Explanation acceptance (): How much will you accept the probabilities, if they were computed on the second subgraph rather than the first?
Iiia Analysis of human feedback
The questions quantitatively reveal the human perception of the two explanation metrics. Let the responses to the questions a, b, c, and d be , , , and , respectively. measures the subject’s perceived simulatability of the explanation . The difference between and measures the preference of a subject between two alternative counterfactuals . measures the subject’s overall acceptance of
as an explanation based on its simulatability and the plausibility of the causes found using the counterfactuals. After filtering out an obvious outlier (the responses to all questions are the same), we have 10 subjects’ responses to 10 test cases, leading to 100 scores for each of the four questions. We draw the following conclusions based on statistical analyses.
High simulatability helps acceptance that can be boosted by high counterfactual relevance. Using responses
, a twoway analysis of variance (ANOVA) shows that the two metrics interact strongly (
value ). Fig. 3(a) and 3(b) confirm that a high simulatability is a prerequisite of explanation acceptance, with high counterfactual relevance being the second condition. A low simulatability leads to more mixed acceptance, regardless of counterfactual relevance. There are some numbers of acceptance with low simulatability, due to the subjects’ indepth analysis of the cases that leads to a final acceptance.Simulatability can predict acceptance of explanations. We conducted a test on the responses from two groups: one has cases with low simulatability and the other has cases with high simulatability. The value is almost zero, indicating that the degree of acceptance differs significantly between the groups. The statistic is . After taking into account the withingroup variances and the sample size, we conclude that the acceptance of a less simulatable explanation is less than that of a more simulatable explanation. Fig. 3(b) further confirm this conclusion.
A higher counterfactual relevance makes a reason more likely perceived as “the cause”. While there can be several factors that jointly lead to the GNN prediction , humans tend to accept the one with high counterfactual relevance as “the cause”, compared to those with low counterfactual relevance. We conducted a test between the responses and . The tests show that a higher counterfactual relevance is more convincing (all value ), regardless of simulatability (see Fig. 3(c)). However, when simulatability is low, the presented “cause” is less convincing (see bottom two subfigures of Fig. 3(a)). Caution: “the cause” presented by a counterfactual may not be the only or the true cause of the prediction , due to confounders. See Section IVB.
Iv Multiobjective explanations of GNN
Given the human study results, we aim to solve the following multiobjective optimization problem.
Definition 1.
Given a graph and a GNN model , on any target node , extract an explanation subgraph and a counterfactual subgraph , where , contains no more than nodes and is acyclic, so that and are maximized:
(6) 
For simplicity of the explanation, we restrict to contain no more than nodes [Miller1956]. The limit to nodes also reduces the degree, coreness, and centrality of any nodes in , and improves human reaction time when reasoning with [Lynn29407]. We restrict the explanations to be acyclic graphs [Vu2020PGMExplainerPG], since a cycle can lead to selfproof and explanations such as “ Alice is a database researcher because she cited a paper of Bob, who is a database researcher since he cited Alice’s paper”.
The optimization is biobjective and the objective vector function has two scalar objectives. We don’t use a single scalar objective function, such as , not only because that can be hard to specify, but also that trading one objective for the other is not desirable according to the human subject study (either low simulatability or counterfactual relevance suppresses human acceptance of the explanation and the counterfactual). Beyond being multiobjective, the solution space of all possible , defined by the constraints in the above optimization problem, is exponentially large and discrete and no polynomialtime algorithm is known to search the space. The gradientbased methods in [ying2019gnn, pope2019explainability] and the searchbased methods in [Russell2019, Ustun2019, Mothilal2020fat, yuan2020xgnn, Vu2020PGMExplainerPG] can only maximize one of the objective functions and do not guarantee Pareto optimality, i.e., efficient tradeoff between objectives. We follow the searchbased explanation generation paradigm, but aim at finding the Pareto front and selecting one particular Pareto efficient explanation with wellbalanced objectives.
Iva Search for Pareto optimal explanations
The algorithm, GNNMOExp (Graph Neural Network MultiObjective Explanations) is shown in Fig. 5. We first apply a depthfirst search (DFS) to explore the space of subgraphs for . Since the prediction of the target does not depend on nodes that are more than hops away from , the search is restricted to the dependent neighbors. A canonical ordering of the edges is determined by a breadthfirst search (BFS) before running the DFS, ensuring no subgraph will be enumerated more than once. The BFS also canonically numbers the nodes to avoid isomorphism test during graph lookup: the same graph will be represented by a unique array of edges with canonical node numbering. Starting from the subgraph containing only , the DFS expands the subgraph by adding an unvisited edge adjacent to the current subgraph. The constraints in Eq. (6) are used in pruning the search space. After all valid candidate subgraphs containing the edge have been explored, the edge is flagged and will not be visited in future. The enumeration will be completed when all edges within the neighborhood are processed.
The GNN model has to be run on each enumerated subgraph and the two metrics and are computed by Eq. (4) and Eq. (5). Since contains at most nodes, the cost is low. To avoid repetitive calculation of when calculating , a hash table is used to record for each subgraph. becomes a counterfactual of all subgraphs that are the descents of in the DFS search tree.
After evaluating each subgraph and its counterfactuals, we need to find the optimal explanation so that both metrics are high. However, the two metrics can be competing and it is hard to find an explanation that outperforms all others in both metrics. We aim to find Pareto optimal (efficient) explanations, that are optimal in the sense that it cannot be outperformed by another explanation in both metrics [miettinen1998nonlinear]. We need the following definitions.
Definition 2.
(Pareto dominance) Let and . , . If Pareto dominates , then , denoted as .
Definition 3.
(Pareto optimality). is Pareto optimal if and only if .
Definition 4.
(Pareto optima) The set of all Pareto optimal solutions: .
Definition 5.
(Pareto optimal front). The set consists of the function values of the Pareto optimal set: .
However, explanations on the Pareto front can be low in one objective while being high in another, and is thus not useful. We design a simple method to find Pareto optimal explanations that are: 1) dominating other explanations, and 2) likely simultaneously optimal in individual metrics (without guarantee). In particular, we sort the explanations and their counterfactuals along the simulatability and counterfactual relevance, independently. Let the ranking position of in the two rankings be denoted by and (the smaller the better). We define the comprehensive ranking be
(7) 
Finally, we select the with the best comprehensive ranking, denoted by as the final explanation.
One possible baseline is to use the socalled preference vector to select a Pareto optimal solution that satisfies some weighted balance between the objectives [mahapatra20a]. We found this method hard to use in our case: the two objectives are of different ranges, which vary across different target nodes. In contrast, the rankingbased approach handles the heterogeneity. We did not present this baseline since it significantly underperforms our method. A more competitive baseline is to find whose rankings in the two objectives are well balanced. We compare our approach with this baseline in the experiments. Since the Pareto front is nonconvex and contains dents that have wellbalanced but low objective values, the above baseline may not work well.
The explanation chosen by the comprehensive ranking is in the Pareto front, as shown by the following theorem.
Theorem 6.
The rankingbased method finds a solution that’s on the Pareto front.
Proof.
If is not a Pareto optimal solution, then there is that dominates . By definition, must be ranked higher than in at least one objective, while in the other objective the two are at least equal. According to the definition of comprehensive ranking, and would have been chosen by the explanation selection algorithm. ∎
Complexity of the Algorithm. Regarding the DFS, in the best case, is on one end of a linear chain and the time complexity is . In the worst case, the number of subgraphs of a complete graph with nodes is exponential, and the complexity is . Many realworld graphs are sparse and the complexity is more likely to be polynomial. The depth of GNN is usually limited () due to the oversmoothing effect of aggregation [Li2018DeeperII] and the number of nodes searched depends on the size of the hop neighborhood of the target node. We show in Fig. 9 that the running time of the subgraph search is practically low.
It seems that one has to find the Pareto front and then use the comprehensive ranking to find the best explanation. To eliminate all dominated solutions, the time complexity is quadratic in the number of enumerated subgraphs. However, Theorem 6 says that the comprehensive ranking already points to a solution on the Pareto front and the overall time complexity is just linear in the number of enumerated subgraphs, using the heap data structure.
IvB Confounders
Confounders are variables that impact both causes and outcome [Pearl2009]. Fig. 6 shows the concepts of confounder that leads to the BackDoor adjustment:
(8) 
which is in general not the same as . For in the figure, the counterfactual explanation is obtained by the intervention of removing from . Humans may think that is “the cause” of the output . However, this is not true due to confounders, as shown in Fig. 6.
IvC Connection to Shapley values
There is a close relationship between counterfactual explanations and Shapley values [shapley1953value, chen2018shapley]. As an explanation, Shapley values are the importance of the factors contributing to the predictions to be explained. One can consider the portion removed from a subgraph as a contributor, and by averaging ’s contributions over all possible that contain (denoted by ), we obtain the Shapley value of :
(9) 
The contribution follows the definition of Shapley values and can be positive, negative, or zero. Instead, counterfactual relevance is always nonnegative and gives the magnitude of the importance of .
IvD Robustness and sanity check of explanations
An accurate explanation of a prediction should vary according to the underlying mechanism that generates the prediction [Adebayo2018], and should remain the same under irrelevant perturbations [Ghorbani2017].
Definition 7.
The robustness of a subgraph explanation is the degree of the change in under perturbations that are irrelevant to the mechanism that generates .
We assume a onelayer GNN () with parameter , where is the total number of classes to be predicted and is the number of features of the nodes. We use the graph in Fig. 7 Left to demonstrate the difference in the robustness of explanations found by GNNMOOExp and prior gradientbased methods. Gradientbased methods [pope2019explainability, ying2019gnn, baldassarre2019explainability]
find explanations using the gradient of the following faithfulness loss function with respect to a mask
over the adjacency matrix :where is the th row of . As we are explaining a GNN prediction, is the predicted class and not necessarily the ground truth class of . is the input feature vectors of the neighbor of . The target GNN model will set all entries of to 1 so that all neighbors of are retained. In Fig. 7 center, the neighbors’ features satisfy so that the relevant neighbors to are just and , with representations and , whose sum is closer to than to for any . The gradient of w.r.t. is
(10) 
The importance of the edge is the magnitude of the above gradient, essentially determined by the correlation between and . In Figure 7 center, since both and are orthogonal to , gradientbased methods will never have and in their explanations. When and are rotated so that is more similar to than while remains, the gradientbased explanation will include , even the prediction remains the same. The rotations are irrelevant to how leads to the prediction . On the other hand, is closer to than to or if only subgraphs with three nodes () are allowed. As a result, GNNMOExp still finds the same optimal subgraph containing , and , even after the rotations and is thus more robust.
Another aspect is that an explanation should faithfully reflect how a changing is generated and is different from simulatability that focuses on explaining a static mechanism that generates a fixed . Formally,
Definition 8.
A sanity check of an explanation of a GNN model’s prediction verifies if changes when the mechanism that generates changes.
A sanity check is a necessary (but not a sufficient) condition for an explanation to be a faithful surrogate of the full model: not passing the sanity check indicates that an explanation is not reflecting the inputoutput relationship encoded by the GNN. When debugging a GNN model to identify whether the model or the graph data are manipulated or polluted, passing the sanity check means the explanations can reveal the malicious attacks to the model or data. The prior work [Adebayo2018] proposed a sanity check for deep neural networks on images and does not address sanity checks for GNN on graphs. We conduct sanity checks for GNNMOExp in Section VC.
V Experiments
Datasets  Classes  Nodes  Edges  Edge/Node  Features 
Cora  7  2,708  10,556  3.90  1,433 
Citeseer  6  3,321  9,196  2.78  3,703 
PubMed  3  1,9717  44,324  2.24  500 
MusaeF  4  2,2470  342,004  15.22  4,714 
Musae–G  2  37,700  578,006  15.33  4,005 
AmazonC  4  13,752  574,418  41.77  767 
AmazonP  6  7,650  287,326  37.56  745 
CoauthorC  13  18,333  327,576  17.87  6,805 
CoauthorP  2  34,493  991,848  28.76  8,415 
Datasets  Simulatability ()  Counterfactual Relevance ()  
RND  EMB  Grad  GAT  GNNExp  PGExp  Shapley  MOEB  GNNMOExp  RND  EMB  Grad  GAT  GNNExp  PGExp  Shapley  MOEB  GNNMOExp  
Cora  0.196  0.252  0.530  0.243  0.213  0.272  0.256  0.108  0.049  0.240  0.260  0.330  0.243  0.225  0.217  0.615  0.455  0.467 
Citeseer  0.051  0.054  0.066  0.050  0.056  0.058  0.068  0.044  0.039  0.114  0.116  0.116  0.115  0.113  0.112  0.178  0.156  0.159 
PubMed  0.081  0.110  0.365  0.117  0.086  0.125  0.129  0.041  0.010  0.112  0.129  0.200  0.117  0.100  0.099  0.330  0.235  0.248 
MusaeF  0.972  1.035  0.899  0.872  0.911  0.895  0.346  1.313  0.199  0.613  0.653  0.438  0.546  0.576  0.520  0.696  1.260  0.806 
MusaeG  0.118  0.120  0.693  0.110  0.144  0.220  0.030  0.308  0.005  0.112  0.119  0.527  0.118  0.126  0.126  0.247  0.366  0.213 
AmazonC  0.129  0.126  0.350  0.134  0.144  0.175  0.049  0.298  0.031  0.094  0.095  0.258  0.089  0.087  0.061  0.201  0.312  0.215 
AmazonP  0.163  0.180  0.458  0.175  0.203  0.231  0.058  0.339  0.034  0.122  0.132  0.315  0.123  0.111  0.090  0.257  0.377  0.277 
CoauthorC  0.216  0.243  0.745  0.264  0.245  0.341  0.097  0.411  0.038  0.183  0.205  0.568  0.214  0.184  0.184  0.268  0.457  0.263 
CoauthorP  0.146  0.144  0.720  0.220  0.159  0.295  0.057  0.314  0.035  0.133  0.141  0.534  0.149  0.138  0.167  0.208  0.367  0.206 
tests (pairwise ttest at 5% significance level). The worst performances are underlined and the secondworst performances are under wave lines.
Va Datasets and Baselines
Datasets and experimental settings. We drew realworld datasets from four applications for the node classification task. The dataset details are provided in the supplement.

[leftmargin=*]

In citation networks (Citeseer, Cora, PubMed) [kipf2017gcn], each paper has bagofwords features, and the goal is to predict the research area of each paper.

We adopt MusaeFacebook (MusaeF) and MusaeGithub (MusaeG) [rozemberczki2019multi] from social networks. Nodes represent official Facebook pages (or Github developers), and edges are mutual likes (or followers) between nodes. Node features are extracted from site descriptions (or developer’s location, repositories starred, employer).

AmazonComputer (AmazonC) and AmazonPhoto (AmazonP) [shchur2018pitfalls] are segments of the Amazon copurchase graph, where nodes represent goods, edges indicate that two goods are frequently bought together, and node features are the bagofwords representation of product reviews.

CoauthorComputer and CoauthorPhysics are coauthorship graphs based on the Microsoft Academic Graph from the KDD Cup 2016. We represent authors as nodes, that are connected by an edge if they coauthored a paper [shchur2018pitfalls]. Node features represent paper keywords for each author’s papers.
We randomly divide each graph into three portions with a ratio of training : validation : test = 50 : 20 : 30. The GNN is trained on the training set and all explanation methods are evaluated on the test set.
Baselines. We adopt the following baselines that generate subgraph explanations. Except the baseline Shapley, all baselines compute the weights of edges in the neighborhood of the target node . The explanation is generated by iteratively adding edges adjacent to the current subgraph until nodes are included in . The edges with higher weights will be considered first. The counterfactual of the baselines are generated in the same way as GNNMOExp by trying different enumerated subgraphs. and of for each baseline are calculated by Eq. (4) and Eq. (5). We describe the details of the baseline:

[leftmargin=*]

Random (RND) assigns random weights to edges.

Embedding (EMB) uses DeepWalk [Bryan2014deepwalk]
to embed the nodes, and the weight of an edge is calculated based on the cosine similarity between the embeddings of two nodes.

Gradient (Grad) [baldassarre2019explainability] use the magnitudes of gradients of GNN output w.r.t. edges to find salient subgraphs.

GAT [velivckovic2018gat] learns attention weights over neighbors of any node for message aggregation to predict the output of the GNN on , and the attention weights on the edges are extracted as edge weights.

GNNExplainer (GNNExp) [ying2019gnn] learns to mask edges so that the masked graph maximally preserve the predictions of , and the mask matrix provides the edge weights.

PGExplainer (PGExp) [luo2020parameterized] trains a deep neural network to parameterize the generation of explanations. The subgraphs generated by the explainer are evaluated.

Shapley picks and , defined in Eq. (9), with the highest counterfactual relevance, and use the selected and to generate the counterfactual .

GNNMOExpb (MOEB) is similar to GNNMOExp, while the strategy is to select explanations that are most balanced in both metrics.
VB Quantitative Results
Average simulatability and counterfactual relevance across all test nodes are reported in Table III. We conclude that:

[leftmargin=*]

Gradient does not perform badly in counterfactual relevance (best in three datasets and second places in 2 datasets), but it performs worst or the secondworst in simulatability except the MusaeF dataset. That’s because the gradients indicate the most effective perturbations of the edges to change a prediction. However, these edges do not constitute a graph to maximally preserve the GNN prediction. Based on the human study, Grad should be first excluded.

GAT, GNNExplainer, and PGExp are outperformed by GNNMOExp in both metrics on all datasets. Clearly, these baselines do not explicitly optimize both objectives.

MOEB has the worst or secondworst simulatability on the latter 6 datasets, though it is the runnerup on the first three. Based on the human study, MOEB is not guaranteed to generate explanations that will likely be accepted.

Shapley has the best counterfactual relevance on the first three datasets, with GNNMOExp as the runnerup. On the remaining 6 datasets, GNNMOExp outperforms or is close to Shapley. On simulatability, GNNMOExp outperforms Shapley on all datasets.

GNNMOExp is the best in simulatability on all baselines on all datasets, and is frequently outperforming or competitive with the feasible runnerups (Grad and MOEB are not feasible due to their low simulatability).
Parameters Sensitivity. We search subgraphs of nodes involving vertices that are hops away from the target node (by default , the depth of the target GNN). The sensitivity analyses of these parameters are shown in Fig. 8. We can see that the performance of simulatability becomes better as the parameters increase, while the performance of counterfactual relevance becomes lower. We let since large explaining subgraphs go against explanation simplicity and simulatability. Since is usually small ( in our experiments) to avoid oversmoothing [Li2018DeeperII], we can see the performance level off when .
One bottleneck of applying GNNMOExp to realworld graphs is its running time [Rudin2019GloballyConsistentRS]. In Fig. 9 we can see that the running time increases as the search space grows with and . However, on average, enumerating and evaluating all acyclic and connected subgraphs of a target node on Cora and Citeseer with some very high node degrees, take no more than 3 seconds on a commodity computer. With an incremental implementation, a newly added edge only leads to enumerating new subgraphs containing the new edge. Given the reasonable running time, the capability of guaranteeing Pareto optimality and simultaneous high simulatability and counterfactual relevance is a unique advantage that gradientbased methods do not have. Explaining GNN with a quality guarantee is a musthave when GNN is used in usercentric applications, such as graphbased recommendation systems [Ying2018].
VC Robustness and Sanity check
We design two ways to perturb GNN predictions. We can link an existing vertex to the target node and add a message to Eq. (2) at the last layer of GNN:
(11) 
where is the perturbed activation. We measure the strength of the perturbation caused by using
(12) 
where is the predicted class of before the perturbing edge is added. Second, we randomize the GNN parameters of layer , which is the last layer of GNN. We measure the perturbation strength using Euclidean distance between the original parameters and the perturbed parameters
(13) 
Given a perturbation, we need to measure the change in the explaining subgraph of . Let the explaining subgraphs after the perturbation be denoted by . We measure the average distance between two explaining subgraph , where is Jaccard distance between two vertex sets.
From Fig. 10, we can observe that the subgraph explanations found by our method pass the sanity check. We have the following observations. i) There is no change in the predicted class by the target GNN when the perturbing message is aligned with (high cosine similarity) or the perturbing distance is small, and predictions start to change when the perturbations are sufficiently strong. ii) The Jaccard distance between two optimal explaining subgraphs becomes larger as predicted class changes, demonstrated by the red curves on top of the blue curve. iii) Interestingly, on the left, even when there is no change in the predicted class, first increases as cosine similarity decreases to 0 ( is orthogonal to ), and then decrease again when further decreases to negative values ( is in the opposite direction of ). We conjecture that the edge () is added to the explaining graph in the former situation, while some message cancel out the opposite in the latter case (though there may not always be such a canceling message). The explanations are more robust to perturbing as the remains low if predictions remain the same (right figure). The explanations are more sensitive to perturbing incoming messages (left figure). In such cases, on average less than two edges are perturbed in the explaining subgraphs.
VD Reproducibility checklist
We adopt a Graph Convolutional Network (GCN) model [kipf2017gcn] as the explained target model, with two hidden layers (
), each with 16 neurons. The dimension of the input layer is the number of input features of the nodes, and the dimension of the output layer is the number of classes. We adopt the crossentropy loss function and the Adam optimizer for training the GNN, while the learning rate is set to be 0.01. We set the maximal training iterations to 500, and apply the earlystop strategy when training.
As for the proposed GNNMOExp, there are two hyperparameters. We set the maximum search distance , which is equal to the depth of GCN, and we set the maximum subgraph complexity , considering both the effectiveness and the explanation simplicity.
Vi Related Work
Explainable ML. The simulatability and counterfactual relevance are two major metrics for evaluating explanations, but their interactions and how humans perceive them are not clear. In [Lundberg2017UnifiedApproach] and [Shrikumar2017DEEPLIFT], they provide a prediction explanation framework based on Shapley values which encompasses LIME as a special case. Two algorithms with linear complexity for feature importance scoring are developed in [chen2018shapley]. In [Ghorbani2019DATAshapley] and [Ancona2019DNNshapley], they approximate Shapley values for deep networks via sampling. The methods proposed in [darwiche2003differential, chan2005sensitivity] use gradients to find salient subgraphs to explain the inference on PGM, but not for GNNs [kipf2017gcn, hamilton2017graphsage, velivckovic2018gat]. [ying2019gnn] explains arbitrary graph neural networks using a simplified model. [baldassarre2019explainability] studies the influence of the change of inputs on outputs of GNN models with gradientbased and decompositionbased methods. Stochastic explaining subgraph search have been proposed [yuan2020xgnn, Vu2020PGMExplainerPG, yuan2021explainability]
using reinforcement learning and hillclimbing. In
[yuan2021explainability], Monte Carlo search is used for exploration.Causal Inference and Counterfactual Reasoning. [guo2020survey] introduces both traditional and advanced methods in learning causal effect and causal relations. In [guo2020learning], they discover the unknown confounders from observed data, by learning representations of confounders using GNN. We identify confounders on the computational graph of GNN.
Robustness and sensitivity. Explanation robustness and sensitivity are two desired properties and have been mostly studied on images [Adebayo2018, Ghorbani2017, Zhang2018, Yeh2019, Pruthi2019] and texts [Pruthi2019], but none on graphs. The differential geometry formulation of manipulability of gradientbased explanations in [Adebayo2018] assumes that the input is a vector (image) that lies on a lowdimensional manifold. For GNN, a decision of a node depends not only on its feature vectors, but also on the messages from neighboring nodes. On graphs, the only relevant study is [wiltschko2020], and the proposed method differs from [wiltschko2020]
in explanation generation (subgraph search vs. gradientbased) and evaluation metrics (output explanation changes vs. attribution accuracy changes).
Vii Conclusion and future work
We proposed to find multiobjective explanations for Graph Neural Networks, with two objectives, simulatability and counterfactual relevance, to be satisfied. The human study showed that the two explanation objectives can represent the perceived quality of explanations based on two different cognitive processes (quick screening vs. effortful deliberation), and they jointly influence and predict explanation acceptance by humans. We proposed to maximize the two objectives by subgraph enumeration and rankingbased optimization to produce Pareto optimal explanations that fulfill both objectives. We showed that gradientbased GNN explanations are not robust against the rotation of incoming messages to the target nodes, while GNNMOExp can reliably output quality explanations. Extensive experiments on 9 graph datasets from 4 applications demonstrated superior performance in simulatability, counterfactual relevance, robustness, and sensitivity.
Acknowledgement
Chao and Sihong were supported in part by the National Science Foundation under Grants NSF IIS1909879, NSF CNS1931042, and NSF IIS2008155. Any opinions, findings, conclusions, or recommendations expressed in this document are those of the author(s) and should not be interpreted as the views of any U.S. Government. Yifei, Yazheng, and Xi were supported by Natural Science Foundation of China (No.61976026) and 111 Project (B18008).
Comments
There are no comments yet.