1 Introduction
Many important realworld data sets are graphs or networks. These include social networks, knowledge graphs, proteinprotein interaction networks, the World Wide Web and many more. Graph neural networks leverage link structure to encode information as well as incorporate node feature information
[12] and currently achieve stateoftheart on many prediction tasks [9]. Similarly to other connectionist models, GNNs lack transparency in their decisionmaking. Explaining GNNs is currently in the early stages of research, but since graphs are particularly expressive by encoding contexts, they are a promising candidate when it comes to producing rich explanations [4]. The most popular type of GNN explainer methods is perturbationbased, where the output variations are studied with respect to different input perturbations [14].When developing any explainable method, it is important to evaluate the performance of the method with respect to valid procedures and metrics [1]. With this in mind, we explore the evaluation methods employed by perturbationbased explainer methods in the GNN domain. Validating explanations is generally a challenging task because a groundtruth explanation is not always available. Even for synthetically generated datasets with groundtruth explanations, this approach can be error prone. While validation schemes do exist, they lack maturity.
Many explainer methods come with differing evaluation protocols, which makes their comparison difficult. However, some of these protocols overlap or have been adopted by others, e.g. with several papers [18, 13, 15] using stateoftheart explainer method [10] and its evaluation protocol as benchmark. As GNN explainer methods become more and more popular, its vital to avoid evaluation pitfalls and, in the next step, introduce a standard evaluation approach. That’s why we perform an empirical study on perturbationbased explanations for GNNs, with focus on [10]. Our contributions include identifying pitfalls in three main areas: (1) synthetic data generation process, (2) evaluation metrics, and (3) the final presentation of the explanation. For each pitfall we propose a remedy to increase the validity of the evaluation.
2 Terminology and Concepts
For perturbationbased explainer methods for GNNs, the output consists of masks, indicating important input features, including node masks, edge masks or node feature masks depending on the explanation task. We can observe three different types of masks that have been proposed, including soft masks (GNNExplainer [10], CFGNNExplainer[18]), discrete masks (ZORRO [13]) and approximated discrete masks (PGExplainer [15])^{1}^{1}1Please refer to the Appendix 0.A for more details on GNNs and explainer methods.. These mask are then applied to the input graph(s) and fed into the trained GNNs to carry out predictions, which is targeted by the objective function to be similar to the original prediction. The currently overarching established explanation evaluation scheme for perturbationbased explainer methods consists of a data generation, GNN training, and mask generation step as is shown in Figure 1. The generated synthetic data is comprised of a base graph and a specific motif (1), which are connected randomly and additionally perturbed by noise. A GNN is applied to the graph execute a prediction task, e.g. node classification (2). In the next step, the explainer method generates masks of the receptive field (3). If the explainer method outputs soft masks, thresholding is needed to arrive at the final explainer subgraph (4). Please refer to the Appendix 0.A for detailed background on GNNs and perturbationbased explainer methods.
2.1 Terminology

Groundtruth explanation: The groundtruth explanation is a particular motif that is used during the synthetic data set generation, e.g. the ”house motif” shown in Figure 1 (1). The motif is embedded into a larger graph and perturbed with noise.

Groundtruth label: The groundtruth label is the respective class a node (or graph) is assigned to.

Explainer subgraph with importance scores: The explainer method assigns importance scores to the edges, indicating their influence in the prediction by the GNN, as shown in Figure 1 (3).

Threshold application: In order to arrive at a compact subgraph, a threshold is applied to reduce the explainer subgraph to the most important edges by removing all edges that fall below the threshold.

Final (explainer) subgraph: A reduced final explainer subgraph remains, as shown in Figure 1 (4).

Label flip: If the input to the GNN is changed, e.g. using the reduced final explainer subgraph instead of the original receptive field, a label flip can occur. This means that a different class is predicted than in the original prediction.
2.2 Synthetic Data
BAShapes: Node classification dataset with a base graph of 300 nodes and a set of 80 fivenode “house”structured network motifs, which are attached to randomly selected nodes of the base graph and function as groundtruth explanations. The resulting graph is further perturbed by adding 0.1 random edges. Nodes are assigned to 4 classes based on their structural roles. In a housestructured motif as can be seen in Figure 1 (1), there are 3 types of roles: the top (yellow), shoulder (orange) and bottom (blue) node of the house and nodes that do not belong to a house (grey).
TreeCycles: Node classification dataset with two different labels, that consists of a base 8level balanced binary tree and 80 sixnode cycle motifs, which are attached to random nodes of the base graph and function as groundtruth explanations.
3 Pitfalls of Evaluation and Possible Remedies
3.1 Pitfall 1: Data Generation
For all 4 introduced explainer methods [10, 18, 13, 15], the synthetic datasets BAShapes and TreeCycles are used for evaluation. Their advantage is their intuitive motifs and labelling, which is understandable by humans. However, the defined groundtruth explanation e.g. the “housemotif”, while being intuitively wellunderstandable, does not necessarily align with the decisionmaking process of the GNN and hence doesn’t represent the optimal explanation. Below, we compare the entropy of the groundtruth explanation to the entropy of other trivial explanation methods including the entire receptive field of the GNN and the target node. Our results^{2}^{2}2A 3layer vanilla Graph Convolutional Network is used carry out experiments. show the proclaimed groundtruth does not have consistently lower entropy compared to trivial baselines as shown in Table 1.
Method  Top Nodes  Shoulder Nodes  Bottom Nodes 

Groundtruth  1.21  0.96  0.95 
Receptive Field  1.31  0.93  1.16 
Target Node  1.25  1.24  1.24 
For each house node type we see a need for a different type of groundtruth explanation, given the differences in entropy and accuracy performance (see Section 3.2). Considering Occam’s razor, which suggests the simplest explanation is best, we can see for several types of nodes, that the house motif is not the optimal groundtruth explanation. Figure 2 (left) shows a number of different possible groundtruth explanations, including the top triangle, the bottom square, the target node “left shoulder” and right shoulder node, a bottom node, or the top node. It can be seen that while the assigned groundtruth explanation “house motif” does lead to the correct prediction in nearly all cases, so do the more compact motifs square and triangle with a similarly low entropy and would be the more compact groundtruth explanation.
Remedy: We propose a new groundtruth explanation generation in order to better evaluate and compare the explanations. Properties of the dataset should include the lowest possible entropy matching with the identified groundtruth explanation, e.g. a specific motif. One way to achieve this, is to do a motif search: The entropy and prediction accuracy of several different potential motifs around the target node are calculated, similarly to our approach in Figure 2. The best result is then chosen to be the groundtruth explanation, which ensures maximal compactness and therefore comprehensibility of the explanation. This motif search step implies some additional work and resources, but it is a worthwhile tradeoff as it ensures that the groundtruth explanation for a specific class prediction from a specific GNN is known and therefore confirms the validity of the evaluation outcome. It also ensures, that only essential parts of the graph are in the groundtruth explanation.
3.2 Pitfall 2: Evaluation Metrics
, whereas in [18], it is the proportion of explanations that are “correct”. In [10] anFor evaluating the accuracy of an explainer method, the groundtruth explanation has to be known. For synthetic datasets, graph motifs can be used as an approximation groundtruth, even though the GNN might not make predictions as intuitively expected as discussed in Section 3.1. When choosing a metric, many papers use the general term “accuracy” with wildly differing definitions. In [13], accuracy is used, defined as the matching rate for important edges in explanations compared with those in the ground truthsd [15]
, an accuracy is formalized according to a binary classification task, where edges in the groundtruth explanation are treated as labels and the importance scores are viewed as prediction scores. The accuracy is equivalent to the calculation of the area under the curve (AUC) of the receiver operating characteristic curve (ROC), which is calculated on the prediction score and not on the predicted classes. ROC AUC has limitations in its capacity to evaluate the explanations, as it only gives us the probability that a randomly chosen positive instance (edges in the groundtruth explanation) is ranked higher than a randomly chosen negative instance. However, when evaluating an explanation, we do care about the actual probability of the evaluation being correct, instead of just the ranking. Even more problematic is the ROC AUC’s tendency to be misleading in situations with high class imbalance
[8], as is the case here. Due to the large number of true negatives, edges that are neither in the groundtruth explanation nor in the final explainer subgraph, the false positive rate is pulled down substantially, which leads to an overly optimistic result.Remedy: PrecisionRecall curves (PR) and the corresponding AUC, as opposed to the ROC AUC, can provide a less misleading evaluation due to the fact that they evaluate the fraction of true positives among positive predictions [8]. Furthermore, for comparability to hard mask methods and since for the final explanation presented to the user, a threshold has to be applied, thresholddependent metrics should be included in the evaluation. We propose to use recall, to account for the sparsity of an explanation.
The edges in the groundtruth explanation represent the true positives and the false negatives are the edges that are in the final explainer subgraph but not in the groundtruth explanation as it provides information about the compactness and therefore comprehensibility of the explainer subgraph. Table 2 shows the difference between avg. ROC AUC and avg. PR AUC. As we expect, PR AUC does not achieve the same level as the ROC AUC with a difference of up to 47 percentage points, providing a more comprehensive picture of the explanation quality, similarly to the average recall.
Class  ROC AUC  SD  PR AUC (proposed)  SD  Recall (proposed)  SD 

Top Nodes  0.98  0.07  0.69  0.19  0.65  0.18 
Shoulder Nodes  0.98  0.91  0.51  0.19  0.51  0.13 
Bottom Nodes  0.93  0.18  0.56  0.22  0.57  0.21 
Cycle Nodes  0.71  0.22  0.55  0.16  0.52  0.14 
Avg. ROC AUC, Avg. PR AUC and average recall with standard deviations (SD) for GNNExplainer
3.3 Pitfall 3: Threshold Application
Reducing the size of the original subgraph is a postprocessing step, executed after training the GNN. It is possible that the originally predicted label can flip, resulting in fidelity of the explanation not being ensured. Fidelity refers to an explanation being faithful to the model it aims to explain.
For example for [10] a label flip occurs for 66 of the top node explainer subgraphs with = 6. In this case, the final explainer subgraph would lead to a different prediction than the original subgraph and therefore defeating its purpose of explaining the original prediction. Overall, for the BAShape dataset, in [10] 19 of labels flip, in [18] 39 of labels flip and in [15] 18 of labels flip.
Additionally, for soft mask explainer methods [10, 18]
, the size of the final explainer subgraph is parameterized. In the established evaluation scheme a dedicated hyperparameter
controls the size of the final explainer subgraph. For synthetic datasets, this hyperparameter is set according to knowledge about the groundtruth motif. Using this approach to configure the threshold leads to leaking information and unfairly biases the result. Therefore, the resulting evaluation is flawed, since it can be assumed that the groundtruth size of the explainer subgraph is not typically known. Choosing an appropriate threshold is not trivial, as the resulting recall can differ substantially for different thresholds as can be seen in Table 3, showing a necessary tradeoff between the compactness and completeness of the final explainer subgraph.Class  Recall th = 6  Recall th = 20 

Top Nodes  0.65  0.98 
Shoulder Nodes  0.51  0.82 
Bottom Nodes  0.57  0.75 
Cycle Nodes  0.52  0.97 
Remedy: The integration of an additional test to ensure that no label flip occurred for the final explainer subgraph is highly recommended to avoid an explanation that leads to an different decision  in other words, ensuring explanation fidelity. A simple ifthen mechanism, that moves on to the next optimal explanation, in case a label flip occurs, would be sufficient.
Furthermore, to ensure that no knowledge about the ground truth leaks into the evaluation of the explanation and biases the result for soft mask approaches, we recommend to configure the size of the final explainer subgraph not to a fixed number of edges, but to carry out a gridsearch on a test set to choose the optimal threshold.
4 Conclusion
The expressive nature of graphs makes them a promising candidate for producing rich explanations for GNN decisionmaking. But since a mature standardized approach to evaluating explanations for GNN explainer methods is missing, a valid comparison of different methods can be challenging. For this reason, we find it important to examine existing evaluation methods closely to uncover potential pitfalls. In this paper, we show the implications of three identified evaluation pitfalls in the context of GNNs and propose remedies to avoid them.
References

[1]
Arrieta, Alejandro Barredo, et al. ”Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI.” Information Fusion 58 (2020): 82115.
 [2] Huang, Qiang, et al. ”Graphlime: Local interpretable model explanations for graph neural networks.” arXiv preprint arXiv:2001.06216 (2020).
 [3] Kipf, Thomas N., and Max Welling. ”Semisupervised classification with graph convolutional networks.” ICLR (2017).
 [4] Lecue, Freddy. ”On the role of knowledge graphs in explainable AI.” Semantic Web 11.1 (2020): 4151.
 [5] Lucic, Ana, et al. ”CFGNNExplainer: Counterfactual Explanations for Graph Neural Networks.” arXiv preprint arXiv:2102.03322 (2021).

[6]
Molnar, Christoph, Giuseppe Casalicchio, and Bernd Bischl. ”Interpretable machine learning–a brief history, stateoftheart and challenges.” Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, Cham, 2020.
 [7] RobnikŠikonja, Marko, and Marko Bohanec. ”Perturbationbased explanations of prediction models.” Human and machine learning. Springer, Cham, 2018. 159175.

[8]
Saito, Takaya, and Marc Rehmsmeier. ”The precisionrecall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.” PloS one 10.3 (2015): e0118432.
 [9] Xu, Keyulu, et al. ”How powerful are graph neural networks?.” ICLR (2018)
 [10] Ying, Rex, et al. ”Gnnexplainer: Generating explanations for graph neural networks.” Advances in neural information processing systems 32 (2019): 9240.
 [11] Yuan, Hao, et al. ”Xgnn: Towards modellevel explanations of graph neural networks.” Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020.
 [12] Zhou, Jie, et al. ”Graph neural networks: A review of methods and applications.” AI Open 1 (2020): 5781.
 [13] Funke, Thorben, Megha Khosla, and Avishek Anand. ”Hard Masking for Explaining Graph Neural Networks.” (2020).
 [14] Yuan, Hao, et al. ”Explainability in graph neural networks: A taxonomic survey.” arXiv preprint arXiv:2012.15445 (2020).
 [15] D. Luo, W. Cheng, D. Xu, W. Yu, B. Zong, H. Chen, and X. Zhang, “Parameterized explainer for graph neural network,” in Advances in neural information processing systems, 2020.
 [16] Anonymous, “Causal screening to interpret graph neural networks,” in Submitted to International Conference on Learning Representations, 2021, under review. [Online]. Available: https: //openreview.net/forum?id=nzKv5vxZfge
 [17] H. Yuan, H. Yu, J. Wang, K. Li, and S. Ji, “On explainability of graph neural networks via subgraph explorations,” arXiv preprint arXiv:2102.05152, 2021.
 [18] Lucic, Ana, et al. ”CFGNNExplainer: Counterfactual Explanations for Graph Neural Networks.” arXiv preprint arXiv:2102.03322 (2021).
Appendix 0.A Background on GNNs and PerturbationBased Explainer Methods
For a GNN, the goal is to learn a function of features on a graph with edges and nodes
. The input is comprised of a feature vector
for every node , summarized in a feature matrix and a representative description of the link structure in the form of an adjacency matrix . The output of the convolutional layer is a nodelevel latent representation matrix , where is the number of output latent dimensions per node. Therefore, every convolutional layer can be written as a nonlinear function:with and , being the number of stacked layers. The vanilla GNN model employed here, uses the propagation rule [3]:
with ,
being the identity matrix.
is the diagonal node degree matrix of , is a weight matrix for the neural network layer andis a nonlinear activation function. Taking the latent node representations
of the last layer we define the logits of node
for a node classification task as follows:where projects the node representations into the dimensional classification space.
GNNExplainer: The GNNExplainer takes a trained GNN and its prediction(s), and it returns an explanation in the form of a small subgraph of the input graph together with a small subset of node features that are most influential for the prediction. For their selection, the mutual information between the GNN prediction and the distribution of possible subgraph structures is maximized through optimizing the conditional entropy.
CFGNNExplainer: The CFGNNEXPLAINER works by perturbing input data at the instancelevel. The instances
are nodes in the graph since it is focused on node classification. The method iteratively
removes edges from the original adjacency matrix based on matrix sparsification techniques, keeping
track of the perturbations that lead to a change in prediction, and returning the perturbation with
the smallest change w.r.t. the number of edges, after adding different edges to the
subgraph.
ZORRO: ZORRO employs discrete masks to identify important input nodes and node features through a greedy algorithm, where nodes or node features are selected step by step. The goodness of the explanation is measured by the expected deviation from the prediction of the underlying model. A subgraph of the
node’s computational graph and its set of features are relevant for a classification decision if the
expected classifier score remains nearly the same when randomizing the remaining features.
PGExplainer: The PGExplainer learns approximated discrete masks
for edges to explain the predictions. Given an input
graph, it first obtains the embeddings for each edge by
concatenating node embeddings. Then the predictor uses
the edge embeddings to predict the probability of each
edge being selected, similarly to an importance
score. The approximated discrete masks are then sampled
via the reparameterization trick. Finally, the objective function maximises the mutual information between
the original predictions and new predictions.
Comments
There are no comments yet.