Many important real-world data sets are graphs or networks. These include social networks, knowledge graphs, protein-protein interaction networks, the World Wide Web and many more. Graph neural networks leverage link structure to encode information as well as incorporate node feature information and currently achieve state-of-the-art on many prediction tasks . Similarly to other connectionist models, GNNs lack transparency in their decision-making. Explaining GNNs is currently in the early stages of research, but since graphs are particularly expressive by encoding contexts, they are a promising candidate when it comes to producing rich explanations . The most popular type of GNN explainer methods is perturbation-based, where the output variations are studied with respect to different input perturbations .
When developing any explainable method, it is important to evaluate the performance of the method with respect to valid procedures and metrics . With this in mind, we explore the evaluation methods employed by perturbation-based explainer methods in the GNN domain. Validating explanations is generally a challenging task because a ground-truth explanation is not always available. Even for synthetically generated datasets with ground-truth explanations, this approach can be error prone. While validation schemes do exist, they lack maturity.
Many explainer methods come with differing evaluation protocols, which makes their comparison difficult. However, some of these protocols overlap or have been adopted by others, e.g. with several papers [18, 13, 15] using state-of-the-art explainer method  and its evaluation protocol as benchmark. As GNN explainer methods become more and more popular, its vital to avoid evaluation pitfalls and, in the next step, introduce a standard evaluation approach. That’s why we perform an empirical study on perturbation-based explanations for GNNs, with focus on . Our contributions include identifying pitfalls in three main areas: (1) synthetic data generation process, (2) evaluation metrics, and (3) the final presentation of the explanation. For each pitfall we propose a remedy to increase the validity of the evaluation.
2 Terminology and Concepts
For perturbation-based explainer methods for GNNs, the output consists of masks, indicating important input features, including node masks, edge masks or node feature masks depending on the explanation task. We can observe three different types of masks that have been proposed, including soft masks (GNNExplainer , CF-GNNExplainer), discrete masks (ZORRO ) and approximated discrete masks (PGExplainer )111Please refer to the Appendix 0.A for more details on GNNs and explainer methods.. These mask are then applied to the input graph(s) and fed into the trained GNNs to carry out predictions, which is targeted by the objective function to be similar to the original prediction. The currently overarching established explanation evaluation scheme for perturbation-based explainer methods consists of a data generation, GNN training, and mask generation step as is shown in Figure 1. The generated synthetic data is comprised of a base graph and a specific motif (1), which are connected randomly and additionally perturbed by noise. A GNN is applied to the graph execute a prediction task, e.g. node classification (2). In the next step, the explainer method generates masks of the receptive field (3). If the explainer method outputs soft masks, thresholding is needed to arrive at the final explainer subgraph (4). Please refer to the Appendix 0.A for detailed background on GNNs and perturbation-based explainer methods.
Ground-truth explanation: The ground-truth explanation is a particular motif that is used during the synthetic data set generation, e.g. the ”house motif” shown in Figure 1 (1). The motif is embedded into a larger graph and perturbed with noise.
Ground-truth label: The ground-truth label is the respective class a node (or graph) is assigned to.
Explainer subgraph with importance scores: The explainer method assigns importance scores to the edges, indicating their influence in the prediction by the GNN, as shown in Figure 1 (3).
Threshold application: In order to arrive at a compact subgraph, a threshold is applied to reduce the explainer subgraph to the most important edges by removing all edges that fall below the threshold.
Final (explainer) subgraph: A reduced final explainer subgraph remains, as shown in Figure 1 (4).
Label flip: If the input to the GNN is changed, e.g. using the reduced final explainer subgraph instead of the original receptive field, a label flip can occur. This means that a different class is predicted than in the original prediction.
2.2 Synthetic Data
BA-Shapes: Node classification dataset with a base graph of 300 nodes and a set of 80 five-node “house”-structured network motifs, which are attached to randomly selected nodes of the base graph and function as ground-truth explanations. The resulting graph is further perturbed by adding 0.1 random edges. Nodes are assigned to 4 classes based on their structural roles. In a house-structured motif as can be seen in Figure 1 (1), there are 3 types of roles: the top (yellow), shoulder (orange) and bottom (blue) node of the house and nodes that do not belong to a house (grey).
Tree-Cycles: Node classification dataset with two different labels, that consists of a base 8-level balanced binary tree and 80 six-node cycle motifs, which are attached to random nodes of the base graph and function as ground-truth explanations.
3 Pitfalls of Evaluation and Possible Remedies
3.1 Pitfall 1: Data Generation
For all 4 introduced explainer methods [10, 18, 13, 15], the synthetic datasets BA-Shapes and Tree-Cycles are used for evaluation. Their advantage is their intuitive motifs and labelling, which is understandable by humans. However, the defined ground-truth explanation e.g. the “house-motif”, while being intuitively well-understandable, does not necessarily align with the decision-making process of the GNN and hence doesn’t represent the optimal explanation. Below, we compare the entropy of the ground-truth explanation to the entropy of other trivial explanation methods including the entire receptive field of the GNN and the target node. Our results222A 3-layer vanilla Graph Convolutional Network is used carry out experiments. show the proclaimed ground-truth does not have consistently lower entropy compared to trivial baselines as shown in Table 1.
|Method||Top Nodes||Shoulder Nodes||Bottom Nodes|
For each house node type we see a need for a different type of ground-truth explanation, given the differences in entropy and accuracy performance (see Section 3.2). Considering Occam’s razor, which suggests the simplest explanation is best, we can see for several types of nodes, that the house motif is not the optimal ground-truth explanation. Figure 2 (left) shows a number of different possible ground-truth explanations, including the top triangle, the bottom square, the target node “left shoulder” and right shoulder node, a bottom node, or the top node. It can be seen that while the assigned ground-truth explanation “house motif” does lead to the correct prediction in nearly all cases, so do the more compact motifs square and triangle with a similarly low entropy and would be the more compact ground-truth explanation.
Remedy: We propose a new ground-truth explanation generation in order to better evaluate and compare the explanations. Properties of the dataset should include the lowest possible entropy matching with the identified ground-truth explanation, e.g. a specific motif. One way to achieve this, is to do a motif search: The entropy and prediction accuracy of several different potential motifs around the target node are calculated, similarly to our approach in Figure 2. The best result is then chosen to be the ground-truth explanation, which ensures maximal compactness and therefore comprehensibility of the explanation. This motif search step implies some additional work and resources, but it is a worthwhile trade-off as it ensures that the ground-truth explanation for a specific class prediction from a specific GNN is known and therefore confirms the validity of the evaluation outcome. It also ensures, that only essential parts of the graph are in the ground-truth explanation.
3.2 Pitfall 2: Evaluation Metrics
, whereas in , it is the proportion of explanations that are “correct”. In  anFor evaluating the accuracy of an explainer method, the ground-truth explanation has to be known. For synthetic datasets, graph motifs can be used as an approximation ground-truth, even though the GNN might not make predictions as intuitively expected as discussed in Section 3.1. When choosing a metric, many papers use the general term “accuracy” with wildly differing definitions. In , accuracy is used, defined as the matching rate for important edges in explanations compared with those in the ground truthsd 
, an accuracy is formalized according to a binary classification task, where edges in the ground-truth explanation are treated as labels and the importance scores are viewed as prediction scores. The accuracy is equivalent to the calculation of the area under the curve (AUC) of the receiver operating characteristic curve (ROC), which is calculated on the prediction score and not on the predicted classes. ROC AUC has limitations in its capacity to evaluate the explanations, as it only gives us the probability that a randomly chosen positive instance (edges in the ground-truth explanation) is ranked higher than a randomly chosen negative instance. However, when evaluating an explanation, we do care about the actual probability of the evaluation being correct, instead of just the ranking. Even more problematic is the ROC AUC’s tendency to be misleading in situations with high class imbalance, as is the case here. Due to the large number of true negatives, edges that are neither in the ground-truth explanation nor in the final explainer subgraph, the false positive rate is pulled down substantially, which leads to an overly optimistic result.
Remedy: Precision-Recall curves (PR) and the corresponding AUC, as opposed to the ROC AUC, can provide a less misleading evaluation due to the fact that they evaluate the fraction of true positives among positive predictions . Furthermore, for comparability to hard mask methods and since for the final explanation presented to the user, a threshold has to be applied, threshold-dependent metrics should be included in the evaluation. We propose to use recall, to account for the sparsity of an explanation.
The edges in the ground-truth explanation represent the true positives and the false negatives are the edges that are in the final explainer subgraph but not in the ground-truth explanation as it provides information about the compactness and therefore comprehensibility of the explainer subgraph. Table 2 shows the difference between avg. ROC AUC and avg. PR AUC. As we expect, PR AUC does not achieve the same level as the ROC AUC with a difference of up to 47 percentage points, providing a more comprehensive picture of the explanation quality, similarly to the average recall.
|Class||ROC AUC||SD||PR AUC (proposed)||SD||Recall (proposed)||SD|
Avg. ROC AUC, Avg. PR AUC and average recall with standard deviations (SD) for GNNExplainer
3.3 Pitfall 3: Threshold Application
Reducing the size of the original subgraph is a post-processing step, executed after training the GNN. It is possible that the originally predicted label can flip, resulting in fidelity of the explanation not being ensured. Fidelity refers to an explanation being faithful to the model it aims to explain.
For example for  a label flip occurs for 66 of the top node explainer subgraphs with = 6. In this case, the final explainer subgraph would lead to a different prediction than the original subgraph and therefore defeating its purpose of explaining the original prediction. Overall, for the BA-Shape dataset, in  19 of labels flip, in  39 of labels flip and in  18 of labels flip.
Additionally, for soft mask explainer methods [10, 18]
, the size of the final explainer subgraph is parameterized. In the established evaluation scheme a dedicated hyperparametercontrols the size of the final explainer subgraph. For synthetic datasets, this hyperparameter is set according to knowledge about the ground-truth motif. Using this approach to configure the threshold leads to leaking information and unfairly biases the result. Therefore, the resulting evaluation is flawed, since it can be assumed that the ground-truth size of the explainer subgraph is not typically known. Choosing an appropriate threshold is not trivial, as the resulting recall can differ substantially for different thresholds as can be seen in Table 3, showing a necessary trade-off between the compactness and completeness of the final explainer subgraph.
|Class||Recall th = 6||Recall th = 20|
Remedy: The integration of an additional test to ensure that no label flip occurred for the final explainer subgraph is highly recommended to avoid an explanation that leads to an different decision - in other words, ensuring explanation fidelity. A simple if-then mechanism, that moves on to the next optimal explanation, in case a label flip occurs, would be sufficient.
Furthermore, to ensure that no knowledge about the ground truth leaks into the evaluation of the explanation and biases the result for soft mask approaches, we recommend to configure the size of the final explainer subgraph not to a fixed number of edges, but to carry out a grid-search on a test set to choose the optimal threshold.
The expressive nature of graphs makes them a promising candidate for producing rich explanations for GNN decision-making. But since a mature standardized approach to evaluating explanations for GNN explainer methods is missing, a valid comparison of different methods can be challenging. For this reason, we find it important to examine existing evaluation methods closely to uncover potential pitfalls. In this paper, we show the implications of three identified evaluation pitfalls in the context of GNNs and propose remedies to avoid them.
Arrieta, Alejandro Barredo, et al. ”Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI.” Information Fusion 58 (2020): 82-115.
-  Huang, Qiang, et al. ”Graphlime: Local interpretable model explanations for graph neural networks.” arXiv preprint arXiv:2001.06216 (2020).
-  Kipf, Thomas N., and Max Welling. ”Semi-supervised classification with graph convolutional networks.” ICLR (2017).
-  Lecue, Freddy. ”On the role of knowledge graphs in explainable AI.” Semantic Web 11.1 (2020): 41-51.
-  Lucic, Ana, et al. ”CF-GNNExplainer: Counterfactual Explanations for Graph Neural Networks.” arXiv preprint arXiv:2102.03322 (2021).
Molnar, Christoph, Giuseppe Casalicchio, and Bernd Bischl. ”Interpretable machine learning–a brief history, state-of-the-art and challenges.” Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, Cham, 2020.
-  Robnik-Šikonja, Marko, and Marko Bohanec. ”Perturbation-based explanations of prediction models.” Human and machine learning. Springer, Cham, 2018. 159-175.
Saito, Takaya, and Marc Rehmsmeier. ”The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.” PloS one 10.3 (2015): e0118432.
-  Xu, Keyulu, et al. ”How powerful are graph neural networks?.” ICLR (2018)
-  Ying, Rex, et al. ”Gnnexplainer: Generating explanations for graph neural networks.” Advances in neural information processing systems 32 (2019): 9240.
-  Yuan, Hao, et al. ”Xgnn: Towards model-level explanations of graph neural networks.” Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020.
-  Zhou, Jie, et al. ”Graph neural networks: A review of methods and applications.” AI Open 1 (2020): 57-81.
-  Funke, Thorben, Megha Khosla, and Avishek Anand. ”Hard Masking for Explaining Graph Neural Networks.” (2020).
-  Yuan, Hao, et al. ”Explainability in graph neural networks: A taxonomic survey.” arXiv preprint arXiv:2012.15445 (2020).
-  D. Luo, W. Cheng, D. Xu, W. Yu, B. Zong, H. Chen, and X. Zhang, “Parameterized explainer for graph neural network,” in Advances in neural information processing systems, 2020.
-  Anonymous, “Causal screening to interpret graph neural networks,” in Submitted to International Conference on Learning Representations, 2021, under review. [Online]. Available: https: //openreview.net/forum?id=nzKv5vxZfge
-  H. Yuan, H. Yu, J. Wang, K. Li, and S. Ji, “On explainability of graph neural networks via subgraph explorations,” arXiv preprint arXiv:2102.05152, 2021.
-  Lucic, Ana, et al. ”CF-GNNExplainer: Counterfactual Explanations for Graph Neural Networks.” arXiv preprint arXiv:2102.03322 (2021).
Appendix 0.A Background on GNNs and Perturbation-Based Explainer Methods
For a GNN, the goal is to learn a function of features on a graph with edges and nodes
. The input is comprised of a feature vectorfor every node , summarized in a feature matrix and a representative description of the link structure in the form of an adjacency matrix . The output of the convolutional layer is a node-level latent representation matrix , where is the number of output latent dimensions per node. Therefore, every convolutional layer can be written as a non-linear function:
with and , being the number of stacked layers. The vanilla GNN model employed here, uses the propagation rule :
being the identity matrix.is the diagonal node degree matrix of , is a weight matrix for the neural network layer and
is a non-linear activation function. Taking the latent node representations
of the last layer we define the logits of nodefor a node classification task as follows:
where projects the node representations into the dimensional classification space.
GNNExplainer: The GNNExplainer takes a trained GNN and its prediction(s), and it returns an explanation in the form of a small subgraph of the input graph together with a small subset of node features that are most influential for the prediction. For their selection, the mutual information between the GNN prediction and the distribution of possible subgraph structures is maximized through optimizing the conditional entropy.
CF-GNNExplainer: The CF-GNNEXPLAINER works by perturbing input data at the instance-level. The instances are nodes in the graph since it is focused on node classification. The method iteratively removes edges from the original adjacency matrix based on matrix sparsification techniques, keeping track of the perturbations that lead to a change in prediction, and returning the perturbation with the smallest change w.r.t. the number of edges, after adding different edges to the subgraph.
ZORRO: ZORRO employs discrete masks to identify important input nodes and node features through a greedy algorithm, where nodes or node features are selected step by step. The goodness of the explanation is measured by the expected deviation from the prediction of the underlying model. A subgraph of the node’s computational graph and its set of features are relevant for a classification decision if the expected classifier score remains nearly the same when randomizing the remaining features.
PGExplainer: The PGExplainer learns approximated discrete masks for edges to explain the predictions. Given an input graph, it first obtains the embeddings for each edge by concatenating node embeddings. Then the predictor uses the edge embeddings to predict the probability of each edge being selected, similarly to an importance score. The approximated discrete masks are then sampled via the reparameterization trick. Finally, the objective function maximises the mutual information between the original predictions and new predictions.