Log In Sign Up

Discovering Invariant Rationales for Graph Neural Networks

Intrinsic interpretability of graph neural networks (GNNs) is to find a small subset of the input graph's features – rationale – which guides the model prediction. Unfortunately, the leading rationalization models often rely on data biases, especially shortcut features, to compose rationales and make predictions without probing the critical and causal patterns. Moreover, such data biases easily change outside the training distribution. As a result, these models suffer from a huge drop in interpretability and predictive performance on out-of-distribution data. In this work, we propose a new strategy of discovering invariant rationale (DIR) to construct intrinsically interpretable GNNs. It conducts interventions on the training distribution to create multiple interventional distributions. Then it approaches the causal rationales that are invariant across different distributions while filtering out the spurious patterns that are unstable. Experiments on both synthetic and real-world datasets validate the superiority of our DIR in terms of interpretability and generalization ability on graph classification over the leading baselines. Code and datasets are available at


page 1

page 2

page 3

page 4


Generalizing Graph Neural Networks on Out-Of-Distribution Graphs

Graph Neural Networks (GNNs) are proposed without considering the agnost...

Interpreting Unfairness in Graph Neural Networks via Training Node Attribution

Graph Neural Networks (GNNs) have emerged as the leading paradigm for so...

Distribution Free Prediction Sets for Node Classification

Graph Neural Networks (GNNs) are able to achieve high classification acc...

Deconfounded Training for Graph Neural Networks

Learning powerful representations is one central theme of graph neural n...

Debiasing Graph Neural Networks via Learning Disentangled Causal Substructure

Most Graph Neural Networks (GNNs) predict the labels of unseen graphs by...

SE(3)-equivariant Graph Neural Networks for Learning Glassy Liquids Representations

Within the glassy liquids community, the use of Machine Learning (ML) to...

1 Introduction

The eye-catching success in graph neural networks (GNNs) (Hamilton et al., 2017; Kipf and Welling, 2017; Dwivedi et al., 2020) provokes the rationalization task, answering “What knowledge drives the model to make certain predictions?”. The goal of selective rationalization (aka. feature attribution) (Chang et al., 2020; Ying et al., 2019; Luo et al., 2020; Wang et al., 2021c) is to find a small subset of the input’s graph features — rationale — which best guides or explains the model prediction. Discovering the rationale in a model helps audit its inner workings and justify its predictions. Moreover, it has tremendous impacts on real-world applications, such as finding functional groups to shed light on protein structure prediction (Senior et al., 2020).

Figure 1: Base Distribution of House Motif.

Two research lines of rationalization have recently emerged in GNNs. Post-hoc explainability (Ying et al., 2019; Luo et al., 2020; Yuan et al., 2021; Wang et al., 2021c) attributes a model’s prediction to the input graph with a separate explanation method, while intrinsic interpretability (Veličković et al., 2018; Gao and Ji, 2019) incorporates a rationalization module into the model to make transparent predictions. Here we focus on intrinsically interpretable GNNs. Among them, graph attention (Veličković et al., 2018) and pooling (Lee et al., 2019; Knyazev et al., 2019; Gao and Ji, 2019; Ranjan et al., 2020) operators prevail, which work as a computational block of a GNN to generate soft or hard masks on the input graph. They cast the learning paradigm of GNN as minimizing the empirical risk with the masked subgraphs, which are regarded as rationales to guide the model predictions.

Despite the appealing nature, recent studies (Chang et al., 2020; Knyazev et al., 2019) show that the current rationalization methods are prone to exploit data biases as shortcuts to make predictions and compose rationales. Typically, shortcuts result from confounding factors, sampling biases, and artifacts in the training data. Considering Figure 1, when the most bases of House-motif graphs are Tree, a GNN does not need to learn the correct function to reach high accuracy for the motif type. Instead, it is much easier to learn from the statistical shortcuts linking the bases Tree with the most occurring motifs House

. Unfortunately, when facing with out-of-distribution (OOD) data, such methods generalize poorly since the shortcuts are changed. Hence, such shortcut-involved rationales hardly reveal the truly critical subgraphs for the predicted labels, being at odds with the true reasoning process that underlies the task of interest

(Teney et al., 2020) and human cognition (Alvarez-Melis and Jaakkola, 2017).

Here we ascribe the failure on OOD data to the inability to identify causal patterns, which are stable to distribution shift. Motivated by recent studies on invariant learning (IL) (Arjovsky et al., 2019; Krueger et al., 2021; Chang et al., 2020; Bühlmann, 2018), we premise different distributions elicit different environments of data generating process. We argue that the causal patterns to the labels remain stable across environments, while the relations between the shortcut patterns and the labels vary. Such environment-invariant patterns are more plausible and qualified as rationales.

Aiming to identify rationales that capture the environment-invariant causal patterns, we formalize a learning strategy, Discovering Invariant Rationales (DIR), for intrinsically interpretable GNNs. One major problem is how to get multiple environments from a standard training set. Differing from the heterogeneous setting (Bühlmann, 2018) of existing IL methods, where environments are observable and attainable, DIR does not assume prophets about environments. It instead generates distribution perturbations by causal intervention — interventional distributions (Tian et al., 2006; Pearl et al., 2016) — to instantiate environments and further distinguish the causal and non-causal parts.

Guided by this idea, our DIR strategy consists of four modules: a rationale generator, a distribution intervener, a feature encoder, two classifiers. Specifically, the rationale generator learns to split the input graph into causal and non-causal subgraphs, which are respectively encoded by the encoder into representations. Then, the distribution intervener conducts the causal interventions on the non-causal representations to create perturbed distributions, with which we can infer the invariant causal parts. Then, the two classifiers are respectively built upon the causal and non-causal parts to generate the joint prediction, whose invariant risk is minimized across different distributions. On one synthetic and three real datasets, extensive experiments demonstrate the generalization ability of DIR to surpass current state-of-the-art IL methods

(Arjovsky et al., 2019; Krueger et al., 2021; Sagawa et al., 2019), and the interpretability of DIR to outperform the attention- and pooling-based rationalization methods (Veličković et al., 2018; Gao and Ji, 2019). Our main contributions are:

  • [leftmargin=*]

  • We propose a novel invariant learning algorithm, DIR, for inherent interpretable models, improving the generalization ability and is suitable for any deep models.

  • We offer causality theoretic analysis to guarantee the preeminence of DIR.

  • We provide the implementation of DIR for graph classification tasks, which consistently achieves excellent performance on three datasets with various generalization types.

2 Invariant Rationale Discovery

With a causal look at the data-generating process, we formalize the principle of discovering invariant rationales, which guides our discovery strategy. Throughout the paper, upper-cased letters like

denote random variables, while lower-case letters like

denote deterministic value of variables.

2.1 Causal View of Data-Generating Process

Generating rationales for transparent predictions requires understanding the actual mechanisms of the task of interest. Without loss of generality, we focus on the graph classification task and present a causal view of the data-generating process behind this task. Here we formalize the causal view as a Structure Causal Model (SCM) (Pearl et al., 2016; Pearl, 2000) by inspecting on the causalities among four variables: input graph , ground-truth label , causal part , non-causal part . Figure 1(a) illustrates the SCM, where each link denotes a causal relationship between two variables.

  • [leftmargin=*]

  • . The input graph consists of two disjoint parts: the causal part and the non-causal part , such as the House motif and the Tree base in Figure 1.

  • . By “causal part”, we mean is the only endogenous parent to determine the ground-truth label . Taking the motif-base example in Figure 1 again, is the oracle rationale, which perfectly explains why the graph is labeled as .

  • . This dashed arrow indicates additional probabilistic dependencies (Pearl, 2000; Pearl et al., 2016) between and . We consider three typical relationships here: (1) is independent of , i.e., ; (2) is the direct cause of , i.e., ; and (3) There exists a common cause , i.e., . See Appendix B for the corresponding examples.

can create spurious correlations between the non-causal part and the ground-truth label . Assuming , is a confounder between and , which opens a backdoor path , thus making and spuriously correlated (Pearl et al., 2016). We systematize such spurious correlations as . Wherein, we make feature induction assumption on to avoid the confusion of the induced subset of between . See Appendix C for the formal assumption. Furthermore, data collected from different environments exhibit various spurious correlations (Teney et al., 2020; Arjovsky et al., 2019), e.g., one mostly picks House motifs with Tree bases as the training data, while another selects House motifs with Wheel bases as the testing data. Hence, such spurious correlations are unstable and variant across different distributions.

(a) SCM
(b) Interventional Distributions.
Figure 2: (a) Causal view of data-generating process; (b) Illustration of interventional distributions.

2.2 Task Formalization of Invariant Rationalization

Oracle Rationale. With the causal theory (Pearl et al., 2016; Pearl, 2000), for each variable in a SCM, there exists a directed link from each of its parent variables to , if and only if the causal mechanism persists, where is the exogenous noise of . For simplicity, we omit the exogenous noise and simplify it as . Hence, there exist a function in our SCM, where the “oracle rationale” satisfies:


where indicates that shields from the influence of , making the causal relationship invariant across different .

Rationalization. In general, only the pairs of input and label are observed during training, while neither oracle rationale nor oracle structural equation model is available. The absence of oracles calls for the study on intrinsic interpretability. We systematize an intrinsically-interpretable GNN as a combination of two modules, i.e., , where discovers rationale from the observed , and outputs the prediction to approach . Distinct from and which are the variables in the causal mechanisms, and represent the variables in the modeling process to approximate and . To optimize these modules, most of current intrinsically-interpretable GNNs (Veličković et al., 2018; Lee et al., 2019; Knyazev et al., 2019; Gao and Ji, 2019; Ranjan et al., 2020) adopt the learning strategy of minimizing the empirical risk:


where is the risk function, which can be the cross-entropy loss. Nevertheless, this learning strategy relies heavily on the statistical associations between the input features and labels, and can potentially exhibit non-causal rationales.

Invariant Rationalization. We ascribe the limitation to ignoring in Equation 1, which is crucial to refine the causal relationship that is invariant across different . By introducing this independence, we formalize the task of invariant rationalization as:


where is the complement of . This formulation encourages the rationale seeking the patterns that are stable across different distributions, while discarding the unstable patterns.

2.3 Principle & Learning Strategy of DIR

Interventional Distribution.

However, it is difficult to recover the oracle rationale from the joint distribution over the inputs and labels — that is, the causal and non-causal relations are hardly distinguished from each other. We

get inspirations from invariant learning (Arjovsky et al., 2019; Krueger et al., 2021; Chang et al., 2020), which constructs different environments to infer the invariant features or predictors. To obtain the environments, previous studies mostly partition the training set by prior knowledge (Teney et al., 2020) or adversarial environment inference (Creager et al., 2021; Wang et al., 2021b). Different from partitioning the training data, we do not assume prophets about environments but introduce the interventional distribution (Tian et al., 2006; Pearl et al., 2016) instead to model the DIR task. Specifically, on the top of our SCM, we generate -interventional distribution by doing intervention on , which removes every link from the parents to the variable and fixes to the specific value . By stratifying different values , we can obtain multiple -interventional distributions.

With interventional distributions, we propose the principle of discovering invariant rationale (DIR) to identify a rationale whose relationship with the label is stable across different distributions.

Definition 1 (DIR Principle)

An intrinsically-interpretable model satisfies the DIR principle if it

  1. minimizes all -interventional risks: , and simultaneously

  2. minimizes the variance of

    various -interventional risks: Var,

where the -interventional risk is defined over the -interventional distribution for specific .

Guided by the proposed principle, we design the learning strategy of DIR as:


where computes the risk under the -interventional distribution, which we will elaborate in Section 2.4. Var calculates the variance of risks over different -interventional distributions; is a hyper-parameter to control the strength of invariant learning.

Justification. We theoretically justify the DIR principle’s ability to discover invariant rationales. Specifically, Theorem 1 shows that the oracle model respects the DIR principle. Moreover, we suggest that can be inferred by making the intrinsically interpretable model conform to the DIR principle under the uniqueness condition (cf. Corollary 1). We leave the detailed proofs in Appendix C due to the limited space. By making the distribution-relevant risks indifferent while pursuing low risks, the DIR principle is able to discover the invariant rationales as the approximation of the oracle rationales , while encouraging approaching the oracle model .

2.4 DIR-Guided Implementation of Intrinsically-Interpretable GNNs

With the DIR principle and objective, we present how to implement the intrinsically-interpretable GNNs. We summarize the key notations of this section in Appendix A for clarity. Following Equation 2, a model with intrinsic interpretability consists of two modules: , where is to extract a possible rationale, and is to make prediction based on the rationale. Moreover, to establish the -interventional distributions, we design an additional module to do the interventions. In a nutshell, our framework consists of four components, as Figure 3 shows.

Figure 3: DIR Implementation on GNNs, which includes a rationale generator, a distribution intervener, an encoder and two classifiers. For the inference, we only use as the prediction.

Rationale Generator. It aims to split the input graph instance into two subgraphs: causal part and non-causal part . Specifically, given an input graph instance with the node set and the edge set , its adjacency matrix is , where denotes the edge from node to node , and otherwise. The rationale generator first adopts a GNN to generate the mask matrix on , where mask indicates the importance of edge :



is the sigmoid function and

summarizes the -dimensional representations of all nodes. The generator then selects the edges with the highest masks to construct the rationale and collects ’s complement as , as follows:


where and are the edge sets of and , respectively; Top selects the top- edges with , and is the hyper-parameter (e.g., ); is the element-wise product. Having obtained the edge sets, we can distill the nodes appearing in the edges to establish and .

Distribution Intervener. It targets at creating interventional distributions. Formally, it first collects the non-causal part of all the instances into a memory bank as . It next samples a memory to conduct the intervention , replacing the complement of the critical subgraph at hand and constructing an intervened pair (), where are indices.

Graph Encoder & Classifiers . Here we represent as a combination of a graph encoder and two classifiers. Specifically, it employs another GNN encoder on to generate node representations , and then combines them as graph representation via a global pooling operator, e.g., average pooling. Then it uses a classifier

to project the graph representation into a probability distribution over class labels

. More formally, the process is as follows:


Analogously, we can obtain for via the shared encoder and another classifier . is the prediction based merely on the causal part , while measures the predictive power of the intervened part . Inspired by Cadène et al. (2019), we formulate the joint prediction under the intervention as masked by :


where the sigmoid function adjusts the output logits of

to compensate for the spurious biases. In Appendix E, we present examples of how this operation helps discover the causal part.

Optimization. Having established the prediction of an instance under the intervention , we are capable of getting the -interventional risk similar as Equation 4 as follows:


where is a pair of graph instance and its ground-truth label from the training set ;

denotes the loss function on a single instance. Moreover, we define the loss for

module as:



is only backpropagated to the classifier

and we set apart the other components from its backpropagation to avoid interference with representation learning. Thus, this loss promotes the -only branch to learn spurious biases given the non-causal features only. Overall, we can jointly optimize these components via the DIR objective and shortcut loss, i.e.,


where and are the parameters of the generator, encoder and two classifiers. While in the inference phase, we yield and as the causal rationale and the causal prediction of a testing graph , which exclude the influence of the non-causal part .

3 Experiments

In this section, we conduct extensive experiments to answer the research questions:

  • [leftmargin=*]

  • RQ1: How effective is DIR in discovering causal features and improving model generalization?

  • RQ2: What are the learning patterns and insights of DIR training? Especially, how does invariant rationalization help to improve generalization?

3.1 Settings

Datasets. We use one synthetic dataset and three real datasets of graph classification tasks. Different GNNs are used in different datasets to achieve DIR and early stopping is exploited during training. Here we briefly introduce the datasets, while the details of dataset statistics, deployed GNNs, and training process are summarized in Appendix D.

  • [leftmargin=*]

  • Spurious-Motif is a synthetic dataset created by following Ying et al. (2019), which involves graphs. Each graph is composed of one base (Tree, Ladder, Wheel denoted by respectively) and one motif (Cycle, House, Crane denoted by , respectively). The ground-truth label is determined by solely. Moreover, we manually construct false relations of different degrees between and label

    in the training set. Specifically, in the training set, we sample each motif from a uniform distribution, while the distribution of its base is determined by

    . We manipulate to create Spurious-Motif datasets of distinct biases. In the testing set, the motifs and bases are randomly attached to each other. Besides, we include graphs with large bases to further magnify the distribution gaps.

  • MNIST-75sp (Knyazev et al., 2019)

    converts the MNIST images into

    superpixel graphs with at most nodes each graph. The nodes in the graphs are superpixels, while edges are the spatial distance between the nodes. Every graph is labeled as one of 10 classes. Random noises are added to nodes’ features in the testing set.

  • Graph-SST2 (Yuan et al., 2020; Socher et al., 2013) Each graph is labeled by its sentence sentiment and consists of nodes representing tokens and edges indicating node relations. Graphs are split into different sets according to their average node degree to create dataset shifts.

  • Molhiv (OGBG-Molhiv) (Hu et al., 2020, 2021; Wu et al., 2017) is a molecular property prediction dataset consisting of molecule graphs, where nodes are atoms, and edges are chemical bonds. Each graph is labeled according to whether a molecule inhibits HIV replication or not.

Baselines. We thoroughly compare DIR with Empirical Risk Minimization (ERM) and two classes of baselines:

  • [leftmargin=*]

  • Interpretable Baselines: Graph Attention (Veličković et al., 2018) and graph pooling operations including ASAP (Ranjan et al., 2020), Top- Pool (Gao and Ji, 2019) and SAG Pool (Lee et al., 2019). We use their generated masks on graph structures as rationales. We also include GSN (Bouritsas et al., 2020), a topologically-aware message passing scheme which enriches GNNs with interpretable structural features.

  • Robust/Invariant Learning Baselines: Group DRO (Sagawa et al., 2019), IRM (Arjovsky et al., 2019), V-REx (Krueger et al., 2021). This class of algorithms improves the robustness and generalization for GNNs, which helps the models better generalize in unseen groups or out-of-distribution datasets. We use random groups or partitions during the model training.

We also include an ablation model of DIR, DIR-Var, which sets , i.e., discards the variance term in , to show the effectiveness of the variance regularization in the DIR objective.

Metrics. We use ROC-AUC for Molhiv and ACC for the other three datasets. Moreover, for Spurious-Motif dataset, we use the precision metric to evaluate the coincidence between model rationales and the ground-truth rationales, and validate the interpretability ability quantitatively.

3.2 Main Results (RQ1)

Spurious-Motif MNIST-75sp Graph-SST2 Molhiv
ERM 42.991.93 39.691.73 38.931.74 33.611.02 12.711.43 81.440.59 76.201.14
Attention 43.072.55 39.421.50 37.410.86 33.460.43 15.192.62 81.570.71 75.841.33
ASAP 44.448.19 44.256.87 39.194.39 31.762.89 15.541.87 81.570.84 73.811.17
Top- Pool 43.438.79 41.217.05 40.277.12 33.600.91 14.913.25 79.781.35 73.011.65
SAG Pool 45.236.76 43.826.32 40.457.50 33.601.18 14.312.44 80.241.72 73.260.84
GSN 43.185.65 34.671.21 34.031.69 32.601.75 19.032.39 82.541.16 74.531.90
Group DRO 41.511.11 39.380.93 39.322.23 33.900.52 15.132.83 81.291.44 75.442.70
V-REx 42.831.59 39.432.69 39.081.56 34.812.04 18.921.41 81.760.08 75.620.79
IRM 42.262.69 41.301.28 40.161.74 35.122.71 18.621.22 81.011.13 74.462.74
DIR-Var 45.872.61 43.811.93 42.691.77 37.121.56 17.744.17 81.740.89 76.050.86
DIR 47.032.46 45.502.15 43.361.64 39.870.56 20.361.78 83.290.53 77.050.57
Table 1: Performance on the Synthetic Dataset and Real Datasets. In Spurious-Motif dataset, we color brown for the results lower than ERM, where is the indicator of the confounding effect.
Model Balance
Attention 0.1830.018 0.1830.130 0.1820.014 0.1340.013
ASAP 0.1870.030 0.1880.023 0.1860.027 0.1210.021
Top Pool 0.2150.061 0.2070.057 0.2120.056 0.1480.018
SAG Pool 0.2120.033 0.1980.062 0.2010.064 0.1360.014
DIR 0.2570.014 0.2550.016 0.2470.012 0.1920.044
Table 2: Precision@5 on Spurious-Motif.

To fairly compare the methods, we train each model under the same training settings as described in Appendix D. The overall results are summarized Table 1, and we have the following observations:

  1. [leftmargin=*]

  2. DIR has better generalization ability than the baselines. DIR outperforms the baselines consistently by a large margin. Specifically, for MNIST-75sp dataset, DIR surpasses ERM by 7.65% and ASAP by 4.82%. Although structure features are shown to be helpful in mitigating feature distribution shift, DIR still performs better than GSN. For Graph-SST2 and Molhiv, DIR achieves the highest performance with low variance. For Spurious-Motif, DIR outstrips IRM averagely by 4.23% and SAG by 3.16% across different degrees of spurious bias. Such improvements strongly validate that DIR can generalize better in various environments.

  3. DIR is consistently effective under different bias degrees, while the baselines easily fail. For interpretable baselines, Attention fails to make salient improvements when bias exists, and pooling methods also fall through under severe bias. This is empirically in line with our presumption that GNNs are easily biased to latch on spurious relations or non-causal features and thus generalize poorly in OOD data. For robust/invariant learning baselines, IRM underperforms ERM when is small. This evidence is accordant with the conclusion in Ahuja et al. (2021) that IRM is guaranteed to be close to the desired OOD solutions when confounders exist, while it has no obvious advantage to ERM under covariate shift. Moreover, Group DRO and V-REx follow a similar pattern. In contrast, DIR works well in various scenarios. We credit such reliability to the rationales discovery from which the causal features are potentially extracted, and the relation learned by the GNNs is invariant across the distribution changes in the testing set.

  4. Data augmentation by intervention is beneficial while the variance regularization further boosts model performance. Interestingly, the ablation model DIR-Var has already exceeded some of the baselines. We attribute such improvement to data augmentation via interventional distributions. On top of DIR-Var, DIR improves the model performance by averagely in Spurious-Motif and in MNIST-75sp. This suggests that the variance regularization demands a stronger invariance condition and is instructive for searching causal features.

  5. DIR has better intrinsic interpretability than the baselines. In Table 2, we report intrinsic interpretable models’ performance w.r.t. Precision@5. From the consistent improvements over the baselines, we find DIR has an advantage in discovering causal features. And the performance gap between DIR and the baselines becomes more significant when the bias increases.

3.3 In-Depth Study (RQ2)

(a) Training rationale: Positive sentiment.
(b) Training rationale: Negative sentiment.
(c) Testing rationale: Positive sentiment.
(d) Testing rationale: Negative sentiment.
Figure 4: Visualization of DIR Rationales. Each graph shows a comment, e.g., a majestic achievement, an epic of astonishing grandeur” in (a), where rationales are highlighted by deep colors.
(a) The first two subfigures show the training curves w.r.t. variance penalty and precision, on Spurious-Motif. The last three subfigures present the rationale distributions of the inspection points, which are visualized by t-SNE (van der Maaten, 2008).
(b) The first three subfigures present the training curves w.r.t. variance penalty and ACC on MNIST-75sp, while the last three illustrate the curves w.r.t. variance penalty and AUC-ROC on Molhiv.
Figure 5: Two-stage Training Dynamics of DIR.

We empirically analyze the DIR’s properties which hopefully give insights into its mechanisms and can be instructive for the existing training paradigms of deep models.

Rationale Visualization. Towards an intuitive understanding of DIR, we first present some cases of the discovered rationale for Graph-SST2 in Figure 4. DIR is able to emphasize the tokens that directly result in the sentences’ positive or negative sentiments, which are reliable and faithful rationales. Specifically, DIR highlights the positive words “majestic achievement” and “astonishing grandeur” in Figure 3(a) and underscores the negative words “worst dialogue” in Figure 3(b) as the rationales, which are clearly salient for the positive and negative sentiments, respectively. Furthermore, DIR can focus persistently on the causal features for OOD testing data. For example, it selects surprisingly engrossing and “admittedly middling” in Figures 3(c) and 3(d), respectively. This again validates the effectiveness of DIR: (1) is well-learned to distinguish causal and non-causal features under various interventional distributions; and (2) conducts message-passing on the highlighted rationales, extracts the graph representations, and finally outputs the predictions with high accuracy. See Appendix F.1 for more examples in Graph-SST2 and Spurious-Motif datasets.

Two-stage Training Dynamics. As Figure 4(a) displays, we find a pattern from the Var-Time curve — during training DIR, the variance penalty (i.e., Var in Equation 4) first increases and then decreases to almost zero. Moreover, there exists an interesting correlation between the variance penalty and the precision metrics — that is, the precision rises dramatically as the penalty increases while growing slowly as the penalty decreases. To probe this learning pattern, we further visualize the rationale distribution in three turning points: (1) the start, (2) the middle, and (3) the end of training. Interestingly, the rationale distribution at the middle point is highly similar to that at the ending point. This illustrates two stages, adaption and fitting, in the patterns. By “adaption”, we mean that the exhibition of , i.e., learning to select salient feature , is mainly conducted during the initial training stage. Since the penalty value can be seen as the magnitude to violate the invariance condition, this stage explores the rationales that satisfy the DIR principle. Correspondingly, adapts quickly with the input of varying rationales generated by . By “fitting”, we mean that, in the later training process, only makes small changes, resulting in the substantially unchanged rationales compared to the initial training process, which is learned from the rationale generator to conform to the DIR principle. This could also imply that based on the well-learned rationales, DIR mainly optimizes to consolidate the functional relation until model convergence.

Moreover, we compare the learning patterns of IRM and DIR in Figure 4(b), where the penalty term of IRM (the gradient norm penalty in IRMv1 (Arjovsky et al., 2019)) follows a similar pattern to the DIR penalty. Notably, in MNIST-75sp, while IRM consistently outperforms DIR w.r.t. Training ACC, it does not improve and even degrades the performance in the testing dataset due to over-fitting. However, DIR shows the solid resistance for over-fitting, partly thanks to the valid rationales exhibited in the adaption stage. For Molhiv, DIR outperforms IRM as the rationales filter out irrelevant or spurious structures bootless for classification tasks and are beneficial for generalization.

Sensitivity Analysis. We conduct a sensitivity analysis of model performance w.r.t. in Appendix F.2, which shows that DIR surpasses the best baselines under a relatively large range of .

4 Related Works

Inherent Interpretability of GNNs. We summarize two classes of the existing methods to build deep interpretable GNNs, (i) Attention (Vaswani et al., 2017; Veličković et al., 2018), which can be broadly interpreted as importance weights on representations.(ii) Pooling (Lee et al., 2019; Knyazev et al., 2019; Gao and Ji, 2019), which selectively performs down-sampling on representations. We include it in this category when it involves selection importance. However, the mechanisms to generate the rationales could be epistemic, as they only reflect the probabilistic relations between data and predicted labels (Pearl, 2000), which may not hold true in all data distributions. Thus, the rationales could fail to align with causal features and even degrade model performance due to being “fooled” by spurious features (Chang et al., 2020).

Invariant Learning. Backed by causal theory, invariant learning assumes the causal relation from the causal factors

to the response variable

remains invariant unless we intervene on . As the most prevailing formulation, IRM (Arjovsky et al., 2019) extends the invariance assumption from feature level to representation level and finds a data representation such that matches for all environments, where is the classifier. However, concerns about its feasibility (Rosenfeld et al., 2021; Ahuja et al., 2021) and optimality (Kamath et al., 2021) have been discussed recently. Besides IRM, variance penalization across environments is shown to be effective for recovering invariance (Krueger et al., 2021; Xie et al., 2020; Teney et al., 2020). Notably, the existing methods generally require accessing different environments, thus additionally involving environment inference (Creager et al., 2021; Wang et al., 2021b). Similarly motivated as ours, Chang et al. (2020) discover rationales by minimizing the performance gap between environment-agnostic predictor and environment-aware predictor . In graph domain, Bevilacqua et al. (2021) construct graph representations from subgraph densities and use attribute symmetry regularization to mitigate the shift of graph size and vertex attribute distributions.

5 Conclusion & Future Work

In this work, we rigorously study the intrinsic interpretability of Graph Neural Networks from a causal perspective. Our concerns are towards the exhibition of shortcut features when generating the rationales. And we proposed an invariant learning algorithm, DIR, to discover the causal features for rationalization. The core of DIR lies in the construction of environments (i.e., interventional distributions) and thus distilling the salient features as rationales that are consistently informative and uniform across these environments. Such rationales serve as the probing towards model mechanisms and are demonstrated to be effective in generalization. In the experiments, we highlight an adaption-fitting training dynamics for DIR to reveal its learning pattern. In the future, we will build more reliable and expressive interpretable models that are feasible under various assumptions, which potentially calls for high-level interpretability. We recommend interested readers go to the open discussion in Appendix G for the detailed description.


This work was supported by the National Key Research and Development Program of China (2020AAA0106000), the National Natural Science Foundation of China (U19A2079), the Sea-NExT Joint Lab, and Singapore MOE AcRF T2.

Ethics Statement

In this work, we propose a novel algorithm for intrinsic interpretable models, where no human subject is related. This synthetic dataset is made available in the anonymous link (cf. Section 3.1). We believe the exhibition of rationales is beneficial for inspecting and eliminating potential discrimination and fairness issues in deep models for real applications.

Reproducibility Statement

We summarize the efforts made to ensure reproducibility in this work. (1) Datasets: We use one synthetic dataset which is made available (cf. the anonymous link in Section 3.1), and three public datasets where the processing details are included in Appendix D. (2) Model Training: We provide the procedure of training in Algorithm A and the training details (including hyper-parameter settings) in Appendix D which are consistent with our implementation in the code (cf. the anonymous link in Section 3.1). (3) Theoretical Results: All assumptions and proofs can be referred to Appendix C.


  • K. Ahuja, J. Wang, A. Dhurandhar, K. Shanmugam, and K. R. Varshney (2021) Empirical or invariant risk minimization? A sample complexity perspective. In ICLR, Cited by: item 2, §4.
  • D. Alvarez-Melis and T. S. Jaakkola (2017) A causal framework for explaining the predictions of black-box sequence-to-sequence models. In EMNLP, pp. 412–421. Cited by: §1.
  • M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz (2019) Invariant risk minimization. CoRR abs/1907.02893. Cited by: §1, §1, §2.1, §2.3, 2nd item, §3.3, §4.
  • Y. Bengio, A. C. Courville, and P. Vincent (2013) Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: 2nd item.
  • B. Bevilacqua, Y. Zhou, and B. Ribeiro (2021) Size-invariant graph representations for graph classification extrapolations. In ICML, Cited by: §4.
  • F. M. Bianchi, D. Grattarola, L. Livi, and C. Alippi (2019) Graph neural networks with convolutional ARMA filters. CoRR abs/1901.01343. Cited by: Table 3.
  • G. Bouritsas, F. Frasca, S. Zafeiriou, and M. M. Bronstein (2020) Improving graph neural network expressivity via subgraph isomorphism counting. arXiv 2006.09252. Cited by: 1st item.
  • P. Bühlmann (2018) Invariance, causality and robustness. arXiv 1812.08233. Cited by: §1, §1.
  • R. Cadène, C. Dancette, H. Ben-younes, M. Cord, and D. Parikh (2019) RUBi: reducing unimodal biases for visual question answering. In NeurIPS, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), Cited by: Appendix E, §2.4.
  • K. H. R. Chan, Y. Yu, C. You, H. Qi, J. Wright, and Y. Ma (2021) ReduNet: A white-box deep network from the principle of maximizing rate reduction. arXiv 2105.10446. Cited by: 1st item.
  • S. Chang, Y. Zhang, M. Yu, and T. S. Jaakkola (2020) Invariant rationalization. In ICML, Cited by: §1, §1, §1, §2.3, §4, §4.
  • Z. Chen, S. Villar, L. Chen, and J. Bruna (2019) On the equivalence between graph isomorphism testing and function approximation with gnns. In NeurIPS, Cited by: §G.1.
  • E. Creager, J. Jacobsen, and R. S. Zemel (2021) Environment inference for invariant learning. In ICML, M. Meila and T. Zhang (Eds.), Cited by: §2.3, §4.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: 3rd item.
  • V. P. Dwivedi, C. K. Joshi, T. Laurent, Y. Bengio, and X. Bresson (2020) Benchmarking graph neural networks. CoRR abs/2003.00982. Cited by: §1.
  • H. Gao and S. Ji (2019) Graph u-nets. In ICML, K. Chaudhuri and R. Salakhutdinov (Eds.), pp. 2083–2092. Cited by: §1, §1, §2.2, 1st item, §4.
  • W. L. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In NeurIPS, pp. 1024–1034. Cited by: §1.
  • W. Hu, M. Fey, H. Ren, M. Nakata, Y. Dong, and J. Leskovec (2021)

    OGB-lsc: a large-scale challenge for machine learning on graphs

    arXiv preprint arXiv:2103.09430. Cited by: Table 3, 4th item.
  • W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec (2020) Open graph benchmark: datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687. Cited by: 4th item.
  • P. Kamath, A. Tangella, D. J. Sutherland, and N. Srebro (2021) Does invariant risk minimization capture invariance?. In AISTATS, A. Banerjee and K. Fukumizu (Eds.), Cited by: §4.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: Appendix D.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: §1.
  • B. Knyazev, G. W. Taylor, and M. R. Amer (2019) Understanding attention and generalization in graph neural networks. In NeurIPS, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 4204–4214. Cited by: §1, §1, §2.2, 2nd item, §4.
  • D. Krueger, E. Caballero, J. Jacobsen, A. Zhang, J. Binas, D. Zhang, R. L. Priol, and A. C. Courville (2021) Out-of-distribution generalization via risk extrapolation (rex). In ICML, M. Meila and T. Zhang (Eds.), pp. 5815–5826. Cited by: §1, §1, §2.3, 2nd item, §4.
  • S. Kullback (1997) Information theory and statistics. Courier Corporation. Cited by: Appendix E.
  • J. Lee, I. Lee, and J. Kang (2019) Self-attention graph pooling. In ICML, K. Chaudhuri and R. Salakhutdinov (Eds.), pp. 3734–3743. Cited by: §1, §2.2, 1st item, §4.
  • P. Li, Y. Wang, H. Wang, and J. Leskovec (2020) Distance encoding: design provably more powerful neural networks for graph representation learning. In NeurIPS, Cited by: §G.1.
  • D. Luo, W. Cheng, D. Xu, W. Yu, B. Zong, H. Chen, and X. Zhang (2020) Parameterized explainer for graph neural network. In NeurIPS, Cited by: §1, §1.
  • H. Maron, H. Ben-Hamu, H. Serviansky, and Y. Lipman (2019) Provably powerful graph networks. In NeurIPS, Cited by: §G.1.
  • C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe (2019) Weisfeiler and leman go neural: higher-order graph neural networks. In AAAI, pp. 4602–4609. Cited by: Table 3.
  • J. Pearl, M. Glymour, and N. P. Jewell (2016) Causal inference in statistics: a primer. John Wiley & Sons. Cited by: Appendix E, §1, 3rd item, §2.1, §2.2, §2.3.
  • J. Pearl (2000) Causality: models, reasoning, and inference. Cited by: Appendix E, 3rd item, §2.1, §2.2, §4.
  • E. Ranjan, S. Sanyal, and P. P. Talukdar (2020) ASAP: adaptive structure aware pooling for learning hierarchical graph representations. In AAAI, pp. 5470–5477. Cited by: Table 3, §1, §2.2, 1st item.
  • E. Rosenfeld, P. K. Ravikumar, and A. Risteski (2021) The risks of invariant risk minimization. In ICLR, Cited by: §4.
  • S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang (2019) Distributionally robust neural networks for group shifts: on the importance of regularization for worst-case generalization. CoRR abs/1911.08731. Cited by: §1, 2nd item.
  • A. W. Senior, R. Evans, J. Jumper, J. Kirkpatrick, L. Sifre, T. Green, C. Qin, A. Zídek, A. W. R. Nelson, A. Bridgland, H. Penedones, S. Petersen, K. Simonyan, S. Crossan, P. Kohli, D. T. Jones, D. Silver, K. Kavukcuoglu, and D. Hassabis (2020)

    Improved protein structure prediction using potentials from deep learning

    Nature 577 (7792), pp. 706–710. Cited by: §1.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, pp. 1631–1642. Cited by: 3rd item.
  • D. Teney, E. Abbasnejad, and A. van den Hengel (2020) Unshuffling data for improved generalization. arXiv 2002.11894. Cited by: §1, §2.1, §2.3, §4.
  • J. Tian, C. Kang, and J. Pearl (2006) A characterization of interventional distributions in semi-markovian causal models. In AAAI, pp. 1239–1244. Cited by: §1, §2.3.
  • G.E. van der Maaten (2008)

    Visualizing high-dimensional data using t-sne

    Journal of Machine Learning Research 9:2579-2605. Cited by: 4(a).
  • T. J. VanderWeele (2013) A three-way decomposition of a total effect into direct, indirect, and interactive effects. Epidemiology (Cambridge, Mass.) 24 (2), pp. 224. Cited by: Appendix E.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), Cited by: §4.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. ICLR. Note: accepted as poster Cited by: §1, §1, §2.2, 1st item, §4.
  • T. Wang, Z. Yue, J. Huang, Q. Sun, and H. Zhang (2021a)

    Self-supervised learning disentangled group representation as feature

    arXiv 2110.15255. Cited by: 1st item.
  • T. Wang, C. Zhou, Q. Sun, and H. Zhang (2021b) Causal attention for unbiased visual recognition. arXiv 2108.08782. Cited by: §2.3, §4.
  • X. Wang, Y. Wu, A. Zhang, X. He, and T. Chua (2021c) Towards multi-grained explainability for graph neural networks. In NeurIPS, Cited by: §1, §1.
  • Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. S. Pande (2017) MoleculeNet: A benchmark for molecular machine learning. arXiv abs/1703.00564. Cited by: 4th item.
  • C. Xie, F. Chen, Y. Liu, and Z. Li (2020) Risk variance penalization: from distributional robustness to causality. arXiv 2006.07544. Cited by: §4.
  • K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2019) How powerful are graph neural networks?. In ICLR, Cited by: Table 3.
  • Z. Ying, D. Bourgeois, J. You, M. Zitnik, and J. Leskovec (2019) GNNExplainer: generating explanations for graph neural networks. In NeurIPS, pp. 9240–9251. Cited by: §F.4, §1, §1, 1st item.
  • H. Yuan, H. Yu, S. Gui, and S. Ji (2020) Explainability in graph neural networks: A taxonomic survey. CoRR. Cited by: 3rd item.
  • H. Yuan, H. Yu, J. Wang, K. Li, and S. Ji (2021) On explainability of graph neural networks via subgraph explorations. ArXiv. Cited by: §1.

Appendix A Notations & Algorithm

Symbol Definition
graph instance
/ ground truth causal or confounding subgraph
 / generated rationale or complement of rationale instance
 / variables in the causal graph
 / space of the ground truth or identified spurious features
 / causal or spurious prediction
joint prediction
rationale generator
 / causal or spurious classifier
Key Notations in the Main Paper.
0:  Training data distribution ; number of classes ; Stepsize ; hyper-parameter
1:  Randomly initialize the parameters of generator , encoder (includes and Pooling layer), two classifiers and , which are denoted as , respectively.
2:  while not converge do
3:     Sample graphs from
4:     Generate each rationale and its complement:
5:     for each  do
6:        Intervener operates
7:        Model forward: ,
8:        # block BP of DIR risk to shortcut branchObtain joint prediction , where
9:        Compute and record risk
10:        Compute and record -interventional risk.
11:     end for
12:     Compute via Eq. 4 and via Eq. 11
13:     Update parameters: ; ;         ;
14:  end while
Algorithm 1 Pseudocode for DIR in training interpretable Graph Neural Networks (Batch Version)

Appendix B Instantiated Causal Graphs

We instantiate possible causal graphs in Figure 1(a). Specifically, we use the example of Base-Motif graphs, whose labels are determined by the motif types. We use to denote cycle, house, crane, respectively; And use to denote ladder, tree, wheels, respectively.

  • [leftmargin=*]

  • : Base graphs and motif graphs are independently sampled and attached to each other.

  • : Type of each motif respects to a given (static) probability distribution. According to the value of

    , the probability distribution of its base graph is given by

  • : Similar to the example for .

  • : Suppose there is a latent variable takes continuous value from to . Then the probability distribution of and s.t.



    stands for binomial distribution,

    i.e., for variable , if , then we have

Appendix C Theory

c.1 Assumption

We phrase the SCM in Figure 1(a) as the following assumption:

Assumption 1 (Invariant Rationalization (IR))

There exists a rationale , such that the structural equation model

and the probability relation

hold for every distribution over , where denotes the complement of . Also, we denote as the oracle structural equation model.

By “oracle”, we mean that is the perfect structure equation model, which, when is available, predicts the response variable with the minimum expected loss over any distribution . Or formally,


where is the task-specific loss function and we ignore the exogenous noise in ’s input except as otherwise noted.

Next, we argue that the assumption is commonly satisfied. For example, for sentences labeled by sentiment, can represent the positive/negative words that cause the sentiment, while includes the prepositions and linking words. For molecule graphs labeled by specific properties, and

can represent the functional groups and carbon structures, respectively. Note that IR assumption enables and calls the introduction of interpretability, highlighting salient features and exhibiting human accessible checks. More importantly, it guarantees the model performance under possible feature reduction,

i.e., .

We also see cases going beyond the IR Assumption. For example, could be a generic function of and , instead of a simple joint. We use a toy example to elaborate this point. Following the Spurious-Motif dataset, we assume each graph has multiple motifs (house, cycle, crane) with only one type and is labeled by the motif type. Thus, the causal feature will be the motifs. Let the spurious feature be ”the way we connect the motifs”. For example, we can place the house motifs in a queue sequence and connect the adjacent motifs, thus forming the graph in a ”line” shape. Or we can place the houses in a cycle order and connect them into a ring. We further make such graph structures strongly correlated with the motif types. Thus, individual and may be intractable individually in the feature level. For example, if we separate the cycle-shaped houses into two lines, the spurious pattern could be broken while the part of the causal feature would be lost. In other words, and are dependent variables. Thus, they can’t be extracted and modeled separately, which goes out of the scope of our work.

Given that and are separable, we further make the following assumption to avoid the confusion of and :

Assumption 2 (Feature Induction)

Define power set operation as . For data and label , if holds for any distribution over , then it implies that for any induced feature , we have holds for the distribution .

This assumption also implies that could not be induced by when . Thus, any feature subset except for would violate the conditional independence condition. For images, this assumption is natural for the splicing of doesn’t typically change its semantics. For example, the splicing of land background would still be divided land. While for graphs, here we assume the causal subgraph’s uniqueness among the induced complement graphs.

c.2 Proofs

Theorem 1 (Necessity)

Suppose does not exist, then the oracle function satisfies the DIR Principle (where is given) over every distribution .


We first prove the fact that for distribution . Specifically, we use to denote the s-interventional distribution.

  • [leftmargin=*]

  • If ,

  • If ,

  • If ,

As holds true for every distribution , which is invariant w.r.t. iterative variable . Moreover, we have . This indicates that the intervention on leave the causal structure untouched. Thus, we have

Finally, taking the definition of , we have

Hence, takes the minimum penalty and satisfies the DIR Principle.

Notably, if , then may not equal to zero since . In such case, is not necessarily satisfied to DIR Principle. That is, although still minimizes , we can’t be sure whether it reaches the lower bound of without knowledge about the specific data distribution. Thus, we only consider the cases of , and in the following discussion.

Theorem 2 (Uniqueness)

Suppose is a strict loss function and there exists one and only one non-trivial subset , then there exists a unique structure equation model s.t. it satisfies the DIR Principle.


Since exists and satisfies the DIR Principle, we only need to prove its uniqueness under the given conditions. Otherwise, suppose we have another structure equation satifies the DIR Principle. Specifically, there exists a datum s.t. . Thus, we have . Given that , we have .

In reality, there could be multiple candidates of , e.g., s.t. , where is the structure equation corresponds to . Thus, it calls for the selection of to avoid the learning of suboptimal . Inspired by Occam’s Razor, we define


as the preferred rationale, or rationale of parsimony. We argue that rationales are not to be extended beyond necessity, which poses simpler hypotheses about causality. As the search of is NP-hard (the worst time complexity is exponential), we use fixed size for the learned rationales in our experiments and leave a better optimization to future work.

Corollary 1 (Necessity and Sufficiency)

Suppose is a strict loss function and there exists one and only one non-trival subset , then any structure causal model s.t. it satisfies the DIR Principle iff. .

This is directly obtained from Theorem 2. Thus, under the unique constraint of , we can approach the oracle by optimizing the DIR objective, which maintains the invariant causal relation between the causal feature and the response variable . In another way, based on the uniqueness of the feasible rationale, the optimization of the DIR Principle on the intrinsic interpretable model (where is exhibited inside of ) pushes the approach to with rationales . Then, can also be approached as an invariant predictor based on the learning from .

Appendix D Setting Details

Spurious-Motif MNIST-75sp (reduced) Graph-SST2 OGBG-Molhiv
Train Val Test Train Val Test Train Val Test Train Val Test
Classes# 3 10 2 2
Graphs# 9,000 3,000 6,000 20,000 5,000 10,000 28,327 3,147 12,305 32,901 4,113 4,113
Avg. N# 25.4 26.1 88.7 66.8 67.3 67.0 17.7 17.3 3.45 25.3 27.79 25.3
Avg. E# 35.4 36.2 131.1 539.3 545.9 540.4 33.3 33.5 4.89 54.1 61.1 55.6
Backbone Local Extremum GNN -GNNs ARMA GIN + Virtual nodes
(Ranjan et al., 2020) (Morris et al., 2019) (Bianchi et al., 2019) (Xu et al., 2019; Hu et al., 2021)
Neuron# [4,32,32,32] [5,32,32,32] [768,128,128,2] [9,300,300,300,1]
Global Pool global mean pool

global max pool

global mean pool global add pool
Gen. Type Scale & Correlation Shift Noise Degree & Scale Shift /
Table 3: Statistics of Graph Classification Datasets.


We summarize dataset statistics in Table 3, and introduce the node/edge features and the preprocessing in each datasets:

  • [leftmargin=*]

  • Spurious-Motif. We use random node features and constant edge weights in this dataset.

  • MNIST-75sp. The nodes in the graphs are superpixels, and node features are the concatenation of pixel intensities (RGB channels) and coordinates of their mass centers. Edges are the spatial distance between the superpixel centers, while we filter the edges with a distance less than 0.1 to make the graphs sparser.

  • Graph-SST2. We use constant edge weight and filter the graphs with edges less than three. We initialize the node features by the pre-trained BERT (Devlin et al., 2018) word embedding.

  • OGBG-Molhiv. We use the official released dataset in our experiment.


We summarize the backbone GNNs for each dataset in Table 3

. The number of neurons in the sequent layers (in forwarding order) is reported. We use ReLU as activation layers and different global pooling layers. In OGBG-Molhiv, we adopt one fully connected layer for the prediction layers while using two fully connected layers for the models in other datasets. For baselines with node pooling/node attention, we add one node pooling/attention layer in the second convolution layer.

Training Optimization & Early Stopping.

All experiments are done on a single Tesla V100 SXM2 GPU (32 GB). During training, we use Adam (Kingma and Ba, 2015)

optimizer. The maximum number of epochs is 400 for all datasets. We use Stochastic Gradient Descent (SGD) for the optimization on Graph-SST2 and OGBG-Molhiv and Gradient Descent (GD) for the other two datasets. Also, we exhibit early stopping to avoid overfitting of the training dataset. Specifically, in MNIST-75sp, Graph-SST2 and OGBG-Molhiv, each model is evaluated on a holdout in-distribution validation dataset after each epoch. While for Spurious-Motif, we use an unbiased validation dataset (

i.e., without spurious relations compared to the training dataset). If the model’s performance on the validation dataset is without improvement (i.e., validation accuracy begins to decrease) for five epochs, we stop the training process to prevent increased generalization error.

Hyper-Parameter Settings.

We set the causal feature ratio and as for MNIST-75sp, Spurious-Motif, Graph-SST2 and OGBG-Molhiv respectively. For other baselines, we adopt grid search for the best parameters using the validation datasets.

Model Selection.

We select each model based on its performance on the corresponding validation dataset. We repeat each experiment at least five times and report the average values and the standard errors in the paper.

Appendix E Unimodal Adjustment

We follow Cadène et al. (2019) to demonstrate how the shortcut prediction can help to remove model bias. For clarity, we refer to the model parameters except for as the main branch, i.e., except for the -only branch.

Given a house-tree graph as the input graph, we suppose the shortcut prediction of the tree subgraph leans towards the house class. Then after reweighting on , the softmax readout on the house class in the joint prediction will be magnified, which results in a smaller loss back-propagated to the main branch and prevents from inductive bias.

In another situation where a house-wheel graph is given as the input, we similarly suppose the shortcut prediction of the wheel subgraph leans towards other classes except the house, say, the circle class. Then after reweighting on , the softmax readout on the house class in the joint prediction will be reduced, which results in a larger loss back-propagated to the main branch and encourages the model to learn from these examples.

Furthermore, we offer the causal- and information-theoretical justifications: (1) From the perspective of causal theory (Pearl, 2000; Pearl et al., 2016)

, the element-wise multiplication enforces the spurious prediction to estimate the pure indirect effect (PIE) of the shortcut features, while the causal prediction captures the natural direct effect (NDE) of the causal patterns

(VanderWeele, 2013); (2) From the perspective of information theory (Kullback, 1997), the element-wise multiplication makes the causal prediction reflect the conditional mutual information between the causal patterns and ground-truths, conditioning on the complement patterns.

Appendix F More Experimental Results

f.1 Visualization

We provide more visualization cases in Graph-SST2 dataset as shown in Figure 6 and Figure 7. The rationales are highlighted in deep colors.

Figure 6: Visualization of Training Rationales. Each graph represents a comment, e.g., , ”determined to uncover the truth and hopefully inspire action” in (a).
Figure 7: Visualization of Testing Rationales. Each graph represents a comment, e.g., , ”whimsical and relevant today” in (a).
(a) Cycle-Tree
(b) House-Ladder
(c) Crane-Tree
Figure 8: Visualization of Training Rationales in Spurious-Motif Dataset. Structures with deeper colors mean higher importance. Nodes of ground truth rationales are colored by green.
(a) Cycle-Tree
(b) House-Ladder
(c) Crane-Wheel
Figure 9: Visualization of Testing Rationales in Spurious-Motif Dataset. Structures with deeper colors mean higher importance. Nodes of ground truth rationales are colored by green.

f.2 Sensitivity Analysis

Figure 10: Sensitivity of Hyper-Parameter . In each chart, dash line represents the performance of the best baseline in the corresponding dataset, and the area between ACCstd are colored.

We analyze the performance of DIR w.r.t. the hyper-parameter . As shown in Figure 10, with , DIR degrades to optimize the performance in each environment only, without explicitly penalizing the shortcuts’ influence on the model predictions. We also see that all testing performances drop sharply if is too large. Since a large weight on the variance term emphasis on the invariance condition while leading to the overlook on the performance loss, it could fail to exhibit correctly. Notably, such a trade-off in the DIR objective is commonly shared among all the datasets.

f.3 Study of the Spurious Classifiers

Here we provide more observations about the predictions of the learned spurious classifier, which sheds light on the designed model mechanism. We first look into the confidence of predictions and define

Spurious-Motif (=0.9) MNIST-75sp GraphSST2 Molhiv
Uniform 1.10 2.30 0.693 0.693
Spurious Predictions 0.529 1.93 0.265 0.187
Table 4: Confidence of the Spurious Predictions. Uniform is the reference indicates the uniform distributions across the classes.
Spurious-Motif (