1 Introduction
The eyecatching success in graph neural networks (GNNs) (Hamilton et al., 2017; Kipf and Welling, 2017; Dwivedi et al., 2020) provokes the rationalization task, answering “What knowledge drives the model to make certain predictions?”. The goal of selective rationalization (aka. feature attribution) (Chang et al., 2020; Ying et al., 2019; Luo et al., 2020; Wang et al., 2021c) is to find a small subset of the input’s graph features — rationale — which best guides or explains the model prediction. Discovering the rationale in a model helps audit its inner workings and justify its predictions. Moreover, it has tremendous impacts on realworld applications, such as finding functional groups to shed light on protein structure prediction (Senior et al., 2020).
Two research lines of rationalization have recently emerged in GNNs. Posthoc explainability (Ying et al., 2019; Luo et al., 2020; Yuan et al., 2021; Wang et al., 2021c) attributes a model’s prediction to the input graph with a separate explanation method, while intrinsic interpretability (Veličković et al., 2018; Gao and Ji, 2019) incorporates a rationalization module into the model to make transparent predictions. Here we focus on intrinsically interpretable GNNs. Among them, graph attention (Veličković et al., 2018) and pooling (Lee et al., 2019; Knyazev et al., 2019; Gao and Ji, 2019; Ranjan et al., 2020) operators prevail, which work as a computational block of a GNN to generate soft or hard masks on the input graph. They cast the learning paradigm of GNN as minimizing the empirical risk with the masked subgraphs, which are regarded as rationales to guide the model predictions.
Despite the appealing nature, recent studies (Chang et al., 2020; Knyazev et al., 2019) show that the current rationalization methods are prone to exploit data biases as shortcuts to make predictions and compose rationales. Typically, shortcuts result from confounding factors, sampling biases, and artifacts in the training data. Considering Figure 1, when the most bases of Housemotif graphs are Tree, a GNN does not need to learn the correct function to reach high accuracy for the motif type. Instead, it is much easier to learn from the statistical shortcuts linking the bases Tree with the most occurring motifs House
. Unfortunately, when facing with outofdistribution (OOD) data, such methods generalize poorly since the shortcuts are changed. Hence, such shortcutinvolved rationales hardly reveal the truly critical subgraphs for the predicted labels, being at odds with the true reasoning process that underlies the task of interest
(Teney et al., 2020) and human cognition (AlvarezMelis and Jaakkola, 2017).Here we ascribe the failure on OOD data to the inability to identify causal patterns, which are stable to distribution shift. Motivated by recent studies on invariant learning (IL) (Arjovsky et al., 2019; Krueger et al., 2021; Chang et al., 2020; Bühlmann, 2018), we premise different distributions elicit different environments of data generating process. We argue that the causal patterns to the labels remain stable across environments, while the relations between the shortcut patterns and the labels vary. Such environmentinvariant patterns are more plausible and qualified as rationales.
Aiming to identify rationales that capture the environmentinvariant causal patterns, we formalize a learning strategy, Discovering Invariant Rationales (DIR), for intrinsically interpretable GNNs. One major problem is how to get multiple environments from a standard training set. Differing from the heterogeneous setting (Bühlmann, 2018) of existing IL methods, where environments are observable and attainable, DIR does not assume prophets about environments. It instead generates distribution perturbations by causal intervention — interventional distributions (Tian et al., 2006; Pearl et al., 2016) — to instantiate environments and further distinguish the causal and noncausal parts.
Guided by this idea, our DIR strategy consists of four modules: a rationale generator, a distribution intervener, a feature encoder, two classifiers. Specifically, the rationale generator learns to split the input graph into causal and noncausal subgraphs, which are respectively encoded by the encoder into representations. Then, the distribution intervener conducts the causal interventions on the noncausal representations to create perturbed distributions, with which we can infer the invariant causal parts. Then, the two classifiers are respectively built upon the causal and noncausal parts to generate the joint prediction, whose invariant risk is minimized across different distributions. On one synthetic and three real datasets, extensive experiments demonstrate the generalization ability of DIR to surpass current stateoftheart IL methods
(Arjovsky et al., 2019; Krueger et al., 2021; Sagawa et al., 2019), and the interpretability of DIR to outperform the attention and poolingbased rationalization methods (Veličković et al., 2018; Gao and Ji, 2019). Our main contributions are:
[leftmargin=*]

We propose a novel invariant learning algorithm, DIR, for inherent interpretable models, improving the generalization ability and is suitable for any deep models.

We offer causality theoretic analysis to guarantee the preeminence of DIR.

We provide the implementation of DIR for graph classification tasks, which consistently achieves excellent performance on three datasets with various generalization types.
2 Invariant Rationale Discovery
With a causal look at the datagenerating process, we formalize the principle of discovering invariant rationales, which guides our discovery strategy. Throughout the paper, uppercased letters like
denote random variables, while lowercase letters like
denote deterministic value of variables.2.1 Causal View of DataGenerating Process
Generating rationales for transparent predictions requires understanding the actual mechanisms of the task of interest. Without loss of generality, we focus on the graph classification task and present a causal view of the datagenerating process behind this task. Here we formalize the causal view as a Structure Causal Model (SCM) (Pearl et al., 2016; Pearl, 2000) by inspecting on the causalities among four variables: input graph , groundtruth label , causal part , noncausal part . Figure 1(a) illustrates the SCM, where each link denotes a causal relationship between two variables.

[leftmargin=*]

. The input graph consists of two disjoint parts: the causal part and the noncausal part , such as the House motif and the Tree base in Figure 1.

. By “causal part”, we mean is the only endogenous parent to determine the groundtruth label . Taking the motifbase example in Figure 1 again, is the oracle rationale, which perfectly explains why the graph is labeled as .

. This dashed arrow indicates additional probabilistic dependencies (Pearl, 2000; Pearl et al., 2016) between and . We consider three typical relationships here: (1) is independent of , i.e., ; (2) is the direct cause of , i.e., ; and (3) There exists a common cause , i.e., . See Appendix B for the corresponding examples.
can create spurious correlations between the noncausal part and the groundtruth label . Assuming , is a confounder between and , which opens a backdoor path , thus making and spuriously correlated (Pearl et al., 2016). We systematize such spurious correlations as . Wherein, we make feature induction assumption on to avoid the confusion of the induced subset of between . See Appendix C for the formal assumption. Furthermore, data collected from different environments exhibit various spurious correlations (Teney et al., 2020; Arjovsky et al., 2019), e.g., one mostly picks House motifs with Tree bases as the training data, while another selects House motifs with Wheel bases as the testing data. Hence, such spurious correlations are unstable and variant across different distributions.
2.2 Task Formalization of Invariant Rationalization
Oracle Rationale. With the causal theory (Pearl et al., 2016; Pearl, 2000), for each variable in a SCM, there exists a directed link from each of its parent variables to , if and only if the causal mechanism persists, where is the exogenous noise of . For simplicity, we omit the exogenous noise and simplify it as . Hence, there exist a function in our SCM, where the “oracle rationale” satisfies:
(1) 
where indicates that shields from the influence of , making the causal relationship invariant across different .
Rationalization. In general, only the pairs of input and label are observed during training, while neither oracle rationale nor oracle structural equation model is available. The absence of oracles calls for the study on intrinsic interpretability. We systematize an intrinsicallyinterpretable GNN as a combination of two modules, i.e., , where discovers rationale from the observed , and outputs the prediction to approach . Distinct from and which are the variables in the causal mechanisms, and represent the variables in the modeling process to approximate and . To optimize these modules, most of current intrinsicallyinterpretable GNNs (Veličković et al., 2018; Lee et al., 2019; Knyazev et al., 2019; Gao and Ji, 2019; Ranjan et al., 2020) adopt the learning strategy of minimizing the empirical risk:
(2) 
where is the risk function, which can be the crossentropy loss. Nevertheless, this learning strategy relies heavily on the statistical associations between the input features and labels, and can potentially exhibit noncausal rationales.
Invariant Rationalization. We ascribe the limitation to ignoring in Equation 1, which is crucial to refine the causal relationship that is invariant across different . By introducing this independence, we formalize the task of invariant rationalization as:
(3) 
where is the complement of . This formulation encourages the rationale seeking the patterns that are stable across different distributions, while discarding the unstable patterns.
2.3 Principle & Learning Strategy of DIR
Interventional Distribution.
However, it is difficult to recover the oracle rationale from the joint distribution over the inputs and labels — that is, the causal and noncausal relations are hardly distinguished from each other. We
get inspirations from invariant learning (Arjovsky et al., 2019; Krueger et al., 2021; Chang et al., 2020), which constructs different environments to infer the invariant features or predictors. To obtain the environments, previous studies mostly partition the training set by prior knowledge (Teney et al., 2020) or adversarial environment inference (Creager et al., 2021; Wang et al., 2021b). Different from partitioning the training data, we do not assume prophets about environments but introduce the interventional distribution (Tian et al., 2006; Pearl et al., 2016) instead to model the DIR task. Specifically, on the top of our SCM, we generate interventional distribution by doing intervention on , which removes every link from the parents to the variable and fixes to the specific value . By stratifying different values , we can obtain multiple interventional distributions.With interventional distributions, we propose the principle of discovering invariant rationale (DIR) to identify a rationale whose relationship with the label is stable across different distributions.
Definition 1 (DIR Principle)
An intrinsicallyinterpretable model satisfies the DIR principle if it

minimizes all interventional risks: , and simultaneously
where the interventional risk is defined over the interventional distribution for specific .
Guided by the proposed principle, we design the learning strategy of DIR as:
(4) 
where computes the risk under the interventional distribution, which we will elaborate in Section 2.4. Var calculates the variance of risks over different interventional distributions; is a hyperparameter to control the strength of invariant learning.
Justification. We theoretically justify the DIR principle’s ability to discover invariant rationales. Specifically, Theorem 1 shows that the oracle model respects the DIR principle. Moreover, we suggest that can be inferred by making the intrinsically interpretable model conform to the DIR principle under the uniqueness condition (cf. Corollary 1). We leave the detailed proofs in Appendix C due to the limited space. By making the distributionrelevant risks indifferent while pursuing low risks, the DIR principle is able to discover the invariant rationales as the approximation of the oracle rationales , while encouraging approaching the oracle model .
2.4 DIRGuided Implementation of IntrinsicallyInterpretable GNNs
With the DIR principle and objective, we present how to implement the intrinsicallyinterpretable GNNs. We summarize the key notations of this section in Appendix A for clarity. Following Equation 2, a model with intrinsic interpretability consists of two modules: , where is to extract a possible rationale, and is to make prediction based on the rationale. Moreover, to establish the interventional distributions, we design an additional module to do the interventions. In a nutshell, our framework consists of four components, as Figure 3 shows.
Rationale Generator. It aims to split the input graph instance into two subgraphs: causal part and noncausal part . Specifically, given an input graph instance with the node set and the edge set , its adjacency matrix is , where denotes the edge from node to node , and otherwise. The rationale generator first adopts a GNN to generate the mask matrix on , where mask indicates the importance of edge :
(5) 
where
is the sigmoid function and
summarizes the dimensional representations of all nodes. The generator then selects the edges with the highest masks to construct the rationale and collects ’s complement as , as follows:(6) 
where and are the edge sets of and , respectively; Top selects the top edges with , and is the hyperparameter (e.g., ); is the elementwise product. Having obtained the edge sets, we can distill the nodes appearing in the edges to establish and .
Distribution Intervener. It targets at creating interventional distributions. Formally, it first collects the noncausal part of all the instances into a memory bank as . It next samples a memory to conduct the intervention , replacing the complement of the critical subgraph at hand and constructing an intervened pair (), where are indices.
Graph Encoder & Classifiers . Here we represent as a combination of a graph encoder and two classifiers. Specifically, it employs another GNN encoder on to generate node representations , and then combines them as graph representation via a global pooling operator, e.g., average pooling. Then it uses a classifier
to project the graph representation into a probability distribution over class labels
. More formally, the process is as follows:(7) 
Analogously, we can obtain for via the shared encoder and another classifier . is the prediction based merely on the causal part , while measures the predictive power of the intervened part . Inspired by Cadène et al. (2019), we formulate the joint prediction under the intervention as masked by :
(8) 
where the sigmoid function adjusts the output logits of
to compensate for the spurious biases. In Appendix E, we present examples of how this operation helps discover the causal part.Optimization. Having established the prediction of an instance under the intervention , we are capable of getting the interventional risk similar as Equation 4 as follows:
(9) 
where is a pair of graph instance and its groundtruth label from the training set ;
denotes the loss function on a single instance. Moreover, we define the loss for
module as:(10) 
Specifically,
is only backpropagated to the classifier
and we set apart the other components from its backpropagation to avoid interference with representation learning. Thus, this loss promotes the only branch to learn spurious biases given the noncausal features only. Overall, we can jointly optimize these components via the DIR objective and shortcut loss, i.e.,(11) 
where and are the parameters of the generator, encoder and two classifiers. While in the inference phase, we yield and as the causal rationale and the causal prediction of a testing graph , which exclude the influence of the noncausal part .
3 Experiments
In this section, we conduct extensive experiments to answer the research questions:

[leftmargin=*]

RQ1: How effective is DIR in discovering causal features and improving model generalization?

RQ2: What are the learning patterns and insights of DIR training? Especially, how does invariant rationalization help to improve generalization?
3.1 Settings
Datasets. We use one synthetic dataset and three real datasets of graph classification tasks. Different GNNs are used in different datasets to achieve DIR and early stopping is exploited during training. Here we briefly introduce the datasets, while the details of dataset statistics, deployed GNNs, and training process are summarized in Appendix D.

[leftmargin=*]

SpuriousMotif is a synthetic dataset created by following Ying et al. (2019), which involves graphs. Each graph is composed of one base (Tree, Ladder, Wheel denoted by respectively) and one motif (Cycle, House, Crane denoted by , respectively). The groundtruth label is determined by solely. Moreover, we manually construct false relations of different degrees between and label
in the training set. Specifically, in the training set, we sample each motif from a uniform distribution, while the distribution of its base is determined by
. We manipulate to create SpuriousMotif datasets of distinct biases. In the testing set, the motifs and bases are randomly attached to each other. Besides, we include graphs with large bases to further magnify the distribution gaps. 
MNIST75sp (Knyazev et al., 2019)
converts the MNIST images into
superpixel graphs with at most nodes each graph. The nodes in the graphs are superpixels, while edges are the spatial distance between the nodes. Every graph is labeled as one of 10 classes. Random noises are added to nodes’ features in the testing set.
Baselines. We thoroughly compare DIR with Empirical Risk Minimization (ERM) and two classes of baselines:

[leftmargin=*]

Interpretable Baselines: Graph Attention (Veličković et al., 2018) and graph pooling operations including ASAP (Ranjan et al., 2020), Top Pool (Gao and Ji, 2019) and SAG Pool (Lee et al., 2019). We use their generated masks on graph structures as rationales. We also include GSN (Bouritsas et al., 2020), a topologicallyaware message passing scheme which enriches GNNs with interpretable structural features.

Robust/Invariant Learning Baselines: Group DRO (Sagawa et al., 2019), IRM (Arjovsky et al., 2019), VREx (Krueger et al., 2021). This class of algorithms improves the robustness and generalization for GNNs, which helps the models better generalize in unseen groups or outofdistribution datasets. We use random groups or partitions during the model training.
We also include an ablation model of DIR, DIRVar, which sets , i.e., discards the variance term in , to show the effectiveness of the variance regularization in the DIR objective.
Metrics. We use ROCAUC for Molhiv and ACC for the other three datasets. Moreover, for SpuriousMotif dataset, we use the precision metric to evaluate the coincidence between model rationales and the groundtruth rationales, and validate the interpretability ability quantitatively.
3.2 Main Results (RQ1)
SpuriousMotif  MNIST75sp  GraphSST2  Molhiv  
Balance  
ERM  42.991.93  39.691.73  38.931.74  33.611.02  12.711.43  81.440.59  76.201.14 
Attention  43.072.55  39.421.50  37.410.86  33.460.43  15.192.62  81.570.71  75.841.33 
ASAP  44.448.19  44.256.87  39.194.39  31.762.89  15.541.87  81.570.84  73.811.17 
Top Pool  43.438.79  41.217.05  40.277.12  33.600.91  14.913.25  79.781.35  73.011.65 
SAG Pool  45.236.76  43.826.32  40.457.50  33.601.18  14.312.44  80.241.72  73.260.84 
GSN  43.185.65  34.671.21  34.031.69  32.601.75  19.032.39  82.541.16  74.531.90 
Group DRO  41.511.11  39.380.93  39.322.23  33.900.52  15.132.83  81.291.44  75.442.70 
VREx  42.831.59  39.432.69  39.081.56  34.812.04  18.921.41  81.760.08  75.620.79 
IRM  42.262.69  41.301.28  40.161.74  35.122.71  18.621.22  81.011.13  74.462.74 
DIRVar  45.872.61  43.811.93  42.691.77  37.121.56  17.744.17  81.740.89  76.050.86 
DIR  47.032.46  45.502.15  43.361.64  39.870.56  20.361.78  83.290.53  77.050.57 
Model  Balance  

Attention  0.1830.018  0.1830.130  0.1820.014  0.1340.013 
ASAP  0.1870.030  0.1880.023  0.1860.027  0.1210.021 
Top Pool  0.2150.061  0.2070.057  0.2120.056  0.1480.018 
SAG Pool  0.2120.033  0.1980.062  0.2010.064  0.1360.014 
DIR  0.2570.014  0.2550.016  0.2470.012  0.1920.044 
To fairly compare the methods, we train each model under the same training settings as described in Appendix D. The overall results are summarized Table 1, and we have the following observations:

[leftmargin=*]

DIR has better generalization ability than the baselines. DIR outperforms the baselines consistently by a large margin. Specifically, for MNIST75sp dataset, DIR surpasses ERM by 7.65% and ASAP by 4.82%. Although structure features are shown to be helpful in mitigating feature distribution shift, DIR still performs better than GSN. For GraphSST2 and Molhiv, DIR achieves the highest performance with low variance. For SpuriousMotif, DIR outstrips IRM averagely by 4.23% and SAG by 3.16% across different degrees of spurious bias. Such improvements strongly validate that DIR can generalize better in various environments.

DIR is consistently effective under different bias degrees, while the baselines easily fail. For interpretable baselines, Attention fails to make salient improvements when bias exists, and pooling methods also fall through under severe bias. This is empirically in line with our presumption that GNNs are easily biased to latch on spurious relations or noncausal features and thus generalize poorly in OOD data. For robust/invariant learning baselines, IRM underperforms ERM when is small. This evidence is accordant with the conclusion in Ahuja et al. (2021) that IRM is guaranteed to be close to the desired OOD solutions when confounders exist, while it has no obvious advantage to ERM under covariate shift. Moreover, Group DRO and VREx follow a similar pattern. In contrast, DIR works well in various scenarios. We credit such reliability to the rationales discovery from which the causal features are potentially extracted, and the relation learned by the GNNs is invariant across the distribution changes in the testing set.

Data augmentation by intervention is beneficial while the variance regularization further boosts model performance. Interestingly, the ablation model DIRVar has already exceeded some of the baselines. We attribute such improvement to data augmentation via interventional distributions. On top of DIRVar, DIR improves the model performance by averagely in SpuriousMotif and in MNIST75sp. This suggests that the variance regularization demands a stronger invariance condition and is instructive for searching causal features.

DIR has better intrinsic interpretability than the baselines. In Table 2, we report intrinsic interpretable models’ performance w.r.t. Precision@5. From the consistent improvements over the baselines, we find DIR has an advantage in discovering causal features. And the performance gap between DIR and the baselines becomes more significant when the bias increases.
3.3 InDepth Study (RQ2)
We empirically analyze the DIR’s properties which hopefully give insights into its mechanisms and can be instructive for the existing training paradigms of deep models.
Rationale Visualization. Towards an intuitive understanding of DIR, we first present some cases of the discovered rationale for GraphSST2 in Figure 4. DIR is able to emphasize the tokens that directly result in the sentences’ positive or negative sentiments, which are reliable and faithful rationales. Specifically, DIR highlights the positive words “majestic achievement” and “astonishing grandeur” in Figure 3(a) and underscores the negative words “worst dialogue” in Figure 3(b) as the rationales, which are clearly salient for the positive and negative sentiments, respectively. Furthermore, DIR can focus persistently on the causal features for OOD testing data. For example, it selects surprisingly engrossing and “admittedly middling” in Figures 3(c) and 3(d), respectively. This again validates the effectiveness of DIR: (1) is welllearned to distinguish causal and noncausal features under various interventional distributions; and (2) conducts messagepassing on the highlighted rationales, extracts the graph representations, and finally outputs the predictions with high accuracy. See Appendix F.1 for more examples in GraphSST2 and SpuriousMotif datasets.
Twostage Training Dynamics. As Figure 4(a) displays, we find a pattern from the VarTime curve — during training DIR, the variance penalty (i.e., Var in Equation 4) first increases and then decreases to almost zero. Moreover, there exists an interesting correlation between the variance penalty and the precision metrics — that is, the precision rises dramatically as the penalty increases while growing slowly as the penalty decreases. To probe this learning pattern, we further visualize the rationale distribution in three turning points: (1) the start, (2) the middle, and (3) the end of training. Interestingly, the rationale distribution at the middle point is highly similar to that at the ending point. This illustrates two stages, adaption and fitting, in the patterns. By “adaption”, we mean that the exhibition of , i.e., learning to select salient feature , is mainly conducted during the initial training stage. Since the penalty value can be seen as the magnitude to violate the invariance condition, this stage explores the rationales that satisfy the DIR principle. Correspondingly, adapts quickly with the input of varying rationales generated by . By “fitting”, we mean that, in the later training process, only makes small changes, resulting in the substantially unchanged rationales compared to the initial training process, which is learned from the rationale generator to conform to the DIR principle. This could also imply that based on the welllearned rationales, DIR mainly optimizes to consolidate the functional relation until model convergence.
Moreover, we compare the learning patterns of IRM and DIR in Figure 4(b), where the penalty term of IRM (the gradient norm penalty in IRMv1 (Arjovsky et al., 2019)) follows a similar pattern to the DIR penalty. Notably, in MNIST75sp, while IRM consistently outperforms DIR w.r.t. Training ACC, it does not improve and even degrades the performance in the testing dataset due to overfitting. However, DIR shows the solid resistance for overfitting, partly thanks to the valid rationales exhibited in the adaption stage. For Molhiv, DIR outperforms IRM as the rationales filter out irrelevant or spurious structures bootless for classification tasks and are beneficial for generalization.
Sensitivity Analysis. We conduct a sensitivity analysis of model performance w.r.t. in Appendix F.2, which shows that DIR surpasses the best baselines under a relatively large range of .
4 Related Works
Inherent Interpretability of GNNs. We summarize two classes of the existing methods to build deep interpretable GNNs, (i) Attention (Vaswani et al., 2017; Veličković et al., 2018), which can be broadly interpreted as importance weights on representations.(ii) Pooling (Lee et al., 2019; Knyazev et al., 2019; Gao and Ji, 2019), which selectively performs downsampling on representations. We include it in this category when it involves selection importance. However, the mechanisms to generate the rationales could be epistemic, as they only reflect the probabilistic relations between data and predicted labels (Pearl, 2000), which may not hold true in all data distributions. Thus, the rationales could fail to align with causal features and even degrade model performance due to being “fooled” by spurious features (Chang et al., 2020).
Invariant Learning. Backed by causal theory, invariant learning assumes the causal relation from the causal factors
to the response variable
remains invariant unless we intervene on . As the most prevailing formulation, IRM (Arjovsky et al., 2019) extends the invariance assumption from feature level to representation level and finds a data representation such that matches for all environments, where is the classifier. However, concerns about its feasibility (Rosenfeld et al., 2021; Ahuja et al., 2021) and optimality (Kamath et al., 2021) have been discussed recently. Besides IRM, variance penalization across environments is shown to be effective for recovering invariance (Krueger et al., 2021; Xie et al., 2020; Teney et al., 2020). Notably, the existing methods generally require accessing different environments, thus additionally involving environment inference (Creager et al., 2021; Wang et al., 2021b). Similarly motivated as ours, Chang et al. (2020) discover rationales by minimizing the performance gap between environmentagnostic predictor and environmentaware predictor . In graph domain, Bevilacqua et al. (2021) construct graph representations from subgraph densities and use attribute symmetry regularization to mitigate the shift of graph size and vertex attribute distributions.5 Conclusion & Future Work
In this work, we rigorously study the intrinsic interpretability of Graph Neural Networks from a causal perspective. Our concerns are towards the exhibition of shortcut features when generating the rationales. And we proposed an invariant learning algorithm, DIR, to discover the causal features for rationalization. The core of DIR lies in the construction of environments (i.e., interventional distributions) and thus distilling the salient features as rationales that are consistently informative and uniform across these environments. Such rationales serve as the probing towards model mechanisms and are demonstrated to be effective in generalization. In the experiments, we highlight an adaptionfitting training dynamics for DIR to reveal its learning pattern. In the future, we will build more reliable and expressive interpretable models that are feasible under various assumptions, which potentially calls for highlevel interpretability. We recommend interested readers go to the open discussion in Appendix G for the detailed description.
Acknowledgment
This work was supported by the National Key Research and Development Program of China (2020AAA0106000), the National Natural Science Foundation of China (U19A2079), the SeaNExT Joint Lab, and Singapore MOE AcRF T2.
Ethics Statement
In this work, we propose a novel algorithm for intrinsic interpretable models, where no human subject is related. This synthetic dataset is made available in the anonymous link (cf. Section 3.1). We believe the exhibition of rationales is beneficial for inspecting and eliminating potential discrimination and fairness issues in deep models for real applications.
Reproducibility Statement
We summarize the efforts made to ensure reproducibility in this work. (1) Datasets: We use one synthetic dataset which is made available (cf. the anonymous link in Section 3.1), and three public datasets where the processing details are included in Appendix D. (2) Model Training: We provide the procedure of training in Algorithm A and the training details (including hyperparameter settings) in Appendix D which are consistent with our implementation in the code (cf. the anonymous link in Section 3.1). (3) Theoretical Results: All assumptions and proofs can be referred to Appendix C.
References
 Empirical or invariant risk minimization? A sample complexity perspective. In ICLR, Cited by: item 2, §4.
 A causal framework for explaining the predictions of blackbox sequencetosequence models. In EMNLP, pp. 412–421. Cited by: §1.
 Invariant risk minimization. CoRR abs/1907.02893. Cited by: §1, §1, §2.1, §2.3, 2nd item, §3.3, §4.
 Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: 2nd item.
 Sizeinvariant graph representations for graph classification extrapolations. In ICML, Cited by: §4.
 Graph neural networks with convolutional ARMA filters. CoRR abs/1901.01343. Cited by: Table 3.
 Improving graph neural network expressivity via subgraph isomorphism counting. arXiv 2006.09252. Cited by: 1st item.
 Invariance, causality and robustness. arXiv 1812.08233. Cited by: §1, §1.
 RUBi: reducing unimodal biases for visual question answering. In NeurIPS, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’AlchéBuc, E. B. Fox, and R. Garnett (Eds.), Cited by: Appendix E, §2.4.
 ReduNet: A whitebox deep network from the principle of maximizing rate reduction. arXiv 2105.10446. Cited by: 1st item.
 Invariant rationalization. In ICML, Cited by: §1, §1, §1, §2.3, §4, §4.
 On the equivalence between graph isomorphism testing and function approximation with gnns. In NeurIPS, Cited by: §G.1.
 Environment inference for invariant learning. In ICML, M. Meila and T. Zhang (Eds.), Cited by: §2.3, §4.
 BERT: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: 3rd item.
 Benchmarking graph neural networks. CoRR abs/2003.00982. Cited by: §1.
 Graph unets. In ICML, K. Chaudhuri and R. Salakhutdinov (Eds.), pp. 2083–2092. Cited by: §1, §1, §2.2, 1st item, §4.
 Inductive representation learning on large graphs. In NeurIPS, pp. 1024–1034. Cited by: §1.

OGBlsc: a largescale challenge for machine learning on graphs
. arXiv preprint arXiv:2103.09430. Cited by: Table 3, 4th item.  Open graph benchmark: datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687. Cited by: 4th item.
 Does invariant risk minimization capture invariance?. In AISTATS, A. Banerjee and K. Fukumizu (Eds.), Cited by: §4.
 Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: Appendix D.
 Semisupervised classification with graph convolutional networks. In ICLR, Cited by: §1.
 Understanding attention and generalization in graph neural networks. In NeurIPS, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’AlchéBuc, E. B. Fox, and R. Garnett (Eds.), pp. 4204–4214. Cited by: §1, §1, §2.2, 2nd item, §4.
 Outofdistribution generalization via risk extrapolation (rex). In ICML, M. Meila and T. Zhang (Eds.), pp. 5815–5826. Cited by: §1, §1, §2.3, 2nd item, §4.
 Information theory and statistics. Courier Corporation. Cited by: Appendix E.
 Selfattention graph pooling. In ICML, K. Chaudhuri and R. Salakhutdinov (Eds.), pp. 3734–3743. Cited by: §1, §2.2, 1st item, §4.
 Distance encoding: design provably more powerful neural networks for graph representation learning. In NeurIPS, Cited by: §G.1.
 Parameterized explainer for graph neural network. In NeurIPS, Cited by: §1, §1.
 Provably powerful graph networks. In NeurIPS, Cited by: §G.1.
 Weisfeiler and leman go neural: higherorder graph neural networks. In AAAI, pp. 4602–4609. Cited by: Table 3.
 Causal inference in statistics: a primer. John Wiley & Sons. Cited by: Appendix E, §1, 3rd item, §2.1, §2.2, §2.3.
 Causality: models, reasoning, and inference. Cited by: Appendix E, 3rd item, §2.1, §2.2, §4.
 ASAP: adaptive structure aware pooling for learning hierarchical graph representations. In AAAI, pp. 5470–5477. Cited by: Table 3, §1, §2.2, 1st item.
 The risks of invariant risk minimization. In ICLR, Cited by: §4.
 Distributionally robust neural networks for group shifts: on the importance of regularization for worstcase generalization. CoRR abs/1911.08731. Cited by: §1, 2nd item.

Improved protein structure prediction using potentials from deep learning
. Nature 577 (7792), pp. 706–710. Cited by: §1.  Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, pp. 1631–1642. Cited by: 3rd item.
 Unshuffling data for improved generalization. arXiv 2002.11894. Cited by: §1, §2.1, §2.3, §4.
 A characterization of interventional distributions in semimarkovian causal models. In AAAI, pp. 1239–1244. Cited by: §1, §2.3.

Visualizing highdimensional data using tsne
. Journal of Machine Learning Research 9:25792605. Cited by: 4(a).  A threeway decomposition of a total effect into direct, indirect, and interactive effects. Epidemiology (Cambridge, Mass.) 24 (2), pp. 224. Cited by: Appendix E.
 Attention is all you need. In NeurIPS, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), Cited by: §4.
 Graph attention networks. ICLR. Note: accepted as poster Cited by: §1, §1, §2.2, 1st item, §4.

Selfsupervised learning disentangled group representation as feature
. arXiv 2110.15255. Cited by: 1st item.  Causal attention for unbiased visual recognition. arXiv 2108.08782. Cited by: §2.3, §4.
 Towards multigrained explainability for graph neural networks. In NeurIPS, Cited by: §1, §1.
 MoleculeNet: A benchmark for molecular machine learning. arXiv abs/1703.00564. Cited by: 4th item.
 Risk variance penalization: from distributional robustness to causality. arXiv 2006.07544. Cited by: §4.
 How powerful are graph neural networks?. In ICLR, Cited by: Table 3.
 GNNExplainer: generating explanations for graph neural networks. In NeurIPS, pp. 9240–9251. Cited by: §F.4, §1, §1, 1st item.
 Explainability in graph neural networks: A taxonomic survey. CoRR. Cited by: 3rd item.
 On explainability of graph neural networks via subgraph explorations. ArXiv. Cited by: §1.
Appendix A Notations & Algorithm
Symbol  Definition 

graph instance  
/  ground truth causal or confounding subgraph 
/  generated rationale or complement of rationale instance 
/  variables in the causal graph 
/  space of the ground truth or identified spurious features 
/  causal or spurious prediction 
joint prediction  
rationale generator  
/  causal or spurious classifier 
Appendix B Instantiated Causal Graphs
We instantiate possible causal graphs in Figure 1(a). Specifically, we use the example of BaseMotif graphs, whose labels are determined by the motif types. We use to denote cycle, house, crane, respectively; And use to denote ladder, tree, wheels, respectively.

[leftmargin=*]

: Base graphs and motif graphs are independently sampled and attached to each other.

: Type of each motif respects to a given (static) probability distribution. According to the value of
, the probability distribution of its base graph is given by(12) 
: Similar to the example for .

: Suppose there is a latent variable takes continuous value from to . Then the probability distribution of and s.t.
(13) where
stands for binomial distribution,
i.e., for variable , if , then we have
Appendix C Theory
c.1 Assumption
We phrase the SCM in Figure 1(a) as the following assumption:
Assumption 1 (Invariant Rationalization (IR))
There exists a rationale , such that the structural equation model
and the probability relation
hold for every distribution over , where denotes the complement of . Also, we denote as the oracle structural equation model.
By “oracle”, we mean that is the perfect structure equation model, which, when is available, predicts the response variable with the minimum expected loss over any distribution . Or formally,
(14) 
where is the taskspecific loss function and we ignore the exogenous noise in ’s input except as otherwise noted.
Next, we argue that the assumption is commonly satisfied. For example, for sentences labeled by sentiment, can represent the positive/negative words that cause the sentiment, while includes the prepositions and linking words. For molecule graphs labeled by specific properties, and
can represent the functional groups and carbon structures, respectively. Note that IR assumption enables and calls the introduction of interpretability, highlighting salient features and exhibiting human accessible checks. More importantly, it guarantees the model performance under possible feature reduction,
i.e., .We also see cases going beyond the IR Assumption. For example, could be a generic function of and , instead of a simple joint. We use a toy example to elaborate this point. Following the SpuriousMotif dataset, we assume each graph has multiple motifs (house, cycle, crane) with only one type and is labeled by the motif type. Thus, the causal feature will be the motifs. Let the spurious feature be ”the way we connect the motifs”. For example, we can place the house motifs in a queue sequence and connect the adjacent motifs, thus forming the graph in a ”line” shape. Or we can place the houses in a cycle order and connect them into a ring. We further make such graph structures strongly correlated with the motif types. Thus, individual and may be intractable individually in the feature level. For example, if we separate the cycleshaped houses into two lines, the spurious pattern could be broken while the part of the causal feature would be lost. In other words, and are dependent variables. Thus, they can’t be extracted and modeled separately, which goes out of the scope of our work.
Given that and are separable, we further make the following assumption to avoid the confusion of and :
Assumption 2 (Feature Induction)
Define power set operation as . For data and label , if holds for any distribution over , then it implies that for any induced feature , we have holds for the distribution .
This assumption also implies that could not be induced by when . Thus, any feature subset except for would violate the conditional independence condition. For images, this assumption is natural for the splicing of doesn’t typically change its semantics. For example, the splicing of land background would still be divided land. While for graphs, here we assume the causal subgraph’s uniqueness among the induced complement graphs.
c.2 Proofs
Theorem 1 (Necessity)
Suppose does not exist, then the oracle function satisfies the DIR Principle (where is given) over every distribution .
Proof:
We first prove the fact that for distribution . Specifically, we use to denote the sinterventional distribution.

[leftmargin=*]

If ,

If ,

If ,
As holds true for every distribution , which is invariant w.r.t. iterative variable . Moreover, we have . This indicates that the intervention on leave the causal structure untouched. Thus, we have
Finally, taking the definition of , we have
Hence, takes the minimum penalty and satisfies the DIR Principle.
Notably, if , then may not equal to zero since . In such case, is not necessarily satisfied to DIR Principle. That is, although still minimizes , we can’t be sure whether it reaches the lower bound of without knowledge about the specific data distribution. Thus, we only consider the cases of , and in the following discussion.
Theorem 2 (Uniqueness)
Suppose is a strict loss function and there exists one and only one nontrivial subset , then there exists a unique structure equation model s.t. it satisfies the DIR Principle.
Proof:
Since exists and satisfies the DIR Principle, we only need to prove its uniqueness under the given conditions. Otherwise, suppose we have another structure equation satifies the DIR Principle. Specifically, there exists a datum s.t. . Thus, we have . Given that , we have .
In reality, there could be multiple candidates of , e.g., s.t. , where is the structure equation corresponds to . Thus, it calls for the selection of to avoid the learning of suboptimal . Inspired by Occam’s Razor, we define
(15) 
as the preferred rationale, or rationale of parsimony. We argue that rationales are not to be extended beyond necessity, which poses simpler hypotheses about causality. As the search of is NPhard (the worst time complexity is exponential), we use fixed size for the learned rationales in our experiments and leave a better optimization to future work.
Corollary 1 (Necessity and Sufficiency)
Suppose is a strict loss function and there exists one and only one nontrival subset , then any structure causal model s.t. it satisfies the DIR Principle iff. .
This is directly obtained from Theorem 2. Thus, under the unique constraint of , we can approach the oracle by optimizing the DIR objective, which maintains the invariant causal relation between the causal feature and the response variable . In another way, based on the uniqueness of the feasible rationale, the optimization of the DIR Principle on the intrinsic interpretable model (where is exhibited inside of ) pushes the approach to with rationales . Then, can also be approached as an invariant predictor based on the learning from .
Appendix D Setting Details
SpuriousMotif  MNIST75sp (reduced)  GraphSST2  OGBGMolhiv  
Train  Val  Test  Train  Val  Test  Train  Val  Test  Train  Val  Test  
Classes#  3  10  2  2  
Graphs#  9,000  3,000  6,000  20,000  5,000  10,000  28,327  3,147  12,305  32,901  4,113  4,113 
Avg. N#  25.4  26.1  88.7  66.8  67.3  67.0  17.7  17.3  3.45  25.3  27.79  25.3 
Avg. E#  35.4  36.2  131.1  539.3  545.9  540.4  33.3  33.5  4.89  54.1  61.1  55.6 
Backbone  Local Extremum GNN  GNNs  ARMA  GIN + Virtual nodes  
(Ranjan et al., 2020)  (Morris et al., 2019)  (Bianchi et al., 2019)  (Xu et al., 2019; Hu et al., 2021)  
Neuron#  [4,32,32,32]  [5,32,32,32]  [768,128,128,2]  [9,300,300,300,1]  
Global Pool  global mean pool  global max pool 
global mean pool  global add pool  
Gen. Type  Scale & Correlation Shift  Noise  Degree & Scale Shift  / 
Datasets
We summarize dataset statistics in Table 3, and introduce the node/edge features and the preprocessing in each datasets:

[leftmargin=*]

SpuriousMotif. We use random node features and constant edge weights in this dataset.

MNIST75sp. The nodes in the graphs are superpixels, and node features are the concatenation of pixel intensities (RGB channels) and coordinates of their mass centers. Edges are the spatial distance between the superpixel centers, while we filter the edges with a distance less than 0.1 to make the graphs sparser.

GraphSST2. We use constant edge weight and filter the graphs with edges less than three. We initialize the node features by the pretrained BERT (Devlin et al., 2018) word embedding.

OGBGMolhiv. We use the official released dataset in our experiment.
GNNs.
We summarize the backbone GNNs for each dataset in Table 3
. The number of neurons in the sequent layers (in forwarding order) is reported. We use ReLU as activation layers and different global pooling layers. In OGBGMolhiv, we adopt one fully connected layer for the prediction layers while using two fully connected layers for the models in other datasets. For baselines with node pooling/node attention, we add one node pooling/attention layer in the second convolution layer.
Training Optimization & Early Stopping.
All experiments are done on a single Tesla V100 SXM2 GPU (32 GB). During training, we use Adam (Kingma and Ba, 2015)
optimizer. The maximum number of epochs is 400 for all datasets. We use Stochastic Gradient Descent (SGD) for the optimization on GraphSST2 and OGBGMolhiv and Gradient Descent (GD) for the other two datasets. Also, we exhibit early stopping to avoid overfitting of the training dataset. Specifically, in MNIST75sp, GraphSST2 and OGBGMolhiv, each model is evaluated on a holdout indistribution validation dataset after each epoch. While for SpuriousMotif, we use an unbiased validation dataset (
i.e., without spurious relations compared to the training dataset). If the model’s performance on the validation dataset is without improvement (i.e., validation accuracy begins to decrease) for five epochs, we stop the training process to prevent increased generalization error.HyperParameter Settings.
We set the causal feature ratio and as for MNIST75sp, SpuriousMotif, GraphSST2 and OGBGMolhiv respectively. For other baselines, we adopt grid search for the best parameters using the validation datasets.
Model Selection.
We select each model based on its performance on the corresponding validation dataset. We repeat each experiment at least five times and report the average values and the standard errors in the paper.
Appendix E Unimodal Adjustment
We follow Cadène et al. (2019) to demonstrate how the shortcut prediction can help to remove model bias. For clarity, we refer to the model parameters except for as the main branch, i.e., except for the only branch.
Given a housetree graph as the input graph, we suppose the shortcut prediction of the tree subgraph leans towards the house class. Then after reweighting on , the softmax readout on the house class in the joint prediction will be magnified, which results in a smaller loss backpropagated to the main branch and prevents from inductive bias.
In another situation where a housewheel graph is given as the input, we similarly suppose the shortcut prediction of the wheel subgraph leans towards other classes except the house, say, the circle class. Then after reweighting on , the softmax readout on the house class in the joint prediction will be reduced, which results in a larger loss backpropagated to the main branch and encourages the model to learn from these examples.
Furthermore, we offer the causal and informationtheoretical justifications: (1) From the perspective of causal theory (Pearl, 2000; Pearl et al., 2016)
, the elementwise multiplication enforces the spurious prediction to estimate the pure indirect effect (PIE) of the shortcut features, while the causal prediction captures the natural direct effect (NDE) of the causal patterns
(VanderWeele, 2013); (2) From the perspective of information theory (Kullback, 1997), the elementwise multiplication makes the causal prediction reflect the conditional mutual information between the causal patterns and groundtruths, conditioning on the complement patterns.Appendix F More Experimental Results
f.1 Visualization
We provide more visualization cases in GraphSST2 dataset as shown in Figure 6 and Figure 7. The rationales are highlighted in deep colors.
f.2 Sensitivity Analysis
We analyze the performance of DIR w.r.t. the hyperparameter . As shown in Figure 10, with , DIR degrades to optimize the performance in each environment only, without explicitly penalizing the shortcuts’ influence on the model predictions. We also see that all testing performances drop sharply if is too large. Since a large weight on the variance term emphasis on the invariance condition while leading to the overlook on the performance loss, it could fail to exhibit correctly. Notably, such a tradeoff in the DIR objective is commonly shared among all the datasets.
f.3 Study of the Spurious Classifiers
Here we provide more observations about the predictions of the learned spurious classifier, which sheds light on the designed model mechanism. We first look into the confidence of predictions and define
(16) 
SpuriousMotif (=0.9)  MNIST75sp  GraphSST2  Molhiv  

Uniform  1.10  2.30  0.693  0.693 
Spurious Predictions  0.529  1.93  0.265  0.187 
SpuriousMotif ( 
