 # Invariant Rationalization

Selective rationalization improves neural network interpretability by identifying a small subset of input features – the rationale – that best explains or supports the prediction. A typical rationalization criterion, i.e. maximum mutual information (MMI), finds the rationale that maximizes the prediction performance based only on the rationale. However, MMI can be problematic because it picks up spurious correlations between the input features and the output. Instead, we introduce a game-theoretic invariant rationalization criterion where the rationales are constrained to enable the same predictor to be optimal across different environments. We show both theoretically and empirically that the proposed rationales can rule out spurious correlations, generalize better to different test scenarios, and align better with human judgments. Our data and code are available.

## Code Repositories

### invariant_rationalization

Tensorflow implementation of Invariant Rationalization

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A number of selective rationalization techniques (lei2016rationalizing; li2016understanding; chen2018learning; chen2018shapley; yu2018learning; yu2019rethinking; chang2019game) have been proposed to explain the predictions of complex neural models. The key idea driving these methods is to find a small subset of the input features – rationale – that suffices on its own to yield the same outcome. In practice, rationales that remove much of the spurious content from the input, e.g., text, could be used and examined as justifications for model’s predictions.

The most commonly-used criterion for rationales is the maximum mutual information (MMI) criterion. In the context of NLP, it defines rationale as the subset of input text that maximizes the mutual information between the subset and the model output, subject to the constraint that the selected subset remains within a prescribed length. Specifically, if we denote the random variables corresponding to input as

, rationales as and the model output as , then the MMI criterion finds the explanation that yields the highest prediction accuracy of .

MMI criterion can nevertheless lead to undesirable results in practice. It is prone to highlighting spurious correlations between the input features and the output as valid explanations. While such correlations represent statistical relations present in the training data, and thus incorporated into the neural model, the impact of such features on the true outcome (as opposed to model’s predictions) can change at test time. In other words, MMI may select features that do not explain the underlying relationship between the inputs and outputs even though they may still be faithfully reporting the model’s behavior. We seek to modify the rationalization criterion to better tailor it to find causal features.

[table]capposition=bottom

[table]capposition=top

As an example, consider figure 1 that shows a beverage review which covers four aspects of beer: appearance, smell, palate, and overall. The reviewers also assigned a score to each of these aspects. Suppose we want to find an explanation supporting a positive score to smell. The correct explanation should be the portion of the review that actually discusses smell, as highlighted in green. However, reviews for other aspects such as palate (highlighted in red) may co-vary with smell score since, as senses, smell and palate are related. The overall statement as highlighted in blue would typically also clearly correlate with any individual aspect score, including smell. Taken together, sentences highlighted in green, red and blue would all be highly correlated with the positive score for smell. As a result, MMI may select any one of them (or some combination) as the rationale, depending on precise statistics in the training data. Only the green sentence constitutes an adequate explanation.

Our goal is to design a rationalization criterion that approximates finding causal features. While assessing causality is challenging, we can approximate the task by searching features that are invariant. This notion was recently introduced in the context of invariant risk minimization (IRM) (arjovsky2019invariant). The main idea is to highlight spurious (non-causal) variation by dividing the data into different environments. The same predictor, if based on causal features, should remain optimal in each environment separately.

In this paper, we propose invariant rationalization (InvRat), a novel rationalization scheme that incorporates the invariance constraint. We extend the IRM principle to neural predictions by resorting to a game-theoretic framework to impose invariance. Specifically, the proposed framework consists of three modules: a rationale generator, an environment-agnostic predictor as well as an environment-aware predictor. The rationale generator generates rationales from the input , and both predictors try to predict from . The only difference between the two predictors is that the environment-aware predictor also has access to which environment each training data point is drawn from. The goal of the rationale generator is to restrict the rationales in a manner that closes the performance gap between the two predictors while still maximizing the prediction accuracy of the environment-agnostic predictor.

We show theoretically that InvRat can solve the invariant rationalization problem, and that the invariant rationales generalize well to unknown test environments in a well-defined minimax sense. We evaluate InvRat on multiple datasets with false correlations. The results show that InvRat does significantly better in removing false correlations and finding explanations that better align with human judgments.

## 2 Preliminaries: MMI and Its Limitation

In this section, we will formally review the MMI criterion and analyze its limitation using a probabilistic model. Throughout the paper, upper-cased letters, and

, denote random scalars and vectors respectively; lower-cased letters,

and , denote deterministic scalars and vectors respectively; denotes the Shannon entropy of ; denotes the entropy of conditional on ; denotes the mutual information. Without causing ambiguities, we use and interchangeably to denote the probabilistic mass function of .

### 2.1 Maximum Mutual Information Criterion

The MMI objective can be formulated as follows. Given the input-output pairs , MMI aims to find a rationale , which is a masked version of , such that it maximizes the mutual information between and . Formally,

 maxm∈SI(Y;Z)s.t. Z=m⊙X, (1)

where is a binary mask and denotes a subset of with a sparsity and a continuity constraints. is the total length in . We leave the exact mathematical form of the constraint set abstract here, and it will be formally introduced in section 3.5. denotes the element-wise multiplication of two vectors or matrices. Since the mutual information measures the predictive power of on , MMI essentially tries to find a subset of input features that can best predict the output . Figure 2: A probabilistic model illustrating different parts of an input that have different probabilistic relationships with the model output Y. A sentence X can be divided into three variables X1, X2 and X3. All X1, X2 and X3 can be highly correlated with Y, but only X1 is regarded as a plausible explanation.

### 2.2 MMI Limitations

The biggest problem of MMI is that it is prone to picking up spurious probabilistic correlations, rather than finding the causal explanation. To demonstrate why this is the case, consider a probabilistic graph in figure 2, where is divided into three variables, , and , which represents the three typical relationship with : influences ; is influenced by ; has no direction connections with . The dashed arrows represent some additional probabilistic dependencies among . For now, we ignore .

As observed from the graph, serves as the valid explanation of , because it is the true cause of . Neither nor are valid explanations. However, , and can all be highly predicative of

, so the MMI criterion may select any of the three features as the rationale. Concretely, consider the following toy example with all binary variables. Assume

, and

 pY|X1(1|1)=pY|X1(0|0)=0.9, (2)

which makes a good predictor of . Next, define the conditional prior of as

 pX2|Y(1|1)=pX2|Y(0|0)=0.9.

According to the Bayes rule,

 pY|X2(1|1)=pY|X2(0|0)=0.9, (3)

which makes also a good predictor of . Finally, assume the conditional prior of is

 pX3|X1,X2(1|1,1)=pX3|X1,X2(0|0,0)=1, and pX3|X1,X2(1|0,1)=pX3|X1,X2(1|1,0)=0.5.

It can be computed that

 pX3|Y(1|1)=pX3|Y(0|0)=0.9. (4)

In short, according to equations (2), (3) and (4), we have constructed a set of priors such that the predictive power of , and is exactly the same. As a result, there is no reason for MMI to favor over the others.

In fact, , and correspond to the three highlighted sentences in figure 1. corresponds to the smell review (green sentence), because it represents the true explanation that influences the output decision. corresponds to the overall review (blue sentence), because the overall summary of the beer inversely influenced by the smell score. Finally, corresponds to the palate review (red sentence), because the palate review does not have a direct relationship with the smell score. However, may still be highly predicative of because it can be strongly correlated with and . Therefore, we need to explore a novel rationalization scheme that can distinguish from the rest.

In this section, we propose invariant rationalization, a rationalization criterion that can exclude rationales with spurious correlations, utilizing the extra information provided by an environment variable. We will introduce InvRat, a game-theoretic approach to solving the invariant rationalization problem. We will then theoretically analyze the convergence property and the generalizability of invariant rationales.

### 3.1 Invariant Rationalization

Without further information, distinguishing from and is a challenging task. However, this challenge can be resolved if we also have access to an extra piece of information: the environment. As shown in figure 2, an environment is defined as an instance of the variable that impacts the prior distribution of (arjovsky2019invariant). On the other hand, we make the same assumption as in IRM that the remains the same across the environments (hence there is no edge pointing from to in figure 2), because is the true cause of . As we will show soon, and will not remain the same across the environments, which distinguishes from and .

Back to the binary toy example in section 2.2, suppose there are two environments, and . In environment , all the prior distributions are exactly the same as in section 2.2. In environment , the priors are almost the same, except for the prior of . For notation ease, define

as the probabilities under environment

, i.e. . Then, we assume that

 qX1(1)=0.6.

It turns out that such a small difference suffices to expose and . In this environment, is the same as in equation (2) as assumed. However, it can be computed that

 qY|X2(1|1)≈0.926,qY|X2(0|0)≈0.867, qY|X3(1|1)≈0.912,qY|X3(0|0)≈0.883,

which are different from equations (3) and (4). Notice that we have not yet assumed any changes in the priors of and , which will introduce further differences. The fundamental cause of such differences is that is independent of only when conditioned on , so would not change with . We call this property invariance. However, the conditional independence does not hold for and .

Therefore, given that we have access to multiple environments during training, i.e. multiple instances of , we propose the invariant rationalization objective as follows:

 maxm∈SI(Y;Z)s.t. Z=m⊙X,Y⊥E | Z, (5)

where denotes probabilistic independence. The only difference between equations (1) and (5) is that the latter has the invariance constraint, which is used to screen out and . In practice, finding an eligible environment is feasible. In the beer review example in figure 1, a possible choice of environment is the brand of beer, because different beer brands have different prior distributions of the review in each aspect – some brands are better at the appearance, others better at the palate. Such variations in priors suffice to expose the non-invariance of the palate review or the overall review in terms of predicting the smell score.

### 3.2 The InvRat Framework

The constrained optimization in equation (5) is hard to solve in its original form. InvRat introduces a game-theoretic framework, which can approximately solve this problem. Notice that the invariance constraint can be converted to a constraint on entropy, i.e.,

 Y⊥E | Z⇔H(Y|Z,E)=H(Y|Z), (6)

which means if is invariant, cannot provide extra information beyond to predict . Guided by this perspective, InvRat consists of three players, as shown in figure 3:

• an environment-agnostic/-independent predictor ;

• an environment-aware predictor ; and

• a rationale generator, .

The goal of the environment-agnostic and environment-aware predictors is to predict from the rationale . The only difference between them is that the latter has access to as another input feature but the former does not. Formally, denote as the cross-entropy loss on a single instance. Then the learning objective of these two predictors can be written as follows.

 L∗i=minfi(⋅)E[L(Y;fi(Z))], L∗e=minfe(⋅,⋅)E[L(Y;fe(Z,E))], (7)

where . The rationale generator generates by masking . The goal of the rationale generator is also to minimize the invariance prediction loss . However, there is an additional goal to make the gap between and small. Formally, the objective of the generator is as follows:

 ming(⋅)L∗i+λh(L∗i−L∗e), (8)

where a convex function that is monotonically increasing in when , and strictly monotonically increasing in when , e.g., and .

### 3.3 Convergence Properties

This section justifies that equations (7) and (8) can solve equation (5) in its Lagrangian form. If the representation power of and is sufficient, the cross-entropy loss can achieve its entropy lower bound, i.e.,

 L∗i=H(Y|Z),L∗e=H(Y|Z,E).

Notice that the environment-aware loss should be no greater than the environment-agnostic loss, because of the availability of more information, i.e., . Therefore, the invariance constraint in equation (6) can be rewritten as an inequality constraint:

 H(Y|Z)=H(Y|Z,E)⇔H(Y|Z)≤H(Y|Z,E). (9)

Finally, notice that . Thus the objective in equation (8) can be regarded as the Lagrange form of equation (5), with the constraint rewritten as an inequality constraint

 h(H(Y|Z)−H(Y|Z,E))≤h(0). (10)

According to the KKT conditions, when equation (10) is binding. Moreover, the objectives in equations (7) and (8) can be rewritten as a minimax game

 ming(⋅),fi(⋅)maxfe(⋅,⋅)Li(g,fi)+λh(Li(g,fi)−Le(g,fe)), (11)

where

 Li(g,fi)=E[L(Y;fi(Z))],  Le(g,fe)=E[L(Y;fe(Z,E))].

Therefore, the generator plays a co-operative game with the environment-agnostic predictor, and an adversarial game with the environment-aware predictor. The optimization can be performed using alternate gradient descent/ascent. Figure 3: The InvRat framework with three players: the rationale generator, environment-agnostic and -aware predictors.

### 3.4 Invariance and Generalizability

In our previous discussions, we have justified the invariant rationales in the sense that it can uncover consistent and causal explanations and leave out spurious statistical correlations. In this section, we further justify invariant rationale in terms of generalizability. We consider two sets of environments, a set of training environments and a test environment . Only the training environments are accessible during training. The prior distributions in the test environment are completely unknown. The question we want to ask is: does keeping the invariant rationales and dropping the non-invariant rationales improve the generalizability in the unknown test environment?

Assume that 1) the training data are sufficient, 2) the predictor is environment-agnostic, 3) the predictor has sufficient representation power, and 4) the training converges to the global optimum. Under these assumptions, any predictor is able to replicate the training set distribution (with all the training environments mixed) , which is optimal under the cross-entropy training objective. In the test environment , the cross-entropy loss of this predictor is given by

 L∗test(Z)=H(p(Y|Z,ea);p(Y|Z,{et})).

where is short for . cannot be evaluated because the prior distribution in the test environment is unknown. Instead, we consider the worst scenario. For notational ease, we introduce the following shorthand for the test environment distributions:

 π1(x1)=pX1|E(x1|ea), π2(x2|x1,y)=pX2|X1,Y,E(x2|x1,y,ea), π3(x3|x1,x2)=pX3|X1,X2,E(x3|x1,x2,⋅,ea).

For the selected rationale , we consider an adversarial test environment (hence the notation ), which chooses , and to maximize (note that is a function of , , and ). The following theorem shows that the minimizer of this adversarial loss is the invariant rationale .

###### Theorem 1.

Assume the probabilistic graph in figure 2 and that there are two environments and . achieves the saddle point of the following minimax problem

where denotes the power set of .

The proof is provided in the appendix A. Theorem 1 shows the nice property of the invariance rationale that it minimizes the risk under the most adverse test environment.

### 3.5 Incorporating Sparsity and Continuity Constraints

The sparsity and continuity constraint (equation (5)) stipulates that the total number of ’s in should be upper bounded and contiguous. There are two ways to implement the constraints.

Soft constraints: Following chang2019game, we can add another two Lagrange terms to equations (11):

 μ1∣∣∣1NE[∥m∥1]−α∣∣∣+μ2E[N∑n=2|mn−mn−1|], (12)

where denotes the -th element of ; is a predefined sparsity level. is produced by an independent selection process (lei2016rationalizing). This method is flexible, but requires sophisticated tuning of three Lagrange multipliers.

Hard constraints: An alternative approach is to force to select one chunck of text with a pre-specified length . Instead of predicting the mask directly, produces a score for each position , and predicts the start position of the chunk by choosing the maximum of the score. Formally

 (13)

where denotes the indicator function, which equals if the argument is true, and otherwise. Equation (13) is not differentiable, so when computing the gradients for the back propagation, we apply the straight-through technique (bengio2013estimating) and approximate it with the gradient of

 ^s=softmax(s),m=% CausalConv(^s),

where denotes causal convolution, and the convolution kernel is an all-one vector of length .

## 4 Experiments

### 4.1 Datasets

To evaluate the invariant rationale generation, we consider the following two binary classification datasets with known spurious correlations.

#### Imdb (maas2011learning):

The original dataset consists of 25,000 movie reviews for training and 25,000 for testing. The output

is the binarized score of the movie. We construct a synthetic setting that manually injects tokens with false correlations with

, whose prior varies across artificial environments. The goal is to validate if the proposed method excludes these tokens from rationale selections. Specifically, we first randomly split the training set into two balanced subsets, where each subset is considered as an environment. We append punctuation “,” and “.” at the beginning of each sentence with the following distributions:

 p(append , |Y=1,ei)=p(append . |Y=0,ei)=αi p(append . |Y=1,ei)=p(append , |Y=0,ei)=1−αi.

Here is the environment index taking values on . Specifically, we set and to be 0.9 and 0.7, respectively, for the training set. For the purpose of model selection and evaluation, we randomly split the original test set into two balanced subsets, which are our new validation and test sets. To test how different rationalization techniques generalize to unknown environments, we also inject the punctuation to the test and validation set, but with and set as 0.5 for the validation set, and 0.1, 0.3 for the testing set. According to equation (4.1), these manually injected “,” and “.” can be thought of as the variable in the figure 2, which have strong correlations to the label. It is worth mentioning that the environment ID is only provided in the training set.

#### Multi-aspect beer reviews (mcauley2012learning):

This dataset is commonly used in the field of rationalization (lei2016rationalizing; bao2018deriving; yu2019rethinking; chang2019game). It contains 1.5 million beer reviews, each of which evaluates multiple aspects of a beer. These aspects include appearance, aroma, smell, palate and overall. Each aspect has a rating at the scale of [0, 1]. The goal is to provide rationales for these ratings. There is a high correlation among the rating scores of different aspects in the same review, making it difficult to directly learn a rationalization model from the original data. Therefore only the decorrelated subsets are selected as training data in the previous usages (lei2016rationalizing; yu2019rethinking).

However, the high correlation among rating scores in the original data provides us a perfect evaluation benchmark for InvRaton its ability to exclude irrelevant but highly correlated aspects, because these highly correlated aspects can be thought of as and in figure 2, as discussed in section 2.2

. To construct different environments, we cluster the data based on different degree of correlation among the aspects. To gauge the correlation among aspect, we train a simple linear regression model to predict the rating of the target aspect given the ratings of all the other aspects except the overall. A low prediction error of the data implies high correlation among the aspects. We then assign the data into different environments based on the linear prediction error. In particular, we construction two training environments using the data with least prediction error,

i.e. highest correlations. The first training environment is sampled from the lowest 25 percentile of the prediction error, while the second one is from 25 to 50 percentile. On the contrary, we construction a validation set and a subjective evaluation set from data with the highest prediction error (i.e. highest 50 percentile). Following the same evaluation protocol (bao2018deriving; chang2019game), we consider a classification setting by treating reviews with ratings 0.4 as negative and 0.6 as positive. Each training environment has a total 5,000 label-balanced examples, which makes the size of the training set as 10,000. The size of the validation set and the subjective evaluation set are 2,000 and 400, respectively. Same as almost all previous work in rationalization, we focus on the appearance, aroma, and palate aspects only.

Also, this dataset includes sentence-level annotations for about 1,000 reviews. Each sentence is annotated with one or multiple aspects label, indicating which aspect this sentence belonging to. We use this set to automatically evaluate the precision of the extracted rationales.

### 4.2 Baselines

We consider the following two baselines:

Rnp: A generator-predictor framework proposed by lei2016rationalizing for rationalizing neural prediction (Rnp). The generator selects text spans as rationales which are then fed to the predictor for label classification. The selection optimizes the MMI criterion shown in equation (1).

3Player: The improvement of Rnp from yu2019rethinking, which aims to alleviate the degeneration problem of Rnp. The model consists of three modules, which are the generator, the predictor and the complement predictor. The complement predictor tries to maximize the predictive accuracy from unselected words. Besides the MMI objective optimized between the generator and predictor, the generator also plays an adversarial game with the complement predictor, trying to minimize its performance.

### 4.3 Implementation Details

For all experiments, we use bidirectional gated recurrent units

(chung2014empirical) with hidden dimension 256 for the generator and both of the predictors. All the methods use fixed 100-dimension Glove embeddings (pennington2014glove). We use the Adam optimizer (kingma2014adam) with a learning rate of 0.001. The batch size is set to 500. To seek fair comparisons, we try to keep the settings of both Rnp and 3Player the same to ours. We re-implement the Rnp

, and use the open-source implementation

of 3Player. The only major difference between these models is that both Rnp and InvRat use the straight-through technique (bengio2013estimating) to deal with the problem of nondifferentiability in rationale selections while 3Player is based on the policy gradient (williams1992simple).

For the IMDB dataset, we follow a standard setting (lei2016rationalizing; chang2019game)

to use the soft constraints to regularize the selected rationales for all methods. Hyperparameters (

i.e., , in equation (12), in equation (8

), and the number of training epochs) are determined based on the best performance on the validation set. For the beer review task, we find the baseline methods perform much worse using soft constraints compared to the hard one. This might be because the review of each aspect is highly correlated in the training set. Thus, we consider the hard constraints with different length in generating rationales.

### 4.4 Results

#### Imdb:

Table 1 shows the results of the synthetic IMDB dataset. As we can see, Rnp selects the injected punctuation in 78.24% of the testing samples, while InvRat, as expected, does not highlight any. This result verifies our theoretical analysis in section 3. Moreover, because Rnp relies on these injected punctuation, whose probabilistic distribution varies drastically between training set and test set, its generalizability is poor, which leads to low predictive accuracy on the testing set. Specifically, there is a large gap of around 15% between the test performance of Rnp and the proposed InvRat. It is worth pointing out that, by the dataset construction, 3Player will obviously fail by including all punctuation as rationales. This is because otherwise, the complement predictor will have a clear clue to guess the predicted label. Thus, we exclude 3Player from the comparison.

[table]capposition=bottom

[table]capposition=top

#### Beer review:

We conduct both objective and subjective evaluations for the beer review dataset. We first compare the generated rationales against the human annotations and report precision, recall and F1 score in table 2. Similarly, the reported performances are based on the best performance on the validation set, which is also reported. We consider the highlight lengths of 10, 20 and 30.

We observe that InvRat consistently surpass the other two baselines in finding rationales that align with human annotation for most of the rationale lengths and the aspects. In particular, although the best accuracies among all three methods on validation sets have only small variations, the improvements are significant in terms of finding the correct rationales. For example, InvRat improves over the other two methods for more than 20 absolute percent in F1 for the appearance aspect. Two baselines methods fail to distinguish the true clues for different aspects, which confirms that the previous MMI objective is insufficient for ruling out the spurious words.

In addition, we also visualize the generated rationales of our method with a preset length of 20 in figure 4. We observe that the InvRat is able to produce meaningful justifications for all three aspects. By reading these selected texts alone, humans will easily predict the aspect label. To further verify that the rationales generated by InvRat align with human judgment, we present a subjective evaluation via Amazon Mechanical Turk. Recall that for each aspect we preserved a hold-out set with 400 examples (total 1,200 examples for all three aspects). We generate rationales with different lengths for all methods. In each subjective test, the subject is presented with the rationale of one aspect of the beer review, generated by one of the three methods (unselected words blocked), and asked to guess which aspect the rationale is talking about. We then compute the accuracy as the performance metric, which is shown in figure 5. Under this setting, a generator that picks spurious correlated texts will have a low accuracy. As can be observed, InvRat achieves the best performances in all cases. Figure 5: Subjective performances of generated rationales. Subjects are asked to guess the target aspect (i.e. which aspect of the model is trained on) based on the generated rationales. We report the case of preset rationale length of 10, 20 and 30.

## 5 Related Work

#### Selective rationalization:

Selective rationalization is one of the major categories of model interpretability in machine learning.

lei2016rationalizing first propose a generator-predictor framework for rationalization. The framework is formally a co-operative game that maximizes the mutual information between the selected rationales and labels, as shown in (chen2018learning). Following this work, chen2018shapley improves the generator-predictor framework by proposing a new rationalization criterion by considering the combinatorial nature of the selection. yu2019rethinking point out the communication problem in co-operative learning and proposes a new three-player framework to control the unselected texts. chang2019game aim to generate rationales in all possible classes instead of the target label only, which makes the model perform counterfactual reasoning. In all, these models deal with different challenges in generating high-quality rationales. However, they are still insufficient to distinguish the invariant words from the correlated ones.

#### Self-explaining models beyond selective rationalization:

Besides selective rationalization, other approaches also improve the interpretability of neural predictions. For example, module networks (andreas2016learning; andreas2016neural; johnson2017inferring)

compose appropriate modules following the logical program produced by a natural language component. The restriction to a small set of pre-defined programs currently limits their applicability. Other lines of work include evaluating feature importance with gradient information

(simonyan2013deep; li2016visualizing; sundararajan2017axiomatic) or local perturbations (kononenko2010efficient; lundberg2017unified); and interpreting deep networks by locally fitting interpretable models (ribeiro2016should; alvarez2018towards). However, these methods aim at providing post-hoc explanations of already-trained models, which is not able to find invariant texts.

#### Learning with biases:

Our work also relates to the topic of discovering dataset-specific biases. Specifically, neural models have shown remarkable results in many NLP applications, however, these models sometimes prone to fit some dataset-specific patterns or biases. For example, in natural language inference, such biased clues can be the word overlap between the input sentence pair (mccoy2019right) or whether the negative word ”not” exists (niven2019probing). Similar observations have been found in multi-hop question answering (welbl2018constructing; min2019compositional). To learn with biased data but not fully rely on it, lewis2018generative use generative objectives to force the QA models to make use of the full question. agrawal2018don; wang2019multi propose carefully designed model architectures to capture more complex interactions between input clues beyond the biases. ramakrishnan2018overcoming; belinkov2019adversarial propose to add adversarial regularizations that punish the internal representations that cooperate well with bias-only models. clark2019don; he2019unlearn propose to learn ensemble models that fit the residual from the prediction with bias features. However, all these works assume that the biases are known. Our work instead can rule out unwanted features without knowing the pattern priorly.

## 6 Conclusion

In this paper, we propose a game-theoretic approach to invariant rationalization, where the method is trained to constrain the probability of the output conditional on the rationales be the same across multiple environments. The framework consists of three players, which competitively rule out spurious words with strong correlations to the output. We theoretically demonstrate the proposed game-theoretic framework drives the solution towards better generalization to test scenarios that have different distributions from the training. Extensive objective and subjective evaluations on both synthetic and multi-aspect sentiment classification datasets demonstrate that InvRat performs favorably against existing algorithms in rationale generation.

## Appendix A Proof To Theorem 1

###### Proof.

, partition into an invariant variable and a non-invariant variable :

 ZI=Z∩{X1},ZV=Z∩{X2,X3}.

Given an arbitrary , we construct a specific and such that

 π∗2(x2|x1,y)=π∗2(x2),π∗3(x3|x1)=π∗3(x3). (14)

In other words, set these two priors such that the all the non-invariant variables are uninformative of . Since the test adversary is allowed to choose any distribution, this set of priors is within the feasible set of the test adversary.

Under the set of priors in equation (14), the non-invariant features are not predicative of , and only the invariant features are predicative of , i.e.

 p(Y|Z,ea)=p(Y|ZI,ea) (15)

Therefore

 L∗test(Z;π1,π∗2,π∗3) =H(p(Y|Z,ea);p(Y|Z,et)) (16) (i)=H(p(Y|ZI,ea);p(Y|Z,et)) (ii)≥H(p(Y|ZI,ea)) (iii)≥H(p(Y|X1,ea)) (iv)=H(p(Y|X1,ea);p(Y|X1,et)) =L∗test(X1;π1,π∗2,π∗3)

where (i) is from equation (15); (ii) is from the relationship between cross entropy and entropy; (iii) is because is the minimizer of conditional entropy of on and , among all the invariant variables; (iv) is because, by the definition of invariant variables, . Here, we use to emphasize that is computed under the distribution of . Therefore, if we optimize over and , we have the following

 maxπ2,π3L∗test(Z;π1,π2,π3)≥L∗test(Z;π1,π∗2,π∗3), (17) maxπ2,π3L∗test(X1;π1,π2,π3)=L∗test(X1;π1,π∗2,π∗3)

where the second line is because does not depend on and . Combining equations (16) and (17), we have

 maxπ2,π3L∗test(Z;π1,π2,π3)≥maxπ2,π3L∗test(X1;π1,π2,π3) (18)

Note that the above discussions holds for all . Therefore, taking the maximum over of equation (18) preserves the inequality.

 maxπ1,π2,π3L∗test(Z;π1,π2,π3)≥maxπ1,π2,π3L∗test(X1;π1,π2,π3)

which implies

 X1=argminZmaxπ1,π2,π3L∗test(Z;π1,π2,π3)