1 Introduction
A number of selective rationalization techniques (lei2016rationalizing; li2016understanding; chen2018learning; chen2018shapley; yu2018learning; yu2019rethinking; chang2019game) have been proposed to explain the predictions of complex neural models. The key idea driving these methods is to find a small subset of the input features – rationale – that suffices on its own to yield the same outcome. In practice, rationales that remove much of the spurious content from the input, e.g., text, could be used and examined as justifications for model’s predictions.
The most commonlyused criterion for rationales is the maximum mutual information (MMI) criterion. In the context of NLP, it defines rationale as the subset of input text that maximizes the mutual information between the subset and the model output, subject to the constraint that the selected subset remains within a prescribed length. Specifically, if we denote the random variables corresponding to input as
, rationales as and the model output as , then the MMI criterion finds the explanation that yields the highest prediction accuracy of .MMI criterion can nevertheless lead to undesirable results in practice. It is prone to highlighting spurious correlations between the input features and the output as valid explanations. While such correlations represent statistical relations present in the training data, and thus incorporated into the neural model, the impact of such features on the true outcome (as opposed to model’s predictions) can change at test time. In other words, MMI may select features that do not explain the underlying relationship between the inputs and outputs even though they may still be faithfully reporting the model’s behavior. We seek to modify the rationalization criterion to better tailor it to find causal features.
[table]capposition=bottom
Beer  Smell Label  Positive 
375ml corked and caged bottle with bottled on date november 30 2005 , poured into snifter at brouwer ’s , reviewed on 5/15/11 . aroma : pours a clear golden color with orange hues and a whitish head that leaves some lacing around glass . smell : lots of barnyaardy funk with tons of earthy aromas , grass and some lemon peel . palate : similar to the aroma , lots of funk , lactic sourness , really earthy with citrus notes and oak . many layers of intriguing earthy complexities . overall : very funky and earthy gueuze , nice and crisp with good drinkability . 
[table]capposition=top
As an example, consider figure 1 that shows a beverage review which covers four aspects of beer: appearance, smell, palate, and overall. The reviewers also assigned a score to each of these aspects. Suppose we want to find an explanation supporting a positive score to smell. The correct explanation should be the portion of the review that actually discusses smell, as highlighted in green. However, reviews for other aspects such as palate (highlighted in red) may covary with smell score since, as senses, smell and palate are related. The overall statement as highlighted in blue would typically also clearly correlate with any individual aspect score, including smell. Taken together, sentences highlighted in green, red and blue would all be highly correlated with the positive score for smell. As a result, MMI may select any one of them (or some combination) as the rationale, depending on precise statistics in the training data. Only the green sentence constitutes an adequate explanation.
Our goal is to design a rationalization criterion that approximates finding causal features. While assessing causality is challenging, we can approximate the task by searching features that are invariant. This notion was recently introduced in the context of invariant risk minimization (IRM) (arjovsky2019invariant). The main idea is to highlight spurious (noncausal) variation by dividing the data into different environments. The same predictor, if based on causal features, should remain optimal in each environment separately.
In this paper, we propose invariant rationalization (InvRat), a novel rationalization scheme that incorporates the invariance constraint. We extend the IRM principle to neural predictions by resorting to a gametheoretic framework to impose invariance. Specifically, the proposed framework consists of three modules: a rationale generator, an environmentagnostic predictor as well as an environmentaware predictor. The rationale generator generates rationales from the input , and both predictors try to predict from . The only difference between the two predictors is that the environmentaware predictor also has access to which environment each training data point is drawn from. The goal of the rationale generator is to restrict the rationales in a manner that closes the performance gap between the two predictors while still maximizing the prediction accuracy of the environmentagnostic predictor.
We show theoretically that InvRat can solve the invariant rationalization problem, and that the invariant rationales generalize well to unknown test environments in a welldefined minimax sense. We evaluate InvRat on multiple datasets with false correlations. The results show that InvRat does significantly better in removing false correlations and finding explanations that better align with human judgments.
2 Preliminaries: MMI and Its Limitation
In this section, we will formally review the MMI criterion and analyze its limitation using a probabilistic model. Throughout the paper, uppercased letters, and
, denote random scalars and vectors respectively; lowercased letters,
and , denote deterministic scalars and vectors respectively; denotes the Shannon entropy of ; denotes the entropy of conditional on ; denotes the mutual information. Without causing ambiguities, we use and interchangeably to denote the probabilistic mass function of .2.1 Maximum Mutual Information Criterion
The MMI objective can be formulated as follows. Given the inputoutput pairs , MMI aims to find a rationale , which is a masked version of , such that it maximizes the mutual information between and . Formally,
(1) 
where is a binary mask and denotes a subset of with a sparsity and a continuity constraints. is the total length in . We leave the exact mathematical form of the constraint set abstract here, and it will be formally introduced in section 3.5. denotes the elementwise multiplication of two vectors or matrices. Since the mutual information measures the predictive power of on , MMI essentially tries to find a subset of input features that can best predict the output .
2.2 MMI Limitations
The biggest problem of MMI is that it is prone to picking up spurious probabilistic correlations, rather than finding the causal explanation. To demonstrate why this is the case, consider a probabilistic graph in figure 2, where is divided into three variables, , and , which represents the three typical relationship with : influences ; is influenced by ; has no direction connections with . The dashed arrows represent some additional probabilistic dependencies among . For now, we ignore .
As observed from the graph, serves as the valid explanation of , because it is the true cause of . Neither nor are valid explanations. However, , and can all be highly predicative of
, so the MMI criterion may select any of the three features as the rationale. Concretely, consider the following toy example with all binary variables. Assume
, and(2) 
which makes a good predictor of . Next, define the conditional prior of as
According to the Bayes rule,
(3) 
which makes also a good predictor of . Finally, assume the conditional prior of is
It can be computed that
(4) 
In short, according to equations (2), (3) and (4), we have constructed a set of priors such that the predictive power of , and is exactly the same. As a result, there is no reason for MMI to favor over the others.
In fact, , and correspond to the three highlighted sentences in figure 1. corresponds to the smell review (green sentence), because it represents the true explanation that influences the output decision. corresponds to the overall review (blue sentence), because the overall summary of the beer inversely influenced by the smell score. Finally, corresponds to the palate review (red sentence), because the palate review does not have a direct relationship with the smell score. However, may still be highly predicative of because it can be strongly correlated with and . Therefore, we need to explore a novel rationalization scheme that can distinguish from the rest.
3 Adversarial Invariant Rationalization
In this section, we propose invariant rationalization, a rationalization criterion that can exclude rationales with spurious correlations, utilizing the extra information provided by an environment variable. We will introduce InvRat, a gametheoretic approach to solving the invariant rationalization problem. We will then theoretically analyze the convergence property and the generalizability of invariant rationales.
3.1 Invariant Rationalization
Without further information, distinguishing from and is a challenging task. However, this challenge can be resolved if we also have access to an extra piece of information: the environment. As shown in figure 2, an environment is defined as an instance of the variable that impacts the prior distribution of (arjovsky2019invariant). On the other hand, we make the same assumption as in IRM that the remains the same across the environments (hence there is no edge pointing from to in figure 2), because is the true cause of . As we will show soon, and will not remain the same across the environments, which distinguishes from and .
Back to the binary toy example in section 2.2, suppose there are two environments, and . In environment , all the prior distributions are exactly the same as in section 2.2. In environment , the priors are almost the same, except for the prior of . For notation ease, define
as the probabilities under environment
, i.e. . Then, we assume thatIt turns out that such a small difference suffices to expose and . In this environment, is the same as in equation (2) as assumed. However, it can be computed that
which are different from equations (3) and (4). Notice that we have not yet assumed any changes in the priors of and , which will introduce further differences. The fundamental cause of such differences is that is independent of only when conditioned on , so would not change with . We call this property invariance. However, the conditional independence does not hold for and .
Therefore, given that we have access to multiple environments during training, i.e. multiple instances of , we propose the invariant rationalization objective as follows:
(5) 
where denotes probabilistic independence. The only difference between equations (1) and (5) is that the latter has the invariance constraint, which is used to screen out and . In practice, finding an eligible environment is feasible. In the beer review example in figure 1, a possible choice of environment is the brand of beer, because different beer brands have different prior distributions of the review in each aspect – some brands are better at the appearance, others better at the palate. Such variations in priors suffice to expose the noninvariance of the palate review or the overall review in terms of predicting the smell score.
3.2 The InvRat Framework
The constrained optimization in equation (5) is hard to solve in its original form. InvRat introduces a gametheoretic framework, which can approximately solve this problem. Notice that the invariance constraint can be converted to a constraint on entropy, i.e.,
(6) 
which means if is invariant, cannot provide extra information beyond to predict . Guided by this perspective, InvRat consists of three players, as shown in figure 3:

an environmentagnostic/independent predictor ;

an environmentaware predictor ; and

a rationale generator, .
The goal of the environmentagnostic and environmentaware predictors is to predict from the rationale . The only difference between them is that the latter has access to as another input feature but the former does not. Formally, denote as the crossentropy loss on a single instance. Then the learning objective of these two predictors can be written as follows.
(7) 
where . The rationale generator generates by masking . The goal of the rationale generator is also to minimize the invariance prediction loss . However, there is an additional goal to make the gap between and small. Formally, the objective of the generator is as follows:
(8) 
where a convex function that is monotonically increasing in when , and strictly monotonically increasing in when , e.g., and .
3.3 Convergence Properties
This section justifies that equations (7) and (8) can solve equation (5) in its Lagrangian form. If the representation power of and is sufficient, the crossentropy loss can achieve its entropy lower bound, i.e.,
Notice that the environmentaware loss should be no greater than the environmentagnostic loss, because of the availability of more information, i.e., . Therefore, the invariance constraint in equation (6) can be rewritten as an inequality constraint:
(9) 
Finally, notice that . Thus the objective in equation (8) can be regarded as the Lagrange form of equation (5), with the constraint rewritten as an inequality constraint
(10) 
According to the KKT conditions, when equation (10) is binding. Moreover, the objectives in equations (7) and (8) can be rewritten as a minimax game
(11) 
where
Therefore, the generator plays a cooperative game with the environmentagnostic predictor, and an adversarial game with the environmentaware predictor. The optimization can be performed using alternate gradient descent/ascent.
3.4 Invariance and Generalizability
In our previous discussions, we have justified the invariant rationales in the sense that it can uncover consistent and causal explanations and leave out spurious statistical correlations. In this section, we further justify invariant rationale in terms of generalizability. We consider two sets of environments, a set of training environments and a test environment . Only the training environments are accessible during training. The prior distributions in the test environment are completely unknown. The question we want to ask is: does keeping the invariant rationales and dropping the noninvariant rationales improve the generalizability in the unknown test environment?
Assume that 1) the training data are sufficient, 2) the predictor is environmentagnostic, 3) the predictor has sufficient representation power, and 4) the training converges to the global optimum. Under these assumptions, any predictor is able to replicate the training set distribution (with all the training environments mixed) , which is optimal under the crossentropy training objective. In the test environment , the crossentropy loss of this predictor is given by
where is short for . cannot be evaluated because the prior distribution in the test environment is unknown. Instead, we consider the worst scenario. For notational ease, we introduce the following shorthand for the test environment distributions:
For the selected rationale , we consider an adversarial test environment (hence the notation ), which chooses , and to maximize (note that is a function of , , and ). The following theorem shows that the minimizer of this adversarial loss is the invariant rationale .
Theorem 1.
Assume the probabilistic graph in figure 2 and that there are two environments and . achieves the saddle point of the following minimax problem
where denotes the power set of .
3.5 Incorporating Sparsity and Continuity Constraints
The sparsity and continuity constraint (equation (5)) stipulates that the total number of ’s in should be upper bounded and contiguous. There are two ways to implement the constraints.
Soft constraints: Following chang2019game, we can add another two Lagrange terms to equations (11):
(12) 
where denotes the th element of ; is a predefined sparsity level. is produced by an independent selection process (lei2016rationalizing). This method is flexible, but requires sophisticated tuning of three Lagrange multipliers.
Hard constraints: An alternative approach is to force to select one chunck of text with a prespecified length . Instead of predicting the mask directly, produces a score for each position , and predicts the start position of the chunk by choosing the maximum of the score. Formally
(13) 
where denotes the indicator function, which equals if the argument is true, and otherwise. Equation (13) is not differentiable, so when computing the gradients for the back propagation, we apply the straightthrough technique (bengio2013estimating) and approximate it with the gradient of
where denotes causal convolution, and the convolution kernel is an allone vector of length .
4 Experiments
4.1 Datasets
To evaluate the invariant rationale generation, we consider the following two binary classification datasets with known spurious correlations.
Imdb (maas2011learning):
The original dataset consists of 25,000 movie reviews for training and 25,000 for testing. The output
is the binarized score of the movie. We construct a synthetic setting that manually injects tokens with false correlations with
, whose prior varies across artificial environments. The goal is to validate if the proposed method excludes these tokens from rationale selections. Specifically, we first randomly split the training set into two balanced subsets, where each subset is considered as an environment. We append punctuation “,” and “.” at the beginning of each sentence with the following distributions:Here is the environment index taking values on . Specifically, we set and to be 0.9 and 0.7, respectively, for the training set. For the purpose of model selection and evaluation, we randomly split the original test set into two balanced subsets, which are our new validation and test sets. To test how different rationalization techniques generalize to unknown environments, we also inject the punctuation to the test and validation set, but with and set as 0.5 for the validation set, and 0.1, 0.3 for the testing set. According to equation (4.1), these manually injected “,” and “.” can be thought of as the variable in the figure 2, which have strong correlations to the label. It is worth mentioning that the environment ID is only provided in the training set.
Multiaspect beer reviews (mcauley2012learning):
This dataset is commonly used in the field of rationalization (lei2016rationalizing; bao2018deriving; yu2019rethinking; chang2019game). It contains 1.5 million beer reviews, each of which evaluates multiple aspects of a beer. These aspects include appearance, aroma, smell, palate and overall. Each aspect has a rating at the scale of [0, 1]. The goal is to provide rationales for these ratings. There is a high correlation among the rating scores of different aspects in the same review, making it difficult to directly learn a rationalization model from the original data. Therefore only the decorrelated subsets are selected as training data in the previous usages (lei2016rationalizing; yu2019rethinking).
However, the high correlation among rating scores in the original data provides us a perfect evaluation benchmark for InvRaton its ability to exclude irrelevant but highly correlated aspects, because these highly correlated aspects can be thought of as and in figure 2, as discussed in section 2.2
. To construct different environments, we cluster the data based on different degree of correlation among the aspects. To gauge the correlation among aspect, we train a simple linear regression model to predict the rating of the target aspect given the ratings of all the other aspects except the overall. A low prediction error of the data implies high correlation among the aspects. We then assign the data into different environments based on the linear prediction error. In particular, we construction two training environments using the data with least prediction error,
i.e. highest correlations. The first training environment is sampled from the lowest 25 percentile of the prediction error, while the second one is from 25 to 50 percentile. On the contrary, we construction a validation set and a subjective evaluation set from data with the highest prediction error (i.e. highest 50 percentile). Following the same evaluation protocol (bao2018deriving; chang2019game), we consider a classification setting by treating reviews with ratings 0.4 as negative and 0.6 as positive. Each training environment has a total 5,000 labelbalanced examples, which makes the size of the training set as 10,000. The size of the validation set and the subjective evaluation set are 2,000 and 400, respectively. Same as almost all previous work in rationalization, we focus on the appearance, aroma, and palate aspects only.Also, this dataset includes sentencelevel annotations for about 1,000 reviews. Each sentence is annotated with one or multiple aspects label, indicating which aspect this sentence belonging to. We use this set to automatically evaluate the precision of the extracted rationales.
4.2 Baselines
We consider the following two baselines:
Rnp: A generatorpredictor framework proposed by lei2016rationalizing for rationalizing neural prediction (Rnp). The generator selects text spans as rationales which are then fed to the predictor for label classification. The selection optimizes the MMI criterion shown in equation (1).
3Player: The improvement of Rnp from yu2019rethinking, which aims to alleviate the degeneration problem of Rnp. The model consists of three modules, which are the generator, the predictor and the complement predictor. The complement predictor tries to maximize the predictive accuracy from unselected words. Besides the MMI objective optimized between the generator and predictor, the generator also plays an adversarial game with the complement predictor, trying to minimize its performance.
4.3 Implementation Details
For all experiments, we use bidirectional gated recurrent units
(chung2014empirical) with hidden dimension 256 for the generator and both of the predictors. All the methods use fixed 100dimension Glove embeddings (pennington2014glove). We use the Adam optimizer (kingma2014adam) with a learning rate of 0.001. The batch size is set to 500. To seek fair comparisons, we try to keep the settings of both Rnp and 3Player the same to ours. We reimplement the Rnp, and use the opensource implementation
^{2}^{2}2https://github.com/Gorov/three_player_for_emnlp. of 3Player. The only major difference between these models is that both Rnp and InvRat use the straightthrough technique (bengio2013estimating) to deal with the problem of nondifferentiability in rationale selections while 3Player is based on the policy gradient (williams1992simple).For the IMDB dataset, we follow a standard setting (lei2016rationalizing; chang2019game)
to use the soft constraints to regularize the selected rationales for all methods. Hyperparameters (
i.e., , in equation (12), in equation (8), and the number of training epochs) are determined based on the best performance on the validation set. For the beer review task, we find the baseline methods perform much worse using soft constraints compared to the hard one. This might be because the review of each aspect is highly correlated in the training set. Thus, we consider the hard constraints with different length in generating rationales.
4.4 Results
Imdb:
Table 1 shows the results of the synthetic IMDB dataset. As we can see, Rnp selects the injected punctuation in 78.24% of the testing samples, while InvRat, as expected, does not highlight any. This result verifies our theoretical analysis in section 3. Moreover, because Rnp relies on these injected punctuation, whose probabilistic distribution varies drastically between training set and test set, its generalizability is poor, which leads to low predictive accuracy on the testing set. Specifically, there is a large gap of around 15% between the test performance of Rnp and the proposed InvRat. It is worth pointing out that, by the dataset construction, 3Player will obviously fail by including all punctuation as rationales. This is because otherwise, the complement predictor will have a clear clue to guess the predicted label. Thus, we exclude 3Player from the comparison.
Dev Acc  Test Acc  Bias Highlighted  

Rnp  78.90  72.25  78.24 
InvRat  86.65  87.05  0.00 
Methods  Len  Appearance  Aroma  Palate  

Dev Acc  P  R  F1  Dev Acc  P  R  F1  Dev Acc  P  R  F1  
Rnp  10  75.20  13.51  5.75  8.07  75.30  30.30  15.26  20.30  75.00  28.20  17.24  21.40 
3Player  10  77.55  15.84  6.78  9.50  80.75  48.85  24.43  32.57  76.60  14.15  8.54  10.65 
InvRat  10  75.65  49.54  20.93  29.43  77.95  48.21  24.36  32.36  76.10  32.80  20.01  24.86 
Rnp  20  77.70  13.54  11.29  12.31  78.85  34.32  34.18  34.25  77.10  19.80  23.78  21.60 
3Player  20  82.56  15.63  13.47  14.47  82.95  35.73  35.89  35.81  79.75  20.73  24.91  22.63 
InvRat  20  81.30  58.03  49.59  53.48  81.90  42.72  42.52  42.62  80.45  44.04  52.75  48.00 
Rnp  30  81.65  26.26  33.10  29.29  83.10  39.97  60.13  48.02  78.55  19.18  33.81  24.47 
3Player  30  80.55  12.56  15.90  14.03  84.40  33.02  49.66  39.67  81.85  21.98  39.27  28.18 
InvRat  30  82.85  54.03  69.23  60.70  84.40  44.72  67.35  53.75  81.00  26.51  46.91  33.87 
[table]capposition=bottom
Beer  Appearance Rationale Length  20 

into a pint glass , poured a solid black , not so much head but enough , tannish in color , decent lacing down the glass . as for aroma , if you love coffee and beer , its the best of both worlds , a very fresh strong full roast coffee blended with ( and almost overtaking ) a solid , classic stout nose , with the toasty , chocolate malts . with the taste , its even more coffee , and while its my dream come true , so delicious , what with its nice chocolate and burnt malt tones again , but i almost say it unknown any unknown , and takes away from the beeriness of this beer . which is n’t to say it is n’t delicious , because it is , just seems a bit unbalanced . oh well ! the mouth is pretty solid , a bit light but not all that unexpected with a coffee blend . its fairly smooth , not quite creamy , well carbonated , thoroughly , exceptionally drinkable . 
Beer  Aroma Rationale Length  20 
into a pint glass , poured a solid black , not so much head but enough , tannish in color , decent lacing down the glass . as for aroma , if you love coffee and beer , its the best of both worlds , a very fresh strong full roast coffee blended with ( and almost overtaking ) a solid , classic stout nose , with the toasty , chocolate malts . with the taste , its even more coffee , and while its my dream come true , so delicious , what with its nice chocolate and burnt malt tones again , but i almost say it unknown any unknown , and takes away from the beeriness of this beer . which is n’t to say it is n’t delicious , because it is , just seems a bit unbalanced . oh well ! the mouth is pretty solid , a bit light but not all that unexpected with a coffee blend . its fairly smooth , not quite creamy , well carbonated , thoroughly , exceptionally drinkable . 
Beer  Palate Rationale Length  20 
into a pint glass , poured a solid black , not so much head but enough , tannish in color , decent lacing down the glass . as for aroma , if you love coffee and beer , its the best of both worlds , a very fresh strong full roast coffee blended with ( and almost overtaking ) a solid , classic stout nose , with the toasty , chocolate malts . with the taste , its even more coffee , and while its my dream come true , so delicious , what with its nice chocolate and burnt malt tones again , but i almost say it unknown any unknown , and takes away from the beeriness of this beer . which is n’t to say it is n’t delicious , because it is , just seems a bit unbalanced . oh well ! the mouth is pretty solid , a bit light but not all that unexpected with a coffee blend . its fairly smooth , not quite creamy , well carbonated , thoroughly , exceptionally drinkable . 
[table]capposition=top
Beer review:
We conduct both objective and subjective evaluations for the beer review dataset. We first compare the generated rationales against the human annotations and report precision, recall and F1 score in table 2. Similarly, the reported performances are based on the best performance on the validation set, which is also reported. We consider the highlight lengths of 10, 20 and 30.
We observe that InvRat consistently surpass the other two baselines in finding rationales that align with human annotation for most of the rationale lengths and the aspects. In particular, although the best accuracies among all three methods on validation sets have only small variations, the improvements are significant in terms of finding the correct rationales. For example, InvRat improves over the other two methods for more than 20 absolute percent in F1 for the appearance aspect. Two baselines methods fail to distinguish the true clues for different aspects, which confirms that the previous MMI objective is insufficient for ruling out the spurious words.
In addition, we also visualize the generated rationales of our method with a preset length of 20 in figure 4. We observe that the InvRat is able to produce meaningful justifications for all three aspects. By reading these selected texts alone, humans will easily predict the aspect label. To further verify that the rationales generated by InvRat align with human judgment, we present a subjective evaluation via Amazon Mechanical Turk. Recall that for each aspect we preserved a holdout set with 400 examples (total 1,200 examples for all three aspects). We generate rationales with different lengths for all methods. In each subjective test, the subject is presented with the rationale of one aspect of the beer review, generated by one of the three methods (unselected words blocked), and asked to guess which aspect the rationale is talking about. We then compute the accuracy as the performance metric, which is shown in figure 5. Under this setting, a generator that picks spurious correlated texts will have a low accuracy. As can be observed, InvRat achieves the best performances in all cases.
5 Related Work
Selective rationalization:
Selective rationalization is one of the major categories of model interpretability in machine learning.
lei2016rationalizing first propose a generatorpredictor framework for rationalization. The framework is formally a cooperative game that maximizes the mutual information between the selected rationales and labels, as shown in (chen2018learning). Following this work, chen2018shapley improves the generatorpredictor framework by proposing a new rationalization criterion by considering the combinatorial nature of the selection. yu2019rethinking point out the communication problem in cooperative learning and proposes a new threeplayer framework to control the unselected texts. chang2019game aim to generate rationales in all possible classes instead of the target label only, which makes the model perform counterfactual reasoning. In all, these models deal with different challenges in generating highquality rationales. However, they are still insufficient to distinguish the invariant words from the correlated ones.Selfexplaining models beyond selective rationalization:
Besides selective rationalization, other approaches also improve the interpretability of neural predictions. For example, module networks (andreas2016learning; andreas2016neural; johnson2017inferring)
compose appropriate modules following the logical program produced by a natural language component. The restriction to a small set of predefined programs currently limits their applicability. Other lines of work include evaluating feature importance with gradient information
(simonyan2013deep; li2016visualizing; sundararajan2017axiomatic) or local perturbations (kononenko2010efficient; lundberg2017unified); and interpreting deep networks by locally fitting interpretable models (ribeiro2016should; alvarez2018towards). However, these methods aim at providing posthoc explanations of alreadytrained models, which is not able to find invariant texts.Learning with biases:
Our work also relates to the topic of discovering datasetspecific biases. Specifically, neural models have shown remarkable results in many NLP applications, however, these models sometimes prone to fit some datasetspecific patterns or biases. For example, in natural language inference, such biased clues can be the word overlap between the input sentence pair (mccoy2019right) or whether the negative word ”not” exists (niven2019probing). Similar observations have been found in multihop question answering (welbl2018constructing; min2019compositional). To learn with biased data but not fully rely on it, lewis2018generative use generative objectives to force the QA models to make use of the full question. agrawal2018don; wang2019multi propose carefully designed model architectures to capture more complex interactions between input clues beyond the biases. ramakrishnan2018overcoming; belinkov2019adversarial propose to add adversarial regularizations that punish the internal representations that cooperate well with biasonly models. clark2019don; he2019unlearn propose to learn ensemble models that fit the residual from the prediction with bias features. However, all these works assume that the biases are known. Our work instead can rule out unwanted features without knowing the pattern priorly.
6 Conclusion
In this paper, we propose a gametheoretic approach to invariant rationalization, where the method is trained to constrain the probability of the output conditional on the rationales be the same across multiple environments. The framework consists of three players, which competitively rule out spurious words with strong correlations to the output. We theoretically demonstrate the proposed gametheoretic framework drives the solution towards better generalization to test scenarios that have different distributions from the training. Extensive objective and subjective evaluations on both synthetic and multiaspect sentiment classification datasets demonstrate that InvRat performs favorably against existing algorithms in rationale generation.
References
Appendix A Proof To Theorem 1
Proof.
, partition into an invariant variable and a noninvariant variable :
Given an arbitrary , we construct a specific and such that
(14) 
In other words, set these two priors such that the all the noninvariant variables are uninformative of . Since the test adversary is allowed to choose any distribution, this set of priors is within the feasible set of the test adversary.
Under the set of priors in equation (14), the noninvariant features are not predicative of , and only the invariant features are predicative of , i.e.
(15) 
Therefore
(16)  
where (i) is from equation (15); (ii) is from the relationship between cross entropy and entropy; (iii) is because is the minimizer of conditional entropy of on and , among all the invariant variables; (iv) is because, by the definition of invariant variables, . Here, we use to emphasize that is computed under the distribution of . Therefore, if we optimize over and , we have the following
(17)  
where the second line is because does not depend on and . Combining equations (16) and (17), we have
(18) 
Note that the above discussions holds for all . Therefore, taking the maximum over of equation (18) preserves the inequality.
which implies
∎
Comments
There are no comments yet.