1 Introduction
Rationales that select the most relevant parts of an input text can help explain model decisions for a range of language understanding tasks Lei et al. (2016); DeYoung et al. (2019). Models can be faithful to a rationale by only using the selected text as input for endtask prediction. However, there is almost always a tradeoff between interpretable models that extract sparse rationales and more accurate models that are able to use the full context but provide little explanation for their predictions Lei et al. (2016); Weld and Bansal (2019). In this paper, we show that it is possible to better manage this tradeoff by optimizing a novel bound on the Information Bottleneck Tishby et al. (1999) objective (Figure 1).
We follow recent work in representing rationales as binary masks over the input text Lei et al. (2016); Bastings et al. (2019). During learning, it is common to encourage sparsity by minimizing a norm on the rationale masks (e.g. or ) Lei et al. (2016); Bastings et al. (2019). As we will see in Section 5, it is challenging to control the sparsityaccuracy tradeoff in normminimization methods; we show that these methods seem to push too directly for sparsity at the expense of accuracy. Our approach, in contrast, allows more control through a prior that specifies taskspecific target sparsity levels that should be met in expectation across the training set.
More specifically, we formalize the problem of inducing controlled sparsity in the mask using the Information Bottleneck (IB) principle. Our approach seeks to extract a rationale as an optimal compressed intermediate representation (the bottleneck) that is both (1) minimally informative about the original input, and (2) maximally informative about the output class. We derive a novel variational bound on the IB objective for our case where we constrain the intermediate representation to be a concise subsequence of the input, thus ensuring its interpretablity.
Our model consists of an explainer that extracts a rationale from the input, and an endtask predictor
that predicts the output based only on the extracted rationale. Our IBbased training objective guarantees sparsity by minimizing the Kullback–Leibler (KL) divergence between the explainer mask probability distribution and a prior distribution with controllable sparsity levels. This prior probability affords us tunable finegrained control over sparsity, and allows us to bias the proportion of the input to be used as rationale. We show that, unlike normminimization methods, our KLdivergence objective is able to consistently extract rationales with the specified sparsity levels.
Across different tasks from the ERASER interpretability benchmark DeYoung et al. (2019) and the BeerAdvocate dataset McAuley et al. (2012), our IBbased sparse prior objective has significant gains over previous normminimization techniques—0.5% to 5% relative improvement in task performance metrics and 6% to 80% relative improvement in agreement with human rationale annotations. Our interpretable model achieves task performance within of a model of comparable size that uses the entire input. Furthermore, we find that in the semisupervised setting, adding a small proportion of gold rationale annotations (approximately of the training examples) bridges this gap—we are able to build an interpretable model without compromising performance.
2 Method
2.1 Task and Method Overview
We assume supervised text classification or regression data that contains tuples of the form . The input document can be decomposed into a sequence of sentences . is the category, answer choice, or target value to predict. Our goal is to learn a model that not only predicts , but also extracts a rationale or explanation —a latent subsequence of sentences in with the following properties:

Model prediction should rely entirely on and not on its complement DeYoung et al. (2019).

must be compact yet sufficient, i.e., it should contain as few sentences as possible without sacrificing the ability to correctly predict .
Following Lei et al. (2016), our interpretable model learns a Boolean mask over the sentences in , where
is a discrete binary variable. To enforce (1), the masked input
is used to predict . We elaborate on how sufficiency is attained using Information Bottleneck in the following section.2.2 Formalizing Interpretability Using Information Bottleneck
Background
The Information Bottleneck (IB) method is used to learn an optimal compression model that transmits information from a random variable
to another random variable through a compressed representation . The IB objective is to minimize the following:(1) 
where is mutual information. This objective encourages to only retain as much information about as is needed to predict
. The hyperparameter
controls the tradeoff between retaining information about either or in . Alemi et al. (2016) derive the following variational bound on Equation 1:^{2}^{2}2For brevity and clarity, objectives are shown for a single data point. More details of this bound can be found in Appendix A.1 and Alemi et al. (2016).(2) 
where is a parametric approximation to the true likelihood ; , the prior probability of , is an approximation to ; and is the parametric posterior distribution over .
The information loss term in Equation 2 reduces by decreasing the KL divergence^{3}^{3}3To analytically compute the KLdivergence term, the posterior and prior distributions over
are typically Kdimensional multivariate normal distributions. Compression is achieved by setting
, the input dimension of . between the posterior distribution that depends on and a prior distribution that is independent of . The task loss encourages predicting the correct label from to increase .Our Variational Bound for Interpretability
The learned bottleneck representation , found via Equation 2, is often not humaninterpretable^{3}^{3}footnotemark: 3. We consider an interpretable latent representation , where is a boolean mask on the input sentences in . We assume that the mask variables over individual sentences are conditionally independent given the input , i.e. , where and indexes sentences in the input text. Because , the posterior distribution over is a mixture of diracdelta distributions:
where is the diracdelta probability distribution that is zero everywhere except at .
Our prior is that the rationale needed for prediction is sparse; we encode this prior as a distribution over masks for some constant , which also induces a distribution on via the relationship . In contrast to Alemi et al. (2016), our prior has no trainable parameters; instead of using an expressive to approximate , we use a fixed prior to force the marginal of the learned distribution over to approximate the prior . Our parameterization of the prior and the posterior achieves compression of the input via sparsity in the latent representation, in contrast to compression via dimensionality reduction Alemi et al. (2016).
For the masked representation , we can decompose we can decompose as:
KL  
Since is a constant with respect to , it can be dropped. Hence, we obtain the following variational bound on IB with interpretability constraints, described in more detail in Appendix A.2:
(3) 
The first term is the expected crossentropy term for the task which can be computed by drawing samples . The informationloss term encourages the mask to be independent of by reducing the KL divergence of its posterior from a prior that is independent of . However, this does not necessarily remove information about in . For instance, a mask consisting of all ones is independent of , but and the rationale is no longer concise. In the following section, we present a simple way to avoid this degenerate case in practice.
2.3 The Sparse Prior Objective
The key to ensuring that is strictly a subsequence of lies in the fact that is our prior belief about the probability of a sentence being important for prediction. For instance, if humans annotate of the input text as a rationale, we can fix our prior belief that a sentence should be a part of the mask is , setting . We refer to this prior probability hyperparameter as the sparsity threshold .
can be estimated as the expected sparsity of the mask from expert annotations. If such a statistic is not available, it can be explicitly tuned using
, for the desired tradeoff between end task performance and rationale length.Consequently, we can control the amount of sparsity in the mask that is eventually sampled from the learned posterior distribution by appropriately setting the prior belief in Equation LABEL:ivib. Since is restricted to be small, the sampled mask is generally sparse. As a result, the intermediate representation , which is our humaninterpretable rationale, is guaranteed to be a subsequence of and reduce . We refer to this training objective as the sparse prior (Sparse IB) method in our experiments.
3 Model
3.1 Architecture
Our model (Figure 2) consists of an explainer which extracts rationales from the input, and an endtask predictor which predicts the output based on the explainer rationales. In our experiments, both the explainer and the predictor are Transformers with BERT pretraining Devlin et al. (2019).
Explainer :
Given an input consisting of sentences, the explainer produces a binary mask over the input sentences which is used to derive a rationale . It maps every sentence to its probability, of being selected as part of where is a binary distribution.
The explainer contextualizes the input sequence at the token level, and produces sentence representations where is obtained by concatenating the contextualized representations of the first and last tokens in sentence . When an optional query sequence is available,^{4}^{4}4For question answering tasks in ERASER. and are encoded together in the sequence while assuming that is fully unmasked i.e.
. A linear layer is used to transform these representations into logits (log probabilities) of a Bernoulli distribution. We choose the Bernoulli distribution since its sample can be reparameterized as described in Section
3.2, and we can analytically compute the KLdivergence term between two Bernoulli distributions. The mask is constructed by independently sampling each from . We define as the rationale representation , an elementwise dot product between and the corresponding sentence representation .Endtask Predictor :
The endtask predictor applies the mask from the explainer to the input sequence^{5}^{5}5Once again, the sequence is used if query is available, i.e., we assume no masking over as it is assumed to be essential to predict . to predict the output variable . The same attention mask is applied to all endtask transformer layers at every head to ensure prediction relies only on
. The predictor further consists of a loglinear classifier layer over the
[CLS] token, similar to Devlin et al. (2019).3.2 Reparameterization and Training
The sampling operation of the discrete binary variable in Section 3.1 is not differentiable. Lei et al. (2016) use a simple Bernoulli distribution with REINFORCE Williams (1992)
to overcome nondifferentiability. We found REINFORCE to be quite unstable with high variance in results. Instead, we employ reparameterization
Kingma et al. (2015) to facilitate endtoend differentiability of our approach. We use the GumbelSoftmax reparameterization trick Jang et al. (2017) for categorical (here, binary) distributions to reparameterize the Bernoulli variables . A random perturbation is added to the log probability (logit), . The reparameterized binary variable is generated as follows:where
is the Sigmoid function,
is a hyperparameter for the temperature of the GumbelSoftmax function, and is a sample from the Gumbel(0,1) distribution Gumbel (1948). is a continuous and differentiable approximation to with low variance.Inference:
During inference, we extract the top sentences, where corresponds to the threshold hyperparameter described in Section 2.3. Previous work Lei et al. (2016); Bastings et al. (2019) samples from during inference. Such an inference strategy is nondeterministic, making comparison of different masking strategies difficult. Moreover, it is possible to appropriately scale values to obtain better inference results, thereby not reflecting if are correctly ordered. By allowing a fixed budget of per example, we are able to fairly judge how well the model fills the budget with the best rationales.
3.3 SemiSupervised Setting
As we will show in Section 5, despite better control over the sparsityaccuracy tradeoff, there is still a gap in task performance between our unsupervised approach and a model that uses full context. To bridge this gap and better manage the tradeoff at minimal annotation cost, we experiment with a semisupervised setting where we have annotated rationales for part of the training data.
For input example and a gold mask , we use the following semisupervised objective:
We set to simplify experiments. For examples where the rationale supervision is not available, we only consider the task loss.
4 Experimental Setup
4.1 End Tasks
We evaluate performance on five text classification tasks from the ERASER benchmark DeYoung et al. (2019) and one regression task used in previous work Lei et al. (2016). All these datasets have sentencelevel rationale annotations for validation and test sets. Additionally, the ERASER tasks contain rationale annotations for the training set, which we only use for our semisupervised experiments.

[label=,leftmargin=0pt]

Movies Pang and Lee (2004): Sentiment classification of movie reviews from IMDb.

FEVER Thorne et al. (2018): A fact extraction and verification task adapted in ERASER as a binary classification of the given evidence supporting or refuting a given claim.

MultiRC Khashabi et al. (2018): A reading comprehension task with multiple correct answers modified into a binary classification task for ERASER, where each (rationale, question, answer) triplet has a true/false label.

BoolQ Clark et al. (2019): A Boolean (yes/no) question answering dataset over Wikipedia articles. Since most documents are considerably longer than BERT’s maximum context window length of 512 tokens, we use a sliding window to select a single document span that has the maximum TFIDF score against the question.

Evidence Inference Lehman et al. (2019): A threeway classification task over fulltext scientific articles for inferring whether a given medical intervention is reported to either significantly increase, significantly decrease, or have no significant effect
on a specified outcome compared to a comparator of interest. We again apply the TFIDF heuristic.

BeerAdvocate McAuley et al. (2012): The BeerAdvocate regression task for predicting 05 star ratings for multiple aspects like appearance, smell, and taste based on reviews. We report on the appearance aspect.
4.2 Evaluation Metrics
We adopt the metrics proposed for the ERASER benchmark to evaluate both agreement with comprehensive human rationales as well as end task performance. To evaluate quality of rationales, we report the tokenlevel IntersectionOverUnion F1 (IOU F1), which is a relaxed measure for comparing two sets of text spans. For task accuracy, we report weighted F1 for classification tasks, and the mean square error for the BeerAdvocate regression task.
Approach  FEVER  MultiRC  Movies  BoolQ  Evidence  BeerAdvocate  

Task  IOU  Task  IOU  Task  IOU  Task  IOU  Task  IOU  Task  IOU  
1. Full  89.5  36.2  66.8  29.2  91.0  47.3  65.6  15.0  52.1  9.7  .015  37.8 
2. Gold  91.8    76.6    97.0    85.9    71.7       
Unsupervised  
3. Task Only  82.8  35.5  60.1  20.8  78.2  37.9  62.5  10.9  43.0  09.0  .018  47.3 
4. Sparse Norm  83.1  44.0  59.7  20.4  78.6  34.7  62.5  7.1  38.9  6.3  .017  35.5 
5. Sparse NormC  83.3  44.9  61.7  21.7  81.9  34.4  63.7  9.1  44.7  8.0  .018  49.0 
6. Sparse IB (Us)  84.7  45.5  62.1  24.3  84.0  39.6  65.2  16.5  46.3  10.0  .017  52.3 
Supervised  
7. Pipeline  85.0  81.7  62.5  40.9  82.4  15.7  62.3  32.5  70.8  53.9     
8. data (Us)  88.8  66.6  66.4  54.4  85.4  43.4  63.4  32.3  46.7  13.3     
4.3 Baselines
Norm Minimization (Sparse Norm):
Controlled Norm Minimization (Sparse NormC):
For fair comparison against our approach for controlled sparsity, we can modify Equation 4 to ensure that the norm of is not penalized when it drops below the threshold .
(5) 
Explicit control over sparsity in the mask through the tunable prior probability naturally emerges from IB theory in our Sparse IB approach, as opposed to the modification adopted in normbased regularization (Equation 5).
No Sparsity (Task Only):
This method only optimizes for the endtask performance without any sparsityinducing loss term, and serves as a common baseline for evaluating the effect of sparsity inducing objectives in Sparse IB, Sparse Norm, and Sparse NormC.
In addition to the above baselines, we also compare to models that don’t predict rationales:
Full Context (Full):
This method uses the entire context to make prediction, and allows us to estimate the loss in performance as a result of our interpretable hard attention model that only uses
of the input.Gold Rationale (Gold):
For datasets with humanannotated rationales available at training time, we train a model that uses these rationale annotations for training and inference to estimate an upperbound that can be achieved on task and rationale performance metrics with transformers.
4.4 Implementation Details
We use BERTbase with a maximum contextlength of 512 to instantiate the explainer and endtask predictor. Models are tuned based on their performance on the rationale IOU F1 as it is available for all the datasets considered in this work. When rationale IOU F1 is not available, the sparsityaccuracy tradeoff (Figure 4) can be used to determine an operation point. We tune the prior probability/threshold to ensure strictly concise rationales. For our Sparse IB approach, we observe less sensitivity to hyperparameter and set it to 1 to simplify experimental design. For baselines, we tune the values of the Lagrangian multipliers, as normbased techniques are more sensitive to . More details about the hyperparameters and model selection are presented in Appendix B.
5 Results
Table 1 compares our Sparse IB approach against baselines described in Section 4.3. Our Sparse IB approach outperforms normminimization approaches (rows 46) in both agreement with human rationales and task performance across all tasks. We perform particularly well on rationale extraction with relative improvements ranging from 5 to 80% over the better performing normminimization variant Sparse NormC. Sparse IB also attains task performance within of the fullcontext model (row 1), despite using of the input sentences on average. All unsupervised approaches still obtain a lower IOU F1 compared to the full context model for the Movies and MultiRC datasets, primarily due to their considerably lower precision on these benchmarks.
Our results also highlight the importance of explicit controlled sparsity inducing terms as essential inductive biases for improved task performance and rationale agreement. Specifically, sparsityinducing methods consistently outperform the Task Onlybaseline (row 3). One way to interpret this result is that sparsity objectives add inputdimension regularization during training, which results in better generalization during inference. Moreover, Sparse NormC, which adds the element of control to normminimization, performs considerably better than Sparse Norm. Finally, we observe a positive correlation between task performance and agreement with human rationales. This is important since accurate models that also better emulate human rationalization likely engender more trust.
Semisupervised Setting
In order to close the performance gap with the fullcontext model, we also experiment with a setup where we minimize the task and the rationale prediction loss using rationale annotations available for a part of the training data (Section 3.3).
Figure 3 shows the effect of incorporating an increasing proportion of rationale annotation supervision for the FEVER and MultiRC datasets. Our semisupervised model is even able to match the performance of the fullcontext models for both FEVER and MultiRC with only 25% of rationale annotation supervision. Furthermore, Figure 3 also shows that these gains can be achieved with relatively modest annotation costs since adding more rationale supervision to the training data seems to have diminishing returns.
Table 1 compares our interpretable model (row 8), which uses rationale supervision for 25% of the training data, with the fullcontext model and Lehman et al. (2019)’s fullysupervised pipeline approach (row 7). Lehman et al. (2019) learn an explainer and a task predictor independently in sequence, using the output of the explainer for inference in the predictor. On three (FEVER, MultiRC, and BoolQ) out of five datasets for which rationale supervision is available, our interpretable models match the end task performance of the fullcontext models while recording large gains in IOU (1730 F1 absolute). Our approach outperforms the pipelinebased approach in task performance (for FEVER, MultiRC, Movies, and BoolQ) and IOU (for MultiRC and Movies). These gains may result from better exploration due to sampling and inference based on a fixed budget of sentences. Our weakest results are on Evidence Inference where the TFIDF preprocessing often fails to select relevant rationale spans.^{6}^{6}6Selected document spans have gold rationales 51.8% of the time Our overall results suggest that the endtask model attention can be supervised with a small proportion of human annotations to be more interpretable.
5.1 Analysis
Dataset  Sparse NormC  Sparse IB  

Mean  Var  Mean  Var  
FEVER  0.10  0.17  0.01  0.21  0.03 
MultiRC  0.25  0.11  1.15  0.26  1.70 
Movies  0.40  0.38  0.01  0.42  0.02 
BoolQ  0.20  0.04  0.02  0.22  0.04 
Evidence  0.20  0.10  0.04  0.20  0.05 
Examples from Error Analysis 
Prediction:Positive 
Ground Truth:Negative 
The original Babe gets my vote as the best family film since the princess bride, and it’s sequel has been getting rave reviews 
from most internet critics, both Siskel and Ebert sighting it more than a month ago as one of the year’s finest films. So, 
naturally, when I entered the screening room that was to be showing the movie and there was nary another viewer to be found, 
this notion left me puzzled. It is a rare thing for a children’s movie to be praised this highly, so wouldn’t you think 
that every parent in the entire city would be flocking with their kids to see this supposedly “magical” piece of work? 
Looking back, I should have taken the hint and left right when I entered the theater. Believe me; I wanted to like Babe: Pig 
in the City. The plot seemed interesting enough; It is here that we meet an array of eccentric characters, the most 
memorable being the family of chimps led by Steven Wright. Here is where the film took a wrong turn unfortunately, the 
story wears thin as we are introduced to a new set of animals that the main topic of discussion it just didn’t feel right 
and was more painful to watch than it was funny or entertaining, and the same goes for the rest of the movie. 
Statement : Unforced labor is a reason for human trafficking. 
Prediction: SUPPORTS 
Ground Truth: REFUTES 
DOC: Human trafficking is the trade of humans, most commonly for the purpose of forced labour, sexual slavery, or comm 
ercial sexual exploitation for the trafficker or others. This may encompass providing a spouse in the context of forced 
marriage, or the extraction of organs or tissues, including for surrogacy and ova removal. Human trafficking can occur 
within a country or transnationally. Human trafficking is a crime against the person because of the violation of the victim’s 
rights of movement through coercion and because of their commercial exploitation In 2012, the I.L.O. estimated that 
21 million victims are trapped in modernday slavery 
Statement: Atlanta metropolitan area is located in south Georgia. 
Prediction: SUPPORTS 
Ground Truth:REFUTES 
DOC: Metro Atlanta , designated by the United States Office of Management and Budget as the AtlantaSandy Springs 
Roswell, GA Metropolitan Statistical Area, is the most populous metro area in the US state of Georgia and the ninthlargest 
metropolitan statistical area (MSA) in the United States. Its economic, cultural and demographic center is Atlanta, and 
it had a 2015 estimated population of 5.7 million people according to the U.S. Census Bureau. The metro area forms the core 
of a broader trading area, the Atlanta – AthensClarke – Sandy Springs Combined Statistical Area. 
The Combined Statistical Area spans up to 39 counties in north Georgia and had an estimated 2015 population of 6.3 million 
people. Atlanta is considered an “ alpha world city ”. It is the third largest metropolitan region in the Census Bureau’s 
Southeast region behind Greater Washington and South Florida. 
Accurate Sparsity Control
Table 2 compares average sparsity rates in rationales extracted by Sparse IB with those extracted by normminimization methods. We measure the sparsity achieved by the explainer during inference by computing the average number of one entries in the input mask over sentences (the hamming weight) for 100 runs. Our Sparse IBapproach consistently achieves the sparsity level used in the prior while the normminimization approach (Sparse NormC) converges to a lower average sparsity for the mask.
SparsityAccuracy Tradeoff
Figure 4 shows the variation in task and rationale agreement performance as a function of the sparsity rate for Sparse IB and Sparse NormC on the FEVER dataset. Both methods extract longer rationales with increasing value of that results in a decrease in agreement with sparse human rationales, while model accuracy improves. However, Sparse IB consistently outperforms Sparse NormC in terms of task performance.
In summary, our analysis indicates that unlike normminimization methods, our KLdivergence objective is able to consistently extract rationales with the specified sparsity rates, and achieves a better tradeoff with accuracy. We hypothesize that optimizing the KLdivergence of the posterior may be able to model input salience better than an implicit regularization (through ). The sparse prior term can learn adaptive to different examples, while encourages uniform sparsity across examples.^{7}^{7}7Unlike the norm , the derivative of KLdivergence term is proportional to This can be seen explicitly in Table 2, where the variance in sampled mask across examples is higher for our objective.
Error Analysis
A qualitative analysis of the rationales extracted by the Sparse IB approach indicates that such methods struggle when the context offers spurious—or in some cases even genuine but limited—evidence for both output labels (Figure 3). For instance, the model makes an incorrect positive prediction for the first example from the Movies sentiment dataset based on sentences that: (a) praise the prequel of the movie, (b) still acknowledge some critical acclaim, and (c) sarcastically describe the movie as magical. We also observed incorrect predictions based on shallow lexical matching (likely equating forced and unforced in the second example) and world knowledge (likely equating south Georgia, southeastern United States, and South Florida in the third). Overall, there is scope for improvement through better incorporation of exact lexical match, coreference propagation, and representation of pragmatics in our sentence representations.
6 Related Work
Interpretability
Previous work on explaining model predictions can be broadly categorized into post hoc explanation methods and methods that integrate explanations into the model architecture. Post hoc explanation techniques Ribeiro et al. (2016); Krause et al. (2017); AlvarezMelis and Jaakkola (2017) typically approximate complex decision boundaries with locally linear or low complexity models. While post hoc explanations often have the advantage of being simpler, they are not faithful by construction.
On the other hand, methods that condition predictions on their explanations can be more trustworthy. Extractive rationalization Lei et al. (2016) is one of the most wellstudied of such methods in NLP, and has received increased attention with the recently released ERASER benchmark DeYoung et al. (2019). Building on Lei et al. (2016), Chang et al. (2019) and Yu et al. (2019) consider benefits like classwise explanation extraction while Chang et al. (2020) explore invariance to domain shift. Bastings et al. (2019)
employ a reparameterizable version of the bimodal beta distribution (instead of Bernoulli) for the binary mask. This more expressive distribution may be able to complement our approach, as KLdivergence for it can be analytically computed
Nalisnick and Smyth (2017).While many methods for extractive rationalization, including ours, have focused on unsupervised settings due to the considerable cost of obtaining reliable annotations, recent work Lehman et al. (2019) has also attempted to use direct supervision from rationale annotations for critical medical domain tasks. Finally, Latcinnik and Berant (2020) and Rajani et al. (2019) focus on generating explanations (rather than extracting them from the input), since the extractive paradigm could be unfavourable for certain tasks like common sense question answering where the given input provides limited context for the task.
Information Bottleneck
The Information Bottleneck (IB) principle Tishby et al. (1999) has recently been adapted in a number of downstream applications like parsing Li and Eisner (2019), summarization West et al. (2019), and image classification Alemi et al. (2016); Zhmoginov et al. (2019). Alemi et al. (2016) and Li and Eisner (2019)
use IB for optimal compression of hidden representations of images and words respectively. We are interested in compressing the number of cognitive units (like sentences) to ensure interpretability of the bottleneck representation. Our work is more similar to
West et al. (2019) in that the input (words) is compressed rather than the embedding dimension. However, while West et al. (2019) use bruteforce search to optimize IB for summarization, we directly optimize a parametric variational bound on IB for rationales.IB has also been previously used for interpretability—Zhmoginov et al. (2019) use a VAE to estimate the prior and posterior distributions over the intermediate representation for image classification. Bang et al. (2019) use IB for posthoc explanation for sentiment classification. They do not enforce a sparse prior, and as a result, cannot guarantee that the rationale is strictly smaller than the input. This also means controlled sparsity, which we have shown to be crucial for task performance and rationale extraction, is harder to achieve in their model.
7 Conclusion
We propose a new sparsity objective derived from the Information Bottleneck principle to extract rationales of desired conciseness. Our approach outperforms existing normminimization techniques in task performance and agreement with human annotations for rationales for tasks in the ERASER benchmark. The sparse prior objective also allows for a straightforward and accurate control of the amount of sparsity desired in the rationales. We also obtain better a trade off of accuracy vs. sparsity using our objective. We are able to close the gap with models that use the full input with rationale annotations for a majority of the tasks. In future work, we would like to explore the application of our approach on longer contexts and tasks such as documentlevel QA.
References
 Deep variational information bottleneck. arXiv preprint arXiv:1612.00410. Cited by: §A.1, §A.2, Appendix A, §2.2, §2.2, §6, footnote 2.

A causal framework for explaining the predictions of blackbox sequencetosequence models.
In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
, pp. 412–421. Cited by: §6.  Explaining a blackbox using deep variational information bottleneck approach. arXiv preprint arXiv:1902.06918. Cited by: §6.
 Interpretable neural predictions with differentiable binary variables. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2963–2977. Cited by: Appendix B, §1, §3.2, §4.3, §6.
 Invariant rationalization. arXiv preprint arXiv:2003.09772. Cited by: §6.
 A game theoretic approach to classwise selective rationalization. In Advances in Neural Information Processing Systems, pp. 10055–10065. Cited by: §6.
 BoolQ: exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044. Cited by: 4th item.
 BERT: pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §3.1, §3.1.
 ERASER: a benchmark to evaluate rationalized nlp models. arXiv preprint arXiv:1911.03429. Cited by: Appendix B, §1, §1, item 1, §4.1, Table 1, §6.
 Statistical theory of extreme values and some practical applications: a series of lectures. Vol. 33, US Government Printing Office. Cited by: §3.2.
 Categorical reparametrization with gumblesoftmax. In International Conference on Learning Representations (ICLR 2017), Cited by: §3.2.
 Looking beyond the surface: a challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 252–262. Cited by: 3rd item.
 Variational dropout and the local reparameterization trick. In Advances in neural information processing systems, pp. 2575–2583. Cited by: §3.2.
 A workflow for visual diagnostics of binary classifiers using instancelevel explanations. In 2017 IEEE Conference on Visual Analytics Science and Technology (VAST), pp. 162–172. Cited by: §6.

Explaining question answering models through text generation
. arXiv preprint arXiv:2004.05569. Cited by: §6.  Inferring which medical treatments work from reports of clinical trials. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3705–3717. Cited by: 5th item, §5, §6.
 Rationalizing neural predictions. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 107–117. Cited by: §1, §1, §2.1, §3.2, §3.2, §4.1, §4.3, §6.
 Specializing word embeddings (for parsing) by information bottleneck. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pp. 2744–2754. Cited by: §6.
 Learning attitudes and attributes from multiaspect reviews. In 2012 IEEE 12th International Conference on Data Mining, pp. 1020–1025. Cited by: §1, 6th item.

Stickbreaking variational autoencoders
. In International Conference on Learning Representations (ICLR), Cited by: §6. 
A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts
. In Proceedings of the 42nd annual meeting on Association for Computational Linguistics, pp. 271. Cited by: 1st item.  Explain yourself! leveraging language models for commonsense reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4932–4942. Cited by: §6.
 ” Why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §6.
 FEVER: a largescale dataset for fact extraction and verification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 809–819. Cited by: 2nd item.
 The information bottleneck method. In Proc. of the 37th Annual Allerton Conference on Communication, Control and Computing, pp. 368–377. Cited by: §1, §6.
 The challenge of crafting intelligible intelligence. Communications of the ACM 62 (6), pp. 70–79. Cited by: §1.
 BottleSum: unsupervised and selfsupervised sentence summarization using the information bottleneck principle. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pp. 3743–3752. Cited by: §6.

Simple statistical gradientfollowing algorithms for connectionist reinforcement learning
. Machine learning 8 (34), pp. 229–256. Cited by: §3.2.  Rethinking cooperative rationalization: introspective extraction and complement control. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pp. 4085–4094. Cited by: §6.
 Informationbottleneck approach to salient region discovery. arXiv preprint arXiv:1907.09578. Cited by: §6, §6.
Appendix A Information Bottleneck Theory
We first present an overview of the variational bound on IB introduced by Alemi et al. (2016) and then derive a modified version amenable to interpretability.
a.1 Variational Information Bottleneck (Alemi et al. (2016))
The objective is to parameterize the information bottleneck objective
using neural models and use SGD to optimize. Consider the joint distribution:
under the Markov chain
. As mutual information is hard to compute, the following bounds are derived on both MI terms:First Term:
where,
This marginal is intractable. Let be a variational approximation to this marginal. Since ,
If and are of a form that KL divergence can be analytically computed, we get:
Typically, the distributions and are instantiated as multivariate Normal distributions to analytically compute the KLdivergence term.
where
is a neural network which outputs the Kdimensional mean of z and
outputs the covariance matrix . This also allows us to reparameterize samples drawn from .Second Term:
where,
Again, as this is intractable, is used as a variational approximation to and is instantiated as a transformer model with its own set of parameters
. As Kullback Leibler divergence is always positive:
The term can be dropped as it is constant with respect to parameters . Thus, we minimize
Thus the IB objective is bounded by the loss function:
Hyperparameter  Movie  FEVER  MultiRC  BoolQ  Evidence Inference  BEER 

Num. Sentences  36  10  15  25  20  10 
(Sparsity threshold (%))  40  10  25  20  20  20 
(weight on SR)  0.5  0.05  1.00E04  0.01  0.001  0.01 
Approach  FEVER  MultiRC  Movies  BoolQ  Evidence  

Task  IOU  Task  IOU  Task  IOU  Task  IOU  Task  IOU  
Full  90.54    68.18    88.0    63.16    47.51   
Gold  92.52    78.20    1.0    71.65    85.39   
Task Only  83.01  35.50  59.17  22.42  81.46  20.63  61.82  10.39  47.51  9.87 
Sparse Norm  84.30  45.44  58.40  20.41  79.35  19.23  59.04  12.40  44.52  9.4 
Sparse NormC  84.42  44.90  60.77  23.25  82.43  18.91  62.24  09.72  49.67  09.40 
Sparse IB  85.64  45.46  61.11  25.55  86.50  22.33  62.07  16.63  49.09  11.09 
a.2 Deriving the sparse prior objective
The latent space learned in Appendix A.1 is not easy to interpret. Instead we consider a masked representation of the form , where is a binary mask sampled from a distribution . This is an adaptive masking strategy, defined by datadriven relevance estimators . The distributions over and induce a distribution on defined by the conditionals
Our prior, based on human annotations, is that rationale needed for a prediction is sparse; we encode this prior as a distribution over masks . The prior also induces a distribution on given by
We want to enforce a constraint ; i.e. that the marginal distribution matches our prior . This is difficult to do directly, but as in Appendix A.1, we can construct an upper bound the mutual information between and :
The inequality is tight if . By optimizing to minimize mutual information , we will implicitly learn parameters that approximate the desired constraint on the marginal.
In contrast to Alemi et al. (2016), our prior has no parameters; rather than using an expressive model to approximate the , we instead use the fixed prior to force the learned conditionals to assume a form such that the marginal approximately matches the marginal of the prior. Average mask sparsity values in Table 2 corroborate this.
By a limiting argument, we can compute the divergence between and :
KL  
The term is a divergence between two Bernoulli distributions and has a simple closed form. If and are uncorrelated then
The term is constant with respect to the parameters and can be dropped.
We use the same, standard crossentropy bound discussed in Appendix A.1 to estimate , leading us to our variational bound on IB with interpretability constraints
Appendix B Experimental Details
Hyperparameters
We use a sequence length of 512, batch size 16 and Adam optimizer with a learning rate of 5e5. We do not use warmup or weight decay. Hyperparameter tuning is done on the validation set for the rationale performance metric (IOU F1) for ERASER tasks and on the test set for BEER (only test set contains rationale annotations). Instead of explicitely tuning or annealing the Gumbel softmax parameter, we fix it to 0.7 across all our experiments. We found that Sparse IB approach is not as sensitive to the parameter and fix it to 1 to simplify experimental design. Hyperparameters for each dataset used for the final results are presented in Table 4.
Data Processing
For ERASER tasks, we use the preprocessed training and validation sets from DeYoung et al. (2019) and for BEER, we use data preprocesssed for the appearance aspect in Bastings et al. (2019). For BoolQ and Evidence Inference, we use a sliding window of 20 sentences (with step 5) over the document to find the span that has maximum TFIDF overlap with the query.