DeepAI
Log In Sign Up

Testing the effectiveness of saliency-based explainability in NLP using randomized survey-based experiments

11/25/2022
by   Adel Rahimi, et al.
0

As the applications of Natural Language Processing (NLP) in sensitive areas like Political Profiling, Review of Essays in Education, etc. proliferate, there is a great need for increasing transparency in NLP models to build trust with stakeholders and identify biases. A lot of work in Explainable AI has aimed to devise explanation methods that give humans insights into the workings and predictions of NLP models. While these methods distill predictions from complex models like Neural Networks into consumable explanations, how humans understand these explanations is still widely unexplored. Innate human tendencies and biases can handicap the understanding of these explanations in humans, and can also lead to them misjudging models and predictions as a result. We designed a randomized survey-based experiment to understand the effectiveness of saliency-based Post-hoc explainability methods in Natural Language Processing. The result of the experiment showed that humans have a tendency to accept explanations with a less critical view.

READ FULL TEXT VIEW PDF
10/01/2020

A Survey of the State of Explainable AI for Natural Language Processing

Recent years have seen important advances in the quality of state-of-the...
08/10/2021

Post-hoc Interpretability for Neural NLP: A Survey

Natural Language Processing (NLP) models have become increasingly more c...
10/13/2022

Constructing Natural Language Explanations via Saliency Map Verbalization

Saliency maps can explain a neural model's prediction by identifying imp...
06/24/2021

Evaluation of Saliency-based Explainability Method

A particular class of Explainable AI (XAI) methods provide saliency maps...
05/08/2021

On Guaranteed Optimal Robust Explanations for NLP Models

We build on abduction-based explanations for ma-chine learning and devel...
01/07/2020

RECAST: Interactive Auditing of Automatic Toxicity Detection Models

As toxic language becomes nearly pervasive online, there has been increa...
07/17/2022

Towards Explainability in NLP: Analyzing and Calculating Word Saliency through Word Properties

The wide use of black-box models in natural language processing brings g...

I Introduction

Machine Learning projects in the past relied on human knowledge and were easily comprehensible by both developers and end-users. For example, decision trees have been widely used for fraud detection

[34]

. However, the surge in popularity of Deep Neural Network (DNN) based models in the previous decade, due to their potential to produce more accurate results, has abated this comprehensibility. An example is the Convolutional Neural Network developed by Krizhevsky that was able to achieve “top-1 and top-5 error rates of 39.7% and 18.9%”

[19]

on Imagenet

[6] comparing that to the second place error rate that was 26.2% on the top-5 error rate. Such superior performance of DNNs has come at the cost of explainability as they can comprise up to millions or even billions of parameters [4] [38] whose synergy in arriving at the final output is often undecipherable. Moreover, these models are considered to be ‘black boxes’, where not only the end-users but also the developers are unaware of their inner workings.

This reduced explainability has compromised the trust various stakeholders have in these models, despite their accurate results [11]. Transparency breeds trust in humans, and it is important to create transparency through explainability of these complex models [29]. It is only becoming more vital as AI gains foothold in making critical – and in some cases, fatal – decisions in sensitive areas like Healthcare, Finance, Automated Driving, and such-like [28] [17] [1]. The true potential of these recent advancements in AI can only be realised if the various stakeholders manage to discern the working of AI models and how their predictions are produced, as that is necessary to incorporate trust. For example, 83% of people do not understand automated decision-making systems in the criminal justice system, and subsequently, 60% oppose its use in the domain [2]. But besides securing the buy-in of end-users and developers through building trust, AI explainability also has the potential of identifying AI inaccuracies prior to deployment. This is crucial considering that black-box machine learning models have led to unjustified social ills, like unfavourable parole denials and credit decisions based on race [20].

This has led to the creation of Explainable AI (XAI), a subfield in AI which focuses on explaining AI decisions in human-understandable ways. Explainability itself is not a new concept [45]; Explanations of AI systems have been used to debug AI systems already [37].

While model interpretability can be augmented through restricting their complexity, this can limit the effectiveness of the models [7]. As such, there has been an explosion of work around post-hoc analysis in XAI, whereby methods are devised to explain the models post-training without aiming to make the models inherently interpretable [18]

. These recent developments are able to distil predictions from complex models like Neural Networks into simple explanations that can be consumed by both developers and end-users alike. However, this points the torch to the similarity between these explanations and human understanding

[13]. That is, whether these explanations of AI predictions and models are understood by humans as intended is still an open question. Innate human tendencies, biases, etc. can handicap the effectiveness of these explanations in humans accurately assessing predictions and models, and can even have the reverse of the intended outcome. This gap between explanation and understanding can be dangerous, especially in fields like healthcare, as it can lead to stakeholders considering incorrect models to be correct, and vice-versa. These necessitate more work to be done in this field.

The need for explainable AI is particularly important in Neural Network based NLP models — a field which is gaining foothold in making pertinent decisions, and hence, requires transparency. For example, NLP are in inchoate stages of being used for political-profiling for elections surveys [25]. Hence, NLP can heavily influence actual election outcomes through assessing these surveys [3]. NLP-based models are also being used at a large scale in education for essay scoring. An example of this is evaluation of essays [31]. Such important applications make NLP a pertinent AI area for explainability.

Ii Background

There are many definitions delineating what “interpretation” and “explanation” mean in the context of AI. Motavan et. al have drawn a distinction between the two [26]. They describe “Interpretability” as the ability to “mapping of an abstract concept into a domain that humans can make sense of”. For instance, text, symbols, images, etc. can help humans interpret the content and concept that they represent.“Explainability”, on the other hand, has been put forth as “the collection of features of the interpretable domain that contribute to a certain decision” [26]. An example of this would be a heatmap highlighting the pixels that contributed the most to the categorization of an image. With the increasing complexity of AI models, the focus of AI ethics has shifted from making models inherently interpretable to explaining their decisions.

Inherently Explainable AI models are white-box models which are transparent by design. By delineating the most important parameters considered, they provide an intuitive understanding of why the model has predicted a certain data point as belonging to a certain class. Some common examples of Inherently Explainable AI models are Decision Trees [30]

and Linear Predictors. However, more complex problems might not be solved by Inherently Explainable AI models and require black-box models e.g. Deep Learning

[21] [16].

The most popular methods for explainability can be summarized into three main categories spanning the entire cycle of modeling: Pre-modeling explainability [33], Modeling explainability or explainable modeling [12], and Post-modeling explainability [35].

Pre-modeling explainability focuses on the study of the input. It is the most rudimentary explainability method, and it comprises inspection of the input through plotting the data distribution, analyzing different classes, and even showing word tokens and class labels in the case of NLP. Such data exploration does not directly explain a model but provides useful insight into the model’s behavior. For example, imbalance in the dataset can justify why the classification performance is subpar on underrepresented labels.

Modeling explainability methods rely on the model — rather than the data — to provide understanding to the end-user. These are the aforementioned Inherently Explainable AI models that are easy to understand due to the explicitness and the quantifiable number of parameters. Black-box models can not be explained through modeling-explainability methods.

A post-modeling or post-hoc XAI method produces approximations of black-box machine learning models in the form of simpler surrogate models, with a trained AI model as an input [43]. The approximations are constructed on the local behavior of a black-box model for a given input space. The main properties of such proxy models are their inherent explainability and their local faithfulness to the original model to be able to extrapolate its behavior. These simpler surrogate models can shed light on the decision logic of the black-box models through simple and understandable representations e.g. natural language, heatmaps, feature importance scores, etc. This can in turn assist human assessors in understanding and critiquing the black-box machine learning models after they are productionized. For example, the users can understand the importance of various features, highlight errors that the model is prone to, and discern biases in the model or data [5]. Some well-known examples of these methods and packages are SHapley Additive exPlanations (SHAP) [23] [24], LIME [32], DeepLIFT [39], Interpret ML [27], and AllenNLP Interpret [44].

A number post-hoc methods derive feature importance by manipulating real samples, observing the change in the output given the changed samples, and generating a simple model that mimics the original model’s behavior in the original samples’ neighborhood. An example of such a method is Local Interpretable Model-agnostic Explanations (LIME) [41]. However, the aforementioned manipulation is done at random to produce the neighboring instances, and the local distribution of feature values and density of class labels in the neighborhood is not considered. The randomly generated instances used for the approximations may not even be observed in real samples. A class of methods relies on extracting decision sets that depict the decision logic of the Machine Learning model in the form of conditional rules[15]. Extracting a subgroup of rules from the copious amounts of decision rule possibilities is deemed as an optimization problem with classification accuracy and overall interpretability as two main objectives [14]. As building inherently interpretable models for complex and high-accuracy tasks is a challenge, post-hoc methods are amongst the most important means of explainability of Machine Learning models.

Despite this wide variety of post-hoc methods for different data types and different output types, there is no evidence to show that they realize their intended outcome of helping the user understand the internals and decision logic of complex models. While these models are able to distill complex black-box models into simpler models and undemanding representations, whether these simplifications translate into users accurately understanding the internals and logic of the model is uncertain. For example, the human tendency to fill in the blanks can lead them to manufacture additional information which misconstrues the explanation [8], or the anchor effect can hinder understanding when a misleading frame of reference is taken [9]. Thus, it is crucial to test the congruence between AI explanations and human understanding. In this paper, we test this on a very popular post-hoc method: Gradient-based saliency maps.

Iii Methodology

A pre-trained model was used for the experiment: RoBERTa [22] that was trained on Stanford Sentiment Treebank [42]. The AllenNLP’s Interpret library was used to generate the model’s predictions and explanations [44] [10].

Subsequently, ten product reviews from e-commerce website Amazon111https://www.amazon.com/ and restaurant reviews from travel review website Tripadvisor222https://www.tripadvisor.in/

were procured randomly, and then classified according to sentiment (positive or negative) using RoBERTa by AllenNLP. Based on the model card reported by AllenNLP

333https://demo.allennlp.org/sentiment-analysis/roberta-sentiment-analysis, the model has achieved 95.11% accuracy in the test set, and hence, the predictions considered to be accurate. Subsequently, saliency map interpretations were generated for half of the predictions by visualising the gradient [40]. The text-specific class saliency maps highlighted the words in a given text that were discriminative with respect to the given class. These are post-hoc explanations to help users understand the top 5 words that helped RoBERTa in classifying the sentence into either negative or positive. An example of a review with top 5 discriminative words highlighted can be found in Figure 1. In order to reduce the complexity of the research and the the top features were not, and were only highlighted without any specific order (as can be seen in 1). This approach is very similar to the methodology in [36], although we did not get the idea from the mentioned paper directly, we are excited that both approaches are similar.

Fig. 1: A negative classified review with top 5 discriminative words

For the other half of the reviews, the words were highlighted at random instead of being based on the gradient. Keeping in mind that the randomly chosen word should not be one of the top-5 actual explanations. The simple gradient explanations and the ‘fabricated explanations’ – where the words were randomly highlighted – were shuffled and put into the questionnaire alongside the prediction from the RoBERTa binary classifier.

A survey was then created and distributed among 56 participants through email distribution lists. Figure 3 shows a screenshot of the survey’s first page. For each of the reviews, the survey-takers were requested to assess whether the highlighted words in the explanation strongly suggest alignment with the classification (positive or negative). You can see a example in 2. If post-hoc explanations are effective in this scenario, the survey takers should ideally agree with the predictions of all the sentences with the original saliency map. Additionally, survey-takers would be expected to disagree with the predictions of all the sentences with the words highlighted at random.

Fig. 2: Screenshot of first question
Fig. 3: Screenshot of the survey

As the questionnaire comprised 10 questions, a point was assigned to each question. If the participant was able to find out if the explanations were modified or not.

Iv Results

On average, the participants achieved a score of 6.43 out of 10. Meaning that on average, only 6.43 out of 10 times the participants agreed with the prediction with the unmodified explanation and disagreed with the prediction with the modified explanation. Table I shows the statistics from the participation survey.

Average Median Range Mode
6.43 / 10 6 / 10 4 - 9 5 / 10
TABLE I: Survey Statistics — Overall

Table II shows the statistics of the responses to the predictions with unmodified simple gradient explanations. The points represent every time the survey taker agreed with an unmodified explanation’s prediction. The high average score shows that survey-takers were mainly aligned with the predictions and the top-weighted words for classifying the sentences with unmodified explanations. The agreement with the predictions is expected due to the aforementioned high accuracy of the model used. However, the conformity of the survey-takers with the highest weighted words is indicative of the explanations accurately depicting the words that would lead to the sentence’s classification. In other words, the simple gradient explanations in NLP are effective in helping end-users realise the most important features in predictions in a consumable form.

Average Median Range Mode
4.02 / 5 4 / 5 2 - 5 4 / 5
TABLE II: Survey Statistics — Unmodified explanations

Table III shows the statistics of the responses to the predictions with modified simple explanations. The points represent every time the survey taker disagreed with a modified explanation’s prediction. The low average score indicates that the survey-takers often agreed with the randomised explanations justifying the respective prediction. This is counter to expectations, as the real highest-weighted discriminated words would often be missed, and words with low to negligible weights would often be highlighted as otherwise in the explanation. As such, these random words should not justify the prediction in most cases; however, the results show otherwise. We suspect that the participants draw improbable links between the explanations and the predictions. Seeing the false ’top-weighted words’ and the prediction, they perhaps come up with ways that they can justify the prediction, which influences their perception of the explanation. This can create a gap between the intended effect of explanations and the actual understanding of explanations by the end-users and developers. If true, the effect of this tendency can make end-users assess incorrect/unfair models as correct/fair. Furthermore, this casts doubt on the effectiveness of unmodified explanations. In that, high number of agreements with original explanation predictions by the participants might have been driven by this tendency and not the actual effectiveness of the explanation. However, more work is required to further explore this hypothesis and other ones that could underpin this discrepancy.

Average Median Range Mode
2.41 / 5 2 / 5 0 - 5 2 / 5
TABLE III: Survey Statistics — Modified explanations

Figure 2 shows the instances of a question being answered by a participant in the form of a ’confusion matrix’. The quadrants are determined based on whether the question being answered pertained to a modified explanation or an original one, and whether the participant agreed with the prediction or disagreed. The concentration of instances in the ’Unmodified-Agree’ quadrant and their sparseness in the ’Unmodified-Disagree’ quadrant indicates that explanations are effective in justifying predictions through highlighting most important features. The instances in the ’Modified-Disagree’ quadrant are expected since random explanations conventionally should not justify the predictions. However, the high number of observations in ’Modified-Agree’ shows that the participants often agree with the predictions due to wrong reasons, and are fooled by the explanation in this scenario. This implies that simple gradients in NLP are not effective for people assessing the accuracy of a model or understanding the internals of it.

Fig. 4: Confusion matrix of responses to explanations ()

V Conclusions

In this paper, we created a simple method to test the effectiveness of post-hoc explainability. The results of the survey indicate that simple gradient explanations in NLP are effective in helping end-users realize the most important features in predictions in a consumable form. However, the survey also shows that the participants often agree with the predictions due to wrong reasons, and are fooled by the explanation in this scenario. This implies that simple gradients in NLP are not effective for people in assessing the accuracy of a model or understanding the internals of it. We can also conclude that current explainability methods that are based on gradients, specifically, simple gradients, might not clearly convey the reasons for the prediction to users. This will become more important as there is a need for trust and verifiability in models.

Vi Future Works

For future work, we propose that there should be more exploration of verifiable and trustworthy model explanations where humans can judge whether they are correct or incorrect. Proposed future works include:

  • More work on verifiable and trustworthy model explanations.

  • A better way to provide decision support by using the natural language from the explanations.

  • Combining multiple models and providing explanations for the multiple models together.

  • Visualizing and using the explanations for debugging the models.

  • Providing explanations for the fairness of models.

References

  • [1] B. Babic, S. Gerke, T. Evgeniou, and I. G. Cohen (2021) Beware explanations from ai in health care. Science 373 (6552), pp. 284–286. Cited by: §I.
  • [2] B. Balaram, T. Greenham, and J. Leonard (2018) Engaging the public on automated decision systems. In Artificial Intelligence: Real Public Engagement, Cited by: §I.
  • [3] A. Blais, E. Gidengil, and N. Nevitte Do polls influence the vote?. The University of Michigan Press. Cited by: §I.
  • [4] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. Advances in neural information processing systems 33, pp. 1877–1901. Cited by: §I.
  • [5] S. Chandler (2020) How explainable ai is helping algorithms avoid bias. Forbes. Cited by: §II.
  • [6] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    ,
    pp. 248–255. Cited by: §I.
  • [7] K. Dziugaite, S. Ben-David, and D. Roy (2020) Enforcing interpretability and its statistical impacts: trade-offs between accuracy and interpretability. arXiv. Cited by: §I.
  • [8] L. Freeman (1992) Filling in the blanks: a theory of cognitive categories and the structure of social affiliation. Social Psychology Quarterly, pp. 118–127. Cited by: §II.
  • [9] A. Furnham (2011) A literature review of the anchoring effect. The journal of socio-economics 40, pp. 35–42. Cited by: §II.
  • [10] M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. Liu, M. Peters, M. Schmitz, and L. Zettlemoyer (2018) Allennlp: a deep semantic natural language processing platform. arXiv preprint arXiv:1803.07640. Cited by: §III.
  • [11] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi (2018) A survey of methods for explaining black box models. ACM computing surveys (CSUR) 51 (5), pp. 1–42. Cited by: §I.
  • [12] D. Gunning (2017) Explainable artificial intelligence (xai). Defense advanced research projects agency (DARPA), nd Web 2 (2), pp. 1. Cited by: §II.
  • [13] H. Hagras (2018) Toward human-understandable, explainable ai. Computer 51 (9), pp. 28–36. Cited by: §I.
  • [14] L. Himabindu, S. Bach, and J. Leskovec (2016) Interpretable decision sets: a joint framework for description and prediction. In 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1675–1684. Cited by: §II.
  • [15] L. Himabindu, E. Kamar, R. Caruana, and J. Leskovec (2017) Interpretable & explroable approximations of black box models. In arXiv preprint arXiv:1707.01154, Cited by: §II.
  • [16] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §II.
  • [17] A. Holzinger, G. Langs, H. Denk, K. Zatloukal, and H. Müller (2019) Causability and explainability of artificial intelligence in medicine. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9 (4), pp. e1312. Cited by: §I.
  • [18] E. M. Kenny, E. D. Delaney, D. Greene, and M. T. Keane (2021) Post-hoc explanation options for xai in deep learning: the insight centre for data analytics perspective. In International Conference on Pattern Recognition, pp. 20–34. Cited by: §I.
  • [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25. Cited by: §I.
  • [20] J. Larson, L. Mattu, and J. Angwin (2016) How we analyzed the compas recidivism algorithm. ProPublica. Cited by: §I.
  • [21] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §II.
  • [22] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §III.
  • [23] S. M. Lundberg, G. Erion, H. Chen, A. DeGrave, J. M. Prutkin, B. Nair, R. Katz, J. Himmelfarb, N. Bansal, and S. Lee (2020) From local explanations to global understanding with explainable ai for trees. Nature machine intelligence 2 (1), pp. 56–67. Cited by: §II.
  • [24] S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 4765–4774. External Links: Link Cited by: §II.
  • [25] C. Mallavarapu, R. Mandava1, S. KC, and G. Holt (2018) Political profiling using feature engineering and nlp.

    SMU Data Science Review

    .
    Cited by: §I.
  • [26] G. Montavon, W. Samek, and K. Müller (2018) Methods for interpreting and understanding deep neural networks. Digital signal processing 73, pp. 1–15. Cited by: §II.
  • [27] H. Nori, S. Jenkins, P. Koch, and R. Caruana (2019) Interpretml: a unified framework for machine learning interpretability. arXiv preprint arXiv:1909.09223. Cited by: §II.
  • [28] F. Pasquale (2015) The black box society: the secret algorithms that control money and information. Harvard University Press. Cited by: §I.
  • [29] D. Pedreshi, S. Ruggieri, and F. Turini (2008) Discrimination-aware data mining. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 560–568. Cited by: §I.
  • [30] J. R. Quinlan (1986) Induction of decision trees. Machine learning 1 (1), pp. 81–106. Cited by: §II.
  • [31] D. Ramesh and S. Sanampudi (2022) An automated essay scoring systems: a systematic literature review. In Artificial Intelligence Review, pp. 2495–2527. Cited by: §I.
  • [32] M. T. Ribeiro, S. Singh, and C. Guestrin (2016) ” Why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §II.
  • [33] S. Sachan, F. Almaghrabi, J. Yang, and D. Xu (2021) Evidential reasoning for preprocessing uncertain categorical data for trustworthy decisions: an application on healthcare and finance. Expert Systems with Applications 185, pp. 115597. Cited by: §II.
  • [34] Y. Sahin, S. Bulkan, and E. Duman (2013) A cost-sensitive decision tree approach for fraud detection. Expert Systems with Applications 40 (15), pp. 5916–5923. Cited by: §I.
  • [35] W. Samek, T. Wiegand, and K. Müller (2017) Explainable artificial intelligence: understanding, visualizing and interpreting deep learning models. arXiv preprint arXiv:1708.08296. Cited by: §II.
  • [36] H. Schuff, A. Jacovi, H. Adel, Y. Goldberg, and N. T. Vu (2022) Human interpretation of saliency-based explanation over text. arXiv preprint arXiv:2201.11569. Cited by: §III.
  • [37] A. C. Scott, W. J. Clancey, R. Davis, and E. H. Shortliffe (1977) Explanation capabilities of production-based consultation systems. Technical report STANFORD UNIV CA DEPT OF COMPUTER SCIENCE. Cited by: §I.
  • [38] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro (2019) Megatron-lm: training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053. Cited by: §I.
  • [39] A. Shrikumar, P. Greenside, and A. Kundaje (2017) Learning important features through propagating activation differences. In International conference on machine learning, pp. 3145–3153. Cited by: §II.
  • [40] K. Simonyan, A. Vedaldi, and A. Zisserman (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: §III.
  • [41] D. Slack, S. Hilgard, S. Singh, and H. Lakkaraju (2020) Fooling lime and shap: adversarial attacks on post hoc explanation methods. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp. 180–186. Cited by: §II.
  • [42] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642. Cited by: §III.
  • [43] S. Tan, R. Caruana, G. Hooker, and Y. Lou (2018) Distill-and-compare: auditing black-box models using transparent model distillation. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society,, pp. 303–310. Cited by: §II.
  • [44] E. Wallace, J. Tuyls, J. Wang, S. Subramanian, M. Gardner, and S. Singh AllenNLP interpret: explaining predictions of nlp models. Cited by: §II, §III.
  • [45] F. Xu, H. Uszkoreit, Y. Du, W. Fan, D. Zhao, and J. Zhu (2019) Explainable ai: a brief survey on history, research areas, approaches and challenges. In CCF international conference on natural language processing and Chinese computing, pp. 563–574. Cited by: §I.